Development

Development

To get info about new technologies, perspective products and useful services

BigData

BigData

To know more about big data, data analysis techniques, tools and projects

Refactoring

Refactoring

To improve your code quality, speed up development process

Tag: BigData

HowTo: Run Cloudera Quickstart in Docker

HowTo: Run Cloudera Quickstart in Docker

The best way to familiarize yourself with Hadoop ecosystem or to do POC is to play with it in a sandbox. For that Cloudera provides 2 Quick Start options: one is the image for VirtualMashine and another is the Docker image. Both of them contain a small CDH cluster with one DataNode. But this is more than enough for sandbox purpose.

However, most of  data engineering neophytes find it difficult to set it up, since, unfortunately,  official Cloudera installation guide missing some essential parts on twinks after installation. Here you will find quick and easy how to setup QuickStart from Cloudera with Docker image on Linux.

If you have installed docker image already and don’t want to repeat those steps again, just move to the part with the solution or to the fixed docker image.

Run QuickStart according to official documentation

Check if you have docker installed. In terminal type: docker -v to check version. (This HOWTO doesn’t cover docker installation, the manual  from official docker site).

Check status of the docker service on your machine, start it if needed:

sudo systemctl status docker
sudo systemctl start docker

In my case docker was stopped, so I had to start it. Now when docker is installed and ready to work, we need to download QuickStart Image: sudo docker pull cloudera/quickstart:latest.  Depends on your connection download operation can take time – be patient :).

After successful download our Cloudera QuickStart image should appear in the list of docker images: docker images.  Copy <IMAGE ID> of ‘cloudera/quickstart’, we will use it later. In my case it is: ‘4239cd2958c6’. Finally, main command to run docker image as container is:

sudo docker run  --hostname=quickstart.cloudera --privileged=true -t -i -p 8888 -p 7180 4239cd2958c6 /usr/bin/docker-quickstart -d

/usr/bin/docker-quickstart #Entry point to start all CDH services. Provided by cloudera
--hostname=quickstart.cloudera #Required: pseudo-distributed configuration assumes this hostname
--privileged=true #Required: for HBase, MySQL-backed Hive metastore, Hue, Oozie, Sentry, and Cloudera Manager, and possibly others
-t  #Required: once services are started, a Bash shell takes over and will die without this
-i  #Required: if you want to use the terminal, either immediately or attach later
-p 8888  #Recommended: maps the Hue port in the guest to another port on the host
-p 7180  #Recommended: maps the Cloudera Manager port in the guest (7180) to another port on the host
-p [PORT] #Any other ports you want to remap from guest to free host ports. To make it accessible outside container. 
-d   #Optional: runs the container in the background. I would recommend to use this option if you planning to run container constantly on background.

Fix standard image to get it work

Further starts tuning to make your sandbox operational. By default, cloudera manager is not started in container, so, lets enable it first.

Obtain <CONTAINER_ID> value docker ps. In my case it is: ‘5fadd6cb8e0c’

Connect  to QuickStart container shell: docker attach <CONTAINER_ID> and run script to enable cloudera manager: /home/cloudera/cloudera-manager --express

At the end script prints out an address on which you can access cloudera-manager. Surprisingly, you won’t be able to connect to it. That’s happening because we remaped port earlier. And now we need to get correct host port to which our guest port 7180 was mapped.

Detect new port mapping from guest to host: sudo docker port <CONTAINER_ID> <guest_port>. Thus, to connect to Cloudera manager I need to type: ‘0.0.0.0:32771’ and finally we will see this:

 But wait a second, something is red at our Hosts. Let’s take a look on our error closer and fix it. It’s clock offset error:

Basically it means that our ntpd service is either not started or can’t connect to services. The solution is:

date      # will show difference between real date and server one
sudo chkconfig --add ntpd
sudo service ntpd restart 
date      # to make sure that ntpd is working and date is sync

Wait a couple of minutes and check cloudera manager again. As you see error has been gone.

Since now, until container won’t be turned off, ntpd service continue to work properly. If you are lazy to run additional command at cloudera docker container every time it was rebooted, you simply can use my docker image, based on Cloudera QuickStart one, or create one yourself.