The best way to familiarize yourself with the Hadoop ecosystem or to do proof of concept: is to play with it in a sandbox. Cloudera provides 2 Quick Start options: one is the image for VirtualMashine, and another is the Docker image. Both of them contain a small CDH cluster with one DataNode. But this is more than enough for sandbox purposes.
However, most data engineering neophytes find it difficult to set it up since, unfortunately, the official Cloudera installation guide missing some essential parts on twinks after installation. Here you will find quick and easy how to setup QuickStart from Cloudera with Docker image on Linux.
If you have installed the docker image already and don’t want to repeat those steps again, just move to the part with the solution or to the fixed docker image.
Run QuickStart according to official documentation
Check if you have docker installed. In terminal type: docker -v
to check version. (This HOWTO doesn’t cover docker installation, the manual from official docker site).
Check status of the docker service on your machine, start it if needed:
sudo systemctl status docker
sudo systemctl start docker
In my case docker was stopped, so I had to start it. Now when docker is installed and ready to work, we need to download QuickStart Image: `sudo docker pull cloudera/quickstart:latest`. Depends on your connection download operation can take time – be patient :).
After successful download our Cloudera QuickStart image should appear in the list of docker images: `docker images`. Copy <IMAGE ID> of ‘cloudera/quickstart’, we will use it later. In my case it is: ‘4239cd2958c6’. Finally, main command to run docker image as container is:
sudo docker run --hostname=quickstart.cloudera --privileged=true -t -i -p 8888 -p 7180 4239cd2958c6 /usr/bin/docker-quickstart -d
/usr/bin/docker-quickstart #Entry point to start all CDH services. Provided by cloudera
--hostname=quickstart.cloudera #Required: pseudo-distributed configuration assumes this hostname
--privileged=true #Required: for HBase, MySQL-backed Hive metastore, Hue, Oozie, Sentry, and Cloudera Manager, and possibly others
-t #Required: once services are started, a Bash shell takes over and will die without this
-i #Required: if you want to use the terminal, either immediately or attach later
-p 8888 #Recommended: maps the Hue port in the guest to another port on the host
-p 7180 #Recommended: maps the Cloudera Manager port in the guest (7180) to another port on the host
-p [PORT] #Any other ports you want to remap from guest to free host ports. To make it accessible outside container.
-d #Optional: runs the container in the background. I would recommend to use this option if you planning to run container constantly on background.
Fix the standard image to get it to work
Further, start tuning to make your sandbox operational. By default, the Cloudera manager is not started in a container, so let’s enable it first.
Obtain <CONTAINER_ID> value `docker ps
`. In my case, it is: ‘26d160291aad’
Connect to QuickStart container shell: `docker attach <CONTAINER_ID>
` and run a script to enable Cloudera manager: `/home/cloudera/cloudera-manager –express
`
At the end script prints out an address on which you can access cloudera-manager. Surprisingly, you won’t be able to connect to it. That’s happening because we remapped the port earlier. And now, we need to get the correct host port to which our guest port 7180 was mapped.
Detect new port mapping from guest to host: `sudo docker port <CONTAINER_ID> <guest_port
>`. Thus, to connect to the Cloudera manager, I need to type: ‘0.0.0.0:32771’, and finally we will see this:
But wait a second, something is red at our `Hosts`. Let’s take a look at our error closer and fix it. It’s clock offset error:
Basically, it means that our ntpd service is either not started or can’t connect to services. The solution is:
date # will show difference between real date and server one sudo chkconfig --add ntpd sudo service ntpd restart date # to make sure that ntpd is working and date is sync
Wait a couple of minutes and check Cloudera manager again. As you see error has been gone.
Since now, until the container won’t be turned off, the ntpd service has continued to work properly. If you are lazy to run additional commands at the Cloudera docker container every time it is rebooted, you simply can use my docker image, based on Cloudera QuickStart one, or create one yourself.