Development

Development

To get info about new technologies, perspective products and useful services

BigData

BigData

To know more about big data, data analysis techniques, tools and projects

Refactoring

Refactoring

To improve your code quality, speed up development process

Author: Katya Belova

HowTo: Run Cloudera Quickstart in Docker

HowTo: Run Cloudera Quickstart in Docker

The best way to familiarize yourself with Hadoop ecosystem or to do POC is to play with it in a sandbox. For that Cloudera provides 2 Quick Start options: one is the image for VirtualMashine and another is the Docker image. Both of them contain a small CDH cluster with one DataNode. But this is more than enough for sandbox purpose.

However, most of  data engineering neophytes find it difficult to set it up, since, unfortunately,  official Cloudera installation guide missing some essential parts on twinks after installation. Here you will find quick and easy how to setup QuickStart from Cloudera with Docker image on Linux.

If you have installed docker image already and don’t want to repeat those steps again, just move to the part with the solution or to the fixed docker image.

Run QuickStart according to official documentation

Check if you have docker installed. In terminal type: docker -v to check version. (This HOWTO doesn’t cover docker installation, the manual  from official docker site).

Check status of the docker service on your machine, start it if needed:

sudo systemctl status docker
sudo systemctl start docker

In my case docker was stopped, so I had to start it. Now when docker is installed and ready to work, we need to download QuickStart Image: sudo docker pull cloudera/quickstart:latest.  Depends on your connection download operation can take time – be patient :).

After successful download our Cloudera QuickStart image should appear in the list of docker images: docker images.  Copy <IMAGE ID> of ‘cloudera/quickstart’, we will use it later. In my case it is: ‘4239cd2958c6’. Finally, main command to run docker image as container is:

sudo docker run  --hostname=quickstart.cloudera --privileged=true -t -i -p 8888 -p 7180 4239cd2958c6 /usr/bin/docker-quickstart -d

/usr/bin/docker-quickstart #Entry point to start all CDH services. Provided by cloudera
--hostname=quickstart.cloudera #Required: pseudo-distributed configuration assumes this hostname
--privileged=true #Required: for HBase, MySQL-backed Hive metastore, Hue, Oozie, Sentry, and Cloudera Manager, and possibly others
-t  #Required: once services are started, a Bash shell takes over and will die without this
-i  #Required: if you want to use the terminal, either immediately or attach later
-p 8888  #Recommended: maps the Hue port in the guest to another port on the host
-p 7180  #Recommended: maps the Cloudera Manager port in the guest (7180) to another port on the host
-p [PORT] #Any other ports you want to remap from guest to free host ports. To make it accessible outside container. 
-d   #Optional: runs the container in the background. I would recommend to use this option if you planning to run container constantly on background.

Fix standard image to get it work

Further starts tuning to make your sandbox operational. By default, cloudera manager is not started in container, so, lets enable it first.

Obtain <CONTAINER_ID> value docker ps. In my case it is: ‘5fadd6cb8e0c’

Connect  to QuickStart container shell: docker attach <CONTAINER_ID> and run script to enable cloudera manager: /home/cloudera/cloudera-manager --express

At the end script prints out an address on which you can access cloudera-manager. Surprisingly, you won’t be able to connect to it. That’s happening because we remaped port earlier. And now we need to get correct host port to which our guest port 7180 was mapped.

Detect new port mapping from guest to host: sudo docker port <CONTAINER_ID> <guest_port>. Thus, to connect to Cloudera manager I need to type: ‘0.0.0.0:32771’ and finally we will see this:

 But wait a second, something is red at our Hosts. Let’s take a look on our error closer and fix it. It’s clock offset error:

Basically it means that our ntpd service is either not started or can’t connect to services. The solution is:

date      # will show difference between real date and server one
sudo chkconfig --add ntpd
sudo service ntpd restart 
date      # to make sure that ntpd is working and date is sync

Wait a couple of minutes and check cloudera manager again. As you see error has been gone.

Since now, until container won’t be turned off, ntpd service continue to work properly. If you are lazy to run additional command at cloudera docker container every time it was rebooted, you simply can use my docker image, based on Cloudera QuickStart one, or create one yourself.

How to create avro based table in Impala

How to create avro based table in Impala

Consider the following situation: A bundle of .avro files is stored on HDFS. They need to be converted to Impala tables. Schemas are not provided with files, at least not externally (it’s contained as first line of any avro file). But Impala has a known-issue with avro tables, and its usage is pretty limited: we can create avro based table only if all the columns are manually declared with their types in CREATE TABLE statement, otherwise it will fail with an error.

(pic).

But what if we have hundreds of columns or just not completely sure in schema and would like to automate process of tables creation?

Disclaimer: you can’t do that directly, but there is  a work around: you have to create temporary avro table in hive, then create as select temporary parquet file as select from avro table and finally run invalidate metadata in impala to catch up all the changes in tables set into impala.

Step by step algorithm:

  1. Check if you have file.avsc along with file.avro , if yes, skip step #2
  2. Create external schema avro file from file.avro
    • Download avro-tools-1.7.7.jar – official tool to work with avro files. Pay attention to a version – it should be 1.7.7.
    • Place it somewhere on your server’s local file system
      # parameters you will need
      # <tmp_local_path>=/tmp/get_avro any random path inside /tmp folder of local file system
      # <file_name> name of the file.avro from where you want to extract schema
      # <hdfs_file_path>  absolute path on hdfs to target file.avro
      # <result_schema_file_path>  path on hdfs where you would like to get created schema
      
      # create tmp folder
      mkdir -p <tmp_local_path>
      
      # here we read with cat command first 50Kb of file.avro and store it as file_sample on our local file system
      # depends on amount of columns 50Kb can be not enough. You can try --lines 1, to pick only first line or increase size. 
      hdfs dfs -cat <hdfs_file_path> | head --bytes 50K > <tmp_local_path>/<file_name>_sample
      
      # use avro-tool jar to retrieve schema from sample file and store it the name of origin file.avro with extension .avsc
      java -jar ~/jars/avro-tools-1.7.7.jar getschema <tmp_local_path>/<file_name>_sample > <tmp_local_path>/<file_name>.avsc
      
      # copy from local file system back to hdfs our created file.avsc (avro schema)
      hdfs dfs -put <tmp_local_path>/<file_name>.avsc <result_schema_file_path>
      
      #clean up mess 
      rm -rf <tmp_local_path>
      
      
  3. In Hive:
    CREATE EXTERNAL TABLE IF NOT EXISTS <avro_tmp_tbl_name>
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
    STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
    LOCATION 'hdfs://<orig_file_dir_path>/'
    tblproperties ('avro.schema.url'='hdfs://<path_to_avro_schema_file_on_hdfs>');
    
    
    CREATE TABLE IF NOT EXISTS <parq_tmp_tbl_name>
    STORED AS PARQUET
    AS select * from  <avro_tmp_tbl_name>  ;
  4. In impala: Invalidate metadata <parq_tmp_tbl_name>;

That’s all – job is done. This step-by-step algorithm is easy to wrap into any pipeline tool like: oozie, airflow, etc and thus, completely automate routine part of work.