Development

Development

To get info about new technologies, perspective products and useful services

BigData

BigData

To know more about big data, data analysis techniques, tools and projects

Refactoring

Refactoring

To improve your code quality, speed up development process

Category: Case_reports

How to create avro based table in Impala

How to create avro based table in Impala

Consider the following situation: A bundle of .avro files is stored on HDFS. They need to be converted to Impala tables. Schemas are not provided with files, at least not externally (it’s contained as first line of any avro file). But Impala has a known-issue with avro tables, and its usage is pretty limited: we can create avro based table only if all the columns are manually declared with their types in CREATE TABLE statement, otherwise it will fail with an error.

(pic).

But what if we have hundreds of columns or just not completely sure in schema and would like to automate process of tables creation?

Disclaimer: you can’t do that directly, but there is  a work around: you have to create temporary avro table in hive, then create as select temporary parquet file as select from avro table and finally run invalidate metadata in impala to catch up all the changes in tables set into impala.

Step by step algorithm:

  1. Check if you have file.avsc along with file.avro , if yes, skip step #2
  2. Create external schema avro file from file.avro
    • Download avro-tools-1.7.7.jar – official tool to work with avro files. Pay attention to a version – it should be 1.7.7.
    • Place it somewhere on your server’s local file system
      # parameters you will need
      # <tmp_local_path>=/tmp/get_avro any random path inside /tmp folder of local file system
      # <file_name> name of the file.avro from where you want to extract schema
      # <hdfs_file_path>  absolute path on hdfs to target file.avro
      # <result_schema_file_path>  path on hdfs where you would like to get created schema
      
      # create tmp folder
      mkdir -p <tmp_local_path>
      
      # here we read with cat command first 50Kb of file.avro and store it as file_sample on our local file system
      # depends on amount of columns 50Kb can be not enough. You can try --lines 1, to pick only first line or increase size. 
      hdfs dfs -cat <hdfs_file_path> | head --bytes 50K > <tmp_local_path>/<file_name>_sample
      
      # use avro-tool jar to retrieve schema from sample file and store it the name of origin file.avro with extension .avsc
      java -jar ~/jars/avro-tools-1.7.7.jar getschema <tmp_local_path>/<file_name>_sample > <tmp_local_path>/<file_name>.avsc
      
      # copy from local file system back to hdfs our created file.avsc (avro schema)
      hdfs dfs -put <tmp_local_path>/<file_name>.avsc <result_schema_file_path>
      
      #clean up mess 
      rm -rf <tmp_local_path>
      
      
  3. In Hive:
    CREATE EXTERNAL TABLE IF NOT EXISTS <avro_tmp_tbl_name>
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
    STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
    LOCATION 'hdfs://<orig_file_dir_path>/'
    tblproperties ('avro.schema.url'='hdfs://<path_to_avro_schema_file_on_hdfs>');
    
    
    CREATE TABLE IF NOT EXISTS <parq_tmp_tbl_name>
    STORED AS PARQUET
    AS select * from  <avro_tmp_tbl_name>  ;
  4. In impala: Invalidate metadata <parq_tmp_tbl_name>;

That’s all – job is done. This step-by-step algorithm is easy to wrap into any pipeline tool like: oozie, airflow, etc and thus, completely automate routine part of work.