Consider the following situation: A bundle of .avro files is stored on HDFS. They need to be converted to Impala tables. Schemas are not provided with files, at least not externally (it’s contained as first line of any avro file). But Impala has a known-issue with avro tables, and its usage is pretty limited: we can create avro based table only if all the columns are manually declared with their types in `CREATE TABLE` statement, otherwise it will fail with an error.

But what if we have hundreds of columns or just not completely sure in schema and would like to automate process of tables creation?

Disclaimer: you can’t do that directly, but there is a workaround: you have to create a temporary avro table in hive, then `create as select` temporary parquet file as select from avro table and finally run `invalidate metadata` in impala to catch up all the changes in tables set into impala.

Step-by-step algorithm:

1. Check if you have `file.avsc` along with `file.avro` , if yes, skip step #2

2. Create an external schema avro file from file.avro:

  • Download avro-tools-1.7.7.jar – the official tool to work with avro files. Pay attention to the version – it should be 1.7.7.
  • Place it somewhere on your server’s local file system.
  • Required input parameters:
    • <tmp_local_path>=/tmp/get_avro any random path inside /tmp folder of local file system
    • <file_name> name of the file.avro from where you want to extract schema
    • <hdfs_file_path> absolute path on hdfs to target file.avro
    • <result_schema_file_path> path on hdfs where you would like to get created schema
  • In terminal line type
# create tmp folder 
mkdir -p <tmp_local_path> 

# here we read with cat command first 50Kb of file.avro and store it as file_sample on our local file system 
# depends on amount of columns 50Kb can be not enough. You can try --lines 1, to pick only first line or increase size. 
hdfs dfs -cat <hdfs_file_path> | head --bytes 50K > <tmp_local_path>/<file_name>_sample 

# use avro-tool jar to retrieve schema from sample file and store it the name of origin file.avro with extension .avsc 
java -jar ~/jars/avro-tools-1.7.7.jar getschema <tmp_local_path>/<file_name>_sample > <tmp_local_path>/<file_name>.avsc 

# copy from local file system back to hdfs our created file.avsc (avro schema) 
hdfs dfs -put <tmp_local_path>/<file_name>.avsc <result_schema_file_path> 

# clean up mess 
rm -rf <tmp_local_path>

3. In Hive run the query:

CREATE EXTERNAL TABLE IF NOT EXISTS <avro_tmp_tbl_name> 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' 
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' 
LOCATION 'hdfs://<orig_file_dir_path>/' tblproperties ('avro.schema.url'='hdfs://<path_to_avro_schema_file_on_hdfs>'); 

CREATE TABLE IF NOT EXISTS <parq_tmp_tbl_name> STORED AS PARQUET AS select * from <avro_tmp_tbl_name> ;

4. In impala: Invalidate metadata <parq_tmp_tbl_name>;

That’s all – the job is done. This step-by-step algorithm is easy to wrap into any pipeline tool like oozie, airflow, etc, and thus, completely automates the routine part of work.

Leave a Reply