Impala Hive and Spark Parque File format Size

I have few doubts surrounding parquet compression between impala, hive and spark Here is the situation

  1. Table is Hive and data is inserted using Impala and table size is as below and table files extension is "data.0.parq" 59.0 M 177.1 M /user/hive/warehouse/database.db/tablename ( parquet + created in impala)
  2. Same table created in Hive tablename_snappy with snappy compression Set as TBLPROPERTIES ("parquet.compression"="SNAPPY") Data is inserted in Hive by using Tablename ( step1). 2a) Why the table size is more? 2b) File name is 000000_0 ( Is this expected) 64.6 M 193.7 M /user/hive/warehouse/database.db/tablename_parq ( parquet + snappy compression + created in Hive)
  3. In spark i read the tablename from step 1, did saveAsTable and file size is reduced as expected and file name is ****.snappy.parquet 39.0 M 117.1 M /user/hive/warehouse/atabase.db/tablename_spark ( parquet + snappy compression + created in Spark)
  4. Same table created in Impala with stored as Parquet and set COMPRESSION_CODEC=snappy; No change, i expected table size should reduce since i applied snappy compression. 59.0 M 177.1 M /user/hive/warehouse/database.db/tablename ( parquet + created in impala)

Please help me to understand how parquet compression works in Impla and Hive.

Read more here:

Content Attribution

This content was originally published by Murali Krishna at Recent Questions - Stack Overflow, and is syndicated here via their RSS feed. You can read the original post over there.

%d bloggers like this: