Https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO
Languagemanual LZO Skip to end of metadata
- Created by Lefty Leverenz, last modified on Sep
Go to start of metadata LZO Compression
- LZO Compression
- General LZO Concepts
- Prerequisites
- LZO/LZOP installations
- Core-site.xml
- Table Definition
- Hive Queries
- Option 1:directly Create LZO Files
- Option 2:write Custom Java to Create LZO Files
General LZO Concepts
LZO is a lossless the data compression library that favors speed over compression ratio. See Http://www.oberhumer.com/opensource/lzo and http://www.lzop.org for general information on Lzo and see compressed D ATA Storage for information on compression in Hive.
Imagine a simple data file, has three columns
Let ' s populate a data file containing 4 records:
19630001 John lennon19630002 Paul mccartney19630003 George harrison19630004 Ringo Starr
Let's call the data file /path/to/dir/names.txt
.
In order to make it into an LZO file, we can use the Lzop utility and it would create a names.txt.lzo
file.
Now copy the file to names.txt.lzo
HDFS.
PREREQUISITESLZO/LZOP installations
lzo
and lzop
need to being installed on every node in the Hadoop cluster. The details of these installations is beyond the scope of this document.
core-site.xml
Add the following to your core-site.xml
:
com.hadoop.compression.lzo.LzoCodec
com.hadoop.compression.lzo.LzopCodec
For example:
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,
Com.hadoop.compression.lzo.lzocodec,com.hadoop.compression.lzo.lzopcodec</value>
</property>
<property>
<name>io.compression.codec.lzo.class</name>
<value>
Com.hadoop.compression.lzo.LzoCodec</value>
</property>
Next we run the command to create an LZO index file:
Hadoop Jar/path/to/jar/hadoop-lzo-cdh4-0.4.15-gplextras.jar com.hadoop.compression.lzo.LzoIndexer /path/to/ Hdfs/dir/containing/lzo/files
This is creates on names.txt.lzo
HDFS.
Table Definition
The following hive -e
command creates an lzo-compressed external table:
Hive-e "CREATE EXTERNAL TABLE IF not EXISTS hive_table_name (column_1 datatype_1......column_n datatype_n) Partitioned by (Partition_col_1 datatype_1 .... col_p datatype_p) ROW FORMAT delimited fields TERMINATED by ' \ t ' c4/>stored as InputFormat \ "com.hadoop.mapred.deprecatedlzotextinputformat\" outputformat \ " Org.apache.hadoop.hive.ql.io.hiveignorekeytextoutputformat\ ";
Note:the double quotes has the to is escaped so, the " hive -e
command works correctly.
See the CREATE TABLE and Hive CLI for information about command syntax.
Hive queriesoption 1:directly Create LZO Files
- Directly Create LZO files as the output of the Hive query.
- Use
lzop
command utility or your custom Java to generate for the .lzo.index
.lzo
files.
Hive Query Parameters
SET Mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.lzocodecset Hive.exec.compress.output=trueset Mapreduce.output.fileoutputformat.compress=true
For example:
Hive-e "SET Mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.lzocodec; SET hive.exec.compress.output=true; SET mapreduce.output.fileoutputformat.compress=true; <query-string> "
note:if The data sets is large or number of output files is large, then this option does not work.
Option 2:write Custom Java to Create LZO Files
- Create text files as the output of the Hive query.
- Write Custom Java code to
- Convert Hive query generated text files to
.lzo
files
- Generate
.lzo.index
files for the .lzo
files generated above
Hive Query Parameters
Prefix the query string with these parameters:
SET Hive.exec.compress.output=falseset Mapreduce.output.fileoutputformat.compress=false
For example:
Hive-e "SET Hive.exec.compress.output=false; SET mapreduce.output.fileoutputformat.compress=false;<query-string> "
Write Custom Java to Create LZO Files