hadoop2.4 Support Snappy

Source: Internet
Author: User

We hadoop2,4 the cluster does not support snappy compression by default, but recently some business parties say that some of their data is snappy compressed (this part of the data is provided to them by another cluster is snappy compressed format) want to migrate to our cluster to calculate. But the direct execution of the Times wrong:

Failed with exception java.io.IOException:java.lang.RuntimeException:native Snappy Library not available:this version O F Libhadoop is built without snappy support
According to the error message shows that the snappy local library is not available, at the same time it seems to compile libhadoop need to specifically specify to support snappy, which is different from hadoop1.0. hadoop1.0 only needs to copy snappy's local library files to a specified folder, without having to compile Libhadoop local library files again.

Because snappy compression algorithm compression ratio is not very high, although in the decompression efficiency of a little advantage, so we cluster default does not support snappy, our cluster data requirements is Rcfile+gzip, here are several compression formats in the advantages and disadvantages of Hadoop comparison:

Address: http://www.linuxidc.com/Linux/2014-05/101230.htm

There are more lzo,gzip in Hadoop now. Snappy Bzip2 These 4 kinds of compression formats. Based on practical experience, the author introduces the advantages and disadvantages of these 4 compression formats and the application scenarios so that we can choose different compression formats according to the actual situation in practice.

1, gzip compression

Advantages: High compression ratio. and the compression/decompression speed is also faster; Hadoop natively supports that files in the gzip format are processed in the same way as text directly; there is a Hadoop native library. Most Linux systems have their own GZIP commands and are easy to use.

Cons: Split is not supported.

Application scenario: When each file is compressed within 130M (1 blocks in size), it is possible to consider using gzip compression format. For example, a day or one-hour log is compressed into a gzip file, and the MapReduce program is executed with multiple gzip files to achieve concurrency. The hive program, the streaming program, and the Java-written MapReduce program are exactly the same as for text processing, and the original program does not need to be changed after compression.

2, Lzo compression

Advantages: Compression/decompression speed is also relatively fast, reasonable compression rate; Support split is the most popular compression format in Hadoop. Support for Hadoop native libraries and the ability to install LZOP commands under Linux systems. Easy to use.

Cons: Compression ratio is lower than gzip; Hadoop does not support it by itself. Need to install, in the application of Lzo format files need to do some special processing (in order to support split need to build an index, you also need to specify InputFormat to Lzo format).

Application scenario: A very large text file, after compression is more than 200M can be considered, and the larger the size of a single file, Lzo advantages more obvious.

3, Snappy compression

Advantages: Fast compression speed and reasonable compression rate; support for Hadoop native library.

Cons: Split is not supported, the compression rate is lower than gzip, Hadoop itself is not supported and needs to be installed. There is no corresponding command under Linux system.

Application scenario: When the map output of the MapReduce job is larger than the data. As a compressed format for intermediate data from map to reduce, or as an output of a mapreduce job and an input to another mapreduce job.

4, BZIP2 compression

Strengths: Split supported, with very high compression rates. The compression rate is higher than gzip, Hadoop itself is supported, but native is not supported, and it is easy to use the BZIP2 command in Linux system.

Disadvantage: Compression/decompression speed is slow. Native is not supported.

Application scenario: Suitable for the speed requirements are not high, but need a high compression rate. The output format that can be used as a mapreduce job. or after the output of the data is relatively large, the data after processing need to compress the archive to reduce disk space and later use less data, or for a very large text file to compress to reduce storage space, at the same time need to support split, and compatible with previous applications (that is, applications that do not need to be changed).

Finally, a table is used to compare the features of the above 4 compression formats (pros and cons):

comparison of features in 4 compression formats
compression Format Split native Compression ratio Speed whether Hadoop comes with Linux Commands if the original application changes after you change to a compressed format
Gzip Whether Is Very high More quickly Yes, direct use Yes The same as text processing. No need to change
Lzo Is Is Relatively high Very fast No, need to install Yes Need to be indexed, you need to specify the input format
Snappy Whether Is Relatively high Very fast Whether. Need to install No The same as text processing. No need to change
Bzip2 Is Whether Highest Slow Is. Direct use Yes The same as text processing. No need to change

Note: The above several compression algorithms are in the premise of compressing ordinary text, whether or not to support split, assuming thatRcfile,Sequence files and so on, itself support split, after compression is the same support split.

rcfile-formatted files support column-based storage. Split is supported at the same time, and gzip compression ratio is relatively high, and compression/decompression speed is also faster , so the rcfile format of the file after gzip compression can ensure that the file can split , it can also guarantee very high compression/decompression speed and compression ratio.

The above is a half-day digression, below to enter the topic How to do not replace the cluster local library files, do not restart the Hadoop process, that is, the Hadoop client can solve the problem of supporting snappy compression method:

1, compile snappy local library, compile snappy local library file address:/data0/liangjun/snappy/

Address: Http://www.tuicool.com/articles/yiiiY3R

2. Compile the libhadoop.so file again, compile at compile time by-dsnappy.prefix specify snappy local library file address compilation:

mvn clean package-pdist-dtar-pnative-dsnappy.prefix=/data0/liangjun/snappy/-dskiptests

Note: I have tested it, and the libhadoop.so compiled by-drequire.snappy is also possible:

MVN Clean Package-pdist,native-dskiptests-drequire.snappy
3, after running the above two steps, finally only need to get libhadoop.so and libsnappy.so.1 two files (only need these two files. Others have been filtered out by my test). The following is a sample of MapReduce and hive using snappy compression:

(1), Mapreduce. Adding a compiled local library to the Distributedcache will enable you to:

Press Snappy to compress the Clientmapred-site.xml file in the test environment by adding the following two configuration items to support map-side data:

<property>    <name>mapreduce.map.output.compress</name>    <value>true</value >    <final>true</final>  </property>  <property>    <name> Mapreduce.map.output.compress.codec</name>    <value>org.apache.hadoop.io.compress.snappycodec </value>    <final>true</final>  </property>

Upload libhadoop.so and libhadoop.so to the specified HDFs folder/test/snappy/. To specify a file by-files:

Hadoop jar Hadoop-mapreduce-examples-2.4.0.jar Wordcount-files hdfs://ns1/test/snappy/libhadoop.so,hdfs://ns1/test /snappy/libsnappy.so.1  /test/missdisk//test/wordcount

(2), hive, specify the file with the Add File:

Hive >add file libhadoop.so;hive >add file libsnappy.so.1;hive >select count (*) from Ct_tmp_objrec;
table Ct_tmp_objrec Data is data that is snappy compressed by the text file. The CT_TMP_OBJREC storage format is an ordinary text format.

After executing the HQL. The data found in the snappy format can be processed correctly, but the 200+m file can only be processed by a map task, which does not support split.

==========================================================

The following section is a test of whether the Rcfile+snappy data supports split:

1, create frequently snappy_test, the table and the previous Ct_tmp_objrec column is exactly the same, only the hive table storage format is replaced by Rcfile:

CREATE EXTERNAL TABLE ' snappy_test ' (  ' from_id ' string,  ' to_id ' string,  ' Mention_type ' bigint,  ' Repost_flag ' bigint,  ' weight ' double,  ' confidence ' double,  ' From_uid ' string,  ' To_object_label ' String,  ' count ' bigint,  ' last_modified ' string,  ' time ' string,  ' Mblog_spam ' bigint,  ' Mblog_ Simhash ' string,  ' Mblog_dupnum ' bigint,  ' Mblog_attribute ' bigint,  ' user_quality ' bigint,  ' user _type ' bigint,  ' old_weight ' double,  ' Obj_pos ' bigint,  ' quality ' bigint) ROW FORMAT SERDE  ' Org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe ' STORED as InputFormat  ' Org.apache.hadoop.hive.ql.io.RCFileInputFormat ' OutputFormat  ' Org.apache.hadoop.hive.ql.io.RCFileOutputFormat ' Location  ' hdfs://ns1/user/liangjun/warehouse/tables/ Snappy_test '

2, the Ct_tmp_objrec in the plain text+snappy compressed data into the Snappy_test rcfile+gzip compressed data:

Hive >add file libhadoop.so;hive >add file libsnappy.so.1;hive >set hive.exec.compress.output=true;hive > Set mapred.output.compression.codec=org.apache.hadoop.io.compress.snappycodec;hive >INSERT OVERWRITE table Snappy_test Select From_id,to_id,mention_type,repost_flag,weight,confidence,from_uid,to_object_label,count,last_ Modified,time,mblog_spam,mblog_simhash,mblog_dupnum,mblog_attribute,user_quality,user_type,old_weight,obj_pos, Quality from Ct_tmp_objrec;

3, Query snappy_test in the Rcfile+snappy data to see whether split

Hive >add file libhadoop.so;hive >add file libsnappy.so.1;hive >select count (*) from snappy_test;
after executing the HQL, it is found that the Rcfile+snappy data can be processed normally and the 200+m files split into two map tasks at the same time. Test finished.

Address:

http://blog.cloudera.com/blog/2011/09/snappy-and-hadoop/

http://blog.csdn.net/czw698/article/details/38387657

http://www.linuxidc.com/Linux/2014-05/101230.htm

Http://book.2cto.com/201305/21922.html

http://blog.csdn.net/czw698/article/details/38387657

http://blog.csdn.net/czw698/article/details/38398399

Http://www.tuicool.com/articles/yiiiY3R








hadoop2.4 Support Snappy

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.