hadoop2.4 Support Snappy

Source: Internet
Author: User

We hadoop2,4 the cluster does not support snappy compression by default, but recently the business party said that some of their data is snappy compressed (this part of the data is provided to them by another cluster is snappy compressed format) want to migrate to our cluster to calculate, But directly run the Times wrong:

Failed with exception java.io.IOException:java.lang.RuntimeException:native Snappy Library not available:this version O F Libhadoop is built without snappy support
The snappy local library is not available according to the error message, and it seems to require special designation to support snappy when compiling libhadoop. This is different from hadoop1.0,hadoop1.0 just need to copy the snappy local library file to the specified directory, without recompiling the Libhadoop local library file.

Because snappy compression algorithm compression ratio is not very high, although in the decompression efficiency and a little advantage, so our cluster is not supported by default snappy, our cluster data requirements are rcfile+gzip, the following are the advantages and disadvantages of several compression formats in Hadoop comparison:

Reference Address: http://www.linuxidc.com/Linux/2014-05/101230.htm

Currently in Hadoop used more than lzo,gzip,snappy,bzip2 these 4 kinds of compression format, the author based on practical experience to introduce the advantages and disadvantages of these 4 compression formats and application scenarios, so that we in practice according to the actual situation to choose different compression format.

1, gzip compression

Advantages: The compression ratio is high, and the compression/decompression speed is relatively fast; Hadoop natively supports that files in the gzip format are processed in the same way as direct text; there is a Hadoop native library; Most Linux systems have their own GZIP commands for ease of use.

Cons: Split is not supported.

Application scenario: When each file is compressed within 130M (1 blocks in size), you can consider using gzip compression format. For example, a day or one-hour log is compressed into a gzip file, and the MapReduce program runs through multiple gzip files to achieve concurrency. The hive program, the streaming program, and the Java-written MapReduce program are exactly the same as text processing, and the original program does not need to be modified after compression.

2, Lzo compression

Advantages: Compression/decompression speed is relatively fast, reasonable compression rate; split is the most popular compression format in Hadoop, supports Hadoop native library, and can be installed under Linux system lzop command, easy to use.

Cons: Compression ratio is lower than gzip; Hadoop itself is not supported, needs to be installed, there are some special processing required for files in LZO format in the application (in order to support split needs to be indexed, you also need to specify InputFormat as Lzo format).

Application scenario: A large text file, after compression is more than 200M can be considered, and the larger the size of a single file, Lzo the more obvious advantages.

3, Snappy compression

Advantages: High speed compression speed and reasonable compression rate; support for Hadoop native library.

Cons: Split is not supported, compression is lower than gzip, Hadoop itself is not supported, requires installation, and no corresponding command is available under Linux system.

Scenario: When the map output data for a mapreduce job is larger, it acts as a compressed format for intermediate data from map to reduce, or as an output of a mapreduce job and an input to another mapreduce job.

4, BZIP2 compression

Pros: Support split, high compression ratio, higher than gzip compression rate; Hadoop natively supports it, but does not support native; it is easy to use with BZIP2 commands in Linux systems.

Disadvantage: Compression/decompression speed is slow; native is not supported.

Application scenario: Suitable for the speed requirements are not high, but need high compression rate, can be used as the output format of the MapReduce job, or the data after the output is larger, the data after processing need to compress the archive to reduce disk space and later data used relatively small situation , or you want to compress a single large text file to reduce storage space while supporting split and compatibility with previous applications (that is, applications that do not need to be modified).

Finally, a table is used to compare the features (pros and cons) of the above 4 compression formats:

comparison of features in 4 compression formats
compression Format Split native Compression ratio Speed whether Hadoop comes with Linux Commands if the original application has to be modified after you change to a compressed format
Gzip Whether Is Very high Relatively fast Yes, direct use Yes As with text processing, there is no need to modify
Lzo Is Is higher than Soon No, you need to install Yes You need to build an index and specify the input format
Snappy Whether Is higher than Soon No, you need to install No As with text processing, there is no need to modify
Bzip2 Is Whether Highest Slow Yes, direct use Yes As with text processing, there is no need to modify

Note: The above compression algorithms are in the compression of ordinary text under the premise of whether to support split, if it is rcfile,Sequence files, etc., itself support split, after compression is the same support split.

In conclusion, we hadoop2.4 cluster requirements Rcfile+gzip is reasonable, first rcfile format file support column storage, while supporting split, and gzip compression ratio is high, and compression/decompression speed is relatively fast , So rcfile format files after gzip compression can ensure that the file can split, but also to ensure a high compression/decompression speed and compression ratio.

The above is a half-day digression, in order to come down to the topic How to do not replace the cluster local library files, do not restart the Hadoop process, that is, the Hadoop client can solve the problem of supporting snappy compression method:

1, compile snappy local library, compile snappy local library file address:/data0/liangjun/snappy/

Reference Address: Http://www.tuicool.com/articles/yiiiY3R

2. Recompile the libhadoop.so file, compile at compile time by-dsnappy.prefix specify snappy local library file address compilation:

mvn clean package-pdist-dtar-pnative-dsnappy.prefix=/data0/liangjun/snappy/-dskiptests

Note: I have tested the-drequire.snappy compiled by libhadoop.so is also feasible:

MVN Clean Package-pdist,native-dskiptests-drequire.snappy
3, after the implementation of the above two steps, the final only need to get libhadoop.so and libsnappy.so.1 two files (only need these two files, the other has been filtered through my test), The following are examples of the use of snappy compression for MapReduce and hive:

(1), MapReduce, add the compiled local library to the Distributedcache to:

When you add the following two configuration items to the client Mapred-site.xml file in the test environment to support map-side data, press snappy to compress:

<property>    <name>mapreduce.map.output.compress</name>    <value>true</value >    <final>true</final>  </property>  <property>    <name> Mapreduce.map.output.compress.codec</name>    <value>org.apache.hadoop.io.compress.snappycodec </value>    <final>true</final>  </property>

Upload libhadoop.so and libhadoop.so to the specified HDFs directory/test/snappy/by-files to specify the file:

Hadoop jar Hadoop-mapreduce-examples-2.4.0.jar Wordcount-files hdfs://ns1/test/snappy/libhadoop.so,hdfs://ns1/test /snappy/libsnappy.so.1  /test/missdisk//test/wordcount

(2), hive, specify the file with the Add File:

Hive >add file libhadoop.so;hive >add file libsnappy.so.1;hive >select count (*) from Ct_tmp_objrec;
The data of table Ct_tmp_objrec is snappy compressed data of text file, Ct_tmp_objrec storage format is normal text format.

After running HQL, the data in the snappy format is found to work correctly, but the 200+m file can only be processed by a map task and does not support split.

==========================================================

The following section is a test of whether the Rcfile+snappy data supports split:

1, create the test table snappy_test, the table and the previous Ct_tmp_objrec column is exactly the same, but the Hive table storage format replaced by Rcfile:

CREATE EXTERNAL TABLE ' snappy_test ' (  ' from_id ' string,  ' to_id ' string,  ' Mention_type ' bigint,  ' Repost_flag ' bigint,  ' weight ' double,  ' confidence ' double,  ' From_uid ' string,  ' To_object_label ' String,  ' count ' bigint,  ' last_modified ' string,  ' time ' string,  ' Mblog_spam ' bigint,  ' Mblog_ Simhash ' string,  ' Mblog_dupnum ' bigint,  ' Mblog_attribute ' bigint,  ' user_quality ' bigint,  ' user _type ' bigint,  ' old_weight ' double,  ' Obj_pos ' bigint,  ' quality ' bigint) ROW FORMAT SERDE  ' Org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe ' STORED as InputFormat  ' Org.apache.hadoop.hive.ql.io.RCFileInputFormat ' OutputFormat  ' Org.apache.hadoop.hive.ql.io.RCFileOutputFormat ' Location  ' hdfs://ns1/user/liangjun/warehouse/tables/ Snappy_test '

2, the Ct_tmp_objrec in the plain text+snappy compressed data into the Snappy_test rcfile+gzip compressed data:

Hive >add file libhadoop.so;hive >add file libsnappy.so.1;hive >set hive.exec.compress.output=true;hive > Set mapred.output.compression.codec=org.apache.hadoop.io.compress.snappycodec;hive >INSERT OVERWRITE table Snappy_test Select From_id,to_id,mention_type,repost_flag,weight,confidence,from_uid,to_object_label,count,last_ Modified,time,mblog_spam,mblog_simhash,mblog_dupnum,mblog_attribute,user_quality,user_type,old_weight,obj_pos, Quality from Ct_tmp_objrec;

3, Query snappy_test in the Rcfile+snappy data to see if you can split

Hive >add file libhadoop.so;hive >add file libsnappy.so.1;hive >select count (*) from snappy_test;
after running HQL, the rcfile+snappy data is found to be able to handle the calculation normally, while the 200+m file is split into two map task processing and the test is completed.

Reference Address:

http://blog.cloudera.com/blog/2011/09/snappy-and-hadoop/

http://blog.csdn.net/czw698/article/details/38387657

http://www.linuxidc.com/Linux/2014-05/101230.htm

Http://book.2cto.com/201305/21922.html

http://blog.csdn.net/czw698/article/details/38387657

http://blog.csdn.net/czw698/article/details/38398399

Http://www.tuicool.com/articles/yiiiY3R








hadoop2.4 Support Snappy

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.