Hadoop2.4 supports snappy

Source: Internet
Author: User
Tags processing text

Hadoop2.4 supports snappy

Our Hadoop 2.4 cluster does not support snappy compression by default, however, some business parties recently say that part of their data is compressed by snappy (this part of data is compressed by snappy when it is provided by another cluster to them) if you want to migrate to our cluster for computing, an error is reported during direct running:

Failed with exception java. io. IOException: java. lang. RuntimeException:
Native snappy library not available: this version of libhadoop was built without snappy support

According to the error message, the local snappy library is not available. It seems that you need to specify the local library to support snappy during libhadoop compilation. This is different from hadoop1.0, hadoop1.0 only needs to copy the local library file of snappy to the specified directory, and does not need to re-compile the local library file of libhadoop.

Because the compression ratio of the snappy compression algorithm is not very high, although it has some advantages in decompression efficiency, our cluster does not support snappy by default. The data requirements of our cluster are RCFile + Gzip, the advantages and disadvantages of several compression formats in hadoop are as follows:

Reference address:

Currently, lzo, gzip, snappy, and bzip2 are widely used in Hadoop. based on practical experience, I will introduce the advantages and disadvantages of these four compression formats and their application scenarios, in practice, we can select different compression formats based on the actual situation.

1. gzip Compression

Advantages: the compression rate is relatively high, and the compression/Decompression speed is also relatively high. hadoop itself supports processing gzip files in applications just like directly processing text; hadoop native libraries are available; most linux systems use gzip commands for ease of use.

Disadvantage: split is not supported.

Application Scenario: when each file is compressed within MB (within 1 block size), gzip compression format can be used. For example, logs of one day or one hour are compressed into a gzip file, and multiple gzip files are used to run mapreduce programs concurrently. The hive program, streaming program, and mapreduce program written in java are exactly the same as text processing. After compression, the original program does not need to be modified.

2. lzo Compression

Advantages: compression/Decompression speed is also relatively fast, reasonable compression rate; Support for split, is the most popular compression format in hadoop; Support for hadoop native Library; can install lzop command in linux, easy to use.

Disadvantages: the compression ratio is lower than that of gzip; hadoop itself is not supported and needs to be installed; lzo files need to be specially processed in the application (indexes must be created to support split, you also need to specify inputformat as lzo format ).

Application Scenario: A large text file that is compressed and later than MB can be considered. In addition, the larger a single file, the more obvious the advantage of lzo.

3. snappy Compression

Advantages: high compression speed and reasonable compression ratio; Support for hadoop native libraries.

Disadvantages: split is not supported; compression rate is lower than gzip; hadoop itself is not supported and needs to be installed; there is no corresponding command in linux.

Application Scenario: When the map output data of mapreduce jobs is large, it is used as the compression format of the intermediate data from map to reduce; or as the output of a mapreduce job and input of another mapreduce job.

4. bzip2 Compression

Advantages: Support for split; high compression rate, higher than gzip compression rate; hadoop itself, but does not support native; bzip2 command is provided in linux for ease of use.

Disadvantages: the compression/Decompression speed is slow; native is not supported.

Application Scenario: Suitable for scenarios where the speed requirement is not high, but the compression ratio is high, it can be used as the output format of mapreduce jobs; or the output data is large, after processing, the data needs to be compressed and archived to reduce disk space and reduce data usage in the future. Or, if you want to compress a single large text file to reduce storage space, you also need to support split, it is also compatible with previous applications (that is, applications do not need to be modified.

Finally, compare the features (advantages and disadvantages) of the above four compression formats with a table ):

Comparison of features in four compression formats

Compression format Split Native Compression rate Speed Hadoop built-in? Linux commands After the compression format is changed, does the original application need to be modified?
Gzip No Yes Very high Relatively Fast Yes, use it directly Yes Same as text processing, it does not need to be modified.
Lzo Yes Yes Relatively high Soon No, you need to install Yes You need to create an index and specify the input format.
Snappy No Yes Relatively high Soon No, you need to install No Same as text processing, it does not need to be modified.
Bzip2 Yes No Highest Slow Yes, use it directly Yes Same as text processing, it does not need to be modified.

Note: The preceding compression algorithms indicate whether to support split before compressing common text. For RCFile and Sequence Files, split is supported, after compression, split is supported.

To sum up, the hadoop2.4 cluster requires that RCFile + gzip be reasonable. The first RCFile Format File supports column-based storage and split, while gzip has a relatively high compression ratio, in addition, the compression/Decompression speed is also relatively high. Therefore, after gzip compression, RCFile files can be split, and can be compressed at a high speed and compression ratio.

After talking about the topic for a long time, let's go to the topic to discuss how to avoid replacing the local database file of the cluster and not restarting the hadoop process, that is to say, you can solve the problem of snappy compression on the hadoop client:

1. Compile the snappy local library. After compilation, the snappy local library file address:/data0/liangjun/snappy/

Reference address:

2. recompile the libhadoop. so file. during compilation, specify the snappy local library file address through-Dsnappy. prefix:

Mvn clean package-Pdist-Dtar-Pnative-Dsnappy. prefix =/data0/liangjun/snappy/-DskipTests

Note: I tested it. The libhadoop. so compiled by-Drequire. snappy is also feasible:

Mvn clean package-Pdist, native-DskipTests-Drequire. snappy

3. After completing the preceding two steps, you only need to get libhadoop. so and libsnappy. so.1 two files (only these two files are required, and the others have been filtered out after my tests). The following is an example of using snappy compression for MapReduce and hive:

(1) MapReduce: add the compiled local library to DistributedCache:

Compress by snappy when adding the following two configuration items to the client mapred-site.xml file in the test environment to support map-side data:

<Property>
<Name> mapreduce. map. output. compress </name>
<Value> true </value>
<Final> true </final>
</Property>
<Property>
<Name> mapreduce. map. output. compress. codec </name>
<Value> org. apache. hadoop. io. compress. SnappyCodec </value>
<Final> true </final>
</Property>

Upload libhadoop. so and libhadoop. so to the specified hdfs directory/test/snappy/and specify the file through-files:

Hadoop jar hadoop-mapreduce-examples-2.4.0.jar wordcount-files hdfs: // ns1/test/snappy/libhadoop. so, hdfs: // ns1/test/snappy/libsnappy. so.1/test/missdisk // test/wordcount

(2) Use add file to specify the file in hive:

Hive> add file libhadoop. so;
Hive> add file libsnappy. so.1;
Hive> select count (*) from ct_tmp_objrec;

The data in the ct_tmp_objrec table is the data compressed by the text file snappy, And the ct_tmp_objrec storage format is the common text format.

After running hql, it is found that the data in snappy format can be processed and computed normally, but 200 + M files can only be processed by one map task, and neither split is supported.

========================================================== ============================

The following section describes whether the data in RCFile + snappy supports the split test:

1. Create the test table snappy_test, which is identical to the ct_tmp_objrec column, but the hive table storage format is changed to RCFile:

Create external table 'snappy _ test '(
'From _ id' string,
'To _ id' string,
'Mention _ type' bigint,
'Repost _ flag' bigint,
'Weight' double,
'Confidence 'double,
'From _ uid' string,
'To _ object_label 'string,
'Count' bigint,
'Last _ modified' string,
'Time' string,
'Mblog _ spam' bigint,
'Mblog _ simhash' string,
'Mblog _ dupnum' bigint,
'Mblog _ attribute' bigint,
'User _ quality' bigint,
'User _ type' bigint,
'Old _ weight 'double,
'Obj _ pos' bigint,
'Quality 'bigint)
ROW FORMAT SERDE
'Org. apache. hadoop. hive. serde2.columnar. lazybinarycolumnarserde'
STORED AS INPUTFORMAT
'Org. apache. hadoop. hive. ql. io. rcfileinputformat'
OUTPUTFORMAT
'Org. apache. hadoop. hive. ql. io. rcfileoutputformat'
LOCATION
'Hdfs: // ns1/user/liangjun/warehouse/tables/snappy_test'

2. Convert the plain text + snappy compressed data in ct_tmp_objrec to RCFile + gzip compressed data in snappy_test:

Hive> add file libhadoop. so;
Hive> add file libsnappy. so.1;
Hive> set hive.exe c. compress. output = true;
Hive> set mapred. output. compression. codec = org. apache. hadoop. io. compress. SnappyCodec;
Hive> insert overwrite table into select from_id, to_id, mention_type, repost_flag, weight, confidence, from_uid, to_object_label, count, interval, time, interval, mblog_simhash, interval, mblog_attribute, user_quality, user_type, old_weight, obj_pos, quality from ct_tmp_objrec;

3. query the RCFile + snappy data in snappy_test to check whether the data can be split.

Hive> add file libhadoop. so;
Hive> add file libsnappy. so.1;
Hive> select count (*) from snappy_test;

After running hql, it is found that the data of RCFile + snappy can be processed normally, and the files of 200 + M are split into two map tasks for processing. The test is completed.

New Features of Hadoop2.5.2

Install and configure Hadoop2.2.0 on CentOS

Build a Hadoop environment on Ubuntu 13.04

Cluster configuration for Ubuntu 12.10 + Hadoop 1.2.1

Build a Hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)

Configuration of Hadoop environment in Ubuntu

Detailed tutorial on creating a Hadoop environment for standalone Edition

Build a Hadoop environment (using virtual machines to build two Ubuntu systems in a Winodws environment)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.