Use RawComparator to accelerate Hadoop programs

Source: Internet
Author: User
In the previous two articles [1] [2], we introduced the knowledge of Hadoop serialization, including the Writable interface and Writable object, and how to compile a customized Writable class, in-depth analysis of the occupied byte space and the composition of the byte sequence after the Writable class serialization. We point out that Hadoop serialization is one of the core parts of Hadoop. We understand and analyze Wri.

In the previous two articles [1] [2], we introduced the knowledge of Hadoop serialization, including the Writable interface and Writable object, and how to compile a customized Writable class, in-depth analysis of the occupied byte space and the composition of the byte sequence after the Writable class serialization. We point out that Hadoop serialization is one of the core parts of Hadoop. We understand and analyze Wri.

In the previous two articles [1] [2], we introduced the knowledge of Hadoop serialization, including the Writable interface and Writable object, and how to compile a customized Writable class, in-depth analysis of the occupied byte space and the composition of the byte sequence after the Writable class serialization. We point out that Hadoop serialization is one of the core parts of Hadoop. understanding and analyzing the Writable class knowledge helps us understand how Hadoop serialization works and select the appropriate Writable class as the key and value of MapReduce, to make efficient use of disk space andFast read/write object. In data-intensive computing, network data transmission is an important factor affecting computing efficiency. Selecting a proper Writable object not only reduces disk space, more importantly, it reduces the amount of data that needs to be transmitted in the network and speeds up the program.

In this article, we will introduce another method to speed up the program.Use RawComparator to accelerate Hadoop programs. We know that the Writable class as the Key must implement the WritableComparable interface to sort the Key. When the Writable class is compared, the default method of Hadoop is to first deserialize the serialized object byte stream as an object, and then compare (compareTo method ), the comparison process requires a deserialization step. RawComparatorComparison at the byte stream layer instead of deserializationIn this way, the deserialization process is saved, thus accelerating program running. The IntWritable, LongWritabe, and other classes provided by Hadoop have implemented this kind of optimization. When these Writable classes are compared as keys, the serialized byte array is used to compare the size, instead of deserialization.

Implementation of RawComparator

Writing Writable RawComparator in Hadoop does not directly inherit the RawComparator class, but inherits the Child class of RawComparator.WritableComparatorBecause the WritableComparator class provides some useful tool methods, such as reading the int, long, And vlong values from the byte array. The following is the RawComparator implementation of the MyWritable class customized in the previous two articles. The custom MyWritable consists of two VLongWritable pairs. To add the RawComparator function, the Writable class must implement the WritableComparable interface, here, we will not show all the content of the MyWritableComparable class that implements the WritableComparable interface, but the implementation of the Comparator in the MyWritableComparable class. The complete code can be found in github.

...//omitted for conciseness/** * A RawComparator that compares serialized VlongWritable Pair * compare method decode long value from serialized byte array one by one * * @author yoyzhou * * */public static class Comparator extends WritableComparator {public Comparator() {super(MyWritableComparable.class);}public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {int cmp = 1;//determine how many bytes the first VLong takesint n1 = WritableUtils.decodeVIntSize(b1[s1]);int n2 = WritableUtils.decodeVIntSize(b2[s2]);try {//read value from VLongWritable byte arraylong l11 = readVLong(b1, s1);long l21 = readVLong(b2, s2);cmp = l11 > l21 ? 1 : (l11 == l21 ? 0 : -1);if (cmp != 0) {return cmp;} else {long l12 = readVLong(b1, s1 + n1);long l22 = readVLong(b2, s2 + n2);return cmp = l12 > l22 ? 1 : (l12 == l22 ? 0 : -1);}} catch (IOException e) {throw new RuntimeException(e);}}}static { // register this comparatorWritableComparator.define(MyWritableComparable.class, new Comparator());}...

Through the above code, we can see that to implement Writable RawComparator, we only need to reload the WritableComparatorpublic int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2)Method. In our example, we use VLongWritable to read the value of VLongWritable one by one from the serialized byte array and then compare it.

After compiling the compare method, do not forget to register the RawComparator class for the Writable class.

Summary

To write RawComparator for the Writable class, you must have a clear understanding of the serialized byte array and know how to read the value of the Writable object from the byte array.Hadoop serialization and Writable Interfaces.

Through the above three articles, we learned about the Hadoop Writable interface, how to write our own Writable class, and the length of the Writable class's byte sequence and its composition, and how to write RawComparator for the Writable class to speed up Hadoop.

References

Tom White, Hadoop: The Definitive Guide, 3rd Edition

Hadoop serialization and Writable APIs (1)

Hadoop serialization and Writable APIs (2)

--EOF--

Original article address: Use RawComparator to accelerate the Hadoop program. Thank you for sharing it with the original author.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.