An analysis of a variable-length long encoding for Hadoop

Last Update:2015-06-15 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop has designed its own encoding scheme for long, int (encoded by long) encoding, which is a variable-length encoding method of zero-compressed encoded, which is helpful for compressing redundant data greatly. The specific algorithm is actually very simple, in particular, the following points:

1, for -112 <= i <= 127 integer, only with 1 bytes of byte to represent, if the above range is exceeded, the encoding of the first byte will be used to represent the total number of bytes I, followed by the byte I;

2, if I is greater than 0, the encoded first byte B range between-113 and-120, then I will have ( -112-b) bytes, so it can be represented by 1-8 bytes;

3, if I is less than 0, then the encoding of the first byte B range between-121 and-128, then I will have ( -120-b) bytes, also can represent 1-8 bytes. (in the implementation of Hadoop, I-complement is encoded when I is a negative number).

The algorithm looks relatively easy to understand, the point is to use the first byte to represent the length of I, and the symbol I, but in fact, if the depth of the source, found that the implementation of Hadoop is a little clever place, we first look at the implementation of the Code:

The first is the encoding of the variable length long:

public static void Writevlong (DataOutput stream, long i) throws IOException {    if (i >= -112 && i <= 127) {      stream.writebyte ((byte) i);      return;    }          int len = -112;    if (I < 0) {      i ^= -1l;//Take one ' s complement '  //Key section! The substitution procedure is i =-I;      len = -120;    }          Long tmp = i;    while (tmp! = 0) {      TMP = tmp >> 8;      len--;    }          Stream.writebyte ((byte) len);          Len = (Len <-120)? -(len + +):-(len +);          for (int idx = len; idx! = 0; idx--) {      int shiftbits = (idx-1) * 8;      Long mask = 0xFFL << shiftbits;      Stream.writebyte ((Byte) (((I & Mask) >> shiftbits));}  }

For convenience, I have also posted myself a little simplified the implementation of the decoding variable length long of Hadoop implementation:

    public static long Readvlong (DataInputStream input) throws IOException {        byte firstbyte = Input.readbyte ();        int len = -112;        Boolean isnegative = false;        if (firstbyte >= -112 && firstbyte <= 127) {            return firstbyte;        } else if (Firstbyte <= -121) {
   len = -120;            Isnegative = true;        }        len = len-firstbyte;        Long res = 0;        for (int i = 0; i < len; ++i) {            res <<= 8;            byte B = input.readbyte ();            res = (b & 0xFF) | res;        }        If the encoding is i =-I; Then this is return isnegative? (-res): res;        Return isnegative? (res ^ -1l): res;    }

the specific implementation of the algorithm, referring to the previous description is easy to understand the approximate framework, but there is a very important part of the code is added to the encoding and decoding part, for the 3rd condition of the algorithm, if I is a negative number, the default implementation of Hadoop will be I in the complement operation, Then proceed with the encoding, and therefore, at the time of decoding, the last part will have to take a complement operation again.

Analysis of algorithm thought

Why would you do that? In fact, analyze the principle of the whole algorithm. First, if we simply put the first byte to represent the byte number of I, not divided into positive, negative two parts to represent the additional symbol, there is a problem: it will not be able to use the variable length coding simple to achieve positive and negative judgment, for a simple example, for i = 128 and i =- 128, the two-digit encoding is 0x80! for 1 bytes. Why is that? If you think that a negative number of binary encoding is a positive inverse plus 1 (plus 1 is to avoid the direct take against 0 for two times encoding, so negative numbers can represent more than 1 number), so for a given byte, negative numbers will always be more than positive to represent 1 numbers, for 1 bytes, can represent -128~127. Therefore, for i = 128, there is no way to distinguish between positive and negative, you must add the symbolic information by the first byte.

When you give the first byte more than 8 digits to represent the symbol, in order to calculate the number of bits of I, if I is negative, I will have a high position of 1, so I must be negative for the case of reverse, and then continue to calculate the length of I, but in fact, we can also be reversed after I plus 1, that is, i = I, to absolute value, and in fact, after my test, whether it is to take the inverse or absolute operation, both can encode and decode normally, but in fact, the reverse has a benefit, for i =-256, if I is reversed, then the output of the two bytes encoded: -121,-1. If I is taken by absolute value, the encoded output is two bytes: -122,1,0. It can be seen that, for this time, the inverse is able to take 1 bytes less than the absolute value.

An analysis of a variable-length long encoding for Hadoop

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

An analysis of a variable-length long encoding for Hadoop

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support