An analysis of a variable-length long encoding for Hadoop

Source: Internet
Author: User

Hadoop has designed its own encoding scheme for long, int (encoded by long) encoding, which is a variable-length encoding method of zero-compressed encoded, which is helpful for compressing redundant data greatly. The specific algorithm is actually very simple, in particular, the following points:

1, for -112 <= i <= 127 integer, only with 1 bytes of byte to represent, if the above range is exceeded, the encoding of the first byte will be used to represent the total number of bytes I, followed by the byte I;

2, if I is greater than 0, the encoded first byte B range between-113 and-120, then I will have ( -112-b) bytes, so it can be represented by 1-8 bytes;

3, if I is less than 0, then the encoding of the first byte B range between-121 and-128, then I will have ( -120-b) bytes, also can represent 1-8 bytes. (in the implementation of Hadoop, I-complement is encoded when I is a negative number).

The algorithm looks relatively easy to understand, the point is to use the first byte to represent the length of I, and the symbol I, but in fact, if the depth of the source, found that the implementation of Hadoop is a little clever place, we first look at the implementation of the Code:

The first is the encoding of the variable length long:

public static void Writevlong (DataOutput stream, long i) throws IOException {    if (i >= -112 && i <= 127) {      stream.writebyte ((byte) i);      return;    }          int len = -112;    if (I < 0) {      i ^= -1l;//Take one ' s complement '  //Key section! The substitution procedure is i =-I;      len = -120;    }          Long tmp = i;    while (tmp! = 0) {      TMP = tmp >> 8;      len--;    }          Stream.writebyte ((byte) len);          Len = (Len <-120)? -(len + +):-(len +);          for (int idx = len; idx! = 0; idx--) {      int shiftbits = (idx-1) * 8;      Long mask = 0xFFL << shiftbits;      Stream.writebyte ((Byte) (((I & Mask) >> shiftbits));}  }

For convenience, I have also posted myself a little simplified the implementation of the decoding variable length long of Hadoop implementation:

    public static long Readvlong (DataInputStream input) throws IOException {        byte firstbyte = Input.readbyte ();        int len = -112;        Boolean isnegative = false;        if (firstbyte >= -112 && firstbyte <= 127) {            return firstbyte;        } else if (Firstbyte <= -121) {
   len = -120;            Isnegative = true;        }        len = len-firstbyte;        Long res = 0;        for (int i = 0; i < len; ++i) {            res <<= 8;            byte B = input.readbyte ();            res = (b & 0xFF) | res;        }        If the encoding is i =-I; Then this is return isnegative? (-res): res;        Return isnegative? (res ^ -1l): res;    }
the specific implementation of the algorithm, referring to the previous description is easy to understand the approximate framework, but there is a very important part of the code is added to the encoding and decoding part, for the 3rd condition of the algorithm, if I is a negative number, the default implementation of Hadoop will be I in the complement operation, Then proceed with the encoding, and therefore, at the time of decoding, the last part will have to take a complement operation again.

Analysis of algorithm thought

Why would you do that? In fact, analyze the principle of the whole algorithm. First, if we simply put the first byte to represent the byte number of I, not divided into positive, negative two parts to represent the additional symbol, there is a problem: it will not be able to use the variable length coding simple to achieve positive and negative judgment, for a simple example, for i = 128 and i =- 128, the two-digit encoding is 0x80! for 1 bytes. Why is that? If you think that a negative number of binary encoding is a positive inverse plus 1 (plus 1 is to avoid the direct take against 0 for two times encoding, so negative numbers can represent more than 1 number), so for a given byte, negative numbers will always be more than positive to represent 1 numbers, for 1 bytes, can represent -128~127. Therefore, for i = 128, there is no way to distinguish between positive and negative, you must add the symbolic information by the first byte.

When you give the first byte more than 8 digits to represent the symbol, in order to calculate the number of bits of I, if I is negative, I will have a high position of 1, so I must be negative for the case of reverse, and then continue to calculate the length of I, but in fact, we can also be reversed after I plus 1, that is, i = I, to absolute value, and in fact, after my test, whether it is to take the inverse or absolute operation, both can encode and decode normally, but in fact, the reverse has a benefit, for i =-256, if I is reversed, then the output of the two bytes encoded: -121,-1. If I is taken by absolute value, the encoded output is two bytes: -122,1,0. It can be seen that, for this time, the inverse is able to take 1 bytes less than the absolute value.

An analysis of a variable-length long encoding for Hadoop

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.