Compressed Integer In. NET/CLI Metadata

Last Update:2018-12-08 Source: Internet

Author: User

Tags 0xc0

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Document directory

Compression Algorithm for Unsigned Integer
Decompression Algorithm for Unsigned Integer
Compression Algorithm for Signed Integer
Decompression Algorithm for Signed Integer
ILASM/ILDASM
Mono Cecil
CCI Metadata
Implements Not Researched
Expert. NET 2.0 IL Explorer
ECMA-335 -- Common Language Infrastructure (CLI) 4th Edition

Compressed Integer In. NET/CLI Metadata

URL: http://www.cnblogs.com/AndersLiu/archive/2010/02/09/en-compressed-integer-in-metadata.html

Author: Anders Liu

Abstract: Compressed Integer is widely used in. NET/cli pe files; this algorithm can place a 32-bit integer into 1, 2, or 4 bytes base on its value. compressed integer can save the size of a PE file into tively, especially when the integer value is small. this document introduces the compression algorithm for integer, and gives the reference implement of the algorithm.

Bibliographies

ECMA-335: Common Language Infrastructure (CLI) 4th Edition, June 2006.
Expert. NET 2.0 IL ExplorerSerge Lidin, Apress, 2006.

Introduction

In short, the compression algorithm is used to place a 32-bit integer (takes 4 bytes) into as little as possible number of storage (1, 2, or 4 bytes ).

This compression algorithm is widely used in. NET/cli pe files, such as metadata signatures, # blob stream and # US stream. in such cases, integers are used to save the number of records, or size of data blocks. since such numbers and sizes are all very small, use 32-bit integers will cause bytes set to 0, which makes no sense. in such cases, compressed integer can reduce tively reduce the disk space a PE file takes, and saves network bandwidth.

Some scenarios of using compressed integer within a PE file are listed below:

In the beginning of each record in Blob heap (storage format of # Blob stream and # US stream), compressed unsigned integer is used to store the size of the record data.
In the method metadata signature, compressed unsigned integer is used to store the number of parameters.
In metadata signatures, lower bounds of each array are saved in compressed signed integer.

Note, all compression and decompression algorithm referred here are applied for 32-bit integer. also, if not special mentioned, all integers are present as big-endian (most significant byte presents in left or on top ).

Compression and Decompression for Unsigned IntegerCompression Algorithm for Unsigned Integer

Compression for unsigned integer is simple, split the range of unsigned integer into 3 ranges, and then place the unsigned integer value into 1, 2, or 4 bytes based on which range the value fall off. table 1 lists all ranges and the format of compressed value.

Table 1-Ranges for unsigned integer
Range	Bytes Used	Mask	Binary Format
[00000000 h, 0000007Fh]	1	80 h	0 BBBBBBBB
[00000080 h, 00003 FFFh]	2	C0h	10 BBBBBB BBBBBBBB
[00004000 h, 1 FFFFFFFh]	4	E0h	110 BBBBB bbbbbbbbbb BBBBBBBB bbbbbbbbbb

In Table 1,

RangeLists the min value (inclusive) and max value (inclusive) of the range.
Bytes UsedLists how many bytes the compressed value will take.
MaskLists mask value applied on the first byte of the compressed value,
- If the compressed value takes 1 byte, perform & (bitwise and) with 80 h, the result will be 0 h;
- If the compressed value takes 2 bytes, perform & with C0h, the result will be 80 h;
- If the compressed value takes 4 bytes, perform & with E0h, the result will be C0h.
Binary FormatLists the binary format of the compressed value, where 1 and 0 are fixed bit, while B means significant bit.

From Table 1, we know that unsigned integers between [0 h, 1 FFFFFFFh] are suitable for this algorithm, values large than 1 FFFFFFFh are not supported.

Code 1 shows a reference implement of unsigned integer compressing.

Code 1-Reference implement of unsigned integer compressing

public static byte[] CompressUInt(uint data){  if (data > 8) | 0x80);    bytes[1] = (byte)(data & 0x00FF);    return bytes;  }  else if (data > 24) | 0xC0);    bytes[1] = (byte)((data & 0x00FF0000) >> 16);    bytes[2] = (byte)((data & 0x0000FF00) >> 8);    bytes[3] = (byte)(data & 0x000000FF);    return bytes;  }  else    throw new NotSupportedException();}

Decompression Algorithm for Unsigned Integer

Decompression algorithm for unsigned integer is the same simple as compression, see below:

If the first byte is in form of 0 bbbbbbb (perform bitwise and with 80 h, the result is 0 h), the compressed value is stored in 1 byte (byte value is b0 ), then the original integer value is b0.
If the first byte is in form of 10 bbbbbb (perform bitwise and with C0h, the result is 80 h), the compressed value is stored in 2 bytes (bytes values are b0, b1 in order), then the original integer value is (b0 & 0x3F) <8 | b1.
If the first byte is in form of 110 bbbbb (perform bitwise and with E0h, the result is C0h), the compressed value is stored in 4 bytes (bytes values are b0, b1, b2, b3 in order), then the original integer value is (b0 & 0x1F) <24 | b1 <16 | b2 <8 | b3.

The Code 2 gives reference implement of unsigned integer decompressing.

Code 2-Reference implement of unsigned integer decompressing

public static uint DecompressUInt(byte[] data){  if (data == null)    throw new ArgumentNullException("data");  if ((data[0] & 0x80) == 0    && data.Length == 1)  {    return (uint)data[0];  }  else if ((data[0] & 0xC0) == 0x80    && data.Length == 2)  {    return (uint)((data[0] & 0x3F)

Compression and Decompression for Signed IntegerCompression Algorithm for Signed Integer

The compressing of signed integer is slightly more complex than the unsigned integer, because we have to deal with the sign bit. in short, after determine how many bytes the compressed value will take, we shoshould left shift the whole integer by 1 bit, and place the sign bit on the least significant bit (0 for positive, 1 for negative), and then set mask value for the first byte as the compressed unsigned integer value.

When determining how many bytes shocould use to store the compressed signed integer value, we shocould get the 'semi-absolute value' of the original integer, that is, for the negative value, we shoshould take its bitwise reversed value (not the opposite number in mathematics ). and then, left shift the 'semi-absolute value' by 1 bit, and search from Table 1 for getting the number bytes shocould use.

Or, you can omit the left ship operation, but use the Table 2 to search the range of the 'semi-absolute value '.

Table 2-Ranges of 'semi-absolute value' of signed integer
Range	Bytes Used	Significant Bit Mask
[00000000 h, 0000003Fh]	1	0000003Fh
[00000040 h, 00001 FFFh]	2	00001 FFFh
[00002000 h, 0 FFFFFFFh]	4	0 FFFFFFFh

In Table 2,

RangeLists the min 'semi-absolute value' (inclusive) and the max 'semi-absolute value' (inclusive) of each range.
Bytes UsedLists the number of bytes that the compressed value will take.
Significant Bit MaskLists a series of mask, on which perform & with the original integer value, you can get all the significant bits. in fact, for a positive value, all left side bits are 0, and make no sense so that can be omitted; also, for a negative value, all left side bits are 1, make no sense so that can be omitted too.

After you got the significant bits through the bitwise and operation with the corresponding mask value, left shift all the significant bits. next, if the original integer is negative, set the least significant bit (the sign bit) to 1.

Finally, apply mask value to the first byte of the compressed value, use the same rule as compressed unsigned integer.

The range of signed integers which are suitable for the compression algorithm contains, for positive integer, [0 h, 0 FFFFFFFh] ([0, 268435455]), while for negative integer, [F0000000h, FFFFFFFFh] ([-268435456,-1]). integers fall out of these ranges are not suitable.

Code 3 gives the reference implement of signed integer compressing.

Code 3-Reference implement of signed integer compressing

public static byte[] CompressInt(int data){    var u = data >= 0 ? (uint)data : ~(uint)data;    if (u > 8) | 0x80);        bytes[1] = (byte)(uv & 0x00FF);        return bytes;    }    else if (u > 24) | 0xC0);        bytes[1] = (byte)((uv & 0x00FF0000) >> 16);        bytes[2] = (byte)((uv & 0x0000FF00) >> 8);        bytes[3] = (byte)(uv & 0x000000FF);        return bytes;    }    else        throw new NotSupportedException();}

Note, the 'semi-absolute value' is used only when determining the number bytes the compressed value takes, once the number is calculated, use the original integer value for compressing, treat it as unsigned.

Decompression Algorithm for Signed Integer

Since the compressed signed integer and the compressed unsigned integer use the same binary format, the decompression of signed integer can be based on the decompression of unsigned integer.

First, decompress the compressed value as unsigned, and got a 32-bit unsigned integer. Then, get the sign of the original integer according to the least significant bit (sign bit ).

If the original integer is positive (the least significant bit, I. e. the sign bit is 0), right shift the decompressed value by 1 bit, and convert to signed integer, then you get the original signed integer.

If the original integer is negative (the least significant bit, I. e. the sign bit is 1), right shift the decompressed value by 1 bit, and bring back the non-sense 1 bits in the left side of the integer:

If the compressed value takes 1 byte, perform | (bitwise or) operation with FFFFFFC0h;
If the compressed value takes 2 bytes, perform | operation with FFFFE000h;
If the compressed value takes 4 bytes, perform | operation with F0000000h.

Finally, convert the result to signed integer; you will get the original negative signed integer.

Code 4 give the reference implement of signed integer decompressing.

Code 4-Reference implement of signed integer decompressing

public static int DecompressInt(byte[] data){    var u = DecompressUInt(data);    if ((u & 0x00000001) == 0)        return (int)(u >> 1);    var nb = GetCompressedIntSize(data[0]);    uint sm;    switch (nb)    {        case 1: sm = 0xFFFFFFC0; break;        case 2: sm = 0xFFFFE000; break;        case 4: sm = 0xF0000000; break;        default: throw new NotSupportedException();    }    return (int)((u >> 1) | sm);}

Here a utility method GetCompressedIntSize is called, which is used to determine how many bytes the compressed value takes, through the first byte of the compressed value. This method is really simple, see Code 5.

Code 5-Get bytes number of the compressed value through the first byte

public static uint GetCompressedIntSize(byte firstByte){  if ((firstByte & 0x80) == 0)    return 1;  else if ((firstByte & 0xC0) == 0x80)    return 2;  else if ((firstByte & 0xE0) == 0xC0)    return 4;  else      throw new NotSupportedException();}

Implement Issues

The compressed signed integer is used less in. NET/CLI metadata, as I know, only in array lower bound value with in metadata signatures (which means, the negative array lower bound is supported by. NET/CLI naturally ). in such a scenario, almost all CLI implements have problems when dealing with compressed signed integer, more or less. and in all bibliographies, the description of compression for signed integer is not clear enough. fortunately, most high level programming language don't support array with negative lower bound, and in CLS, all lower bounds of an array shoshould be 0, so these problems don't have serious implications for actual projects.

In the following sewing, I'll list problems occurred in some CLI implements that I 've researched, followed by the issues appear in bibliographies.

ILASM/ILDASM

Obviusly, Microsoft doesn' t clarify the compression algorithm for signed integer itself. ILASM is the only compiler can accept negative array lower bound that I 've used, it is also the most used compiler when I researching on this question. for the positive lower bound within array, no problem in ILASM; while for the negative lower bound, you'll get an incorrect compressed value when the lower bound value is between-8192 (aggressive) and-8129 (aggressive ).

In addition, ILASM uses different decompression algorithm for signed integer other than the one described in this article, which cannot cover all theoretically supported integers ([-268435456,268 435455]), when the lower bound is less than or equal to-268427265, you'll also get an incorrect value.

We can't test the ILDASM precisely, because of the problem occurred in the ILASM. however, though try to decompress the incorrect value generated by the ILASM, the ILDASM and the reference implement referred in this article both get the same value, so I prefer to consider that the decompression algorithm used in ILDASM is correct. but the incorrect value will make ILDASM crashed randomly.

The problems introduced abve appear in version 2.0, 3.0, and 3.5 of ILASM, in version 4.0 beta, all the problems are resolved. the ILASM shipped. NET Framework SDK 4.0 Beta can accept all suitable signed value as the lower bound of an array, and generate correct compressed value; and the ILDASM can also decompress the compressed value correctly.

Mono Cecil

After read the source code of Mono Cecil, I find that Mono Cecil is loyal to the ECMA-335 standard, but ECMA-335 makes mistake on the description of array lower bound (seeRevision of BibliographiesSection later), where the array lower bound is treat as unsigned (not signed) integer.

So, Mono Cecil provides only compression and decompression for unsigned integer (see 'mono. cecil. metadata. utilities. writeCompressedInteger (BinaryWriter, Int32): Int32 'method and 'mono. cecil. metadata. utilities. readCompressedInteger (Byte [], Int32, Int32 &): Int32 'method in Mono. cecil. dll ). when writing and reading array lower bounds, it also treat the lower bounds as unsigned integers (see 'mono. cecil. signatures. signatureWriter. write (SigType): Void 'method and 'mono. cecil. signatures. signatureReader. readType (Byte [], Int32, Int32 &): SigType 'method in the same library ).

When you reflecting an assembly by using Mono Cecil, if the array lower bound is positive, you will get a lower bound twice as the real value (because the right shift operation is missed ); or if the array lower bound is negative, the result is totally wrong.

I only researched version 0.6 of Mono Cecil, no sure in other versions, you can research them yourself.

CCI Metadata

CCI Metadata treats the array lower bound as signed integer indeed, but uses an oversimplification algorithm: left shift the absolute value of the original integer, then place the sign bit in the least significant bit (see 'Microsoft. cci. binaryWriter. writeCompressedInt (Int32): void' method in Microsoft. cci. peWriter. dll), and compress the value as an unsigned integer. the decompression algorithm is opposite, decompress the compressed value as unsigned integer, determine the sign according to the least signiicant bit, right shift the decompressed unsigned value by 1 bit, then convert it to signed integer and set the sign according the sign bit (see 'Microsoft. cci. utilityDataStructures. memoryReader. readCompressedInt32 (): Int32 'method in Microsoft. cci. peReader. dll ).

CCI Metadata uses the same algorithmExpert. NET 2.0 IL Explorer, Which has problem also (seeRevision of BibliographiesSection later ).

Version 2.0.49.23471 of CCI Metadata has been researched.

Implements Not Researched

Some other implements are not covered in this article, such:

System. Reflection/System. Reflection. Emit
Shared Source CLI (Rotor)

Revision of BibliographiesExpert. NET 2.0 IL Cycler

This book describes the compression algorithm in Chapter 8, in the paragraph after Table 8-4 (first paragraph in P150). The description is incorrect, for the correct description, seeCompression Algorithm for Signed IntegerSection in this article.

ECMA-335 -- Common Language Infrastructure (CLI) 4th Edition

The ECMA-335 standard doesn' t discriminate the terms compressed unsigned integer and compressed signed integer, they are collectively called compressed integer.

23.2 Blobs and signaturesSection inECMA-335 Partition II: Metadata Definition and SemanticsDefines compression algorithm for compressed integer (P153), which is in fact compressed unsigned integer and is correct when applied on unsigned integer.

23.2.13 ArrayShapeSection inECMA-335 Partition II: Metadata Definition and SemanticsDefines the array shape used in metadata signatures (P161), where the Size element and LoBound element are all called compressed integer, which is incorrect.

The revision is that, involve term compressed unsigned integer to describe the original compressed integer other than LoBound in ArrayShape; and involve term compressed signed integer for the LoBound in ArrayShape. and provide description for signed integer compression algorithm according to the description inCompression Algorithm for Signed IntegerSection.

(End)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More