Document directory
- Unsigned integer Compression Algorithm
- Unsigned integer decompression algorithm
- Algorithm for compressing signed integers
- Algorithm for extracting signed integers
- ILASM/ILDASM
- Mono Cecil
- CCI Metadata
- Other implementations not yet studied
- Expert. NET 2.0 IL Explorer
- ECMA-335-Common Language Infrastructure (CLI) 4th Edition
Compressed integer used in. NET/CLI metadata
Address: http://www.cnblogs.com/AndersLiu/archive/2010/02/09/compressed-integer-in-metadata.html
Author: Anders Liu
Abstract :. NET/cli pe files widely use an integer compression algorithm, which can put a 32-bit integer in 1, 2 or 4 bytes according to their size. When the integer value is small, this algorithm can effectively reduce the size of the PE file. This article introduces this compression algorithm and provides reference implementation for compression/decompression.
References
- ECMA-335 -- Common Language Infrastructure (CLI) 4th Edition, June 2006
- Expert. NET 2.0 IL runner er, Serge Lidin, Apress, 2006
- Exploring. NET: MSIL authoritative guide (Chinese Version of Expert. NET 2.0 IL Explorer), by Serge Lidin, Bao Jianqiang translation, people's post and telecommunications press, 2009
Introduction
To put it simply, the integer compression algorithm is to place a 32-bit integer (usually occupying 4 bytes) into the smallest possible storage space (1, 2, or 4 bytes.
Integer compression algorithms are widely used in. NET/cli pe files, such as various metadata signatures, # Blob and # US streams. In these cases, an integer is required to record the number of entries or the size of data blocks. If you simply use a 32-bit integer, because the vast majority of the number or size values are not large, a large number of bytes will be set to meaningless 0 values. In these scenarios, the compression algorithm can effectively save the disk space or network bandwidth occupied by PE files.
The following are some scenarios in which the PE file Uses compressed integers:
- At the beginning of each entry in the Blob heap (# Blob stream and # US stream's storage format), the compressed unsigned integer is used to indicate the size of the entry;
- In the metadata signature of the method, the number of parameters is stored using compressed unsigned integers;
- The array subscript in the metadata signature is stored with a compressed signed integer.
Note that the compression and decompression algorithms described in this article are for 32-bit integers. In addition, if there is no special mention in this article, all the integers that appear are represented by the big ending number (the highest weight byte is placed on the left or top ).
Unsigned integer compression and unzipping
The compression of unsigned integers is relatively simple, that is, the entire value range of unsigned integers is divided into several sections, and the integer is different according to its section, placed in 1, 2, or 4 bytes. Table 1 lists the partition and compression methods of unsigned integers.
Table 1-partition of unsigned integers
Section |
Bytes |
Mask |
Binary format |
[00000000 h, 0000007Fh] |
1 |
80 h |
0 BBBBBBBB |
[00000080 h, 00003 FFFh] |
2 |
C0h |
10 BBBBBB BBBBBBBB |
[00004000 h, 1 FFFFFFFh] |
4 |
E0h |
110 BBBBB bbbbbbbbbb BBBBBBBB bbbbbbbbbb |
In table 1:
- "Section" lists the minimum and maximum values of each section ).
- The number of bytes lists the number of bytes occupied by the compressed value.
- "Mask" lists the masks applied to the compressed value,
- If the compressed integer occupies 1 byte, the result of the & (bitwise AND) operation with the mask 80 h is 0 h,
- If the compressed integer occupies 2 bytes, the first byte and the mask C0h are used. The result of the operation is 80 h,
- If the compressed integer occupies 4 bytes, the first byte and the mask E0h are operated & the result is C0h.
- The "binary form" lists the binary form of the compressed result. "1" and "0" are both fixed values, while "B" indicates the valid bits of the actual integer.
From table 1, we can clearly see that the applicability of the unsigned integer compression algorithm is an unsigned integer within [0 h, 1 FFFFFFFh] ([0, 536870911, unsigned integers larger than 1FFFFFFFh cannot be compressed in this way.
Code 1 provides a reference implementation for the unsigned integer compression algorithm.
Code 1-reference implementation of the unsigned integer Compression Algorithm
public static byte[] CompressUInt(uint data){ if (data > 8) | 0x80); bytes[1] = (byte)(data & 0x00FF); return bytes; } else if (data > 24) | 0xC0); bytes[1] = (byte)((data & 0x00FF0000) >> 16); bytes[2] = (byte)((data & 0x0000FF00) >> 8); bytes[3] = (byte)(data & 0x000000FF); return bytes; } else throw new NotSupportedException();}
Unsigned integer decompression algorithm
The unsigned integer decompression algorithm is also very simple, as shown below:
- If the binary format of the first byte is 0 bbbbbbb (bitwise AND computation with 80 h, the result is 0 h), an integer (b0 in bytes) is stored in one byte ), original integer = b0.
- If the binary format of the first byte is 10 bbbbbb (bitwise AND operation with C0h, and the result is 80 h), the integer value is stored in two bytes (the byte value is b0 in sequence, b1), original integer = (b0 & 0x3F) <8 | b1.
- If the binary format of the first byte is 110 bbbbb (bitwise AND operation with E0h, and the result is C0h), the integer value is stored in four bytes (the byte value is b0 in sequence, b1, b2, b3), original integer = (b0 & 0x1F) <24 | b1 <16 | b2 <8 | b3 ..
Code 2 provides the reference implementation of the unsigned integer decompression algorithm.
Code 2-reference implementation of the unsigned integer decompression algorithm
public static uint DecompressUInt(byte[] data){ if (data == null) throw new ArgumentNullException("data"); if ((data[0] & 0x80) == 0 && data.Length == 1) { return (uint)data[0]; } else if ((data[0] & 0xC0) == 0x80 && data.Length == 2) { return (uint)((data[0] & 0x3F)
Compression and decompression of signed integers
The compression and decompression of signed integers are slightly more complex because the signed digits need to be processed. To put it simply, after determining the required number of storage bytes, you need to move the original integer to the left by one, and then place the symbol bit on the second BIT (0 indicates a positive number, 1 indicates a negative number ), finally, set the mask for the first byte in the same way as the unsigned integer.
When determining how many bytes are needed to store the compressed value for a signed integer, you must first obtain the "quasi-absolute value" of the original integer ", that is, the negative number is reversed by bit (instead of the negative value in mathematics), and then the "quasi-absolute value" is shifted to the left by one (for the sign bit to leave the negative position blank ), then, obtain the number of bytes used by the segments listed in Table 1.
Alternatively, you can omit the one-bit left operation, but search by the section listed in table 2.
Table 2-Section Division of the signed integer "quasi-absolute value"
Section |
Bytes |
Valid bit mask |
[00000000 h, 0000003Fh] |
1 |
0000003Fh |
[00000040 h, 00001 FFFh] |
2 |
00001 FFFh |
[00002000 h, 0 FFFFFFFh] |
4 |
0 FFFFFFFh |
In table 2:
- The "section" lists the minimum and maximum values (inclusive) of each section based on the "quasi-absolute value" of the original integer ).
- The number of bytes lists the number of bytes occupied by the compressed value.
- The masks listed in the "valid bit mask" can obtain meaningful digits in the original integer after performing the & operation with the original integer. This is based on the fact that some of the leftmost bits of a positive integer are 0, which is meaningless and can be omitted. For negative integers, the leftmost bits are 1, which is meaningless and can be omitted.
After obtaining a valid bit from the effective bit mask, you need to move these valid bits to the left. Next, if the original integer is negative, you need to set the digit (symbol bit) to 1.
Finally, set a mask for the first byte of the compressed value. The rule is the same as that of an unsigned integer.
The applicable range of the signed integer compression algorithm is: [0 h, 0 FFFFFFFh] ([0, 268435455]) for positive numbers, [F0000000h, FFFFFFFFh] ([-268435456, -1]), integers out of this range cannot be compressed in this way.
Code 3 provides a reference implementation for the signed integer compression algorithm.
Code 3-reference implementation of the signed integer Compression Algorithm
public static byte[] CompressInt(int data){ var u = data >= 0 ? (uint)data : ~(uint)data; if (u > 8) | 0x80); bytes[1] = (byte)(uv & 0x00FF); return bytes; } else if (u > 24) | 0xC0); bytes[1] = (byte)((uv & 0x00FF0000) >> 16); bytes[2] = (byte)((uv & 0x0000FF00) >> 8); bytes[3] = (byte)(uv & 0x000000FF); return bytes; } else throw new NotSupportedException();}
Note: The "quasi-absolute value" of the original integer is used only when determining the number of bytes occupied by the compression value. Once the number of bytes is determined, the original integer is used for actual compression, but treat it as an unsigned integer.
Algorithm for extracting signed integers
Because the compressed values of signed integers have the same structure as the compressed values of unsigned integers, the decompression algorithm of signed integers can be based on the unsigned integer decompression algorithm.
First, extract the compressed value according to the unsigned integer decompression algorithm. A 32-bit unsigned integer is obtained, and the original integer symbol is determined based on the signed bit.
If the original integer is a positive number (digit, that is, the sign bit is 0), the unsigned integer obtained after decompression is shifted to one right and then forcibly converted to a signed integer to obtain the original integer.
If the original integer is a negative number (digit, that is, the symbol bit is 1), you need to shift the unsigned integer to the right by one, then, the meaningless "1" bits at the left of the negative number are restored:
- If the compressed value occupies 1 byte, it is operated with ffffc0h | (by bit or;
- If the compression value occupies 2 bytes, It is performed with FFFFE000h | operation;
- If the compression value occupies 4 bytes, perform the operation with F0000000h |.
Finally, convert the unsigned integer to a signed integer to obtain the original integer.
Code 4 provides the reference implementation of the algorithm for extracting signed integers.
Code 4-reference implementation of the algorithm for extracting signed integers
public static int DecompressInt(byte[] data){ var u = DecompressUInt(data); if ((u & 0x00000001) == 0) return (int)(u >> 1); var nb = GetCompressedIntSize(data[0]); uint sm; switch (nb) { case 1: sm = 0xFFFFFFC0; break; case 2: sm = 0xFFFFE000; break; case 4: sm = 0xF0000000; break; default: throw new NotSupportedException(); } return (int)((u >> 1) | sm);}
Here, a tool method GetCompressedIntSize is called to determine how many bytes are used to store the compressed value based on the first byte of the compressed value. This method is very simple, as shown in code 5.
Code 5-determine the number of required bytes based on the first byte of the compressed Value
public static uint GetCompressedIntSize(byte firstByte){ if ((firstByte & 0x80) == 0) return 1; else if ((firstByte & 0xC0) == 0x80) return 2; else if ((firstByte & 0xE0) == 0xC0) return 4; else throw new NotSupportedException();}
Problems in implementation
The compressed signed integer is in. NET/CLI metadata is rarely used-as far as I know, only the array lower-end values in the metadata signature use compressed signed integers (which means the principle. NET/CLI supports arrays with negative subscripts ). In this regard, almost all existing CLI implementations have encountered some problems more or less. At the same time, in my reference documents, the descriptions of the signed integer compression algorithm are also vague. Fortunately, almost all advanced languages do not allow developers to declare an array whose subscript is negative. CLS also requires that the subscript of the array start from 0, therefore, these problems do not have a major impact on the actual project.
The following lists several implementation issues that I have studied. The following section lists the references.
ILASM/ILDASM
Obviously, Microsoft's own algorithm for compressing signed integers is not very clear. ILASM is the only compiler I have ever used that can accept arrays of negative subscript numbers. It is also the compiler that I use most when studying this topic. There is no problem with ILASM for the subscript of the positive number array; but for the negative subscript, the compressed value obtained when the current value is between-8192 (inclusive) and-8129 (inclusive) is incorrect.
In addition, the implementation of the signed integer compression algorithm used by ILASM is obviously different from that described in this article. Therefore, it cannot cover all theoretically supported integers ([-268435456,268 435455]). when the current value is less than or equal to-268427265, the resulting compression value is also incorrect.
Because ILASM has an error, you cannot perform a full and accurate test on ILDASM. However, even if the error value generated by ILASM is extracted, the results obtained by ILDASM are consistent with those obtained by the signed integer decompression algorithm described in this article, all have reason to believe that ILDASM should be correct in the decompression algorithm. However, incorrect compression values will randomly cause ILDASM to crash.
The above problems exist in ILASM versions 2.0, 3.0, and 3.5, but have been corrected in version 4.0 Beta ,. NET Framework SDK 4.0 Beta carries ILASM to compress all theoretically acceptable negative array subscripts correctly, and ILDASM can decompress them correctly.
Mono Cecil
Through the study of Mono Cecil source code, it is found that the implementation of Mono Cecil is very loyal to the ECMA-335 standard, the ECMA-335's description of the array's lower mark is just wrong (See the "reference correction" section later) -- the array's lower mark is a compressed unsigned integer (rather than a signed integer ).
Therefore, Mono Cecil only provides compression and decompression for unsigned integers (see Mono. cecil. mono. cecil. metadata. utilities. writeCompressedInteger (BinaryWriter, Int32): Int32 method and Mono. cecil. metadata. utilities. readCompressedInteger (Byte [], Int32, Int32 &): Int32 method ). When writing and reading metadata signatures, the array subscript is also processed as an unsigned integer (see Mono. cecil. signatures. signatureWriter. write (SigType): Void method and Mono. cecil. signatures. signatureReader. readType (Byte [], Int32, Int32 &): SigType method ).
When the Mono Cecil library is used for reflection, if the subscript of the array is positive, the result is twice the actual downloading (because the right shift operation for compressing signed integers is missing ); if the following table of the array is negative, the result is completely incorrect.
I only did a survey on the source code of Mono Cecil 0.6. Other versions are unknown. Readers can check and analyze them by themselves.
CCI Metadata
CCI Metadata treats the array subscript as a signed integer, but it uses a very simple compression algorithm-shifts the absolute value of the original integer to the left, then place the symbol bit in the symbol bit (see Microsoft. cci. peWriter. microsoft. cci. binaryWriter. writeCompressedInt (Int32): Void method), and then compress according to the unsigned integer. the decompression algorithm corresponds to the following: extract the unsigned integer to obtain an unsigned integer, then, the result symbols are determined based on the number of symbols. Finally, the entire number of unsigned characters is shifted to one place, and the plus and minus signs are set based on the number of symbols (see Microsoft. cci. peReader. microsoft. cci. utilityDataStructures. memoryReader. readCompressedInt32 (): Int32 method ).
The algorithm used by CCI Metadata and the Expert. the algorithm descriptions mentioned in NET 2.0 IL Assembler are consistent, but the descriptions in this book are also incorrect (see the "correction of references" section below ).
The CCI Metadata version I have investigated is 2.0.49.23471.
Other implementations not yet studied
Some implementations of. NET/CLI have not been studied yet, for example:
- System. Reflection/System. Reflection. Emit
- Shared Source CLI (Rotor)
Reference to revision of Expert. NET 2.0 IL Cycler
This book describes the compression algorithm of signed integers in a natural section (the first section of P150) after Chapter 8-4. The description here is incorrect, for a correct description, see the section "Compression Algorithm with signed integers" in this document.
Unfortunately, the Chinese version of ". NET exploring: MSIL authoritative guide" in this book has not corrected this issue (also a natural section after Chapter 8-4, P132 ). When Bao Jianqiang translated this book, I also mentioned the problem here to him. However, at that time, I had not completely inferred the correct compression algorithm, so he had to translate it.
ECMA-335-Common Language Infrastructure (CLI) 4th Edition
In the ECMA-335 standard, there is no distinction between the compressed unsigned integer and the compressed signed integer, collectively referred to as the "compressed integer ".
The section 23.2 Blobs and signatures in ECMA-335 Partition II: Metadata Definition and Semantics provides the compressed integer compression algorithm (P153), which is actually an unsigned integer compression algorithm, this algorithm is correct.
ECMA-335 Partition II: Metadata Definition and Semantics section 23.2.13 ArrayShape provides the array representation (P161) in the Metadata signature, where both Size and LoBound are "compressed integer ", this is not accurate.
The correction method is to introduce the term "compressed unsigned integer", which is used to describe "compressed integer" elsewhere, and the term "compressed signed integer", which is used to describe the LoBound of an array ). The following section describes the compression algorithm for signed integers.
(End)