Anatomy of SQL Server 13th integers storage format in row compression and page compression (translated)
http://improve.dk/the-anatomy-of-row-amp-page-compressed-integers/
When resolving orcamdf support for row compression, the view has encountered some challenges when parsing integers.
Unlike normal uncompressed integer storage, these are variable lengths-meaning that 1 integers have a value of 50 that takes 1 bytes instead of the usual 4 bytes.
These are not new features, you can look at vardecimal he was stored as variable length. However, the difference is the way the data is stored on disk.
Note that although I just implement row compression, he is the same as the row compression used in page compression, and there is no difference
You can see the detailed explanation of the row compression and page compression in the "Deep parse SQL Server 2008 notes"
Tinyint
Tinyint is basically the same after compression and compression (tinyint: integer data from 0 to 255, storage size 1 bytes) with only one exception, when the value is 0, if row compression is turned on, it will not occupy any bytes.
If the non-compressed storage will store 0x0 and consume one byte. All shaping types (tinyint,smallint,int,bigint) are treated equally for 0, values are described by compressed row metadata and no values are stored
Smallint
Let's start by observing the normal uncompressed smallint values, and for the storage of these values for -2,-1,1,2, 0 will not store anything. Note that all these values are stored exactly on disk, in which case they use small byte order to store
- 2 = 0xFEFF - 1 = 0xFFFF 1 = 0x0100 2 = 0x0200
Little-endian
Starting with these two values, they are very straightforward to convert to decimal and the actual value you want. However, 1 is a bit different, showing 0xFEFF converting him to decimal is 65.535--the largest unsigned shaping value we can store is 2 bytes,
SQL Server for a smallint range is –32768 to 32767
Calculating the actual value depends on the integer overflow used. Take a look at the following C # code snippet:
unchecked{Console.WriteLine (0+ ( Short)32767); Console.WriteLine (0+ ( Short)32768); Console.WriteLine (0+ ( Short)32769); // ...Console.WriteLine (0+ ( Short)65534); Console.WriteLine (0+ ( Short)65535);}
The output is as follows:
32767-32768-32767-2-1
If we were to calculate the maximum value of the 0+ signed short, then the maximum value would be the signed 32767, which is obviously minus 32767,
However, if we calculate 0+32.768=32768 in this way, we will go beyond the short range, and we will turn the top bit upside down to negative-32768 without overflow.
Because these numbers are constants, the compiler does not allow overflow--unless we encapsulate the code in uncheck {}section
You may have heard the fictional sign bit. Basically, its highest bit is used to indicate whether a number is positive or negative.
From the above example it should be obvious that the sign bit is not so special--by querying this symbol bit to determine the symbol for a given number. Look at the sign bit when it overflows.
32767 = 0b0111111111111111-32768 = 0b1000000000000000-32767 = 0b1000000000000001
For numbers that are too large to overflow, the highest bit "sign bit" needs to be set. It's not magical, it's just used to cause overflow.
Well, we have some background knowledge of how a regular non-compressed integers is stored. Now take a look at how the same number of smallint is stored in the row compression table.
-2 = 0x7E-1 = 0x7F1 = 0x81 2 = 0x82
Let's try converting these values to decimal, and I'll do the following conversions
-2 = 0x7E = -126-1 = 0x7F = -1271 = 0x81 = - 129 2 = 0x82 = -
Obviously, these values are stored in a different way. The most obvious difference is that we now only use one byte--because it becomes a variable-length store. When we parse these values, we need to take a simple look at the byte storage of these numbers. If only one byte is used, we know that this represents 0 to 255 (for tinyint) or 128 to 127 for smallint values. When the value stored in the smallint range is 128 to 127, a byte is used to store
If we use the same method, we will obviously get the wrong result. 1 <> 0 + 129 The trick is to use the stored value as an unsigned integer in this example, and then the minimum value as the offset
Instead of using the Zero Zero offset, we'll use the signed minimum of one byte-128 as the offset
-2 = 0x7E = -126-1 = 0x7F = -1271 = 0x81 = - 129 2 = 0x82 = -
This means that once we go beyond the signed 1-byte range we will need 2 bytes to store it, right?
A very important difference is that uncompressed values are always stored in small byte order, whereas integer values that use row compression are stored using large sections of the order!
So, instead of using different offset values, they use different byte-sequences. But the end result is the same, but the calculation method is very different
Int and bigint
Once I find the rules of the byte order and the numerical schema of the row compression integer values, the implementation of int and bigint is simple. As with other types, they are also variable-length, so you might encounter a 5-byte long bigint value and a 1-byte long int value. The following is the main parsing code for the Sqlbigint type
Switch(value.) Length) { Case 0: return 0; Case 1: return(Long)(- -+ value[0]); Case 2: return(Long)(-32768+ bitconverter.touint16 (New[] {value[1], value[0] },0)); Case 3: return(Long)(-8388608+ Bitconverter.touint32 (New byte[] {value[2], value[1], value[0],0},0)); Case 4: return(Long)(-2147483648+ Bitconverter.touint32 (New[] {value[3], value[2], value[1], value[0] },0)); Case 5: return(Long)(-549755813888+ Bitconverter.toint64 (New byte[] {value[4], value[3], value[2], value[1], value[0],0,0,0},0)); Case 6: return(Long)(-140737488355328+ Bitconverter.toint64 (New byte[] {value[5], value[4], value[3], value[2], value[1], value[0],0,0},0)); Case 7: return(Long)(-36028797018963968+ Bitconverter.toint64 (New byte[] {value[6], value[5], value[4], value[3], value[2], value[1], value[0],0},0)); Case 8: return(Long)(-9223372036854775808+ Bitconverter.toint64 (New[] {value[7], value[6], value[5], value[4], value[3], value[2], value[1], value[0] },0)); default: Throw NewArgumentException ("Invalid value Length:"+value. Length);}
A variable-length value is a byte array containing byte data that is stored on disk. If the length is 0, there is nothing to store so we know that his value is 0.
For each remaining valid length, simply use the minimum display value as the offset and add the stored value
For non-compressed values we can use the Bitconverter class to convert input values directly into expectations using the system byte order, which is generally small endian for most Intel and AMD systems (meaning that Orcamdf does not run on a large-order system). However, when the compression value is compressed using a large byte order, I must remap the input array to the small-end byte format, and 0 in the byte end to match the size of the short,int and long
For shorts and ints I read the unsigned values in, because that's what I'm interested in. Works by casting int and uint to a long value. I can't do the same thing with a long type because no other data type is larger than long. For long, the maximum value is 9.223.372.036.854.775.807, which is actually stored as 0xFFFFFFFFFFFFFFFF on the disk. Parsing a signed long type using Bitconverter results-1 is caused by an overflow. This can cause an error due to an additional negative number overflow
-9.223. 372.036. 854.775. 808 0xFFFFFFFFFFFFFF =>-9.223. 372.036. 854.775. 808 +-1 =9.223. 372.036. 854.775. 807
Conclusion
Usually I have a lot of interesting attempts by executing a SELECT statement to find out which byte the value ends on disk.
This will not take a long time to achieve, the technical Insider book just as a guide, there is a lot of things we need to dig deeper
End of the 13th chapter
Anatomy of SQL Server 13th integers storage format in row compression and page compression (translated)