Today, my colleague asked me a question about SQL Server's internal storage structure, so I wrote this blog and summarized it. The problem is as follows:
First, we create two tables. The column length of one table is 4039 bytes, and the length of the other table is 4040 bytes. They only have one byte gap, for example, the SQL statement used to create the table below:
Create table tb4039 (c1 int identity, c2 char (4035) not null)
Create table tb4040 (c1 int identity, c2 char (4036) not null)
Since the INT type is 4 bytes, The tb4039 table we created has a length of 4 + 4035 = 4039 bytes. The c2 field in tb4040 has one more byte than the c2 field in tb4039, the total length is 4040 bytes. Next we will insert data into these two tables. For example, if 100 data records are inserted, the SQL statement is:
DECLARE @ I INT
SET @ I = 1
WHILE I <= 100
BEGIN
Insert into tb4039 (c2) VALUES ('test' + CONVERT (VARCHAR (5), @ I ));
Insert into tb4040 (c2) VALUES ('test' + CONVERT (VARCHAR (5), @ I ));
SET @ I = @ I + 1
END
Well, now we can use SSMS to check the space usage of the two tables. If it is SQL2005, you can use the Report provided by SSMS to view the space usage. If it is SQL2008, you can directly view the table space usage on the details page of the object Resource Manager (if you are using SQL2008 but do not know how to view the table space usage, please refer to a blog I wrote earlier: SQL Server 2008 New Feature-enhanced SSMS ). I am using SQL2008 here and can see the situation
Of course, we can also use the T-SQL to query the view, get the space occupied by the data of the two tables, the query code is:
SELECT OBJECT_NAME (I. object_id) AS TableName, data_pages * 8 AS DataSize -- here the number of data pages is returned, and 1 page is 8 K, So multiply by 8
FROM sys. indexes as I
JOIN sys. partitions as p ON p. object_id = I. object_id and p. index_id = I. index_id
JOIN sys. allocation_units as a ON a. container_id = p. partition_id
Where I. object_id = OBJECT_ID ('tb4039') OR I. object_id = OBJECT_ID ('tb404040 ')
System return result:
TableName DataSize
Tb4039 400
Tb4040 800
Similar to the results we can see through reports or SSMS, the two tables only differ by one byte, but one occupies KB of storage space and the other occupies KB of storage space, double the value of another table !!!
The gap between one byte leads to a doubling of the storage space. Why? This should begin with the SQL Server storage structure.
---------------------------------------------- The gorgeous split line. Enter the topic -----------------------------------------------------------------
The smallest storage unit of SQL Server is Page, and the size of a Page is 8 K = 8192 bytes. A data page consists of three parts: the page header, data row, and row offset matrix.
The page header stores the basic information about the page, such as the page number, the previous page ID, the next page ID, and the number of bytes. The size of the page header is fixed to 96 bytes, so the remaining 8192-96 = 8096 bytes are used to store the data row and row offset matrix.
The row offset matrix is arranged in reverse order at the end of the page. Two bytes are used to indicate the offset of the Data row inside the page, if one row of data exists, the size of the line offset matrix is 2 bytes. If there are two rows of data, the size of the line offset matrix is 4 bytes, and so on.
In addition to the space occupied by the page header and the space occupied by the row offset matrix, the remaining space is used for data rows. Suppose we want to save two rows of data in one page, the two rows of data can use 8096-4 = 8092 bytes of space, that is to say, one row of data can use 8092/2 = 4046 bytes of space. The 4046 bytes are not completely used to store data rows. Other information exists in a data row to indicate the data of the row. The specific structure is as follows:
Status bit |
Status BIT B |
Set the length of the Data Type |
Content of fixed-length data |
Number of Columns |
NULL bitmap |
Number of variable-length columns |
Offset matrix of variable-length columns |
Variable-Length Column data |
1 byte |
1 byte |
2 bytes |
Specific fixed-length data bytes |
2 bytes
|
Number of columns/8 bytes |
2 bytes |
Number of variable-length columns * 2 bytes |
Variable-length data bytes |
No matter how simple the table is, a row of Data occupies at least 1 + 1 + 2 + 2 + 1 = 7 bytes in addition to the space occupied by the data. If many data columns are defined, or there are variable-length data columns, more space may be occupied.
Now let's go back to the two tables tb4039 and tb4040 we mentioned earlier. to store a row of data in tb4039, we need 1 + 1 + 2 + 4039 + 2 + 1 = 4046 bytes, therefore, two rows of data can be saved on a page. Therefore, 100 rows of data are inserted and saved in 50 data pages. The size is 8 K * 50 = 400 K. For tb4040 tables, it takes 4047 bytes to store a row of data, and two rows of data cannot be saved on one page. Therefore, one row of Data occupies one data page, 100 rows of Data occupy 100 data pages. The size is 8 K * 100 = 800 K.
------------------------------------------ After a bunch of addition, subtraction, multiplication, division, the following summary --------------------------------------------
Here is just an extreme example. As a result, a byte deviation is caused to double the occupied storage space. In actual applications, this extreme situation is rare, however, it is very likely that a page stores five data records. Because a column has one or two more bytes, only four data records can be stored. You may think that there is nothing to save one piece of data, but after the data volume is very large, there will be significant performance differences between 4 pieces of data and 5 pieces of data on one page. The reason for storing more data on a page is not to save storage costs, but the hard disk is already very cheap. Many servers use several hundred GB hard disks, 5G data is now 10G longer, which is nothing compared to a hard disk with several hundred G data.
In fact, we want to store more rows of data on a data page mainly for performance considerations. The basic unit for SQL Server to perform database read/write operations is a page. If more data is stored on a page, I/O operations will be reduced when the table is scanned and searched, after all, IO operations consume a lot of time and affect performance. If tb4039 contains million data records, a full table scan will read data pages. If it takes 1 second to read data pages, it will take 5 seconds to scan tb4039, if tb4040 is used to store the 100W data records, it takes 10 seconds to read data pages for full table scanning. The difference between one byte, one is 5 seconds and the other is 10 seconds, which has a significant impact on performance.
To improve the performance of database queries, the following suggestions can be followed during table design:
- The primary key should be as short as possible. If you can use tinyint, do not use int. If you can use char (5), do not use varchar (50 ).
- If the length of the table column is calculated and five data records can be stored on one page, do not set the field length so that only three or four data records can be stored on one page.
- Set the field as far as possible to not allow NULL, because the system needs to handle the NULL value during storage and data processing to reduce performance.
- If you can use a fixed length, do not use a variable-length field. For example, you can use the ID card number CHAR (18) instead of VARCHAR (18 ).
- Do not create too many columns in a table. If there are too many attributes of an object, you can consider Vertical Split. place common fields in one table, and other tables for infrequently used fields, this reduces the space occupied by data columns in common field tables, allowing more data rows to be stored on a data page.
- Do not place large objects, long strings, and common fields in the same table. For performance reasons, for example, a product table contains the product ID, product name, product price, product image, product description, and other fields, then, we can place the frequently used columns (product ID, product name, and product price) that occupy a small amount of space in a table and create a table such as product ID, product image, and product description, you can place big object data and long string data in another table by using foreign key constraints.