Parsing SQL Server Kernel architecture

Last Update:2014-10-31 Source: Internet

Author: User

Tags odbc sybase server memory

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

We do the management software, the main core is in the data storage management. So database design is our priority. In order to make our management software stable, scalable, excellent performance, traceable troubleshooting, upgradeable deployment, plug-in operation, we often develop our own management software development platform. We always want to learn other people's development platform (such as UF or Kingdee or SAP), but we always lament the management software business processing a wide range of details, and database management software is simple SELECT, INSERT, DELETE, update four commands to be done. We hope that one day we can make an architecture, or we can take care of management software as simple as this. We tend to look at other people's architectures, but forget the architecture of the database we are most familiar with around us. Therefore, today, I would like to lead you to analyze the database structure, to explore the database architecture ideas. And I myself, only familiar with SQL Server this kind of database products, so I took SQL Server to analyze.

Before talking about the internal principles of SQL Server, I think it is very necessary to introduce the history of SQL Server to you.
Let's stand in 1999 and see what the computer database industry is really in.

In 1999, Oracle launched Oracle 8i in September 1998 (probably in Chinese for 1999 years). Oracle 8i supports writing stored procedures in Java, supporting XML and supporting Linux.
SQLSERVER7 officially released in January 1999. SQLSERVER7 reconstructs the entire database engine (equivalent to rewriting SQL Server). SQL Server supports row locks for the first time (there is no mistake about how people used database products in the past.) In 1988, Oracle6 supported row locks. In 1988, Oracle began developing ERP products. Who says Oracle is an ERP layman, can refer to this).

Look at the previous version of both of them. If you are late (after 2000), you may be more surprised by the following words.

1992, Oracle7 released. There are stored procedures, triggers, referential integrity checks, and distributed transaction processing. (God, Oracle7 have these things).
1995, SQLSERVER6 released. SQLSERVER6 is Microsoft's real first database product (really explosive material, we did not think that SQLSERVER6 is Microsoft's first database product, that version 6 before the 5, 4, 3, 2, 1 how to spend it). Because in 1994, Microsoft and Sybase broke (Sybase was the first C/S database product to run on a PC). Microsoft, who had no experience in the field of database products, worked with Sybase (then Microsoft was the world's largest software company, and Microsoft went public in 1986.) Sybase has a product and is short of money. Microsoft is short of products and has money. So hit it off). Until 1994, Microsoft also did not need Sybase (already learned database technology), Sybase also feel Microsoft too ambitions, so co-operation split. Microsoft began to make its own database.

History is finished. Let's have a closer.
Many entry-level management software, SQL statement playing proficiency, from sub-query to having to cross-table statistics SQL can be made, even the stored procedure can write more than 2000 lines, cursors, custom functions, triggers, constraints with a dazzling. Again, in the SQL query, you can use SQL analytics to optimize the index, SQL profile to track SQL, and even to monitor SQL Server memory, CPU, thread, I/O running status in the performance viewer, or even be complacent about using DBCC.

You are so familiar with SQL Server and so unfamiliar with SQL Server.

I'm going to give you an architectural perspective today to analyze SQL Server architecture and principles. Just a blog post must be more than a few, deep layer may need to serial several articles or even a large brick book to complete. However, I hope my blog post will give you a chance to see SQL Server from a point of view not previously thought.

SQL Server, as a database product, I personally think that the most important thing is two chunks: the storage engine and the query engine.

Other logs, transactions, locks, indexes, etc. all work around them.

SQL Server is a C/s product, so an SQL statement must be transferred to the servers of SQL Server to execute it. Transmission, we certainly know the need for NetBEUI, TCP/IP and so on network transmission protocol. But it's not good to have all this light. How to send the client, how to receive the server, how to confirm the correct delivery and receive the complete, how to really send and receive is over, how to send and receive across various network protocols (such as UNIX and Windows and Novell Communication), how to ensure data security check, how to ensure that data transmission is synchronous or asynchronous, You need to construct a layer of protocols on top of the network transport protocol. SQL Server supports both the IPC mechanism and the RPC mechanism. You think about whether your management software development platform has this layer. Of course, now the message server has been professional to provide this mechanism, reliable, secure, efficient, asynchronous, message compression, message splitting, intelligent routing, clustering, across different operating systems, different programming languages, different communication protocols, different hardware platform for message data transmission. Maybe you didn't know about the message middleware, this case can be used to know the purpose of the message middleware.

SQL statements are reliably sent to the server side, and the first module in the SQL Server engine receives the SQL data. The name of this module is: Open Data Services. It listens for new connections, clears failed connections, and returns result sets, messages, and states to the client.

Data is transmitted between the SQL Server client and servers, and the packets are formatted. is known as tabular data stream in SQL Server. This data flow is the token control client and server-side dialog (otherwise, the client says n sentence, the server side returns n sentence, no tokens are mixed together, do not know which answer is corresponding to which request). We are often unable to deal directly with Open Data services and to put it in. Instead, we have to send the tabular data stream through ODBC, ADO, or Db-library. The results of the data returned by SQL Server are also sent back to tabular data stream through these ODBC. You see how ingenious SQL Server is designed, a generic data access interface that masks you and SQL Server, just as Windows API masks the kernel so you can't access it, just as DirectX masks the UI and peripheral controls.

The SQL statement-odbc-encoded into tabular data STREAM-IPC or rpc-network protocol-IPC or rpc-decoding tabular data stream-odbc-open data Services.

Open Data Services monitors client connections. If there is too much concurrency, it creates a connection and, if the service is finished, it maintains the connection itself into the pool. Keep a lifetime in the pool and it will release the connection itself. If a client connection suddenly breaks down (such as the client restarts), it does not respond after listening, it will also organize their own connections. The connection that we see in the SQL Server thread is created by open Data services.

Open Data Services has a connection (possibly created from a pool, pooling, creating, destroying are very skilled.) How much is pooled, how the context resource is retained, how long it takes to pool, when it should be destroyed, and how badly the resource will be consumed by scheduling improperly, and then catching the SQL. At this point, is received in the open Data services read buffer inside. This buffer provides a bit of respite for the high-performance SQL Server that handles data, and this respite gives SQL Server the ability to do the job (is your design there?). ）。 Open Data Services has a write buffer. SQL Server retrieves the retrieved data and immediately puts it into the write buffer, which is immediately sent out by the Open Data service when the write buffer is full. When I used to study the principles of SQL Server, I used to marvel at the fact that a small SQL Server peripheral module was designed so delicately that it was really impressive. We often pursue massive data storage and cache architectures, ignoring the SQL Server we have at our fingertips.

SQL statements into the read buffer, the SQL Server relational engine begins to work. It is always listening to this read buffer.

The first module of the relational engine that the SQL statement encounters is the command parser. The result of the query analysis we see in SQL Query Analyzer is its output masterpiece. It to construct the query tree. The first is to standardize your SQL statements (you think of the software code you write, the input data to what all regardless of the direct processing, even the input data check is not, how to stabilize), or the next step will be bad operation, if your SQL statement has syntax errors, the construction of this query tree can not be completed, and then interrupted. To standardize an SQL statement, you first extract the various syntaxes and functions that SQL Server supports from the SQL Syntax library.

Once the construction is successful, the second module of the relational engine is the command optimizer to crop the tree. An SQL statement can generate a variety of execution and optimization scenarios (you can understand if you've used that SQL Optimization tool), and SQL Server chooses the most memory-saving, CPU-utilization, I/o times (I/O is the most deadly area of performance optimization, and often the bottleneck in performance is i/ O) of the scheme. The optimizer will count the data on each table (sometimes you have to update the stats in sync for performance optimization, otherwise the optimization will have errors). And the optimizer will choose the right index based on the query tree (if it is expensive to use the index, it will automatically select a full table scan), the optimizer will know what table data to fetch based on the query tree, and then how to merge the data in memory to get the results you want (sometimes think of the optimizer really great, you a SQL past, It needs to do a lot of things in a very short period of time, in order to be able to determine a relatively optimized solution in a very short time, it can not be exhaustive all possible scenarios, so when we do a lot of data optimization, often evaluate a variety of scenarios, and then modify their own SQL statements to meet the optimal solution.

Normalization and optimization of SQL statements will result in an execution plan. The SQL Manager is responsible for executing the plan generation. Because the SQL statement you sent might be a select or an INSERT or update. Even select, there are restrictions on user permissions (you will understand if you set the object permissions and column permissions for a SQL Server user). An UPDATE statement, such as INSERT, also involves permissions, default values, constraints, expressions, primary foreign keys, and triggers. An optimized SQL, specifically to let SQL Server from memory or hard disk to find out or update the data back, requires a lot of detail steps.

Query executor to be responsible for SQL execution. Because SQL execution involves transactions, locks, waits, CPU scheduling, memory page invalidation effects, I/O access effects, the query executor coordinates many other modules, but the individual modules are responsible for processing, and the query executor is not really all arranged, otherwise let the transaction manager, lock manager, index Manager, Page File Manager, Buffer Manager, row manager, log Manager.

The query executor is the last module of the query engine, and the next module belongs to the storage engine category. Therefore, from the perspective of the query engine is the most important to construct a SQL query tree, optimize cropping SQL query tree, according to the query tree to generate execution plan, and then coordinate the execution of the query tree, the results returned.

If you really want to get the data out or put it in, you need the storage engine to work.

First, according to the execution plan, which data pages and index pages to access. That's what access to the method Manager (Access Methods Manager) does. But actually to open these pages, not to access the method manager to do their own hands.

To do this with my own hands. is a module called the Buffer Manager. Because the data on the hard disk is impossible to compute, it must be in memory for the CPU to compute. So to access those data pages and index pages, let the buffer Manager do the notification. If the data is not in memory, let the buffer Manager read it, and if the data is already in memory, the buffer manager will only be able to return. This process is masked by the buffer manager and is transparent to the access Method manager. People do not think that access to the method manager do nothing, just a release scheduling command. This is wrong with it. Because SQL Server guarantees high-speed processing, it must anticipate which data pages and index pages to process. No, you can't. The buffer manager has finished processing, and you access the method manager to calculate which pages will be processed next. You know, these managers are not handled by any of the users. How can you predict which data pages and index pages are to be processed if you accept more than 100 concurrent users and send out a variety of processing requests? This requires a unified dispatch. And this unified dispatch also affects the buffer manager. You can't ask for a big data, buffer Manager This only burns the buttocks to enlarge the buffer, then loads the data, then the pipeline stops. The buffer manager must know beforehand that there will be a large data in the near future, so there is a separate thread to extend the buffer in parallel operations. Because expanding the buffer is also related to the operating system. You want to enlarge the buffer, just meet the Windows page invalidation, it is related to your virtual file changes. The failure of the page can affect both the CPU and I/O. So page invalidation is a problem with a lot of performance implications. and increasing the hit rate is the focus of our performance optimization efforts. If the data is not used for a long time, the buffer Manager will expire the memory data and be overwritten with new data. Otherwise the buffer old load does not unload also not. Moreover, some data has been updated, your data is aging, do not re-read, your data will cause read errors.

We know that the data page contains rows of data. Index pages contain index rows. Data rows are controlled by the row manager. Instead, the index row is the responsibility of the index manager.

The retrieval, modification, and execution on a single line are also affected by the transaction manager and lock manager. Transactions, there are two kinds of explicit transactions and implicit transactions. And lock, there is a shared lock, exclusive lock, update lock, intent lock. And locks, also divided into row locks, page locks, table locks, database locks. And locking, there is the possibility of deadlock. The different locks, plus the impact of the transaction, whether the line can read, can modify, how to read (read consistent or dirty read), is waiting for transactions and locks, or can be carried out, is affected by a lot. Because the row on a data page is limited, especially the effect of padding (such as a fill level of 80%, the data page can only be filled with 80% must be paged, in case there is data inserted in the time, it will greatly affect the data inserts, which is the performance impact is relatively large, especially in the case of inserting more data). A data page for SQL Server defaults to 64K, removing the padding and data headers, and there's not much data to store. This is for the relational database to discourage everyone to small table big data. That is, columns are less, columns are shorter, and frequently accessed columns are in front. Data can be massive. If the president, you want to retrieve and update how many data pages, which requires how many page scheduling, facing the page failure and locking mechanism. Moreover, the large text and the variable line, are the pointer storage, need to jump to find, but also wasted a lot of time.

The index manager, however, is most important in maintaining the index B-tree. Without the index page, we're going to do a full-table scan, how much data pages to load, and a row-by-line scan, and if there's a transaction and update lock, it's more problematic. Therefore, the index is very important. and a table that can build many indexes. Index, you can directly find the desired row, without the need for a full table scan. However, if your index is only for men and women, or if your index involves variable rows, it is not good for the index. Index, should not be established more. Otherwise, the cost and consumption of maintaining index pages is also very great. The index page is more related to the Insert, split page, frequent changes related to the index of the field, will make the index page drastic changes, especially the larger the amount of data affect the greater. I won't be here. Explain how to use the index to optimize SQL, or a book can not be finished.

Data is constantly accessed, data is continuously maintained, loaded into memory, or written to the hard disk from memory. Actually all are lazy writers in care. The lazy writer periodically scans the aging data to make the data in the hard disk and in memory consistent. With this lazy writer, there is a difference in memory and hard disk time windows. It is possible for an exception to occur. Once the server suddenly loses power, there is no time to write what happens to the disk. Another module is also involved: Log manager. The log Manager uses the checkpoint mechanism to maintain the log files. When the server restarts, rewrite the load log to restore the data to consistency. Writing logs, of course, is much easier than writing data, much faster. Because write data to manipulate memory and hard disk, but also pay attention to the permissions, locks, transactions, so suddenly power off, you have not responded to too late. So this lightweight approach to logging can be very helpful in restoring consistency (and, of course, data loss.) The log page did not have time to write to the hard drive).

In this case, the transaction manager, the lock manager, is left. There are many relationships between these two managers and explicit transactions, implicit transactions, dominant locks, implicit locks, transaction isolation levels, lock levels, row managers, and index managers. Microsoft has Windows Advantage, and Jim Gray, a giant division (Jim Gray is the Turing Award winner, that is, this is the concept of database transactions. In order for the Lord to work for Microsoft, the Lord did not like the rainy day in Redmond, so he built a separate research institute in the California Sun. So, in terms of performance, I personally think SQL Server performance is very good (you think about the performance of a database product affected by what). As far as the industry is concerned that SQL Server is unable to manage massive amounts of data, performance is poor, and I personally feel that the industry is baseless assertion. And especially in mainland China IT industry, most of them are entry-level in the noisy, especially a group of more technical media reporters or writers.

If you really want to talk about SQL Server, large-scale data management must use a certain database products, I suggest from the internal principles, internal architecture, internal implementation of three levels in many aspects to analyze in the end is not justified.

And finally, I am I/O manager. I always disagree with the I/O manager in the SQL Server kernel. Because SQL Server uses the same page scheduling and page allocation methods as Windows. Why do you have to create another set of it. Just as SQL Server gives the page, hard disk, memory, thread, CPU to Windows. SQL Server, as an application on Windows, should be managed by Windows just like any other software on Windows. SQL Server is not cross-platform and does not need to be managed by itself.

Besides the subtleties of SQL Server, the peripheral tools of SQL Server are well designed. such as SQL Server user security management method, object classification (table, column, constraint, default, index, trigger, stored procedure, view, primary key), object permission method, metadata self-management method, SQL language, SQL Query Analyzer, SQL Tracker, SQL Performance Analyzer, SQL Database (Master/msdb/tempdb/model).

Think about it, do you have these architectural ideas in your management software platform?

SQL Server kernel architecture profiling (GO)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More