Anatomy of the SQL Server kernel Architecture (reproduced)
This article in my computer for a long time, today careless to turn out, think write very good, so posted out to share.
Have to admit that a good software is a step-by-step accumulation of down-to-earth, many excellent programmers, they are not simply writing code, but in the creation of an art.
They have one thing in common compared to the development of blizzard companies mentioned earlier: they only do classics. They can't say they're concentrating on the best programmers in the world, and they're actually concentrating on the best ideas in the world and putting them into practice.
Success is not by haste, but by foresight. I wish Microsoft SQL Server the farther away.
Original posted by: http://www.sqlserver.com.cn
We do the management software, the main core is in the data storage management. So database design is our priority. To enable our management software to be stable, scalable, excellent performance, traceable troubleshooting, upgradeable deployment, plug-in operation,
We often develop our own management software development platform. We always want to learn other people's development platform (such as UF or Kingdee or SAP), but we always lament the management software business processing a wide range of details,
and the database management software is simple Select, INSERT, DELETE, update four commands to be done .
We hope that one day we can make an architecture, or we can take care of management software as simple as this. We tend to look at other people's architectures, but forget the architecture of the database we are most familiar with around us.
Therefore, today, I would like to lead you to analyze the database structure, to explore the database architecture ideas. And I myself, only familiar with SQL Server this kind of database products, so I took SQL Server to analyze.
Before talking about the internal principles of SQL Server, I think it is very necessary to introduce the history of SQL Server to you.
Let's stand in 1999 and see what the computer database industry is really in.
In 1999 , Oracle launched Oracle 8iin September 1998 (probably in Chinese for 1999 years).
Oracle 8i supports writing stored procedures in Java, supporting XML and supporting Linux.
1999 January,SQLSERVER7 officially released. SQLSERVER7 reconstructs the entire database engine (equivalent to rewriting SQL Server).
SQL Server supports row locks for the first time (there is no mistake about how people used database products in the past.) In 1988, Oracle6 supported row locks. In 1988, Oracle began developing ERP products. who says Oracle is an ERP layman, can refer to this ).
Look at the previous version of both of them. If you are late (after 2000), you may be more surprised by the following words.
1992 ,Oracle7 released. There are stored procedures, triggers, referential integrity checks, and distributed transaction processing. (God, Oracle7 have these things).
1995 , SQLSERVER6 released. SQLSERVER6 is Microsoft's real first database product (really explosive material, we did not think that SQLSERVER6 is Microsoft's first database product , that version 6 before the 5, 4, 3, 2, 1 How to spend it ).
Because in 1994 , Microsoft and Sybase broke (Sybase was the first C/S database product to run on a PC). Microsoft has no experience in getting into the database product area,
Then worked with Sybase (then Microsoft was the world's largest software company, and Microsoft was listed in 1986.) Sybase has a product and is short of money. Microsoft is short of products and has money. So hit it off).
Until 1994 , Microsoft also did not need Sybase (already learned database technology), Sybase also feel Microsoft too ambitions, so co-operation split. Microsoft began to make its own database.
History is finished. Let's have a closer.
Many entry-level management software, SQL statement playing proficiency, from sub-query to having to cross-table statistics SQL can be made, even the stored procedure can write more than 2000 lines, cursors, custom functions, triggers, constraints with a dazzling.
Again, in the SQL query, you can use SQL analytics to optimize the index, SQL profile to track SQL, and even to monitor SQL Server memory, CPU, thread, I/O running status in the performance viewer, or even be complacent about using DBCC.
You are so familiar with SQL Server and so unfamiliar with SQL Server.
I'm going to give you an architectural perspective today to analyze SQL Server architecture and principles. Just a blog post must be more than a few, deep layer may need to serial several articles or even a large brick book to complete.
However, I hope my blog post will give you a chance to see SQL Server from a point of view not previously thought.
SQL Server, as a database product, I personally think that the most important thing is two chunks: the storage engine and the query engine (relational engine).
Other logs, transactions, locks, indexes, etc. all work around them.
Relational engine (query engine) upper layer
SQL Server is a C/s product, so an SQL statement must be transferred to the servers of SQL Server to execute it. Transmission, we certainly know the need for NetBEUI, TCP/IP and so on network transmission protocol. But it's not good to have all this light.
How to send the client, how to receive the server, how to confirm that the hair and receive the correct integrity, how the actual hair and received has ended, how to send and receive can cross various network protocols (such as UNIX and Windows and Novell Communication), how to ensure data security check,
How to ensure the data transmission is synchronous or asynchronous, it is necessary to construct a layer of protocol on the Network transport protocol. SQL Server supports both the IPC mechanism and the RPC mechanism . You think about whether your management software development platform has this layer.
Of course, now the message server has been professional to provide this mechanism, reliable, secure, efficient, asynchronous, message compression, message splitting, intelligent routing, clustering, across different operating systems, different programming languages, different communication protocols, different hardware platform for message data transmission.
Maybe you didn't know about the message middleware , this case can be used to know the purpose of the message middleware.
SQL statements are reliably sent to the server side, and the first module in the SQL Server engine receives the SQL data. The name of this module is:Open Data Services.
It listens for new connections, clears failed connections, and returns result sets, messages, and states to the client.
Data is transmitted between the SQL Server client and servers, and the packets are formatted. is known as tabular data streamin SQL Server.
This data flow is the token control client and server-side dialog (otherwise, the client says n sentence, the server side returns n sentence, no tokens are mixed together, do not know which answer is corresponding to which request ).
We are often unable to deal directly with Open Data services and to put it in. Instead, we have to send the tabular data stream through ODBC, ADO, or Db-library .
The results of the data returned by SQL Server are also sent back to tabular data stream through these ODBC. You see how ingenious SQL Server is designed, a universal data access interface that masks direct communication between you and SQL Server ,
Just as the Windows API masks the kernel so you can't access it, just as DirectX masks the UI and peripheral controls.
The SQL statement-odbc-encoded into tabular data STREAM-IPC or rpc-network protocol-IPC or rpc-decoding tabular data stream-odbc-open data Services.
Open Data Services monitors client connections. If there is too much concurrency, it creates a connection and, if the service is finished, it maintains the connection itself into the pool. Keep a lifetime in the pool and it will release the connection itself .
If a client connection suddenly breaks off (such as the client restarts), it does not respond after listening to send keepalive packet detection is dead, it will also organize their own connections.
The connection that we see in the SQL Server thread is created by Open Data services .
Open Data Services has a connection (possibly created from a pool, pooling, creating, destroying are very skilled .)
How much is pooled, how the context resource is retained, how long it takes to pool, when it should be destroyed, and how badly the resource will be consumed by scheduling improperly, and then catching the SQL. At this point, is received in the Open Data services read buffer inside.
This buffer provides a bit of respite for the high-performance SQL Server that handles data, and this respite gives SQL Server the ability to do the job (is your design there?). )。
Open Data Services has a write buffer . SQL Server retrieves the retrieved data and immediately puts it into the write buffer, which is immediately sent out by the Open Data service when the write buffer is full.
When I used to study the principles of SQL Server, I used to marvel at the fact that a small SQL Server peripheral module was designed so delicately that it was really impressive. We often pursue massive data storage and cache architectures, ignoring the SQL Server we have at our fingertips.
SQL statements are put into the read buffer, and SQL Server's query engine (relational engine) begins to work. It is always listening to this read buffer .
The first module of the relational engine that the SQL statement encounters is the command parser . The result of the query analysis we see in SQL Query Analyzer is its output masterpiece. It to construct the query tree.
The first is to standardize your SQL statements (you think of the software code you write, the input data to what all regardless of the direct processing, even the input data check is not, how to stabilize), otherwise the steps will be bad operation, if your SQL statement has a syntax error,
The construction of this query tree cannot be completed and is interrupted. To standardize an SQL statement, you first extract SQL Server's existing supported syntax and functions, SELECT * flatten, and field access check .
Once the construction is successful, the second module of the relational engine is the command optimizer to crop the tree. An SQL statement can generate a variety of execution and optimization scenarios (which you can understand if you have used the SQL Optimization tool),
SQL Server chooses the best way to save memory, CPU utilization, I/o times (I/O is the most deadly area of performance optimization, and often performance is the bottleneck on I/O).
The optimizer will be based on the data statistics/statistics of each table (sometimes you have to update the stats synchronously for performance optimization, otherwise the optimization will have errors).
The optimizer also chooses the appropriate index based on the query tree (if the index is expensive, it automatically selects a full table scan), and the optimizer knows which table data to fetch based on the query tree.
And then how to merge the data in memory to get the results you want (sometimes think of the optimizer is really great, you a SQL in the past, it needs to do in a very short period of time, Ah, in order to be able to determine a relatively optimized solution in a very short time, it can not be poor to lift all possible scenarios, so we do the massive data , often evaluate multiple scenarios, and then modify their own SQL statements to conform to the resulting optimal scenario).
Normalization and optimization of SQL statements will result in an execution plan. The SQL Manager is responsible for executing the plan generation. Because the SQL statement you sent might be a select or an INSERT or update. Even Select,
Also face restrictions on user permissions (you will understand if you set the object permissions and column permissions for a SQL Server user ). An UPDATE statement, such as INSERT, also involves permissions, default values, constraints, expressions, primary foreign keys, and triggers .
An optimized SQL, specifically to let SQL Server from memory or hard disk to find out or update the data back, requires a lot of detail steps.
Query Executor to be responsible for SQL execution. Because SQL execution involves transactions, locks, waits, CPU scheduling, memory page invalidation effects, I/O access effects , the query executor coordinates many other modules , but each module is responsible for processing,
The query executor is not really all arranged, otherwise let the transaction manager, lock manager, index Manager, paging File Manager, Buffer Manager, row manager, log manager to do.
The query executor is the last module of the query engine , and the next module belongs to the storage engine category. Therefore, from the perspective of the query engine is the most important to construct the SQL query tree, optimize cropping SQL query tree, according to the query tree to generate the execution plan, and then coordinate the execution of the query tree,
Return the results.
Storage Engine Downlevel
If you really want to get the data out or put it in, you need the storage engine to work.
First, according to the execution plan, which data pages and index pages to access. That's what access to the method manager (Access Methods Manager) does. But actually to open these pages, not to access the method manager to do their own hands.
To do this with my own hands. is a module called the buffer manager . Because the data on the hard disk is impossible to compute, it must be in memory for the CPU to compute. So to access those data pages and index pages, let the buffer Manager do the notification.
If the data is not in memory, let the buffer Manager read it, and if the data is already in memory, the buffer manager will only be able to return. This process is masked by the buffer manager and is transparent to the access Method manager .
People do not think that access to the method manager do nothing, just a release scheduling command. This is wrong with it. Because SQL Server guarantees high-speed processing, it must anticipate which data pages and index pages to process.
No, you can't. The buffer manager has finished processing, and you access the method manager to calculate which pages will be processed next. You know, these managers are not handled by any of the users. If you accept a user from more than 100 concurrent users, a variety of data processing requests are sent,
How can you predict which data pages and index pages to process? This requires a unified dispatch. And this unified dispatch also affects the buffer manager. You can't ask for a big data buffer Manager This only burns the buttocks to enlarge the buffer and then load the data,
Then the assembly line stopped . The buffer manager must know beforehand that there will be a large data in the near future, so there is a separate thread to extend the buffer in parallel operations. Because expanding the buffer is also related to the operating system.
You want to enlarge the buffer, just encounter Windows page failure (pages fault), which involves changes in your virtual file. The failure of the page can affect both the CPU and I/O. So page invalidation is a problem with a lot of performance implications.
and increasing the hit rate is the focus of our performance optimization efforts. If the data is not used for a long time, the buffer Manager will expire the memory data and be overwritten with new data. Otherwise the buffer old load does not unload also not.
Moreover, some data has been updated, your data is aging, do not re-read, your data will cause read errors.
We know that the data page contains rows of data. Index pages contain index rows. Data rows are controlled by the row manager . Instead, the index row is the responsibility of the Index manager .
The retrieval, modification, and execution on a single line are also affected by the transaction manager and lock manager . Transactions, there are two kinds of explicit transactions and implicit transactions. And lock, there is a shared lock, exclusive lock, update lock, intent lock. And locks, also divided into row locks, page locks, table locks, database locks.
And locking, there is the possibility of deadlock. The different locks, plus the impact of the transaction, whether the line can read, can modify, how to read (read consistent or dirty read), is waiting for transactions and locks, or can be carried out, is affected by a lot.
Because the row on a data page is limited, especially the effect of padding (such as a fill level of 80%, the data page can only be filled with 80% must be paged, in case the data inserted later, it will affect the data insert,
This is also the performance impact is relatively large, especially when inserting data more than the case. A data page for SQL Server defaults to 8K, removing the padding and data headers, and there's not much data to store.
This is for the relational database to discourage everyone to small table big data . That is, columns are less, columns are shorter, and frequently accessed columns are in front . Data can be massive. If the president has, you want to retrieve and update how many pages of data, which requires how many page schedules,
Faced with the impact of page failure and locking mechanism. Moreover, the large text and the variable line, are the pointer storage, need to jump to find, but also wasted a lot of time .
The index manager, however, is most important in maintaining the index B-tree. Without the index page, we're going to do a full-table scan, how much data pages to load, and a row-by-line scan, and if there's a transaction and update lock, it's more problematic.
Therefore, the index is very important. and a table that can build many indexes. Index, you can directly find the desired row, without the need for a full table scan. However, if your index is only for men and women, or if your index involves variable rows, it is not good for the index.
Index, should not be established more. Otherwise, the cost and consumption of maintaining index pages is also very great. The index page is more related to the Insert, split page, frequent changes related to the index of the field, will make the index page drastic changes, especially the larger the amount of data affect the greater.
I won't be here. Explain how to use the index to optimize SQL, or a book can not be finished.
Data is constantly accessed, data is continuously maintained, loaded into memory, or written to the hard disk from memory. Actually all are lazy writers in care. The lazy writer periodically scans the aging data to make the data in the hard disk and in memory consistent.
With this lazy writer, there is a difference in memory and hard disk time windows. It is possible for an exception to occur. Once the server suddenly loses power, there is no time to write what happens to the disk. Another module is also involved: Log manager .
The log Manager uses the checkpoint mechanism to maintain the log files. When the server restarts, rewrite the load log to restore the data to consistency. Writing logs, of course, is much easier than writing data, much faster.
Because write data to manipulate memory and hard disk, but also pay attention to the permissions, locks, transactions, so suddenly power off, you have not responded to too late. So the log of this lightweight method,
Can be very helpful in restoring consistency (and, of course, data loss.) The log page did not have time to write to the hard drive).
In this case, the transaction manager, the lock manager, is left. There are many relationships between these two managers and explicit transactions, implicit transactions, dominant locks, implicit locks, transaction isolation levels, lock levels, row managers, and index managers.
Microsoft has Windows Advantage, and Jim Gray, a giant division (Jim Gray is the Turing Award winner, that is, this is the concept of database transactions. Gates to let the Lord work for Microsoft, and this I do not like Redmond every day rainy weather,
So in the California Sun gave the master a separate institute of the Microsoft California Institute of Research.
Baidu Encyclopedia: Http://baike.baidu.com/link?url=n0vbmC2FrrVB2xWyHFlQqntBeG9yOvX1zqrHYM6J9R0-xFAkjbnTuSh6Vh94c_ M2yq8nrdzaqd3v9llgvlvwxa
So, in terms of performance, I personally think SQL Server performance is very good (you think about the performance of a database product affected by what).
as far as the industry is concerned that SQL Server is unable to manage massive amounts of data, performance is poor, and I personally feel that the industry is baseless assertion . and especially in mainland China IT industry, most of them are entry-level in the noisy, especially a group of more technical media reporters or writers .
If you really want to talk about SQL Server, large-scale data management must use a certain database products, I suggest from the internal principles, internal architecture, internal implementation of three levels in many aspects to analyze in the end is not justified.
And finally, I am I/O manager. I always disagree with the I/O manager in the SQL Server kernel. Because SQL Server uses the same page scheduling and page allocation methods as Windows.
Why do you have to create another set of it. Just as SQL Server gives the page, hard disk, memory, thread, CPU to Windows.
SQL Server, as an application on Windows, should be managed by Windows just like any other software on Windows. SQL Server is not cross-platform and does not need to be managed by itself.
Besides the subtleties of SQL Server, the peripheral tools of SQL Server are well designed.
such as SQL Server user security management method, object classification (table, column, constraint, default, index, trigger, stored procedure, view, primary key), object permission method, metadata self-management method, SQL language, SQL Query Analyzer, SQL Tracker, SQL Performance Analyzer, SQL Database (Master\msdb\tempdb\model).
Think about it, do you have these architectural ideas in your management software platform?
http://blog.csdn.net/hzbooks/article/details/2247912
The Super Bull Man in the database world, the 1998 Turing Award, Mr. Gray took a boat off San Francisco Bay on the morning of January 28, 2007 and went to a Farallon island to sprinkle his mother's ashes, and unfortunately disappeared off the coast. Until now there was no news of him that the US Coast Guard had launched a large-scale search. The Nobel Prize in Computer science, the Turing Award, is extremely rigorous and rarely awarded to two people in the same field. So the Turing Award winner is not only a super scientist who has made outstanding contributions, but is also recognized as a leader in the field of research. As the traditional mainstream of the computer, the database only three Turing Award winners, are the famous characters, the creation of their own era. E.f.codd, one of the big Three, died in 2003, and the father of the database, Bachman, had retired and left academia, and the current study only remaining only Jim Gray: Charles W. Bachman (1973)--database
Edgar Frank Codd (1981)--Relational data model
Jim Gray (1998)--the database and transaction database textbooks generally separate one or two chapters to explain the business and its importance, which is the basis of database application in banking, finance and other industries. Credit card, online transfer these operations are done as database transactions, and the transaction mechanism guarantees the correctness and completeness of these operations and is widely used. And the concept of business and related technology, it is Jim Gray. In addition, he is also an on-Line Analytical processing technology (
OLAP) of the founders, he made a
Data Cubeand other important concepts. For the last ten years (and probably for the next 10 years), he has been
Database,
Data WarehouseAnd
Data MiningThe leading figures in the field. Officials have declared Mr. Jim Gray dead. Is there any other miracle to come?
SQL Server Kernel Architecture anatomy (reprint)