As an important branch of computer science, database is also a very good learning direction for programmers. We usually use the most frequently-used query statements, such as add, delete, modify, and query statements, select, update, and delete. Of course, I won't say this again, these are old scum items. So we can learn advanced
As an important branch of computer science, database is also a very good learning direction for programmers. We usually use the most frequently-used query statements, such as add, delete, modify, and query statements, select, update, and delete. Of course, I won't say this again, these are old scum items. So we can learn advanced
As an important branch of computer science, database is also a very good learning direction for programmers. We usually use the most frequently used query statements, select,
Update, delete, and so on. Of course, I won't talk about it again. These are old scum stuff. Therefore, we can learn the technologies involved in advanced databases. In other words, the logic at the business layer is put aside and the database is understood from a deeper perspective. Today, I mainly submit three technical points,
1. Data Index Technology, typical B + Tree Index Series
2. Database fault recovery technology. Here I only mention log-based recovery technology.
3. Database System Structure: a popular distributed database system
The database indexing technology is related to the data query speed. When the data volume is small, the response is no problem. When the database has millions or tens of millions of data records, creating an index is required. When traditional data query operations are performed in a single query of massive data, the efficiency of searching one by one in disk blocks is of course not high. If it is continuous storage, it would be a good luck, if it is physically discontinuous, the results can be imagined. So here we introduce the index concept, which does not store real data. The index is actually an address pointer pointing to real data records, when querying data, first find the index value, and then find the actual record address based on the index value, because the actual data is not saved, the index query speed is faster. First, let's talk about ordered indexing. The definition of ordered indexing is that when the storage order of data files is ordered by a certain index key value, this file is called an ordered file, when an index is created for an ordered file, the index is usually created with an index key. For example, an index created by an ID is a sequential index, the index of ordered files can be quickly located using the binary search method. B + tree indexes are indexes built based on the structure of the Balance Tree. With its special structure design, B + tree maintains a high query efficiency, I have mentioned this in my previous articles. Finally, hash indexes are used to map the search key values to the buckets of M record search key value storage addresses, in this way, the hash algorithm can be used for direct retrieval. We all know that the hash algorithm is fast, but at the same time, a good hash algorithm is crucial, it is best to evenly distribute index records to different buckets.
Log-based recovery technology is also part of database Disaster Tolerance processing. There are three types of logs: undo, redo, and undo/redo. Undo, we can probably know from this English that it means not to do it, that is, to cancel the operation, generally, the Undo action is performed only when the operation is not completely completed, so as to avoid the birth of dirty data. Undo logs are used to prevent this situation. The structure of undo logs is as follows:
Where: T is the transaction for updating data, X is the data element updated by T, and V is the old value before the update. If an error occurs in the data, the data is updated, A complete undo log record is as follows:
// Start the transaction
// Modify the data of Data A and record its old value
// Perform data operations on data B and record the old value of B.
// The transaction is successfully executed and recorded
If the system fails before the commit operation, that is No records are written, and you do not know whether the data operation is correct. logs are read in reverse order from the back to the back. , Update B to the old value 10, and then encounter , Restore A back to 10 It indicates that the transaction is restored successfully, and the START log of the transaction is restored. The transaction execution sequence of the undo log is:
(1). Write logs that update data volume Elements
(2). Write the updated data element again.
(3). The final write COMMIT indicates that the transaction has been successfully committed.
However, this actually exposes a potential problem. This requires that the transaction write the modifications made to the disk before committing the transaction. This undoubtedly increases the I/O overhead. In fact, we can temporarily write data modification operations to the disk and write the data to the disk when the buffer is full. This saves a lot of I/O operations, so we can see the emergence of redo logs, the redo log format is slightly different from the undo log format:
(1). Write logs that update data volume Elements
(2) write COMMIT again, indicating that the transaction has been successfully committed
(3). Write the updated data element at last.
Therefore, if an error occurs in step 2 or 3, and the system has successfully updated the value, the operation will be redone, Here, V stores new values and is used for redo operations. This indicates that the values of A and B may be written to the disk after A longer delay, if the system encounters an exception during this time, redo operations will be followed by undo/redo operations. The mode is combined and the flexibility is strongest. All log records require a checkpoint mechanism, and the records before the checkpoint can be ignored. Each time the checkpoint starts from the new checkpoint, the two checkpoints are marked:
Each time you find the end ckpt, you can Logs are discarded.
Finally, let's take a look at the emergence of distributed systems in recent years. The concept of distributed databases has also emerged. Distributed Database systems can be seen as the combination of a series of centralized databases and logically belong to the same system, physically distributed and discontinuous. The only connection between them is ---- Network. The query of a database can involve distributed data queries in multiple places, and corresponding to distributed transaction management, which is more complex than distributed query, network factors will become the biggest influence factor on the Distributed Query time. In traditional local databases, the CPU processing speed of computers will be the most important factor affecting the query speed, it is quite different from the current distributed database, so the design of the distributed database can set the cost to access instead of making distributed queries. Generally, 90% of the queries can be used in local queries, in the case of distributed queries, the distributed query should be the most recent distributed database. Distributed Data can be designed in two ways: bottom-up and top-down. This is complicated and I will not expand it.