InnoDB Storage engine notes,

Last Update:2018-03-17 Source: Internet

Author: User

Tags unix domain socket

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

InnoDB Storage engine notes,
Chapter 2 Mysql architecture and storage engine 1st define a set of databases and instance databases: databases, physical operating system files, or other forms of file types. When using the NDB storage engine, database files may be stored in the memory rather than on the disk. Instance: instance. the Mysql database instance consists of a background thread and a shared memory zone. The instance is actually used to operate database files. 1.2 Mysql architecture mysql consists of the following parts: connection Pool components, management services and tool components, SQL interface components, query analyzer components, optimizer components, buffer components, plug-in storage engines, and physical files. Mysql architecture diagram: 1.3 the Mysql storage engine OLTP is used for online transaction processing, with fewer reads and writes; OLAP is used for online analysis and less reads and writes. (1) the InnoDB Storage engine starts from mysql 5.5.8. InnoDB is the default storage engine, mainly for OLTP. Features: supports transactions, row lock design, foreign keys, and non-locked reads (no locks are generated by default ). InnoDB stores data in a logical tablespace, and each form is stored in an independent ibd file. MVCC (Multi-version concurrency control) is used to achieve high concurrency. The default isolation level is repeated read. It also provides high-performance and high-availability functions such as Insert buffer, double write, adaptive hash index, and read ahead. Data storage in a table is clustered. The storage of each table is in the primary key order. When the table does not display the specified primary key, a 6-byte ROWID is generated as the primary key. (2) MyISAM storage engine before mysql 5.5.8 is the default storage engine, mainly for OLAP. Features: transactions, table lock design, full-text retrieval, and buffer pool are not supported. Only index files are buffered and data files are not buffered. A table consists of a MYD data file and a MYI index file. Data files can be compressed. (3) The NDB storage engine is a highly available and high-performance Cluster Storage file, similar to the oracle RAC cluster. Features: data is stored in the memory, and the primary key search speed is extremely fast. You can add cluster nodes to linearly improve database performance. complex operations similar to Join operations are completed at the upper layer of Mysql, resulting in high overhead and high speed. (4) The Memory storage engine stores data in the Memory and is suitable for temporary tables that store temporary data. By default, hash indexes are used instead of B + tree indexes. Features: data stored in the memory will be lost, only table locks are supported, concurrency performance is poor, TEXT and BLOB column types are not supported. Temporary tables in the intermediate result set of mysql query are stored in the Memory storage engine. (5) The Archive storage engine is designed to provide high-speed insertion and compression, and is suitable for storing archived data. Features: only Insert and Select operations and data row compression are supported. (6) Maria storage engine is newly developed to replace MyISAM and become the default storage engine. Features: supports cache data and index files, row lock design, MVCC, transaction and non-transaction options, and better processing performance of BLOB character types. Chapter 2 InnoDB Storage engine 2nd InnoDB Storage engine overview is developed by Innobase Oy. Its founder and linus are alumni of the University of Helsinki in Finland. 2.2 InnoDB architecture (1) Background Thread ① Master Thread is responsible for Asynchronously refreshing data in the buffer pool to the disk to ensure data consistency, including dirty page refresh, merged insert buffer, and undo log collection. ② IO Thread is responsible for the callback processing of a large number of asynchronous AIO writes. There are four IO threads: write, read, insert buffer, and log. ③ Purge Thread when a transaction is committed, multiple threads are used to collect undo logs that are not needed. Reduce the work of the Master thread. ④ Page Cleaner Thread is responsible for refreshing dirty pages to reduce the work of the Master Thread. (2) memory ① The buffer pool uses the buffer pool technology to improve the overall performance of DBMS due to the gap between CPU speed and disk speed. The buffer pool is a large area of memory, which directly affects the overall performance of the DBMS. We strongly recommend that you install it in a 64-bit operating system to exceed the 32-bit maximum of 3G memory. A. the read record operation will FIX the page read from the disk to the buffer pool. When you read the same page again next time, check whether the buffer pool hits the page. Otherwise, read the disk. B. Write record operations first modify pages in the buffer pool, and then refresh back to the disk at a certain frequency. Refresh is triggered through the Checkpoint mechanism, rather than every modification. Multiple buffer pool instances are allowed, and pages are allocated by hash. Benefits: reduces resource competition and increases concurrent processing capabilities. The buffer pool is managed using the LRU (Latest Recent Used, least recently Used) algorithm. The team head is the hot end, the team end is the cold end, and the new page is inserted to the midpoint. The default size of each page is 16 kb. Pages can be compressed. ② LRU List, Free List, and Flush ListFree List: When the database is started, the pages are stored in the Free List, and the LRU List is empty. LRU List: Manage pages that have been read. When you read a page, first look for the Free List. If yes, move the page to the LRU list. If no, read the page from LRU. Otherwise, retrieve a new page from the disk and remove the LRU team's last page. After a page is modified, it is called a dirty page ). Flush List: dirty page List. The database uses the checkpoint mechanism to refresh the dirty page back to the disk. Dirty pages exist in both LRU and Flush List. The LRU list is used to manage the availability of pages in the buffer pool, and the Flush list is used to manage page refresh back to disk. ③ Redo log buffer: Put the redo log into this buffer, and then refresh it to the redo log file every 1 second by default. The default value is 8 MB. Refresh the redo log to the disk in the following three situations: Master Thread per second, transaction commit, and redo log buffer space less than 1/2. ④ The additional memory pool stores the control objects of each buffer pool, and other information such as LRU management, lock, and wait. 2.3 Checkpoint technology Write Ahead Log policy: When a dirty page is refreshed back to the disk, it goes down to avoid data loss. When a transaction is committed, write the redo log before modifying the page. Checkpoint: Checkpoint technology. Responsibility: refresh the dirty pages in the buffer pool back to the disk. Three problems are solved: shorten the database recovery time, refresh dirty pages to the disk when the buffer pool is insufficient, and refresh dirty pages when the redo log is unavailable. There are two types of checkpoints: ① Sharp Checkpoint: When the database is closed, all dirty pages are refreshed back to the disk. Default working mode. ② Fuzzy Checkpoint: select only some dirty pages to refresh the disk, not all. Four Fuzzy scenarios: master Thread asynchronously refreshes from the dirty page List Flush List every 1 s and 10 s, less than 100 idle pages in the LRU List are forcibly refreshed, refresh is forcibly performed when redo log files that are used cyclically are unavailable, and dirty pages force refresh when the total number is too large and the proportion exceeds 75%. 2.3 Insert cache primary key clustered index: Data Page records are stored in the primary key order, and the insertion speed is extremely fast. Secondary index: insert requires random reading to access non-clustered index pages discretely. The performance is not high. Insert Buffer: A piece of memory in the Buffer pool. The data structure is B + tree. Purpose: Insert operations for non-unique secondary indexes. There is only one Insert Buffer B + tree in the world. It is responsible for inserting buffering non-unique secondary indexes of all tables and storing them in the shared tablespace. Change Buffer: introduced in InnoDB 1.0, which is an UPDATE of the INSERT Buffer. You can Buffer Insert, DELETE, UPDATE, and other DML operations, including INSERT | Delete | Purge Buffer. Insert Buffer Principle: For non-clustered index insertion and update operations, not every time it is directly inserted into the index page, but first checks whether the inserted non-clustered index page is in the Buffer pool, if yes, Insert it directly. If not, First Insert it into an Insert Buffer object to tell the database that the non-clustered index page and inserted to the leaf node are successful. Then, the Insert Buffer and secondary index leaf nodes are merged at a certain frequency and condition. Insert Buffer advantages: by combining multiple inserts on the same index page into one operation, the non-clustered index insertion performance is greatly improved. The scenarios of Merge Insert Buffer are as follows: ① When the secondary index page is read to the Buffer pool. ② When the Insert Buffer Bitmap page traces that the secondary index page has no available space. ③ Master Thread. 2.4 Two write operations failed: When a dirty page is refreshed to the disk, if it encounters a crash, only a part of 16 K of the page, such as 4 K, is written to the disk, data is lost. Redo log file: record the physical operations on the page, for example, write record A at the offset of 200. A copy of the page is required for recovery, and the two writes improve the reliability by saving the copy of the page at the time of downtime in the shared tablespace. The two writes are composed of two parts: one part is the doublewrite buffer in the memory, which is 2 MB in size; the other part is the two consecutive partitions in the shared tablespace on the disk, with a total of 128 pages, the total size is also 2 MB. The write process is as follows: ① When refreshing the dirty pages of the buffer pool, the dirty pages are copied to the doublewrite buffer in the memory using the memcpy function instead of directly writing to the disk. ② It is divided into two parts. 1 MB of data is written to a partition in the disk shared tablespace each time, and the fsycn function is called immediately to force Disk Synchronization to avoid disk buffering. Because the addresses in the two zones are continuous, sequential write is fast. ③ Write the doublewrite buffer in the memory to the tablespace files on the disk, which is a discrete random write. Two crash recovery processes: Find a copy of the page from the two areas of the shared tablespace, copy it to the tablespace file, and then apply the redo log. 2.5 Adaptive Hash Index: AHI is automatically optimized in the database. It is enabled by default. Only one operation is used for query and positioning, while the height of the B + tree is generally 3 ~ Layer 4, 3 ~ 4 operations. AHI is constructed through B + tree pages in the buffer pool. It is fast to create and does not need to create hash indexes for the entire table. The storage engine automatically creates a hash index for some hotspot pages based on the Access frequency and mode. Generally, the following three conditions must be met: ① The access mode to the page must be continuous and the same, and the access mode cannot be alternate. ② Access 100 times in this mode or * 1/16 times on the page. 2.6 asynchronous IOAIO: Async IO. Advantage: You can improve IOPS by avoiding synchronous waiting and merging IO operations. The operating system Native AIO is required. In InnoDB, pre-reading, dirty page refreshing to disk, and so on are all completed through AIO. 2.7 refresh the adjacent page when a dirty page is refreshed, it will detect all pages in the partition area (extent) of the page. If it is a dirty page, it will be refreshed together. The advantage is that multiple page I/O operations are combined into one I/O operation through AIO to improve performance. Chapter 2 mysql and InnoDB files include parameter files and log files (error logs, slow query logs, query logs, and binary logs), socket files, pid files, table structure definition files, storage engine files (tablespace files, redo log files), etc. 3.1 binary log: records all operations performed to change the mysql database. It belongs to the mysql service layer rather than the storage engine layer. It is disabled by default. Binary log functions: point-in-time recovery, master-slave replication, and security audit. When the storage engine supports transactions, uncommitted binary logs of all transactions are recorded in a session-level cache. When a transaction is committed, directly write binary logs in the cache to the binary log file. There are three file formats: STATEMENT (logical SQL Statement), ROW (Row record changes), and MIXED (STATEMENT is used by default, and ROW is sometimes used ). The ROW format has the advantages of high copy and recovery reliability, but the disadvantage is that the file is large. 3.2 Other files. The sock suffix is a socket file. You can use the Unix domain socket method to connect to Mysql locally on Unix systems .. The pid suffix is the pid file. When the mysql instance starts, it will write its own process ID to this file .. The suffix of frm is the table structure definition file. It is also used to store view definitions. 3.3 tablespace file. ibd is suffixed with tablespace file, which is divided into two types: Shared tablespace file, also known as default tablespace file. Independent tablespace file: When the innodb_file_per_table parameter is set, each table has an independent tablespace. Otherwise, the data is saved to the shared tablespace. 3.4 redo log file: the InnoDB Storage engine-level transaction log file. Run in cyclic writing mode. InnoDB has 51 types of redo logs. The redo log file and the binary log file both record the transaction log, but the difference is as follows: ① The level and range are different. The binlog belongs to the Mysql service layer and records all logs related to the Mysql database, including all storage engines. The redolog belongs to the InnoDB Storage engine layer and only records the transaction logs of the engine. ② The Record Content is different. The binlog records the transaction logic logs, while the redolog records the physical changes of each page. ③ The write time is different. The binlog is written only once when the transaction is committed, while the redolog is constantly written in the transaction. Chapter 2 Table 4th index organization table HOT: heap organization table. The index and table data are separated, and the index records the rowid of the data location. The storage location for data insertion is random, mainly determined by the idle data blocks in the database. The installation hit rate is used to obtain data. During full table scan, no data must be inserted first. IOT: an index is used to organize tables. Indexes and data are stored together. Row data is stored as indexes. Difference between the two: query speed index organization table is higher, insert speed heap organization table is higher. Oracle supports heap tables and index organization tables. Innodb only supports index organization tables. In the InnoDB Storage engine, tables are organized and stored in the primary key order. If no primary key is explicitly defined, the system first checks whether the table has a non-null unique index; if not, a 6-byte pointer is automatically created, that is, "_ rowid ". 4.2 InnoDB logical storage structure all data is logically stored in the table space. The table space consists of segments, partitions, and pages. (1) tablespace shared tablespace: stores undo logs, inserts buffered index pages, redo logs, and secondary write buffering. Independent tablespace of each table: data, index, insert buffer bitmap page. (2) segment tablespaces are composed of various segments. Common segments include data segments, index segments, and rollback segments. (3) The area consists of consecutive pages. In any case, the size of each area is 1 MB. A zone has 64 consecutive pages, each of which is 16 kb. (4) page is the minimum unit for InnoDB disk management. Common page types include: data page, undo page, system page, transaction data page, insert buffer bitmap page, etc. 4.3 InnoDB row record formats include Compact and Redundant. Compact is the latest version, and Redundant is designed to be compatible with the old version. In Compact format, the NULL value does not occupy any storage space. Each row of data has an additional 2 ~ Three hidden columns: Transaction ID column, rollback pointer column, and possible _ rowid column. Row overflow data: some data in a record is stored outside the real data page. When a row overflow occurs, the row data may be stored in 2 ~ N pages. BLOB, LOB, varchar, and other large objects may be on or off the page. You can set a maximum of 65535 bytes for varchar definition, but the actual overhead can only store a maximum of 65532 bytes. For multi-byte-encoded Char Types, InnoDB treats them as variable-length characters, char (N), and N indicates the number of characters. 4.4 InnoDB has five constraints: Primary Key, Unique Key, Foreign Key, Default, and Not Null. Difference between constraints and indexes: constraints are logical concepts used to ensure data integrity. Indexing is a data structure with both logical concepts and physical storage methods. Trigger: automatically calls SQL commands or stored procedures before or after executing INSERT, DELETE, and UPDATE. 4.5 A view is a virtual table based on a base table. operations on a view are basic tables defined based on The View. Materialized View: this view is not based on the virtual table of the base table, but based on the actual existing real table of the base table. Materialized views can be used to pre-calculate and save SQL operation results from multiple table links or aggregation time-consuming perspectives. 4.5 Partition Table Partitioning is completed at the mysql service layer rather than the storage engine layer, but not all storage engines Support partitions. Partial partition index: the mysql database partition is a partial partition index, which stores data and indexes. Global partitioning: data is stored in each partition, but all data indexes are placed in one object. Currently, mysql does not support global partitioning. Mysql database supports the following four partition types: ① RANGE partition: Row data is placed into partitions Based on the column values of continuous intervals. ② LIST partition: discrete value oriented. ③ HASH partition: partitions are based on the return values of user-defined expressions, and cannot be negative. ④ KEY partition: partitions are based on the hash function provided by the mysql database. Regardless of the partition type, if a table has a primary key or a unique index, the partition column must be an integral part of the unique index. Otherwise, you can specify any column as a partition column. Partitions are allowed for NULL values. Like sorting, NULL values are always regarded as the minimum value. Subpartition: it is also called a composite partition. It is partitioned Based on the partition. Mysql Databases allow HASH or KEY subpartitions on the RANGE and LIST partitions. Chapter 2 Index and algorithm 5th InnoDB Index overview InnoDB Storage engine supports three indexes: B + tree index, full-text index, and hash index. B Indicates a balance, and the hash index cannot be manually intervened. 5.2 data structure and algorithm binary lookup method: This method is also called the half lookup method. It is used to find a record in an ordered record array. Binary Search Tree: the root node key value is always greater than the key value of the Left subtree and less than the key value of the right subtree. Traverse in the middle order. In extreme cases, sequential search is degraded. Balanced Binary Tree: AVL Tree, defined as: First, it complies with the Binary Search Tree definition, and second, it must meet the maximum height difference of the two Subtrees of any node is 1. Balance by left and right. Mostly used for memory structure objects. B + tree: a balanced search tree designed for disks or other storage devices, evolved from a balanced binary tree and B tree. In the B + tree, all record nodes are stored on the leaf nodes on the same layer in the order of key values, and are connected by the pointer of each leaf node. When the B + tree is inserted or deleted, it is re-balanced by rotation or splitting. 5.3 B + tree index is divided into clustered index and secondary index, both of which are internal 2 ~ Layer-4 highly balanced B + tree, the difference is whether the leaf node stores a whole row of information. (1) clustered index: Also called primary key index. a B + tree is constructed based on the primary key of each table. At the same time, the leaf node stores the row record data of the entire table, A leaf node is also called a data page, and an intermediate node is called an index page. Each data page is connected through a two-way linked list. Because InnoDB is an index organization table, data in the table is stored in the primary key order. Therefore, each table can have only one clustered index. Data is logically continuous, but physically discontinuous. The primary key sorting speed is extremely fast. (2) secondary index: Also called secondary index and non-clustered index. The leaf node stores the primary key rather than the whole row of records. Each table can have multiple secondary indexes. When you use the secondary index to find data, you can first use the secondary index to find the primary key of the leaf node. If the index cannot be overwritten, you can then use the primary key index to find the row record. Pre-reading is generally used to avoid secondary index multiple discrete reads. (3) The split of B + tree indexes the split of B + tree indexes does not always start from the intermediate records of the page, which may lead to a waste of page space. (4) There are two ways to create and delete indexes for B + tree indexes: alter table and create/drop index. The process of creating and deleting clustered indexes: (Disadvantage: it takes a long time for a large table to become unavailable.) ① create a new temporary table with the table structure defined by the alter table command. ② Import the data in the original table to the temporary table. ③ Delete the original table. ④ Rename the temporary table as a new table. FIC (Fast Index Creation) is used to create and delete secondary indexes. The process is as follows: (Disadvantage: The write operation is still unavailable): InnoDB adds an S lock to the table that creates the Index. You do not need to recreate the table, which is faster. When deleting a table, you only need to delete the index definitions of the table in the internal view. FIC can only be used for secondary indexes, and clustered indexes still need to be rebuilt. 5.4 Cardinality low-selectivity columns: fields such as gender, region, and type. The value range is small and B + tree indexes are not required. High Selectivity: a field has a wide range of values and has almost no duplicates. Therefore, it is suitable for B + tree indexes. Cardinality value: the estimated value of the number of non-Repeated Records in the index. Sampling is not an accurate value. The ratio of Cardinality to the total number of rows in the table should be as close as possible to 1. 5.5 joint index refers to the index of multiple columns on the table. It is also a B + tree, and the second key value has been sorted. 5.6 overwrite index covering index: This is also called index overwrite. That is, you can obtain the query records from the secondary index without querying the records in the clustered index. The benefit of overwriting indexes: Secondary indexes do not contain all the information of the entire row of records, so their size is much smaller than that of clustered indexes. Therefore, a large number of IO operations can be reduced. When the secondary index cannot be overwritten and the query returns a large number of rows, even if the query condition matches the secondary index, the optimizer will choose to use the clustered index, replace random read searches with sequential reads. 5.7 MRR optimization MMR: Multi-Range Read optimization, that is, Multi-Range Read optimization. Query performance increased by 10 times. Objective: To reduce random access to a disk and convert it to sequential access. MRR works as follows: ① stores the queried secondary index key values in a cache, and the data in the cache is sorted by the secondary index order. ② Sort the cached key values by RowID. ③ Access the actual data files according to the order of RowID. 5.7 ICP optimization: Index Condition Pushdown optimization, that is, Index Condition sinking. By placing some of the where filter operations on the storage engine layer, you can determine whether to filter the where condition while retrieving the index. Objective: To reduce the number of records requested by the upper SQL layer. 5.8 Full-Text Search: Full-Text Search is supported from InnoDB 1.2. Full-text search is implemented using inverted indexes. Inverted index: an Inverted index is also an index structure. It stores the ing between words and their locations in one or more documents in the secondary table. Using the associated array, there are two manifestations: ① inverted file index, the representation is {word, the Document ID of the word }. ② Full inverted index, in the form of {word, (the Document ID of the word, in the specific document )}. InnoDB adopts this method. Auxiliary Table: a secondary Table that stores words and persists them to a disk. To improve the parallel performance, there are a total of 6 tables, each of which is partitioned according to the Latin encoding of words. FTS Index Cache: full-text search Index Cache, which is a red-black tree structure. Full-text retrieval restrictions: ① each table can have only one full-text retrieval index. ② Full-text search index columns composed of multiple columns must use the same character set and sorting rules. ③ Languages without word delimiters, such as Chinese, Japanese, and Korean, are not supported. Chapter 1 lock 6th what is the lock mechanism used to manage concurrent access to shared resources to provide data integrity and consistency. Latch and lock can both be called locks. Latch: latch, lightweight, requires a very short lock time. In InnoDB, latch includes mutex and rwlock ). The objective is to ensure the correctness of the critical resources operated by concurrent threads. There is no Deadlock Detection mechanism. Lock: the lock object is a transaction used to lock objects in the database, such as tables, pages, and rows. The locked object is released only when the transaction is committed or rolled back. There is a deadlock detection mechanism. 6.2 InnoDB Storage engine locks (1) the Lock type InnoDB implements two standard row-level locks: ① shared locks: Allow transactions to read a row of data. ② Exclusive Lock (X Lock): allows a transaction to delete or update a row of data. Lock compatibility: both the shared lock and exclusive lock are row locks. compatibility is the compatibility of the same row record locks. Intention Lock: Intention Lock, which divides the locked objects into multiple layers. The multi-granularity Lock supported by InnoDB allows the row-level locks of transactions and table-level locks to coexist. If you think of a locked object as a tree, if you lock fine-grained objects, You need to first lock the objects in the upper layer of coarse-grained objects. For example, if the row X or S on the page is locked, the intention lock must be applied to objects such as tables and pages. InnoDB only has table-level intention locks. It is designed to reveal the type of lock that the next row will be requested in a transaction. Two intention locks are supported: ① The intention share Lock (IS Lock): the transaction wants to obtain the share locks of several rows in a table. ② Intention exclusive Lock (IX Lock): the transaction wants to obtain the exclusive Lock of several rows in a table. Lock compatibility: (2) consistent non-locked read consistency non-locked read: InnoDB reads data from the row at the current execution time through multi-version control. If the read row has an X lock, you do not need to wait for the release of the X lock to read a snapshot data of the row. Snapshot data is the data of previous versions of the row. It is implemented through the transaction undo segment without additional overhead and does not need to be locked to read the snapshot. Non-locked read is the default read method, which greatly improves database concurrency. MVCC: Multi-version concurrency control. A row record may have more than one snapshot data (transaction undo segment). This technology is generally called the row multi-version technology, and the resulting concurrency control is called MVCC. Snapshot data is defined at different isolation levels for transactions: ① When the Read committed isolation level, non-locked read always reads the latest snapshot data of the locked row. ② At the Repeatable read isolation level, non-locked read always reads the row data version at the beginning of the transaction. (3) Consistent locking read InnoDB supports two consistent locking reads for select statements, both of which must be in one transaction: ① select .... for update: adds an X lock to the read Row Records. ② Select... lock in share mode, and add an S lock to the read row record. (4) The auto-increment and lock InnoDB Storage engine provides a lightweight auto-increment mechanism of mutex, which greatly improves the performance of auto-increment value insertion. In InnoDB, the auto-increment column must be the index and the first column of the index. (5) Foreign keys and lock Foreign keys are mainly used to check the constraints of reference integrity. In InnoDB, if a foreign key column is not explicitly indexed, innoDB automatically adds an index to it to avoid table locks. To insert or update a foreign key value, you must first query the records in the parent table. However, consistent non-locked read cannot be used for the query operation on the parent table because data inconsistency may occur. Therefore, the select... lock in share mode method is used to add the S lock to the parent table for read. The InnoDB Storage engine has three row Lock algorithms: ① Record Lock: the Lock on a single row Record. ② Gap Lock: A Gap Lock that locks a range, but does not contain the record itself. ③ Next-Key Lock: Record Lock + gap Lock, Lock a range, and Lock the record itself. The lock range of Next-Key Locking technology is from negative infinity to positive infinity. Designed to solve phantom read. When the queried index contains a unique attribute, Next-Key Lock is downgraded to Record Lock, that is, only the index itself is locked, not the range. Phantom read: In the same transaction, two consecutive executions of the same SQL statement may lead to different results. The second SQL statement may return rows that do not exist before. 6.4 deadlock: Two or more transactions compete for Lock resources during execution. The simplest way to solve the deadlock is to time out, that is, when two transactions wait for each other, when one wait time exceeds the threshold, one transaction will be rolled back, and the other transaction will continue. Disadvantage: When a timeout rollback transaction is large and occupies a large amount of undo logs, the rollback cost is high. Wait diagram: wait-for graph, an active Deadlock Detection Method, determines whether a loop exists in the diagram to detect a deadlock, and chooses to roll back the transactions with the smallest undo volume. InnoDB uses a wait diagram to detect deadlocks. The wait chart requires the database to save two types of information: the lock information linked list and the transaction wait linked list. 6.5 lock upgrade refers to reducing the granularity of the current lock. The advantage is that the lock is a rare resource. by upgrading the lock, the number of locks is reduced, and the system memory and other lock overhead are reduced, which improves the efficiency to a certain extent. The disadvantage is that the concurrency performance is reduced. The InnoDB Storage engine does not need to update locks because it does not generate row locks based on each record. On the contrary, it manages locks Based on every page accessed by each transaction, the bitmap method is used. Therefore, whether a transaction locks one or more records on the page, the overhead is usually the same. Chapter 1 transactions 7th recognize five transaction types: (InnoDB does not support nested transactions, but supports four other transactions) ① flat transactions: all operations are at the same level, you cannot commit or roll back a part of a transaction. ② Flat transactions with storage points: You can roll back to an earlier state in the same transaction. However, the saved points will disappear during crash. ③ Link transaction: when a transaction is committed, the unnecessary objects are released, and the necessary processing context is implicitly passed to the next transaction to begin. ④ Nested transactions: top-level Transactions Control Sub-transactions at all levels, and sub-Transactions Control each local transformation. Sub-transactions can be committed or rolled back, but will not expire immediately, any sub-transaction is actually committed only after the top-level transaction is committed. Any node rollback in the tree structure will cause the rollback of all its subnode transactions. ⑤ Distributed transactions: usually flat transactions in a distributed environment. 7.2 The Implementation of the transaction redo log is called the redo log, which is used to ensure the atomicity and durability of the transaction. Undo log is used to ensure transaction consistency. They are not inverse processes of each other. They can both be considered a recovery operation. Redo log resume the page operation for submitting the modification of the transaction. Undo log rollback records to a specific version. The two records have different content. redo logs are generally physical logs and records physical modification operations on pages. Undo is a logical log, which is recorded per line. Redo log is used to ensure transaction persistence. undo log is used to ensure transaction rollback and MVCC functions. Redo logs are basically sequential writes, and do not need to be read when the database is running. undo logs need to be read and written randomly. 7.3 redo Log redo log consists of the redo log buffer in the memory and the redo Log file in the disk. InnoDB implements persistence through the Force Log at Commit mechanism. That is, when a transaction is committed, all the redo logs of the transaction must be written to the redo log file for persistence. After the transaction is committed, the transaction is complete. To ensure that each log is written to a redo log file, you need to call the fsync operation after each redo log buffer is written to the redo log file. Therefore, the disk performance determines the transaction commit performance. Mysql binary logs are written only once after the transaction is committed. While InnoDB redo logs are constantly written in the transaction. Redo log block: redo log block. Each redo log block is 512 bytes in size. The redo log buffer and file are stored in blocks. Because the block size is the same as the disk sector size, redo log file writing can ensure atomicity without the doublewrite technology. LSN: Log Sequence Number, Log serial Number. 8 bytes, increasing monotonically. The LSN indicates the total number of redo log writes, the position of the checkpoint, and the version of the page. Restoration of redo logs: Because checkpoint indicates the LSN refreshed to the disk page, you only need to restore the log portion starting with checkpoint during restoration. The storage engine will attempt to recover each time it starts. Because redo logs are physical logs, the recovery speed is faster than logical logs such as binlog. The Insert operation is not idempotent in binary logs, but idempotent in redo logs. Because redo records physical modification operations on pages. 7.4 undo Logundo segment: undo is stored in the undo segment of the database's shared tablespace. undo is a logical log that restores the database logic to its original form. Undo has two functions: rollback and MVCC. MVCC: when a user reads a row of records, if the transaction has been occupied by other transactions, the current transaction can read the previous row version information through undo to achieve non-locked read. Read-only transactions: The data in the entire transaction is determined at the beginning of the transaction. Even if other sessions modify and submit data within the transaction cycle, the transaction will not be affected. Database read-only transactions are also implemented using undo segment snapshots. The process of writing a transaction to the undo log on the undo log segment allocation page also requires the write of the redo log. After the transaction is committed, the undo log and undo log pages cannot be deleted immediately, because there may be other transactions that need to use the undo log to obtain the previous version of the row record. After the story service is submitted, the undo is put into a linked list, which is determined and finally deleted by the purge thread. An undo page stores the undo logs of different transactions. 7.5 group commitInnoDB performs two phase operations when committing a transaction: ① modify the transaction information in the memory and write the logs into the redo log buffer. ② Calling fsync will ensure that logs are written from the redo log buffer to the disk. In this step, you can refresh the redo logs of multiple transactions to the disk through one fsync to improve performance, which is group commit. To ensure the consistency between transactions in the storage engine layer and binary logs after enabling binary logs, take the following steps: Once the Server writes binary logs through fsync, it indicates that the transaction is successfully committed, even when the subsequent steps are executed. BLGC: Binary Log Group Commit is an implementation mechanism that supports writing Binary logs at the upper layer of the mysql database to group commit, and redo logs at the storage engine layer are also group commit. When the Mysql database is submitted at the upper layer, it is first put into a queue in order. The first transaction in the queue is called the Leader, and other transactions are called the Follower. The Leader controls the Follower behavior. The BLGC step is divided into three phases: ① Flush stage, which writes the binary logs of each transaction to the memory. ② In the Sync phase, the binary logs of multiple transactions in the memory are refreshed to the disk with one fsync operation, which is Group Commit. ③ In the Commit phase, the Leader calls the Commit of transactions at the storage engine layer in sequence. Use the Group Commit of InnoDB itself. 7.6 distributed transactions InnoDB supports XA transactions. When using distributed transactions, the InnoDB Transaction isolation level must be set to Serializable. XA transactions are composed of one application, one transaction manager, and multiple resource managers. Mysql has external XA transactions and internal XA transactions, respectively: ① external XA transactions, the resource manager is the Mysql database itself. ② Internal XA transactions refer to the relationship between the storage engine and the storage engine. The most common internal XA transaction is between the binlog and the innodb Storage engine. When a transaction is committed, the InnoDB Storage engine first receives the Prepare request from the service layer, and then writes the xid of the transaction. Then, the service layer writes binary logs, if the database goes down when InnoDB executes the last Commit operation, after the Mysql database restarts, it will first check whether the Prepare transaction xid has been committed. If not, then, the storage engine submits the request again. 7.7 bad transaction habits: long transactions, commit in a loop, use automatic commit, and use automatic rollback. Chapter 2 backup and recovery 8th backup is divided into: ① Hot Backup: Direct backup during database operation. ② Cold backup: backup after the database is stopped. ③ Warm backup: When the database is running, it will have a certain impact on database operations. For example, a global read lock is added to ensure the consistency of backup data. According to the backup file content, backup is divided into: ① logical backup: a readable text file with SQL statements, slow recovery. ② Raw file backup: copy the physical files of the database. The running replication tools include ibbackup and xtrabackup. Fast Recovery. According to the backup database content, backup can be divided into: ① full backup: a complete backup of the database. ② Incremental Backup: Back up the changed data based on the last complete backup. ③ Log backup: Back up binary logs. Restore point-in-time through binlog replay. 8.2 Snapshot Backup mysql does not support the snapshot function. Therefore, Snapshot Backup refers to backing up the database through the snapshot function supported by the file system. The premise for backup is to place all database files in the same file partition and then perform snapshot operations on the partition. Linux's LVM (logical volume Management) disk management tool supports the snapshot function. LVM uses the write-time replication technology to create snapshots. Copy-on-write: The principle of copy-on-write: When a snapshot is created, only the metadata of the data in the original volume is copied, and no physical operations on the data are performed, therefore, the snapshot creation process is very fast. When a snapshot is created and a write operation is performed on the original volume, the snapshot tracks the change of the original volume block. The data to be changed is copied to the reserved space of the snapshot before the change. When a snapshot is read, if the data block read is not modified after the snapshot is created, the operation will be directly redirected to the original volume, if you want to read a modified block, the data that is saved in the snapshot before the block changes on the original volume is read. Snapshot reading diagram: 8.3 replication: a high-availability and high-performance database solution. The replication principle is divided into the following three steps: ① master database) record data changes to binary logs. ② Copy the binary log of the master server from the database (slave) to its relay log. ③ Redo the relay log from the data block (slave) and apply the change to your data block to achieve the final data consistency. Architecture of snapshot backup + Replication

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More