Written in front 
 
  
  This blog to undertake the last MySQL default engine InnoDB (a) to explore the MySQL database InnoDB storage engine
 
 
 
MySQL default storage engine InnoDB Introduction 
 
  
  InnoDB is the first MySQL storage engine to fully support the ACID transaction (BDB is the first MySQL storage engine to support transactions, and is now discontinued);
 
  The main features are support row lock, MVCC, transaction, foreign key and consistent non-lock read, can effectively utilize CPU and memory;
 
 
 
各版本对比如下:
InnoDB Architecture
innodb存储引擎主要有一些后台线程和相应内存(缓冲池,重做日志缓冲,额外缓冲池)组成;tips : 由于innodb存储引擎是基于磁盘存储的,由于CPU速度和磁盘速度存在巨大差距,因此基于磁盘的数据库通常都会使用缓冲池技术提高数据库的整体性能;
Background thread
后台线程:    master thread : 将缓冲池数据异步刷新到磁盘,保障数据一致性,包括刷新日志到磁盘,脏页刷新,合并插入缓冲,undo页回收等    IO thread :         innodb大量使用AIO(async IO)来提高数据库的性能         innodb有4个IO线程,read thread, write thread, log io thread, insert buffer thread        可以通过 show variables like "innodb_%io_threads"\G 查看读写线程数        可以使用 show engine innodb status\G  查看innodb存储引擎状态        可以通过 innodb_read_io_threads or innodb_write_io_threads配置read & write IO线程数(innodb 1.0x)    purge thread :         事务提交后,其所使用的undolog就不再需要,所以需要存储引擎负责回收并分配新undo页;        innodb1.1x之前 purge动作在master thread中完成;        之后可以通过配置 innodb_purge_threads=1 开启独立的purge thread来处理,以减轻master thread负担,提高CPU使用率;    page cleaner thread :         innodb1.2x版本开始,将脏页刷新操作放在单独的page cleaner thread中,进一步减轻了master thread负担,并减轻用户查询线程的阻塞;
Buffer pool
缓冲池:    缓冲池本质上就是一块内存区域,通过内存的速度弥补磁盘速度较慢对数据库性能的影响;    在数据库中进行读取操作时,首先判断待读取页是否存在缓冲池中;    如果存在,则缓冲池命中,直接从缓冲池中读取进行后续操作;    否则进行io操作,从磁盘中读取页,然后存放在缓冲池中(这个过程称 将页fix到缓冲池);    当数据库进行修改操作(insert update delete)时,则首先修改缓冲池中对应的页,然后再以一定频率刷新到磁盘上;    (注意:缓冲池中是页并不是每次都刷新,而是通过checkpoint技术按一定频率刷盘,下面会详细介绍checkpoint技术)    所以缓冲池的大小直接影响数据库的性能;    可以通过 show variables like "innodb_buffer_pool_size"\G 查看缓冲池大小    具体缓冲池包括索引页,数据页,undo页,插入缓冲,自适应哈希索引等(MySIAM只会缓存索引页,可看上面的介绍)
 
 
  
  Starting from innodb1.0x, run a set of multiple buffer pool instances, each of which is evenly distributed to different buffer pool instances based on the hash value;
So as to reduce the competition of internal resources and increase the concurrent processing ability;
The number of buffer pool instances can be configured through innodb_buffer_pool_instances;
The buffer pool instance configuration can be viewed through show variables like "innodb_buffer_pool_instances";
 
  The InnoDB storage engine uses the LRU (lastest recent used) to manage memory at least using an algorithm, as detailed in Appendix A
 
 
 
重做日志缓冲    重做日志记录了innodb的事务信息,当数据库宕机或者出现其他意外情况;可以还原数据到某个时间点,从而保证数据的一致性和事务的持久性;    以下情况下会将重做日志缓冲刷到磁盘上:        master thread 每秒刷盘一次        事务提交时,强制刷盘        重做日志缓冲池剩余空间小于1/2时,刷新到磁盘    所以重做日志缓冲不需要太大,只要保证一秒的事务量即可,一般设置为8M;    可以通过 show variables like "innodb_log_buffer_size" 查看;    
额外缓冲池    一些数据结构本身内存分配,缓冲相关对象(缓冲控制对象),等待等信息会直接从额外缓冲池中进行申请;    当缓冲池配置很大时,也要适当增加额外缓冲池的配置;
Checkpoint Introduction 
 
  
  When you execute a DML statement (update,delete) modification page, the page is dirty (that is, the page in the buffer pool is newer than the disk's page);
If each page is modified to synchronize to disk, the cost is very large, especially the hotspot data scattered in different pages;
 
  In another case, the data cannot be recovered when the dirty page is flushed to disk and the outage occurs;
To be able to recover data, the current transactional database generally takes the write ahead log policy (when a transaction commits, writes the redo log first, and then refreshes the dirty page to disk); When the outage occurs, the data can be recovered to a point in time by redo log;
 
 
 
Since the size of the buffer pool is limited, the log file cannot be enlarged infinitely; So the checkpoint technology is introduced.
Checkpoint (checkpoint) addresses the following issues: Reduce data recovery time when the database goes down when the buffer pool is not enough, refresh dirty pages to disk when the redo log file is not available, refresh the Dirty page checkpoint is divided into the following two kinds: sharp checkpoint Brush all dirty pages back to disk; it is time-consuming to use this mode of work when the database is closed; the Fuzzy CheckPoint brushes some dirty pages back to disk, which is used in the database runtime, and the InnoDB storage engine triggers checkpoint:master in the following cases THR EAD Checkpoint:master Thread periodically refreshes a certain percentage of dirty pages (per second or less than 10 seconds) to disk; Because this process executes asynchronously, the InnoDB storage engine can perform other operations, such as user query threads that do not block; Flush_lru_list C The HECKPOINT:INNODB storage engine requires a certain amount of free pages to be retained, and when the free pages are smaller than this number, the end of the LRU list page is removed, and if the pages contain dirty pages, the checkpoint is triggered and the dirty pages are flushed to disk; (because checking the LRU list and removing the end page will block User query operation, mysql5.6 start, this CheckPoint is placed in page cleaner thread) Async/sync Flush CheckPoint: When the Redo log file is unavailable (there is not enough space left), this type of Checkpoint force some dirty pages to be flushed back to disk; Set the following parameters: Redo_lsn: LSN written to redo log, Checkpoint_ls: LSN to flush back to disk C Heckpoint_age = Redo_lsn-checkpoint_lsn Async_water_mark = 75% * Total_redo_log_file_size Sync_wat Er_mark = 90% * total_redo_log_file_size when Checkpoint_age < Async_water_mark does not need to flush any dirty pages to disk when async _water_mark < CheckpoInt_age < Sync_water_mark asynchronously refreshes dirty pages when disk, so checkpoint < Async_water_mark when checkpoint_age > Sync_water_mark Synchronize flush dirty pages to disk so that Checkpoint_age < Asyn_water_mark because async flush checkpoint will block the user query thread that discovers the problem, sync Flush checkpoint will hinder Plug all user query threads; mysql5.6 begins to place this part in page cleaner thread; Dirty page too much CheckPoint: When dirty pages are too long, the InnoDB storage engine  Force flushing dirty pages back to disk, can be configured by innodb_max_dirty_page_pct the maximum proportion of dirty pages;
InnoDB Master Thread working mode
InnoDB存储引擎将主要工作都放在 master thread中完成,master thread具有最高线程优先级别(mysql是单进程多线程);
Introduction to Background threading logic
master thread由多个循环组成:    主循环(loop)    后台循环(backgroup loop)    刷新循环(flush loop)    暂停循环(suppend loop)主循环 loop  :     每秒操作:        刷新日志缓冲到磁盘(总是)        合并插入缓冲(当当前IO压力较小时,执行)        刷新至多100脏页到磁盘(当缓冲池脏页比例buf_get_modified_ratio_pct > 配置文件中 innodb_max_dirty_page_pct时刷新)        如果当前没有用户活动,切换到backgroup loop(可能)    每10秒操作:        刷新100个脏页到磁盘(当IO压力不大的情况下执行)        合并至多5个插入缓冲(总是)        刷新日志缓冲到磁盘(总是)        删除无用undo页(总是)        刷新100个或者10个脏页到磁盘(总是)后台循环(backgroup  loop):    当数据库空闲或者数据库关闭时,切换到此loop    删除无用undo页(总是)    合并20个插入缓冲(总是)    跳回到主循环 或者 跳到flush loop(总是 )刷新循环(flush loop ):    不断刷新100个脏页,直到符合条件,跳到suppend loop暂停循环(suppend loop ):    挂起master thread,等待事件发生,再次唤醒master thread
Background thread, pseudo code implementation
voidMaster_thread () {Loop://per second operationfor (int i-0; I<Ten; I++) {Thread_sleep (1);//thread sleeps for 1 seconds and may be less than 1 seconds when load is large             Do LogBuffer Flush toDisk//Refresh log buffering to disk            //If the last second IO count is less than 5, indicating that the current IO pressure is small, insert buffer merge            //if (Last_one_second_io_num < 5) {1.0x before judgment, hard-coded processing            if(Last_one_second_io_num<Innodb_io_capacity*5%){ DoMerge at most5Insert Buffer}refresh up to 100 dirty pages to disk when the current buffered dirty page ratio is greater than the maximum dirty page scale in the profile            if(buf_get_modified_ratio_pct>innodb_max_dirty_page_pct) {//Do buffer pool flush dirty pages 1.0x before hard-coded processing                 DoBuffer Pool Flush Innodb_io_capacity* -%Dirty pages}Else if(Enable_adaptive_flush) { DoBuffer pool Flush Desired Amount Dirty pageIntroduction of adaptive refresh strategy after//1.0x}//Switch to background loop if no active user is currently present            if(no user activity) {goto BackgroupLoop}        }//per 10-second Operation        //If the number of Io is less than 200 in 10 seconds, indicating that the current IO pressure is small, refresh 100 dirty pages to disk        ///1.0x before hard-coded processing        //if (Last_ten_second_io_num < +) {        //Do buffer pool flush dirty pages        // }        if(Last_tem_second_io_num<innodb_io_capacity) { DoBuffer Pool Flush Innodb_io_capacity* -%Dirty pages}//Do merge in most 5 insert buffer always merges up to 5 buffer 1.0x before hard-coded processing         DoMerge at most innodb_io_capacity*5%Insert Buffer//Always merge up to 5 insert buffers         Do LogBuffer toDisk//Always flush log buffers to disk         Do  FullPurge//Always delete useless undo page        //10 dirty pages to disk according to the scale of the buffer dirty page        if(buf_get_modified_pct>  -%){//Do buffer pool flush dirty pages 1.0x before hard-coded processing             DoBuffer Pool Flush Innodb_io_capacity* -%Dirty pages}Else{//Do buffer pool flush dirty pages 1.0x before hard-coded processing             DoBuffer Pool Flush Innodb_io_capactiy*Ten%Dirty pages} BackgroupLoop: Do  FullPage//delete useless undo page        //Do merge Insert buffer combines 20 insert buffers before 1.0x, hard-coded processing         DoMerge innodb_io_capacity Insert Buffer//Merge Insert buffer        //Skip to flush loop if idle, or skip to loop        if( notIdle) {gotoLoop}Else{Goto FlushLoop} flushLoop://Do Buffer pool flush dirty page 1.0x before hard-coded processing         DoBuffer Pool Flush Innodb_io_capacity* -%Dirty pageif(buf_get_modified_ratio_pct>innodb_max_dirty_pages_pct) {goto flushLoop}Else{goto SuppendLoop} suppendLoop: Suppend_thread ()//Suspend master threadWaiting event//wait for eventGotoLoop   //Jump to main loop}
From the above section of the pseudo-code can be clearly understood, innodb1.0x, or 10 seconds per second of Io write dead in the code, due to the advent of SSDs and disk development, the need for higher IO processing, so in the 1.0x version, the introduction of Innodb_io_ Capacity parameters to dynamically adjust IO throughput;
Innodb_max_dirty_pages_pect too Large (eg:90%), if the memory is too large or the server pressure is too large, the Refresh dirty page speed will be reduced, the recommended value is 75%
1.0x version, introduced the innodb_adapitive_flushing (adaptive Refresh) parameter, when the dirty page scale is less than Innodb_max_dirty_pages_pect, will automatically refresh a certain number of dirty pages; (innodb_ Adapitive_flushing determines the most appropriate number of dirty pages to refresh based on the speed at which the redo log redo log is generated)
Before the 1.0x version, full purge operation, up to 20 undo pages, from the 1.0x version, introduced the Innodb_purge_batch_size parameter, can be adjusted to recover undo pages;
Show variables like "Innodb_purge_batch_size" \g
Starting with the 1.2x version, the dirty page operation is refreshed in master thread, separated into a separate page cleaner thread, reducing the master thread burden and further improving system concurrency.
InnoDB Key Features 
 
  
   
   - Insertion buffer (insert buffer) improves performance
- Two writes (double write) for increased security
- Adaptive hash Indexes (adapitive hash index) INNODB internal Auto-optimization
- Asynchronous io (async io) improves concurrency
- Refresh the pick page (Flush neighbor page) for improved performance
 
Insert Buffer Insert Buffer Introduction 
 
  
  InnoDB storage engine, the primary key is the row unique identifier, usually sequential incremental inserts, so the insertion of the clustered index (primary key) is generally stored in order, do not need random discrete read disk, fast;
 
  However, the table usually has not only clustered indexes, but also some nonclustered indexes and is not sequential, the insertion of the leaf node of the nonclustered index is a discrete access to the nonclustered index page, and random reads result in degraded performance.
 
  For an INSERT or update operation on a nonclustered index, INNODB introduces an insert buffer,
When inserting or updating a nonclustered index, first determine whether the non-clustered index page in the buffer pool, if present, is inserted directly;
Otherwise, the merge (merge) operation of the insert buffer and the secondary Index page child node is first put into the Inssert buffer object, and then a certain frequency or special case, (e.g., insert buffer merge in master thread);
 
  Insert Buffer usage conditions
 
   
   - Index-time secondary index (nonclustered index)
- The index is not unique (a unique key will be found before each insertion)
 
INNODB1.0X started introducing change buffer, and insert buffer is different from the DML operation (Insert,update,delete), respectively, with insert buffer, purge Buffer,delete Buffer
 
 
  
  Update operation:
 
   
   - Mark a record for deletion
- Really delete records
Delete buffer corresponds to the deletion of the record mark, purge buffer corresponds to the actual deletion of the record
 
 
 
ps:    当数据库写比较密集时,insert buffer会占用过多缓冲池,默认最大可以占用1/2的缓冲池内存    innodb1.2x开始,可以通过innodb_change_buffer_max_size调节change buffer最大占用缓冲池的百分比
Insert Buffer Internal implementation
The data structure of the insert buffer is a B + tree, which is stored by default in the shared Tablespace (ibdata1) "Tips: Of course the buffer pool page has a corresponding insert buffer buffer";
The non-leaf node of the insert buffer B + is the query key (search key);
 
 
  
   
   - Space: The spatial ID of the table where the record is to be inserted (in the InnoDB storage engine, each table has a unique space ID)
- Marker: For compatibility with older versions of Insert buffer, temporarily ignoring
- Offset: Indicates where the page is located
 
When the secondary index is to be inserted into the secondary index page, if the secondary index page is not in the buffer pool, a search key is constructed according to the rules and inserted into the insert buffer B + tree;
Insert the leaf node structure as follows:
 
 
  
   
   - Ibuf_rec_offset_count: Two bytes, order to sort each record into insert buffer
 
To ensure that each secondary index page has enough space to store each time the merge is inserted, a special page is required to mark the available space for each secondary index page, which is called Insert buffer bitmap
Each insert buffer bitmap page Tracks 16,384 secondary index pages (256 extents, details of subsequent INNODB data will be explained in detail);
Insert Buffer Merge 
 
  
  Merge Insert buffer:
 
   
   - Secondary index pages are read to the buffer pool
 When a select query is read into the buffer pool, the secondary index page needs to be checked for the Insert Buffer bitmap page, and then confirm that the secondary index page is stored in the Insert buffer B + tree, and if present, insert buffer B + The records in the tree are inserted into the secondary index page
 
   - Insert Buffer Bitmap page traced to the secondary index page there is no space available
 The Insert Buffer bitmap page is used to track the available space for each secondary index page, ensuring that at least 1/32 pages of space are available, or forcing a merge to insert records from the page in the Insert buffer B + tree into the secondary index page
 
   - Master Thread
 The main thread, every second, or every 10 seconds, merges insert buffer operations
 
Write two times (double write) 
 
  
  When the InnoDB storage engine is writing a page to a table and writes only part of the data (16k pages, writes 4k), an outage occurs, called partial write failure (partial page write);
Since redo logs are recorded based on offsets (such as offset 299, write ' Insert Data '), a copy of the page must be required if you want to recover by redo log;
Before writing the data, save a copy of the page, that is, doublewrite
 
 
 
 
 
  
  Double write consists of two parts:
 
   
   - Double Write buffer (2MB)
- Shared table space on a physical disk 128 pages (2MB) continuous
 
When the dirty page is refreshed, the dirty page is first copied to double write buffer through the memcpy function, followed by two times, each time the 1MB is written to the physical disk of the shared tablespace (because it is sequential, very fast), and then the Dirty page is flushed back to the corresponding page in a discrete way;
 
 
  
  Tips: Some system files provide protection against partial write failures (EG:ZFS file system), which can be disabled by Innodb_doublewrite = off two write
 
 
 
Adaptive Hash Index
The InnoDB storage engine monitors queries against the indexes on the table, and if it is observed that a hash index can result in a speed increase, a hash index is established, called an Adaptive Hash Index (adapitive hash, index AHI);
The ahi is constructed by a B + tree in the buffer pool, so the data is established quickly;
Because the hash index can only be used for the equivalent query, the scope of the query can not be limited to a certain extent
You can consider whether to disable this feature by observing the searches and non hash searches, which is turned on by default
Asynchronous IO (AIO)
The current database system, in order to improve the performance of disk operation, have adopted asynchronous IO;
You can check if it is turned on by show variables like "Innodb_use_native_aio" \g;
Refresh adjacency Page
When a dirty page is refreshed, the InnoDB storage engine detects all pages in the same area (extend) of the page and, if there are other dirty pages, refreshes them together;
 
 
  
  Tips
 
   
   - Corresponding to the mechanical hard disk, adjacent page refresh can be multiple IO write operations, through the asynchronous IO feature, combined into an IO operation, can significantly improve performance
- For SSDs, it is recommended to turn off this feature due to high IOPS
 
Start, close and resume
No
Postscript
Subsequent exploration of InnoDB storage engine related files
 
InnoDB on MySQL default engine (ii)