InnoDB started using Linux native AIO (N-AIO later) from 5.5 and said goodbye to the previous simulation method. We will analyze the native AIO architecture of InnoDB from the source code 5.6.10.
InnoDB has n Io handler threads (n = 1 ibuf_io_thread + 1 log_io_thread +
Innodb_read_io_threads read_io_thread + innodb_write_io_threads write_io_thread), these threads also use the N-AIO (and the thread that submits the AIO request ). First, let's look at the InnoDB N-AIO core data structure and relationship:
Figure 1 Structure and relationship of N-AIO core data
OS _aio_array_t: indicates the InnoDB AIO object managed by a Class (ibuf, log, read, write) Io handler.
Mutex: the mutex of the structure. We can see that the OS _aio_array_t variable is a global variable. For read, the write-type worker may have multiple threads for concurrent access (innodb_read_io_threads, innodb_write_io_threads ), for ibuf, the log type has only one thread, so there is no concurrent access problem.
Not_full: A Condition variable event. When the slot of OS _aio_array_t changes from full to not full, it releases the condition variable, that is, there is space in the aioarray; obviously, this variable will be concerned when I/O is submitted.
Is_empty: Another condition variable event. When the slot of OS _aio_array_t changes from not empty to empty, the condition variable is released, that is, no pending AIO IN THE aioarray; who wait is a condition variable?
N_slots: the number of pending AIO events that the AIO object can accommodate, also known as the number of AIO request slots. It is equal to all threads included in the AIO object (for read, the write type may contain multiple threads in the management of the object.) the number of pending AIO events that can be accommodated, = number of threads * max pending AIO event (256) supported by each thread)
N_segments: number of zones that can be processed by the object, which is also the number of threads included
Cur_seg: current range number
N_reserved: number of pending AIO events that have been occupied
Slots: n_slots an array of so_aio_slot_t (AIO request object), that is, n_segments threads share n_slots to store pending AIO events.
Aio_ctx: array of n_segments AIO contexts, that is, each thread has an AIO Context
Aio_event: array of n_slots aio_event. location where io_event is saved after AIO event is completed
OS _aio_slot_t: An InnoDB AIO request object
Is_read: bool, true if a read operation
POs: Index of the object in the OS _aio_array_t-> slots [] Array
Reserved: true if this slot is reserved, reserved or occupied
Reservation_time: the time when the slot is scheduled.
Len: length of the IO request
Buf: The Buf of the IO request
Type: Io operate type: OS _file_read oros_file_write
Offset: file offset in bytes
File: file where to read or write
Name: file name or path
Io_already_done: whether its AIO request has been completed
Message1: InnoDB file descriptor (f_node_t) of the AIO operation)
Message2: Additional information. This information is also used by each processing function at the end of AIO.
Control: This slot uses the AIO Request control block iocb, which is also the most important member of this structure.
N_bytes: bytes written/read
RET: AIO return code
F_node_t: file descriptor managed by InnoDB in the tablespace or log space (File node of a tablespaceor the log data space)
Fil_space_t: the space (tablespace or log data space) to which the above file belongs, also known as name space. [these two structures are discussed in the InnoDB file management mode]
Io_context: AIO Context
Iocb: Io Request control block, indicating an AIO request [this structure is libaio, rather than aio_abi.h/iocb in Linux, which has no data field]
Data: used to save user data. Here it is used to save the OS _aio_slot_t corresponding to the AIO request.
Io_event: io_event used to save the completed io_event in Linux AIO
Data: used to save the iocb data content, but it is not used here
OBJ: iocb for submitting the IO
Res: indicates the status of Io completion. Here, successful Io operations indicate the number of completed bytes, that is, they are saved to OS _aio_slot_t-> n_bytes
RES2: indicates the status of Io completion. The error code is saved to OS _aio_slot_t-> ret.
[The above three structures are AIO's own structure. You can find the Linux AIO solution from http://www.cnblogs.com/hustcat/archive/2013/02/05/2893488.html]
Let's look at the next figure. This figure shows the relationship between IO handle thread and OS _aio_array_t as a whole:
Figure 2 Io thread and aio_array
[Note] Figure 2 assumes that there are n read_io_threads and M write_io_threads, and each thread can manage the maximum number of pend Io is io_limit. It can be seen that the ibuf and log types have only one thread, so aio_ctx is 1, while the slots and aio_event are io_limit.
Next let's take a look at the initialization process of InnoDB N-AIO: OS _aio_init, the function is to call OS _aio_array_create to complete the initialization of OS _aio_array_t of various Io handler, and finally through the struct )) initializes the AIO context of all threads in OS _aio_array_t. These dimensions are global variables (OS _aio_log_array, delimiter, OS _aio_write_array, and OS _aio_read_array). The following uses OS _aio_read_array as an example (two innodb_read_io_threads) to explain the Member values of these: n_slots = 2*256 = 512, n_segments = 2, slots = so_aio_slot_t [512], aio_ctx =
Io_context_t [2], aio_event = io_event [512]. That is, two read threads each can monitor 256 AIO events. In other words, each thread has its own AIO context, and each context manages 256 Io events.
After AIO is initialized, the corresponding number of Io handler threads will be created. The entry function of each Io handler thread is io_handler_thread, and
while (srv_shutdown_state != SRV_SHUTDOWN_EXIT_THREADS) {fil_aio_wait(segment);}
The main task of fil_aio_wait is to use OS _aio_linux_handle to check whether the AIO event monitored by the thread is complete (the ing between the thread and its monitoring AIO event [see Figure 2]: each thread has a global_seg, and each class of OS _aio_array_t uses the segment flag, that is, global_seg can be used to determine the OS _aio_array_t object to which the current thread belongs, then, we can use segment to obtain the AIO context (OS _aio_array_t-> aio_ctx [segment]) of the thread in the OS _aio_array_t object, and the event monitored by the thread is the IO monitored by the AIO context.
Event. When an AIO request is completed, its event information is stored in OS _aio_array_t-> aio_events [segment * seg_size] (io_getevents (io_ctx,
1, seg_size, events, & timeout); If an IO event is completed, it is determined that the completed Io operation belongs to fil_node-> space-> purpose = fil_tablespace? If yes, run buf_page_io_complete to complete the Buf page ?? Otherwise, log_io_complete is called to complete the log ?? [Note: we will discuss these two functions in the following topic]
Let's take a look at the main processing logic of the above functions:
Fil_aio_wait (ulint segment) {ret = OS _aio_linux_handle (segment, & fil_node, & message, & type); // obtain the segment thread (equivalent to global_seg in figure 2) io events that are waiting for execution to complete, Io execution results are saved to fil_node, message... If (fil_node-> space-> purpose = fil_tablespace) {// This Io is for Buf pagesrv_set_io_thread_op_info (segment, "complete Io for Buf page "); buf_page_io_complete (static_cast <buf_page_t *> (Message);} else {srv_set_io_thread_op_info (segment, "complete Io for log "); log_io_complete (static_cast <log_group_t *> (Message ));}}
OS _aio_linux_handle logic:
OS _aio_linux_handle (ulintglobal_seg, fil_node_t ** message1, void ** message2, ulint * type) {segment = hour (& array, global_seg ); // obtain the thread ID of the global_seg thread in the array. This Id also starts from 0, as shown in Figure 2n = array-> n_slots/array-> n_segments; // obtain the number of Io events that can be monitored by a thread, that is, the io_limit/* loop until we have found a completed request in Figure 2. */For (;) {iboolany_reserved = false; OS _mutex_enter (array-> mutex); for (I = 0; I <n; ++ I) {// traverse all slotslots managed by this thread = OS _aio_array_get_nth_slot (array, I + segment * n); If (! Slot-> reserved) {// whether the slot is occupied continue;} else if (slot-> io_already_done) {// The slot has been done, that is, it indicates that the IO request has been completed/* something for us to work on. */goto found;} else {any_reserved = true;} OS _mutex_exit (array-> mutex); // if no completed Io is found, go to collectos_aio_linux_collect (array, segment, n); found: // locate a completed Io and return the content * message1 = slot-> message1; * message2 = slot-> message2; // This information is also the parameter * type = slot-> type; If (SL Ot-> ret = 0 & slot-> n_bytes = (long) slot-> Len) {ret = true ;}...}
Wait for the IO request to complete OS _aio_linux_collect
OS _aio_linux_collect (OS _aio_array_t * array, ulint segment, ulint seg_size) {events = & array-> aio_events [segment * seg_size]; // array used to save the completed Io event/* which io_context we are going to use. obtain the AIO context of the thread */io_ctx = array-> aio_ctx [segment];/* starting point of the segment we will be working on. */start_pos = segment * seg_size;/* end point. */end_pos = start_pos + seg_size; ret = io_getevents (io_ctx, 1, se G_size, events, & timeout); // blocking waits for a AIO monitored by the io_cio context to complete retry: If (Ret> 0) {for (I = 0; I <ret; I ++) {// actually, RET can only be 1/* here. The slot finally points to the OS _aio_slot_t object of the AIO, it mainly stores some judgments and IO operation return values */OS _aio_slot_t * slot; struct iocb * control; control = (struct iocb *) events [I]. OBJ; // obtain the iocb of the completed AIO, that is, the iocbut_a (control! = NULL); slot = (OS _aio_slot_t *) control-> data; // obtain the OS _aio_slot_t corresponding to this AIO iocb through data. Obviously, this value is assigned at Io submission, we will introduce it later;/* some sanity checks. */ut_a (slot! = NULL); ut_a (slot-> reserved); OS _mutex_enter (array-> mutex); slot-> n_bytes = events [I]. res; // Save the IO execution result to the slot-> ret = events [I]. RES2; slot-> io_already_done = true; // indicates that the IO has been completed. This flag is also the condition for the outer judgment. OS _mutex_exit (array-> mutex);} return ;}...}
From the above we can see that Io handler thread is actually updating the corresponding information after Io is completed and processing buf_page_io_complete and log_io_complete.
Let's look at the next AIO request submission process. Here we don't care who will submit these requests (InnoDB has various timer tasks to submit these requests), and InnoDB's IO requests will eventually (as if) all are called to fil0fil. cc: fil_io/9 interface. This function performs a lot of parameter judgment and searches for fil_node_t under the space_id of the IO request (here we can regard it as a file descriptor ), then, the macro OS _aio is called. OS _aio_func is called after non-pfs_io. The function first determines the OS _aio_array_t to which the request will be placed through mode and type, and then finds an available slot through OS _aio_array_re, finally, the AIO request is submitted through OS _aio_linux_dispatch. Next, let's take a closer look at the OS _aio_array_reserve_slot function.
Struct (ulint type, OS _aio_array_t * array, fil_node_t * message1, void * message2, OS _file_t file, const char * Name, void * Buf, OS _offset_toffset, ulintlen) {/* obtain the maximum number of Io pending for each thread. Figure 2 io_limit */slots_per_seg = array-> n_slots/array-> n_segments; local_seg = (Offset> (univ_page_size_shift + 6) % array-> n_segments; // adjust the optimal segment (which thread in the array) that the IO request can store. Loop: OS _mutex_enter (array-> mutex); If (array-> n_reserved = array-> n_slots) {// determines whether all slots of the OS _aio_array_t are occupied by OS _mutex_exit (array-> mutex); If (! Srv_use_native_aio) {/* If the handler threads are suincluded, wake themso that we get more slots */OS _aio_simulated_wake_handler_threads ();} // there is no available slot, so wait for the not_full condition variable OS _event_wait (array-> not_full); goto loop;} // here it indicates that there is already an available slot, here we use the most appropriate segment to traverse and find an available slotfor (I = local_seg * slots_per_seg, counter = 0; counter <array-> n_slots; I ++, counter ++) {I % = array-> n_slots; // if it is not found in local_seg, it is possible to go back and find slot = OS _aio_array_get_nth_slot (array, I) from its segment ); if (slot-> reserved = false) {goto found;}/* We must always be able to get hold of a reserved slot. */ut_error; // find the available slotfound: ut_a (slot-> reserved = false); array-> n_reserved ++; If (array-> n_reserved = 1) {// if this slot is the first in the current array-> slots, reset the is_empty condition variable OS _event_reset (array-> is_empty );} if (array-> n_reserved = array-> n_slots) {// if all current slots variables are occupied, reset the not_full condition variable OS _event_reset (array-> not_full );} slot-> reserved = true; // This AIO has been occupied slot-> reservation_time = ut_time (); slot-> message1 = message1; // Save the fil_node_t information for the request to the slot> message2 = message2; // additional information, this information is also the parameter information slot used by each processing function at the end of IO-> file = file; // save other information to the slot-> name = Name; slot-> Len = Len; slot-> type = type; slot-> Buf = static_cast <byte *> (BUF); slot-> offset = offset; slot-> io_already_done = false; // AIO has not completed aio_offset = (off_t) offset; ut_a (sizeof (aio_offset)> = sizeof (offset) | (OS _offset_t) aio_offset) = offset); iocb = & slot-> control; // initialize the iocbif (type = OS _file_read) {io_prep_pread (iocb, file, Buf, Len, aio_offset);} else {ut_a (type = OS _file_write); io_prep_pwrite (iocb, file, Buf, Len, aio_offset);} iocb-> DATA = (void *) slot; // Save the slot to iocb-> data of the AIO, that is, its own slot-> control, for extraction at the end of AIO; this is a bit like a closed relationship, the slot has an iocb, And the iocb saves the slot address. This is because AIO only knows iocb, when an event ends, iocb is saved to event-> obj, therefore, the process waiting for the end can obtain the slotslot to which the iocb belongs from this event-> n_bytes = 0; slot-> ret = 0 ;}
After the slot is initialized, the current process can finally deliver the AIO request to the operating system OS _aio_linux_dispatch (array, slot ):
OS _aio_linux_dispatch (OS _aio_array_t * array, Region * slot) {iocb = & slot-> control; // obtain iocbio_ctx_index = (slot-> POS * array-> n_segments) /array-> n_slots; // obtain the thread segmentret = io_submit of the slot (array-> aio_ctx [io_ctx_index], 1, & iocb ); // operate the AIO request from the operating system}
In this way, we have completed the introduction of InnoDB's AIO method and architecture. In fact, InnoDB uses multiple threads. Another common application of AIO is the combination of epoll. See: http://www.pagefault.info /? P = 76.