Some time ago, I ran mysql on the self-developed iSCSI-based SAN, And the CPU iowait was very large. Later I switched to Native AIO, which greatly improved. Here is a brief summary of the implementation of Native AIO. For databases with IO as the biggest bottleneck, native AIO is almost the best choice. It relies only on multithreading and obviously cannot solve disk and network problems.
1 API and data struct
Main AIO interfaces:
System call |
Description |
Io_setup () |
Initializes an asynchronous context for the current process |
Io_submit () |
Submits one or more asynchronous I/O operations |
Io_getevents () |
Gets the completion status of some outstanding asynchronous I/O operations |
Io_cancel () |
Cancels an outstanding I/O operation |
Io_destroy () |
Removes an asynchronous context for the current process |
1.1 AIO Context
The first step to use AIO is to create an AIO context, which is used to track the asynchronous IO running status of process requests. Aio_context_t:
// Linux/aio_abi.h Typedef unsigned long aio_context_t; // Create an AIO Context Int io_setup (unsigned nr_events, aio_context_t * ctxp ); |
Io_setup creates the AIO context for receiving the nr_events event.
Kioctx:
The AIO context corresponds to the data structure kioctx in the kernel space, which stores all information about asynchronous IO:
// AIO Environment Struct kioctx { Atomic_t users; Int dead; Struct mm_struct * mm; /* This needs improving */ Unsigned long user_id; // ring_info.mmap_base, starting address of AIO Ring Struct kioctx * next; // The next aio Environment Wait_queue_head_t wait; // wait for the process queue Spinlock_t ctx_lock; Int reqs_active; Struct list_head active_reqs;/* used for cancellation */ Struct list_head run_list;/* used for kicked reqs, a list of running IO requests */ Unsigned max_reqs; // maximum number of asynchronous IO operations Struct aio_ring_info ring_info; // AIO Ring Struct work_struct wq; }; |
A process can create multiple AIO contexts, which constitute a one-way linked list.
Struct mm_struct { ... /* Aio bits */ Rwlock_t ioctx_list_lock; Struct kioctx * ioctx_list; // process's AIO context linked list Struct kioctx default_kioctx; } |
AIO Ring
The AIO context kioctx object contains an important data structure AIO Ring:
// Aio. h // AIO Ring # Define AIO_RING_PAGES 8 Struct aio_ring_info { Unsigned long mmap_base; // starting address of the AIO ring user State Unsigned long mmap_size; // buffer Length Struct page ** ring_pages; // a pointer array of the AIO ring page Spinlock_t ring_lock; Long nr_pages; Unsigned nr, tail; Struct page * internal_pages [AIO_RING_PAGES]; }; |
The AIO Ring corresponds to a memory cache area of the User-state process address space. The user-state process can be accessed, and the kernel can also be accessed. In fact, the kernel first calls the kmalloc function to allocate some page boxes and maps them to the user-state address space through do_mmap. For details, see the aio_setup_ring function.
The AIO Ring is a Ring buffer. The kernel uses it to report asynchronous IO completion. User-State processes can also directly check asynchronous IO completion to avoid overhead of system calls.
AIO structure is simple: aio_ring + io_event array:
Struct aio_ring { Unsigned id;/* kernel internal index number */ Unsigned nr;/* number of io_events */ Unsigned head; Unsigned tail; Unsigned magic; Unsigned compat_features; Unsigned incompat_features; Unsigned header_length;/* size of aio_ring */ Struct io_event io_events [0]; };/* 128 bytes + ring size */ |
The system calls io_setup with two parameters: (1) nr_events confirms the maximum number of asynchronous IO requests, which determines the size of the AIO Ring, that is, the number of io_events; (2) ctxp: the pointer to the AIO context handle is also the starting address of the AIO Ring, aio_ring_info.mmap_base. For details, see the aio_setup_ring function.
1.2 submit IO requests
To perform asynchronous IO, you must call io_submit to submit asynchronous IO requests.
// Submit asynchronous IO request/aio. c Asmlinkage long sys_io_submit (aio_context_t ctx_id, long nr, Struct iocb _ user * iocbpp) |
Parameters:
(1) ctx_id: AIO context handle. The kernel uses it to find the corresponding kioctx object;
(2) In the iocb array, each iocb describes an asynchronous IO request;
(3) nr: size of the iocb array.
Iocb
// User-state asynchronous IO request descriptor/aio_abi.h Struct iocb { /* These are internal to the kernel/libc .*/ _ U64 aio_data;/* data is left with a custom pointer: it can be set as the callback function after IO completion */ _ U32 PADDED (aio_key, aio_reserved1 ); /* The kernel sets aio_key to the req #*/ /* Common fields */ _ Aio_lio_opcode;/* see IOCB_CMD _ above, Operation Type: io_0000_pwrite | io_0000_pread */ _ S16 aio_reqprio; _ U32 aio_fildes; // file descriptor of the IO operation _ U64 aio_buf; // IO buffer _ U64 aio_nbytes; // Number of IO request bytes _ S64 aio_offset; // offset /* Extra parameters */ _ U64 aio_reserved2;/* TODO: use this for a (struct sigevent *)*/ _ U64 aio_reserved3; };/* 64 bytes */ |
The data structure iocb is used to describe user space asynchronous IO requests. The corresponding kernel data structure is kiocb.
Io_submit process:
The io_submit_one function assigns a kiocb object to each iocb, adds run_list to the IO Request queue of kioctx in the AIO context, and then calls aio_run_iocb to initiate IO operations, it actually calls kiocb's ki_retry method (aio_pread/aio_pwrite ).
If the ki_retry method returns-EIOCBRETRY, it indicates that the asynchronous IO request has been submitted, but not all has been completed. Later, kiocb's ki_retry method will be called to continue to complete the IO request; otherwise, call aio_complete and add an io_event to AIO Ring to indicate IO completion.
1.3 collect complete IO requests
Asmlinkage long sys_io_getevents (aio_context_t ctx_id, Long min_nr, Long nr, Struct io_event _ user * events, Struct timespec _ user * timeout) |
Parameters:
(1) ctx_id: AIO context handle;
(2) min_nr: Collects at least min_nr completed IO requests and returns them;
(3) nr: a maximum of nr completed IO requests can be collected;
(4) timeout: Waiting Time
(5) events: allocated by the application layer. The kernel copies the completed io_event to the buffer. Therefore, the events array must have at least nr io_event.
Io_event:
// Aio_abi.h Struct io_event { _ U64 data;/* the data field from the iocb */ _ U64 obj;/* what iocb this event came from */ _ S64 res;/* result code for this event */ _ S64 res2;/* secondary result */ }; |
Io_event is used to describe the returned results:
(1) data corresponds to the iocb aio_data, and the user-defined pointer is returned;
(2) obj is the iocb when I submitted the IO task;
(3) res and res2 indicate the status of IO task completion.
Io_getevents process:
It is relatively simple to scan the AIO Ring in the AIO context kiocxt and check whether there is a completed io_event. If there are at least min_nr completed IO events (or timeout), copy the completed io_event to events and return the number or error of io_event; otherwise, add the process itself to the kiocxt wait queue and suspend the process.
2. AIO work queue 2.1 create an AIO work queue
// Aio. c Static struct workqueue_struct * aio_wq; // AIO work queue Static int _ init aio_setup (void) { ... Aio_wq = create_workqueue ("aio "); ... |
2.2 create work_struct
Static struct kioctx * ioctx_alloc (unsigned nr_events) { ... INIT_WORK (& ctx-> wq, aio_kick_handler, ctx ); |
The aio_kick_hanlder function is called when the aio kernel thread processes aio work:
Static void aio_kick_handler (void * data) { Requeue =__ aio_run_iocbs (ctx ); ... /* * We're in a worker thread already, don't use queue_delayed_work, */ If (requeue) Queue_work (aio_wq, & ctx-> wq ); } |
The logic is very simple. Call _ aio_run_iocbs to continue processing the asynchronous IO to be completed in kioctx. If necessary, add the aio work to the aio work queue and re-process it next time.
2.3 Scheduling
After the aio_run_iocbs function initiates an asynchronous IO request, if there is still unfinished IO in the run_list of kioctx, call queue_delayed_work to add work_struct (kioctx-> wq) to the AIO work queue aio_wq, the aio kernel thread continues to initiate asynchronous IO.
3. AIO and epoll
When using AIO, you must callIo_geteventsObtain the completedIOEvent, andSystem CallIo_geteventsIs blocked, so there are2Method:(1)Use multithreading and use special threads to callIo_getevents, ReferMySQL5.5And later versions;(2)For a single-threaded program, you can useEpollTo useAIOHowever, this requires system callsEventfdAnd the system calls only2.6.22Later kernel is supported.
Eventfd is an API of Linux-native aio used to generate file descriptors. These file descriptors can provide an effective "Wait/notification" event mechanism for applications. Similar to pipe, but it is better than pipe. On the one hand, it only uses one file descriptor (pipe needs two), saving kernel resources. On the other hand, the buffer management of eventfd is much simpler, pipe requires a buffer with an indefinite length, while eventfd only needs a buffer with a fixed length of 8 bytes.
For the combination of AIO and epoll, see:
Nginx 0.8.x stable version for linux aio support (http://www.pagefault.info /? P = 76)
4. AIO and direct IO
AIO must be combined with direct IO.
For the simple implementation of direct IO, refer:
Introduction to direct I/O mechanism in Linux
Http://www.ibm.com/developerworks/cn/linux/l-cn-directio/index.html
5 cases
(1) Synchronous IO
(2) Native AIO