Implementation of Native AIO in Linux

Source: Internet
Author: User

Some time ago, I ran mysql on the self-developed iSCSI-based SAN, And the CPU iowait was very large. Later I switched to Native AIO, which greatly improved. Here is a brief summary of the implementation of Native AIO. For databases with IO as the biggest bottleneck, native AIO is almost the best choice. It relies only on multithreading and obviously cannot solve disk and network problems.

1 API and data struct

Main AIO interfaces:

System call

Description

Io_setup ()

Initializes an asynchronous context for the current process

Io_submit ()

Submits one or more asynchronous I/O operations

Io_getevents ()

Gets the completion status of some outstanding asynchronous I/O operations

Io_cancel ()

Cancels an outstanding I/O operation

Io_destroy ()

Removes an asynchronous context for the current process

 

1.1 AIO Context

The first step to use AIO is to create an AIO context, which is used to track the asynchronous IO running status of process requests. Aio_context_t:

// Linux/aio_abi.h

Typedef unsigned long aio_context_t;

 

// Create an AIO Context

Int io_setup (unsigned nr_events, aio_context_t * ctxp );

Io_setup creates the AIO context for receiving the nr_events event.

 

Kioctx:

The AIO context corresponds to the data structure kioctx in the kernel space, which stores all information about asynchronous IO:

// AIO Environment

Struct kioctx {

Atomic_t users;

Int dead;

Struct mm_struct * mm;

 

/* This needs improving */

Unsigned long user_id; // ring_info.mmap_base, starting address of AIO Ring

Struct kioctx * next; // The next aio Environment

 

Wait_queue_head_t wait; // wait for the process queue

 

Spinlock_t ctx_lock;

 

Int reqs_active;

Struct list_head active_reqs;/* used for cancellation */

Struct list_head run_list;/* used for kicked reqs, a list of running IO requests */

 

Unsigned max_reqs; // maximum number of asynchronous IO operations

 

Struct aio_ring_info ring_info; // AIO Ring

 

Struct work_struct wq;

};

 

A process can create multiple AIO contexts, which constitute a one-way linked list.

Struct mm_struct {

...

/* Aio bits */

Rwlock_t ioctx_list_lock;

Struct kioctx * ioctx_list; // process's AIO context linked list

 

Struct kioctx default_kioctx;

}

 

AIO Ring

The AIO context kioctx object contains an important data structure AIO Ring:

// Aio. h

// AIO Ring

# Define AIO_RING_PAGES 8

Struct aio_ring_info {

Unsigned long mmap_base; // starting address of the AIO ring user State

Unsigned long mmap_size; // buffer Length

 

Struct page ** ring_pages; // a pointer array of the AIO ring page

Spinlock_t ring_lock;

Long nr_pages;

 

Unsigned nr, tail;

 

Struct page * internal_pages [AIO_RING_PAGES];

};

The AIO Ring corresponds to a memory cache area of the User-state process address space. The user-state process can be accessed, and the kernel can also be accessed. In fact, the kernel first calls the kmalloc function to allocate some page boxes and maps them to the user-state address space through do_mmap. For details, see the aio_setup_ring function.

 

The AIO Ring is a Ring buffer. The kernel uses it to report asynchronous IO completion. User-State processes can also directly check asynchronous IO completion to avoid overhead of system calls.

AIO structure is simple: aio_ring + io_event array:

Struct aio_ring {

Unsigned id;/* kernel internal index number */

Unsigned nr;/* number of io_events */

Unsigned head;

Unsigned tail;

 

Unsigned magic;

Unsigned compat_features;

Unsigned incompat_features;

Unsigned header_length;/* size of aio_ring */

 

 

Struct io_event io_events [0];

};/* 128 bytes + ring size */

 

 

The system calls io_setup with two parameters: (1) nr_events confirms the maximum number of asynchronous IO requests, which determines the size of the AIO Ring, that is, the number of io_events; (2) ctxp: the pointer to the AIO context handle is also the starting address of the AIO Ring, aio_ring_info.mmap_base. For details, see the aio_setup_ring function.

 

1.2 submit IO requests

To perform asynchronous IO, you must call io_submit to submit asynchronous IO requests.

// Submit asynchronous IO request/aio. c

Asmlinkage long sys_io_submit (aio_context_t ctx_id, long nr,

Struct iocb _ user * iocbpp)

Parameters:

(1) ctx_id: AIO context handle. The kernel uses it to find the corresponding kioctx object;

(2) In the iocb array, each iocb describes an asynchronous IO request;

(3) nr: size of the iocb array.

 

Iocb

// User-state asynchronous IO request descriptor/aio_abi.h

Struct iocb {

/* These are internal to the kernel/libc .*/

_ U64 aio_data;/* data is left with a custom pointer: it can be set as the callback function after IO completion */

_ U32 PADDED (aio_key, aio_reserved1 );

/* The kernel sets aio_key to the req #*/

 

/* Common fields */

_ Aio_lio_opcode;/* see IOCB_CMD _ above, Operation Type: io_0000_pwrite | io_0000_pread */

_ S16 aio_reqprio;

_ U32 aio_fildes; // file descriptor of the IO operation

 

_ U64 aio_buf; // IO buffer

_ U64 aio_nbytes; // Number of IO request bytes

_ S64 aio_offset; // offset

 

/* Extra parameters */

_ U64 aio_reserved2;/* TODO: use this for a (struct sigevent *)*/

_ U64 aio_reserved3;

};/* 64 bytes */

The data structure iocb is used to describe user space asynchronous IO requests. The corresponding kernel data structure is kiocb.

 

Io_submit process:

The io_submit_one function assigns a kiocb object to each iocb, adds run_list to the IO Request queue of kioctx in the AIO context, and then calls aio_run_iocb to initiate IO operations, it actually calls kiocb's ki_retry method (aio_pread/aio_pwrite ).

If the ki_retry method returns-EIOCBRETRY, it indicates that the asynchronous IO request has been submitted, but not all has been completed. Later, kiocb's ki_retry method will be called to continue to complete the IO request; otherwise, call aio_complete and add an io_event to AIO Ring to indicate IO completion.

 

1.3 collect complete IO requests

Asmlinkage long sys_io_getevents (aio_context_t ctx_id,

Long min_nr,

Long nr,

Struct io_event _ user * events,

Struct timespec _ user * timeout)

Parameters:

(1) ctx_id: AIO context handle;

(2) min_nr: Collects at least min_nr completed IO requests and returns them;

(3) nr: a maximum of nr completed IO requests can be collected;

(4) timeout: Waiting Time

(5) events: allocated by the application layer. The kernel copies the completed io_event to the buffer. Therefore, the events array must have at least nr io_event.

 

Io_event:

// Aio_abi.h

Struct io_event {

_ U64 data;/* the data field from the iocb */

_ U64 obj;/* what iocb this event came from */

_ S64 res;/* result code for this event */

_ S64 res2;/* secondary result */

};

Io_event is used to describe the returned results:

(1) data corresponds to the iocb aio_data, and the user-defined pointer is returned;

(2) obj is the iocb when I submitted the IO task;

(3) res and res2 indicate the status of IO task completion.

 

Io_getevents process:

It is relatively simple to scan the AIO Ring in the AIO context kiocxt and check whether there is a completed io_event. If there are at least min_nr completed IO events (or timeout), copy the completed io_event to events and return the number or error of io_event; otherwise, add the process itself to the kiocxt wait queue and suspend the process.

2. AIO work queue 2.1 create an AIO work queue

// Aio. c

Static struct workqueue_struct * aio_wq; // AIO work queue

Static int _ init aio_setup (void)

{

...

Aio_wq = create_workqueue ("aio ");

...

 

 

2.2 create work_struct

Static struct kioctx * ioctx_alloc (unsigned nr_events)

{

...

INIT_WORK (& ctx-> wq, aio_kick_handler, ctx );

The aio_kick_hanlder function is called when the aio kernel thread processes aio work:

Static void aio_kick_handler (void * data)

{

Requeue =__ aio_run_iocbs (ctx );

...

/*

* We're in a worker thread already, don't use queue_delayed_work,

*/

If (requeue)

Queue_work (aio_wq, & ctx-> wq );

}

The logic is very simple. Call _ aio_run_iocbs to continue processing the asynchronous IO to be completed in kioctx. If necessary, add the aio work to the aio work queue and re-process it next time.

2.3 Scheduling

After the aio_run_iocbs function initiates an asynchronous IO request, if there is still unfinished IO in the run_list of kioctx, call queue_delayed_work to add work_struct (kioctx-> wq) to the AIO work queue aio_wq, the aio kernel thread continues to initiate asynchronous IO.

 

3. AIO and epoll

When using AIO, you must callIo_geteventsObtain the completedIOEvent, andSystem CallIo_geteventsIs blocked, so there are2Method:(1)Use multithreading and use special threads to callIo_getevents, ReferMySQL5.5And later versions;(2)For a single-threaded program, you can useEpollTo useAIOHowever, this requires system callsEventfdAnd the system calls only2.6.22Later kernel is supported.

Eventfd is an API of Linux-native aio used to generate file descriptors. These file descriptors can provide an effective "Wait/notification" event mechanism for applications. Similar to pipe, but it is better than pipe. On the one hand, it only uses one file descriptor (pipe needs two), saving kernel resources. On the other hand, the buffer management of eventfd is much simpler, pipe requires a buffer with an indefinite length, while eventfd only needs a buffer with a fixed length of 8 bytes.

 

For the combination of AIO and epoll, see:

Nginx 0.8.x stable version for linux aio support (http://www.pagefault.info /? P = 76)

 

4. AIO and direct IO

AIO must be combined with direct IO.

For the simple implementation of direct IO, refer:

Introduction to direct I/O mechanism in Linux

Http://www.ibm.com/developerworks/cn/linux/l-cn-directio/index.html

 

5 cases

(1) Synchronous IO

 

(2) Native AIO

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.