Linux asynchronous IO analysis

Source: Internet
Author: User
I have known asynchronous IO for a long time, but it has not been used recently to solve the actual problem (in a CPU-intensive application, some data to be processed may be stored on the disk. The location of the data is known in advance, so an asynchronous IO read request is initiated in advance. Wait until you really need to use these... information

 

I have known asynchronous IO for a long time, but it has not been used recently to solve the actual problem (in a CPU-intensive application, some data to be processed may be stored on the disk. The location of the data is known in advance, so an asynchronous IO read request is initiated in advance. Wait until the data is actually used, and then wait until the asynchronous IO completes. When asynchronous IO is used, the program can continue to do other things during the time when I/O requests are initiated to actually use data ).

Taking this opportunity, I also studied the implementation of asynchronous IO in linux.

 

In linux, there are mainly two sets of asynchronous IO, one is implemented by glibc (called glibc), and the other is implemented by the linux kernel, libaio is used to encapsulate the called Interface (hereinafter referred to as the linux version ).

 

 

Glibc version

 

Interface

The glibc version mainly includes the following interfaces:

Int aio_read (struct aiocb * aiocbp);/* Submit an asynchronous read */

Int aio_write (struct aiocb * aiocbp);/* Submit an asynchronous write */

Int aio_cancel (int fildes, struct aiocb * aiowhite);/* cancel an asynchronous request (or all asynchronous requests based on one fd, aiowhite = NULL )*/

Int aio_error (const struct aiocb * aiocbp);/* view the status of an asynchronous request (in progress EINPROGRESS? Or has an error occurred ?) */

Ssize_t aio_return (struct aiocb * aiocbp);/* view the return value of an asynchronous request (same as that defined in synchronous read/write )*/

Int aio_suspend (const struct aiocb * const list [], int nent, const struct timespec * timeout);/* blocking waiting for request completion */

 

Struct aiocb mainly contains the following fields:

Int aio_fildes;/* fd to be read/written */

Void * aio_buf;/* memory buffer corresponding to read/write operations */

_ Off64_t aio_offset;/* file offset corresponding to read/write operations */

Size_t aio_nbytes;/* length of bytes to be read/written */

Int aio_reqprio;/* request priority */

Struct sigevent aio_sigevent;/* asynchronous event, which defines the notification signal or callback function when asynchronous operations are completed */

 

 

 

Implementation

Glibc's aio implementation is easy to understand:

1. the asynchronous request is submitted to request_queue;

2. request_queue is actually a table structure. "row" is fd and "column" is a specific request. That is to say, requests from the same fd will be organized together;

3. asynchronous requests have the concept of priority. requests belonging to the same fd are sorted by priority and eventually processed by priority;

4. with the submission of asynchronous requests, some asynchronous processing threads are dynamically created. All these threads need to do is retrieve the request from request_queue and process it;

5. to avoid competition between asynchronous processing threads, requests corresponding to the same fd are processed by only one thread;

6. The asynchronous processing thread processes each request synchronously. after processing, it fills in the result in the corresponding aiocb, then trigger the possible signal notification or callback function (the callback function needs to be called by creating a new thread );

7. the asynchronous processing thread enters the idle state after completing all requests of a fd;

8. when the asynchronous processing thread is idle, if a new fd is added to the request_queue, it is re-invested, process the request for this new fd (the new fd and the fd it processed last time may not be the same );

9. after The asynchronous processing thread is idle for a period of time (no new requests exist), it will automatically exit. When there are new requests, they will be dynamically created;

 

It seems that we want to implement an asynchronous IO in the user state, which seems to be similar in design ......

 

 

Linux version

 

Interface

Next let's take a look at the asynchronous IO in linux. It mainly includes the following system call interfaces:

Int io_setup (int maxevents, io_context_t * ctxp);/* create an asynchronous IO context (io_context_t is a handle )*/

Int io_destroy (io_context_t ctx);/* destroy an asynchronous IO context (if there is an ongoing asynchronous IO, cancel and wait for them to finish )*/

Long io_submit (aio_context_t ctx_id, long nr, struct iocb ** iocbpp);/* submit asynchronous IO requests */

Long io_cancel (aio_context_t ctx_id, struct iocb * iocb, struct io_event * result);/* cancel an asynchronous IO request */

Long io_getevents (aio_context_t ctx_id, long min_nr, long nr, struct io_event * events, struct timespec * timeout) /* events waiting for and obtaining asynchronous IO requests (that is, the processing result of asynchronous requests )*/

 

Struct iocb mainly includes the following fields:

_ Aio_lio_opcode;/* request type (for example, iocb_pai_pread = Read, iocb_0000_pwrite = write, etc )*/

_ U32 aio_fildes;/* fd to be operated */

_ U64 aio_buf;/* memory buffer corresponding to read/write operations */

_ U64 aio_nbytes;/* length of bytes to be read/written */

_ S64 aio_offset;/* file offset corresponding to read/write operations */

_ U64 aio_data;/* private data carried by the request (which can be obtained from io_event results when io_getevents )*/

_ U32 aio_flags;/* optional IOCB_FLAG_RESFD mark, indicating that eventfd is used for notification when asynchronous request processing is complete (Baidu )*/

_ U32 aio_resfd;/* eventfd for receiving notifications when IOCB_FLAG_RESFD is marked */

 

Struct io_event mainly contains the following fields:

_ U64 data;/* the value of aio_data corresponding to iocb */

_ U64 obj;/* pointer to iocb */

_ S64 res;/* corresponding IO request result (> = 0: equivalent to the return value of the corresponding synchronous call; <0:-errno )*/

 

 

 

Implementation

The io_context_t handle corresponds to a struct kioctx structure in the kernel, which is used to provide a context for a group of asynchronous IO requests. It mainly includes the following fields:

Struct mm_struct * mm;/* memory management structure corresponding to the caller process (representing the caller's virtual address space )*/

Unsigned long user_id;/* Context ID, that is, the value of io_context_t handle (equal to ring_info.mmap_base )*/

Struct hlist_node list;/* all kioctx structures belonging to the same address space are connected through this list. the linked list header is mm-> ioctx_list */

Wait_queue_head_t wait;/* waiting queue (the io_getevents system call may need to wait, and the caller will sleep in the waiting queue )*/

Int reqs_active;/* number of requests in progress */

Struct list_head active_reqs;/* Request queue in progress */

Unsigned max_reqs;/* maximum number of requests (corresponding to the int maxevents parameter called by io_setup )*/

Struct list_head run_list;/* List of requests to be processed by the aio thread (in some cases, IO requests may be submitted by the aio thread )*/

Struct delayed_work wq;/* delay task queue (when the aio thread is required to process requests, wq is mounted to the request queue corresponding to the aio thread )*/

Struct aio_ring_info ring_info;/* ring buffer storing the io_event structure of the request result */

 

The aio_ring_info structure is worth mentioning. it is the ring buffer used to store the io_event structure of the request results. It mainly contains the following fields:

Unsigned long mmap_base;/* starting address of the ring buffer */

Unsigned long mmap_size;/* the space allocated by the ring buffer */

Struct page ** ring_pages;/* page array corresponding to the ring buffer */

Long nr_pages;/* Number of pages corresponding to the allocated space (nr_pages * PAGE_SIZE = mmap_size )*/

Unsigned nr, tail;/* Number of io_event contained and access cursor */

 

This data structure looks a little strange. isn't it enough to get an io_event array directly? Why is it necessary to maintain a complex set of information such as mmap_base, mmap_size, ring_pages, and nr_pages, and hide the io_event structure?

The magic here is that the buffer in the io_event structure is allocated in the user-state address space. Note: We can see in the kernel that many data structures are allocated in the kernel address space, because these structures are dedicated to the kernel and do not need to be seen by the user program, it cannot be modified by the user program. Here, the io_event is intended to be seen by the user program, and the correctness of the kernel will not be affected even if the user modifies it. Therefore, this method is used to allocate the buffer in the user-state address space by the kernel. (If you use another conservative method, the kernel state can maintain the buffer of io_event, and copy the corresponding io_event to the user space during io_getevents .)

In io_setup, the kernel allocates a piece of memory through mmap in the corresponding user space. mmap_base and mmap_size are the locations and sizes of the memory ING. Then, the physical memory must be allocated immediately after the optical ing fails. ring_pages and nr_pages are allocated physical pages. (Because these memories are directly accessed by the kernel, the kernel will write the asynchronous IO results to them. If the physical page is delayed, a page error occurs when the kernel accesses the memory. It is troublesome to handle page missing exceptions in kernel mode, so it is better to allocate physical memory directly. Second, the kernel does not directly access the information in the buffer through the virtual address mmap_base. Since it is asynchronous, the result may be written back in another context, and the virtual address space is different. To avoid virtual address space switching, the kernel simply uses kmap to map ring_pages to high-end memory for access .)

 

Then, a struct aio_ring structure is stored in the address of the user space pointed to by mmap_base to manage the ring buffer. It mainly includes the following fields:

Unsigned id;/* equals to user_id in aio_ring_info */

Unsigned nr;/* equals to nr in aio_ring_info */

Unsigned head, tail;/* io_events array cursor */

Unsigned magic, compat_features, incompat_features;

Unsigned header_length;/* size of the aio_ring structure */

Struct io_event io_events [0];/* io_event buffer */

Finally, we expect the io_event array to appear.

 

If you have understood the preceding content, you may have a question: since the entire aio_ring structure and its io_event buffer are all in the user space, what is the io_getevents system call provided by the kernel? Isn't the user program able to directly access io_event and modify the cursor (kernel as the producer, modify aio_ring-> tail; user as the consumer, modify aio_ring-> head )? I think the original intention of aio_ring is to put it in the user space.

How does the user space know the address of the aio_ring structure (aio_ring_info-> mmap_base? In fact, the user_id in the kioctx structure, that is, the io_context_t that io_setup returns to the user, is equal to aio_ring_info-> mmap_base.

Then, fields such as magic, compat_features, and incompat_features are included in the aio_ring structure. the user space can read these magic fields to confirm that the data structure is not abnormally tampered. If everything is controllable, do it yourself. Otherwise, call the io_getevents system. The io_getevents system calls aio_ring_info-> ring_pages to obtain the aio_ring structure, and then copies the corresponding io_event to the user space.

Next, paste the io_getevents code in libaio (as mentioned earlier, the asynchronous IO of linux is encapsulated by the user-state libaio ):

Int io_getevents_0_4 (io_context_t ctx, long min_nr, long nr, struct io_event * events, struct timespec * timeout ){

Struct aio_ring * ring;

Ring = (struct aio_ring *) ctx;

If (ring = NULL | ring-> magic! = AIO_RING_MAGIC)

Goto do_syscall;

If (timeout! = NULL & timeout-> TV _sec = 0 & timeout-> TV _nsec = 0 ){

If (ring-> head = ring-> tail)

Return 0;

}

Do_syscall:

Return _ io_getevents_0_4 (ctx, min_nr, nr, events, timeout );

}

The information of the aio_ring structure in the user space is indeed used, but the scale is not large enough.

 

The above is the context structure of asynchronous IO. So why does asynchronous IO in linux require the concept of "context", while glibc does not?

In glibc, the asynchronous processing thread is a thread dynamically created by glibc in the caller process. it must be in the same virtual address space as the caller. The link "same context" is already hidden.

For the kernel, we need to deal with any process and any virtual address space. When processing an asynchronous request, the kernel needs to access the data in the caller's corresponding address space and must know what the virtual address space is. However, it is certainly possible to hide the concept of "context" in design (for example, to enable each mm to imply an asynchronous IO context ). The specific choice is only a design issue.

 

Struct iocb corresponds to the struct kiocb structure in the kernel, which mainly includes the following fields:

Struct kioctx * ki_ctx;/* request the corresponding kioctx (context structure )*/

Struct list_head ki_run_list;/* requests that need to be processed by the aio thread are linked to ki_ctx-> run_list */

Struct list_head ki_list;/* link to ki_ctx-> active_reqs */

Struct file * ki_filp;/* corresponding file pointer */

Void _ user * ki_obj.user;/* indicates the iocb structure of the user State */

_ U64 ki_user_data;/* equal to iocb-> aio_data */

Loff_t ki_pos;/* equals to iocb-> aio_offset */

Unsigned short ki_opcode;/* equals to iocb-> aio_lio_opcode */

Size_t ki_nbytes;/* equals to iocb-> aio_nbytes */

Char _ user * ki_buf;/* equals to iocb-> aio_buf */

Size_t ki_left;/* remaining bytes of the request (initial value equal to iocb-> aio_nbytes )*/

Struct eventfd_ctx * ki_eventfd;/* eventfd object corresponding to iocb-> aio_resfd */

Ssize_t (* ki_retry) (struct kiocb *);/* request submission function selected by ki_opcode */

 

After io_submit is called, each iocb structure transmitted by the user generates a corresponding kiocb structure in the kernel state, and reserves an io_events space in ring_info of the corresponding kioctx structure. Then, the request processing result is written to this io_event.

Then, the corresponding asynchronous read/write (or other) requests are submitted to the virtual file system, in fact, file-> f_op-> aio_read or file-> f_op-> aio_write (or other) is called ). That is, after going through the high-speed cache layer and general block layer of the disk, the request is submitted to the IO scheduling layer for processing. This is similar to common file read/write requests.

In linux file read/write analysis, we can see that for non-direct-io Read requests, if the page cache does not hit, the IO requests will be submitted to the underlying layer. Then, do_generic_file_read will perform the lock_page operation and wait until the data is finally read. This is contrary to asynchronous I/O, because asynchronous means that the request cannot wait after submission and must be returned immediately. For non-direct-io write requests, write operations generally only apply data updates to the page cache, and do not need to actually write data to the disk. Writing the page cache back to the disk is an asynchronous process. It can be seen that for non-direct-io file reading and writing, using the asynchronous IO interface of linux version is meaningless (just like using the synchronous interface ).

Why is there such a design? Because non-direct-io file reads and writes only deal with page cache. While page cache is memory, and there is no blocking to deal with memory, so there is no asynchronous concept. As for the blocking that occurs when reading and writing a disk, it is what happens when page cache deals with the disk and has no direct relationship with the application.

However, for direct-io, asynchronization is meaningful. Because direct-io is the direct interaction between the application buffer and the disk (no page cache is used ).

 

Here, when direct-io is used, file-> f_op-> aio _ {read, write} returns directly after the IO request is submitted, and then the io_submit system calls back. (See the following execution process .)

I/O scheduling triggered asynchronously in the Linux kernel (for example, triggered by clock interruption, triggered by other I/O requests). Submitted I/O requests are scheduled, the corresponding device driver is submitted to a specific device. For disks, generally, the driver initiates a DMA request. After some time, the read/write requests are processed by the disk, and the CPU receives the interrupt signal indicating the completion of the DMA, the processing function registered by the device driver will be called in the interrupt context. This processing function calls the end_request function to end the request. This process is the same as non-direct-io Read operations described in linux file read/write analysis.

The difference is that for synchronous non-direct-io, end_request will wake up the blocked read operation process by clearing the PG_locked mark of the page structure. asynchronous IO works the same as synchronous IO. For direct-io, apart from waking up the blocked read operation flow (synchronous IO) or io_getevents flow (asynchronous IO, you also need to enter the corresponding io_event in the IO request processing result.

Finally, when the caller calls io_getevents, the corresponding result (io_event) of the request can be obtained ). If the result is not displayed when io_getevents is called, the process will be blocked and will be awakened during the direct-io end_request process.

 

In linux, asynchronous IO also has an aio thread (one per CPU), but unlike the asynchronous processing thread in glibc, the aio thread here is used to process request retry. In some cases, file-> f_op-> aio _ {read, write} may return-EIOCBRETRY, indicating that retry is required (only some special IO devices will ). Since the caller is using an asynchronous IO interface, he certainly does not want the logic of waiting/retry. Therefore, if you encounter-EIOCBRETRY, the kernel will add a task in the aio thread corresponding to the current CPU to allow the aio thread to submit the request again. The call process can be directly returned without blocking.

 

The maximum difference between request submission in the aio thread and request submission in the caller process is that the address space used by the aio thread may be different from that used by the caller thread. You must use kioctx-> mm to switch to the correct address space before sending a request. (See the discussion in async IO .)

 

Kernel processing process

Finally, sort out the process of direct-io asynchronous read operations:

Io_submit. For each iocb (asynchronous request) in the submitted iocbpp array, call io_submit_one to submit them;

Io_submit_one. Assign a kiocb structure to the request, and reserve a corresponding io_event for it in ring_info of the corresponding kioctx. Then, call aio_rw_vect_retry to submit the read request;

Aio_rw_vect_retry. Call file-> f_op-> aio_read. This function is usually implemented by generic_file_aio_read or its encapsulation;

Generic_file_aio_read. For non-direct-io, do_generic_file_read is called to process the request (see linux file read/write analysis). For direct-io, mapping-> a_ops-> direct_IO is called. This function is generally blkdev_direct_IO;

Blkdev_direct_IO. Call filemap_write_and_wait_range to discard or fl the page cache that may exist in the corresponding location back to the disk (to avoid inconsistency), and then call direct_io_worker to process the request;

Direct_io_worker. One read may contain multiple read operations (corresponding to the class readv system call). For each of these operations, do_direct_IO is called;

Do_direct_IO. Call submit_page_section;

Submit_page_section. Call dio_new_bio to allocate the corresponding bio structure, and then call dio_bio_submit to submit bio;

Dio_bio_submit. Call submit_bio to submit the request. The subsequent process is the same as that of non-direct-io. after the request is completed, the driver will call bio-> bi_end_io to end the request. For asynchronous io under direct-IO, bio-> bi_end_io is equal to dio_bio_end_aio;

Dio_bio_end_aio. Call wake_up_process to wake up the blocked process (the caller of io_getevents in asynchronous IO ). Then call aio_complete;

Aio_complete. Write the processing result back to the corresponding io_event;

 

 

Comparison

 

From the above process, we can see that the asynchronous IO of linux only utilizes the features that the CPU and I/O devices can work asynchronously (the IO request submission process is mainly completed in the caller thread synchronously, after the request is submitted, the CPU and I/O devices can work in parallel, so the call process can be returned, and the caller can continue to do other things ). Compared with synchronous IO, it does not occupy additional CPU resources.

The asynchronous IO of glibc utilizes the asynchronous operation between threads and uses a new thread to complete IO requests, this method consumes additional CPU resources (there is a CPU overhead for thread creation, destruction, and scheduling, and there is also an overhead for inter-thread communication between the caller thread and the asynchronous processing thread ). However, the process of submitting IO requests is completed by the asynchronous processing thread (and the linux version is the request submitted by the caller). The caller thread can respond to other tasks more quickly. If the CPU resources are rich, this implementation is not bad.

 

 

Another point is that when the caller continuously calls the asynchronous IO interface and submits multiple asynchronous IO requests. In asynchronous IO of glibc, the read/write requests of the same fd are completed by the same asynchronous processing thread. The asynchronous processing thread processes these requests synchronously and one by one. Therefore, for the underlying IO scheduler, it can only see one request at a time. After this request is processed, the asynchronous processing thread submits the next request. The asynchronous IO implemented by the kernel directly submits all the requests to the IO scheduler, and the IO scheduler can view all the requests. When there are more requests, the elevator-like algorithm used by the IO scheduler can play a greater role. When there are fewer requests, in extreme cases (for example, IO requests in the system are concentrated on the same fd without pre-reading), the IO scheduler can always see only one request, then, the elevator algorithm will degrade to the first-come-first-served algorithm, which may greatly increase the overhead of the face movement.

 

 

Finally, the asynchronous IO of glibc supports non-direct-io, and the page cache provided by the kernel can be used to improve the efficiency. Linux only supports direct-io, and the cache can only be implemented by user programs.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.