Read System Call analysis (VFS analysis address

Read System Call analysis (VFS analysis address_space page cache)

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Processing of read system calls in user space

The implementation mechanism of the Linux System Call (SCI, system call interface) is actually a process of multi-channel aggregation and decomposition. The aggregation point is the entry point of 0x80 interruptions (x86 system structure ). That is to say, all system calls are aggregated from the user space to 0x80, and the specific system call number is saved. When the 0x80 interrupt handler program runs, different system calls are processed based on the system call number (Different kernel functions are called for processing ). For more information about system calls, see references.

The read system call is no exception. When a call occurs, the library function is interrupted by 0x80 after saving the read system call number and parameters. At this time, the library function is finished. The processing of read system calls in user space is complete.

Processing of read system calls in core space

0x80 after the interrupt handler takes over the execution, first check its system call number, and then find the system call table based on the system call number, obtain the kernel function sys_read for processing the read system call from the system call table, pass the parameters, and run the sys_read function. So far, the kernel has actually started to process the read System Call (sys_read is the kernel entry of the read System Call ).

In the Processing Section of the read system calling in the core space, the kernel first introduces the hierarchical model for processing disk requests, next, we will introduce the process of disk read requests at each layer in the order of the hierarchical model from top to bottom.

Hierarchical Model processed by read system calls in core space

Figure 1 shows the hierarchical model of read system calls in the core space. It can be seen that a Read Request to a disk first goes through the Virtual File System layer (VFS layer), and then the specific file system layer (such as ext2 ), next is the cache layer (page cache layer), general block layer (generic block layer), and I/O scheduler layer (I/O scheduler Layer) block device driver layer (Block device driver layer), and finally block device layer (Block device layer)

Figure 1 processing level of read system calls in core space

The role of the virtual file system layer: shields the differences in the underlying file system operations and provides a unified interface for upper-layer operations. It is precisely because of this level that devices can be abstracted into files, making device operations as simple as file operations.
At the specific file system layer, different file systems (such as ext2 and NTFS) have different operation procedures. Each file system defines its own set of operations. For more information about file systems, see references.
The cache layer is introduced to improve the disk access performance of the Linux operating system. The cache layer caches some data on the disk in the memory. When a data request arrives, if this data exists in the cache and is up-to-date, the data is directly transmitted to the user program, eliminating the need for operations on the underlying disk and improving performance.
The main task of the general block layer is to receive disk requests from the upper layer and finally send IO requests. This layer hides the features of underlying hardware block devices and provides a general abstract view for Block devices.
Functions of the IO scheduling layer: receives IO requests from the general block layer, caches requests, and tries to merge adjacent requests (if the data of these two requests is adjacent to the disk ). Based on the configured scheduling algorithm, callback the request processing function provided by the driver layer to process specific IO requests.
The driver in the driver layer corresponds to a specific physical block device. It extracts the IO request from the upper layer and operates the device to transmit data by sending commands to the device controller of a specific block device based on the information specified in the IO request.
The device layer is a specific physical device. Defines the specifications for specific device operations.

Related kernel data structure:

Dentry: contact the I node of the file name and file
Inode: file I node. It stores information such as the file ID, permission, and content.
File: a collection of function pointers for saving information about files and various operation files
File_operations: a set of function interfaces for Operating Files
Address_space: Describes the page cache structure and related information of the file, and contains a set of function pointers for operating the page cache.
Address_space_operations: Set of function interfaces for page cache operations
Bio: Description of the IO request

Relationship between data structures:

Figure 2 illustrates the relationships between the preceding data structures (except bio. We can see that the dentry object can find the inode object, and the address_space object can be retrieved from the inode object, and then the address_space_operations object can be found from the address_space object.

The file object can be obtained based on the information provided in the current process descriptor, and then the dentry object, address_space object, and file_operations object can be found.

Figure 2 data structure diagram:

Prerequisites:

For a specific read call, the kernel may encounter many processing situations. One of the examples is as follows:

The file to be read already exists.
File going through page Cache
A common file is to be read.
The file system on the disk is an ext2 file system. For more information about the ext2 file system, see references.

Preparation:

Note: All the code in the list comes from the original linux2.6.11 kernel code.

Before reading data, you must open the file. The kernel function for processing open system calls is sys_open.
So let's first look at what the function has done. Listing 1 shows the sys_open code (some content is omitted, and the program list will be processed in the same way in the future)

Listing 1 sys_open Function Code

                asmlinkage long sys_open(const char __user * filename, int flags, int mode){ …… fd = get_unused_fd(); if (fd >= 0) { struct file *f = filp_open(tmp, flags, mode); fd_install(fd, f); } …… return fd; ……}

Code explanation:

Get_unuesed_fd (): Retrieves an unused file descriptor (the smallest unused file descriptor is selected each time ).
Filp_open (): Call the open_namei () function to retrieve the dentry and inode related to the file. (because the premise indicates that the file already exists, dentry and inode can be searched and do not need to be created ), call the dentry_open () function to create a new file object, and initialize the file object with the information in dentry and inode (the current file read/write location is saved in the file object ). Note that dentry_open () has a statement:

F-> f_op = fops_get (inode-> I _fop );

This assignment statement assigns the function pointer set of the Operation file related to the specific file system to the F _ OP variable of the file object (this pointer set is saved in the inode object ), in the next sys_read function, the read member in file-> f_op will be called.

Fd_install (): indexes the file descriptor, associates the current process descriptor with the preceding file object, and prepares for subsequent read and write operations.
The function returns the file descriptor.

Figure 3 shows the association between the file object and the current process descriptor after the sys_open function is returned, and the source of the function pointer set in the file object (I _fop, a member of the inode object ).

Figure 3 Relationship Between the file object and the current process Descriptor

So far, all preparations have been completed. The following describes the processing process of read system calls at each level shown in Figure 1.

Processing of the virtual file system layer:

The kernel function sys_read () is the entry point of the read system call at this layer. Listing 2 shows the code of this function.

Listing 2 sys_read Function Code

                asmlinkage ssize_t sys_read(unsigned int fd, char __user * buf, size_t count){ struct file *file; ssize_t ret = -EBADF; int fput_needed; file = fget_light(fd, &fput_needed); if (file) { loff_t pos = file_pos_read(file); ret = vfs_read(file, buf, count, &pos); file_pos_write(file, pos); fput_light(file, fput_needed); } return ret;}

Code parsing:

Fget_light (): extracts the corresponding file object from the current process descriptor Based on the index specified by FD (see figure 3 ).
If the specified file object is not found, an error is returned.
If the specified file object is found:
Call the file_pos_read () function to retrieve the current location of the read/write file.
Call vfs_read () to execute the file read operation, and this function finally calls the function pointed to by file-> f_op.read (). The Code is as follows:

If (file-> f_op-> Read)
Ret = file-> f_op-> Read (file, Buf, Count, POS );

Call file_pos_write () to update the current read/write location of the file.
Call fput_light () to update the reference count of the file.
Finally, the number of bytes of data read is returned.

At this point, the processing done by the virtual file system layer is complete, and the control is handed over to the ext2 file system layer.

Before parsing the ext2 file system layer operations, let's take a look at the read pointer source in the file object.

Source of the READ function pointer in the file object:

From the analysis of sys_open kernel functions, file-> f_op comes from inode-> I _fop. So where does inode-> I _fop come from? Assigned when initializing inode objects. See listing 3.

Listing 3 code of the ext2_read_inode () function

                void ext2_read_inode (struct inode * inode){ …… if (S_ISREG(inode->i_mode)) { inode->i_op = &ext2_file_inode_operations; inode->i_fop = &ext2_file_operations; if (test_opt(inode->i_sb, NOBH)) inode->i_mapping->a_ops = &ext2_nobh_aops; else inode->i_mapping->a_ops = &ext2_aops; }  ……}

The Code shows that if the inode is associated with a common file, the address of the variable ext2_file_operations is assigned to the I _fop Member of the inode object. So we can know that the inode-> I _fop.read function pointer points to the function pointed to by the read member of the ext2_file_operations variable. Next, let's take a look at the initialization process of the ext2_file_operations variable, as shown in Listing 4.

Listing 4 initialize ext2_file_operations

                struct file_operations ext2_file_operations = { .llseek = generic_file_llseek,    .read        = generic_file_read,    .write       = generic_file_write,    .aio_read  = generic_file_aio_read,    .aio_write= generic_file_aio_write,    .ioctl       = ext2_ioctl,    .mmap        = generic_file_mmap,    .open        = generic_file_open,    .release     = ext2_release_file,    .fsync       = ext2_sync_file,    .readv       = generic_file_readv,    .writev      = generic_file_writev,    .sendfile    = generic_file_sendfile,};

The read member points to the generic_file_read function. Therefore, inode-> I _fop.read points to the generic_file_read function, and then file-> f_op.read points to the generic_file_read function. The final conclusion is that the generic_file_read function is the real entry of the ext2 layer.

Ext2 file system layer Processing

Figure 4 function call relationship when read system calls are processed in the ext2 Layer

Figure 4 shows that the entry function generic_file_read at this layer calls the function _ generic_file_aio_read. The latter determines the access method of this read request, if the direct io (filp-> f_flags is set with the o_direct flag, that is, the function is not cached), The generic_file_direct_io function is called. If the function is page cache, The do_generic_file_read function is called. The do_generic_file_read function is just a packaging function that calls the do_generic_mapping_read function.

Before explaining what the do_generic_mapping_read function has done, let's take a look at how files are organized in the cache area of the memory.

File page cache structure

Figure 5 shows the page cache structure of a file. Files are divided into data blocks with page size as units. These data blocks (pages) are organized into a multi-Cross Tree (called the radix tree ). All the leaf nodes in the tree are in a page frame structure (struct page), indicating each page used to cache the file. The first page at the far left of the leaf layer stores the first 4096 bytes of the file (if the page size is 4096 bytes), and the next page stores the second 4096 bytes of the file, and so on. All the intermediate nodes in the tree are organizational nodes, indicating the page where data on an address is located. The hierarchy of the tree can be from layer 0 to layer 6. The Supported file size ranges from 0 bytes to 16 t bytes. The root node pointer of the tree can be stored in the inode object associated with the file from the address_space object related to the file) (For more information about the structure of the page cache, see references ).

Figure 5 page cache structure of a file

Now, let's take a look at what the do_generic_mapping_read function has done. The do_generic_mapping_read function has a long code. This article briefly introduces its main process:

Find the page in the page cache for caching request data based on the current file read/write location
If the page is up to date, copy the requested data to the user space.
Otherwise, lock the page
Call the readpage function to send a page request to the disk (the page will be unlocked when the underlying layer completes the IO operation), Code:

error = mapping->a_ops->readpage(filp, page);

When the page is successfully locked again, the data is already in the page cache, because the page can be unlocked only after the I/O operation is complete. A synchronization point is used to synchronize data from disk to memory.
Unlock this page
So far, the data has been in the page cache and then copied to the user space (the read call can be returned in the user space)

At this point, we know that when the data on the page is not up-to-date, this function calls mapping-> a_ops-> readpage to point to the function (the ing variable is the address_space object in the inode object). What is this function?

The origin of the readpage Function

The address_space object is embedded in the inode object, so it is hard to imagine that the initialization of the address_space object member a_ops will be performed when the inode object is initialized. As shown in the second half of listing 3.

if (test_opt(inode->i_sb, NOBH)) inode->i_mapping->a_ops = &ext2_nobh_aops;else    inode->i_mapping->a_ops = &ext2_aops;

The member a_ops of the address_space object points to the variable ext2_aops or the variable ext2_nobh_aops. The initialization of these two variables is shown in listing 5.

Listing 5 initialization of the variables ext2_aops and the variables ext2_nobh_aops

                struct address_space_operations ext2_aops = {    .readpage          =                ext2_readpage,    .readpages         = ext2_readpages,    .writepage         = ext2_writepage,    .sync_page         = block_sync_page,    .prepare_write     = ext2_prepare_write,    .commit_write       = generic_commit_write,    .bmap                 = ext2_bmap,    .direct_IO           = ext2_direct_IO,    .writepages          = ext2_writepages,};struct address_space_operations ext2_nobh_aops = {    .readpage                = ext2_readpage,    .readpages           = ext2_readpages,    .writepage         = ext2_writepage,    .sync_page           = block_sync_page,    .prepare_write      = ext2_nobh_prepare_write,    .commit_write       = nobh_commit_write,    .bmap                 = ext2_bmap,    .direct_IO           = ext2_direct_IO,    .writepages          = ext2_writepages,};

From the code above, we can see that no matter which variable, the readpage Member points to the function ext2_readpage. Therefore, we can conclude that the do_generic_mapping_read function finally calls the ext2_readpage function to process read data requests.

So far, the ext2 file system layer has ended.

Page cache layer Processing

We learned from the above that the ext2_readpage function is the entry point of this layer. This function calls the mpage_readpage function. Listing 6 shows the code of the mpage_readpage function.

Listing 6 code of the mpage_readpage Function

                int mpage_readpage(struct page *page, get_block_t get_block){    struct bio *bio = NULL;    sector_t last_block_in_bio = 0;    bio = do_mpage_readpage(bio, page, 1,                         &last_block_in_bio, get_block);    if (bio)        mpage_bio_submit(READ, bio);    return 0;}

This function first calls the do_mpage_readpage function to create a bio request, this request specifies the location of the disk where the data block to be read is located, the number of data blocks, and the target location for copying the data-page information in the cache area. Then, call the mpage_bio_submit function to process the request. The mpage_bio_submit function calls the submit_bio function to process the request. The latter finally passes the request to the generic_make_request function, and the generic_make_request function submits the request to the general block layer for processing.

So far, the processing of the page cache layer has ended.

General block layer Processing

The generic_make_request function is the entry point of this layer. This layer only has this function to process requests. Listing 7 shows some code of the function.

Listing 7 code of the generic_make_request Function

                void generic_make_request(struct bio *bio){    ……    do {        char b[BDEVNAME_SIZE];        q = bdev_get_queue(bio->bi_bdev);        ……        block_wait_queue_running(q);        /*        * If this device has partitions, remap block n        * of partition p to block n+start(p) of the disk.        */        blk_partition_remap(bio);        ret = q->make_request_fn(q, bio);    } while (ret);}

Main Operations:

Obtain the Request queue q According to the block device number saved in bio
Checks whether the current IO scheduler is available. If it is available, continue; otherwise, wait for the scheduler to be available.
Call the function directed to Q-> make_request_fn to add the request (bio) to the Request queue.

So far, the general block layer operation has ended.

Io scheduling layer Processing

The call to the make_request_fn function can be considered as the entry of the IO scheduling layer. This function is used to add requests to the Request queue. This function is specified when a request queue is created. The Code is as follows (in the blk_init_queue function ):

q->request_fn= rfn;blk_queue_make_request(q, __make_request);

The blk_queue_make_request function assigns the address of the function _ make_request to the make_request_fn Member of the Request queue Q. Then, the _ make_request function is the real entry of the IO scheduling layer.

The main tasks of the _ make_request function are:

Check whether the request queue is empty. If yes, the driver will delay processing of the current request (the purpose is to accumulate more requests, so that it has the opportunity to merge adjacent requests, so as to improve the processing performance), and jump to 3, otherwise jump to 2
Attempts to merge the current request with the existing request in the Request queue. If the merge is successful, the function returns; otherwise, it jumps to 3.
This request is a new request. It creates a new request descriptor, initializes the corresponding domain, and adds the request descriptor to the Request queue. The function returns

After a request is put into the request queue, the scheduling algorithm of the IO scheduler determines when the request is processed. (For more information about the algorithm of the IO scheduler, see references ). Once the request can be processed, the function directed to request_fn in the Request queue is called. The initialization of this member is also set when the request queue is created:

q->request_fn= rfn;blk_queue_make_request(q, __make_request);

The first line is to assign the rfn pointer of the request processing function to the request_fn Member of the Request queue. Rfn is passed in through parameters when creating the Request queue.

The call to the request processing function request_fn means that the IO scheduling layer has finished processing.

Block device driver layer Processing

The request_fn function is the entry to the driver layer of the block device. It is transmitted by the driver to the IO scheduling layer when the driver creates a request queue.

The IO scheduling layer calls back the request_fn function to send the request to the driver. The driver obtains the IO request sent by the upper layer from the parameter of the function, the device controller is operated according to the information specified in the request (the request must be sent according to the specification specified by the physical device ).

So far, the operation on the driver layer of the block device has ended.

Block device layer Processing

Accept requests from the driver layer and complete the actual data copying. In addition, a series of specifications are stipulated, and the driver must operate the hardware according to the specification.

Follow-up work

After the device completes the IO request, it notifies the CPU by means of interruption, and the interrupt handler calls the request_fn function for processing.

When the driver processes the request again, it notifies the upper-layer function whether the IO operation is successful based on the data transmission result. If the operation is successful, upper-layer functions unlock the pages involved in Io operations (the locks added to the do_generic_mapping_read function ).

After the page is unlocked, The do_generic_mapping_read () function can successfully obtain the lock again (data synchronization point) and continue to execute the program. Then, the function sys_read can return. The final read system call can also be returned.

At this point, the entire process from sending to ending the read system call is complete.

Summary

This article describes the entire process of calling read in Linux. This process is divided into two parts: user space processing and core space processing. In user space, the control is handed over to the kernel for processing by means of 0x80 interruptions. After the kernel takes over, after six levels of processing, the request is finally handed over to the disk, the final data copy operation is completed by the disk. In this process, a series of kernel functions are called. 6

Figure 6 read system calling level in the kernel

References

View the article "kernel commands called by Linux" to learn about the basic principles of system calls and how to implement your own system call methods.
View the article "Linux File System Analysis" to learn about the Linux File System.
For more information about the ext2 file system, see Chapter 18th of "ext2 file system hard disk layout" and "Understanding the Linux kernel (3rd edition.
Read chapter 14th and chapter 15 of the book "Understanding the Linux kernel (3rd edition)" to learn about Io scheduling algorithms and page cache technologies.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Read System Call analysis (VFS analysis address_space page cache)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Read System Call analysis (VFS analysis address_space page cache)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support