Inux Read system call

Source: Internet
Author: User

Reprint website: http://my.oschina.net/haomcu/blog/468656

    • 1. What is a system call
    • 2. Read system call processing hierarchy model in kernel space
    • 3. Related kernel data structures
    • 4. The process of the read system call
    • 4.1. Pre-conditions
    • 4.2. Read before the Open
    • 4.3. Processing of the virtual file system layer
    • 4.4. EXT2 layer and subsequent treatment
    • 4.4.1. Page cache structure for files
    • 4.4.2. Treatment of ext2 Layer
    • 4.4.3. Page Cache Layer Processing
    • 4.4.4. Processing of common block layers
    • 4.4.5. Processing of the IO scheduling layer
    • 4.4.6. Processing of block device drive layer
    • 4.4.7. Processing of block equipment layers
    • 4.4.8. Follow-up work

A recent project made a device that simulates the U-disk, but the contents of the Read virtual U disk must be read from the disk each time, rather than read from the system's cache, as a result of this problem, we looked at the system call of read and some contents of the file system. Due to the wide range of file systems, such as virtual file system (VFS), page caching, block caching, data synchronization and other content, it is impossible to fully analyze in place, there are only two ways to record and read about the use. Cached IO and direct IO.

1. What is a system call

First system call can do those things? In summary, there are probably a few things that need to be done by a system call.

    1. Control Hardware: system calls are often used as abstract interfaces for hardware resources and user space, such as Write/read calls used to read and write files.
    2. set System state or read kernel data: Because system calls are the only means of communication between the user space and the kernel, the user sets the system state, such as on/off a kernel service (setting a kernel variable), or reading the kernel data must be called through the system. such as Getpgid, GetPriority, SetPriority, SetHostName
    3. Process Management: to ensure that processes in the system can operate in a virtual memory environment with multiple tasks. such as fork, clone, Execve, exit, etc.

Why do you have to use the system to access the contents of the operating system, in fact, this can be seen as the protection of the kernel, Linux is divided into user space and kernel space, and user space is not allowed to access the kernel space data. Then in the user-space program needs to access the kernel space resources, it must be through the system call this intermediary to achieve. This limits the behavior of the user space, and only the specific permitted (pre-defined) user-space behavior can enter the kernel space. In a word, a system call is an interface that the kernel provides to user space that can access kernel resources.

In addition, there are only two ways to switch from user process to kernel process, one is system call and the other is interrupt.

To implement a system call, the first thing to be able to switch from the user space to the kernel space, this switch on the IA-32 system is the assembly instruction int $0x80 to trigger the software interrupt implementation. This section is generally implemented in the C standard library. After entering the kernel space, the system calls the central processing code (all system calls are handled by a central code) and executes different functions according to the parameters passed (including the unique system call number with the Register pass) and a static table. For example, the read system call, after the 0x80 interrupt handler takes over, checks its system call number, finds the system call table based on the system call number, and obtains the kernel function Sys_read from the system call table that processes the read system call, and finally passes the parameters and runs the Sys_read function. At this point, the kernel really begins to process the read system call (Sys_read is the kernel entry for the read system call).

2. Read system call processing hierarchy model in kernel space

Call the hierarchy model that you want to go through in the core space for the read system. See: For a disk read a request, first through the virtual file system layer (VFS layer), followed by the specific file system layer (for example, ext2), followed by the cache layer (page cache layer), the general block layer (generic block layers), IO scheduling layer (I /O Scheduler layer), block device driver layer (Block devices driver layer), and finally, the physical block device layer.

    • The role of the virtual file system layer: Masking the differences in the operation of the underlying file system, providing a unified interface for upper-level operations. Because of this level, it is possible to abstract the device into a file, making the operating device as simple as manipulating the file.
    • In a specific file system layer, different file systems (such as ext2 and NTFS) are different in their specific operating procedures. Each file system defines its own set of operations. For more information about the file system, see Resources.
    • The cache layer is introduced to improve the performance of the Linux operating system for disk access. The cache layer caches some of the data on the disk in memory. When the data request arrives, if the data is present in the cache and is up-to-date, the data is passed directly to the user program, which improves performance without the operation of the underlying disk.
    • The main task of the common block layer is to receive disk requests from the upper layer and eventually make an IO request. This layer hides the features of the underlying hardware block device, providing a common, abstract view of the block device.
    • The function of the IO scheduling layer is to receive IO requests from the common block layer, cache requests and attempt to merge adjacent requests (if the data of the two requests is adjacent on disk). And according to the set-up scheduling algorithm, callback driver layer provides the request processing function to handle the specific IO request.
    • Drivers in the drive layer correspond to specific physical block devices. It pulls the IO request from the upper layer and manipulates the device to transmit data by sending commands to the device controller of the specific block device, based on the information specified in the IO request.
    • The device layer is a specific physical device. Defines the specification for operating specific devices.
3. Related kernel data structures
    • Dentry (Directory entry): The I node that contacted the file name and files
    • Inode (Index node): File I node, saving information such as file identity, permissions, and content
    • File: A collection of function pointers that store information about files and various action files
    • File_operations: A collection of function interfaces for manipulating files
    • Address_space: Describes the page cache structure and related information for the file, and contains a collection of function pointers that manipulate the page cache
    • Address_space_operations: Function Interface collection to manipulate page cache
    • Description of the Bio:io request

The definition of the above structure can refer to the VFS file system and the kernel source code.

The relationships between data structures are as follows:

Shows the relationship between the above data structures (except bio). You can see that the Inode object can be found by the Dentry object, the Address_space object can be removed from the Inode object, and the Address_space_operations object is found by the Address_space object. The File object can be obtained from the information provided in the current process descriptor, and the Dentry object, the Address_space object, and the File_operations object can be found.

4. Read the system call procedure 4.1. Pre-conditions

For a specific read call, there are a number of processing situations that may be encountered in the kernel. Here is an example of one of these cases:

    • The file to be read already exists
    • File Passes page cache
    • To read a normal file.
    • The file system on disk is the Ext2 file system, related to the Ext2 file system, see Resources
4.2. Read before the Open

The corresponding kernel function for the open system call is Sys_open. Sys_open Call Do_sys_open:

LongDo_sys_open(int DFD, const char __user * FileName, int flags, int mode) { struct Open_flags op; int lookup = build_open_flags (flags, mode, &OP); char *tmp = getname (filename); int fd = PTR_ERR (TMP); if (!is_err (tmp)) {FD = Get_unused_fd_flags (flags); 0) {struct file *f = Do_filp_open (DFD, TMP, &op, lo Okup); if (Is_err (f)) {put_unused_fd (FD); fd = Ptr_err (f);} else {fsnotify_open (f); Fd_install (FD, f);}} Putname (TMP);} return fd;}            

Explanation of the main code:

    • Get_unused_fd_flags: Retrieve an unused file descriptor (the smallest unused file descriptor is selected each time).
    • Do_filp_open: Call the Open_namei () function to remove the dentry and inode associated with the file (because the premise indicates that the file already exists, so dentry and Inode can find, not create), and then call the Dentry_open () function Creates a new file object and initializes the file object with information from the Dentry and Inode (the current read and write location of the files is saved in the Document object). Note that there is a statement in Dentry_open (): F->f_op = Fops_get (INODE->I_FOP);
      This assignment statement assigns the set of function pointers to the file object, which is associated with the specific filesystem, to the F-_op variable of the file (which is stored in the Inode object), and the member read in the FILE->F_OP is called in the next Sys_read function.
    • Fd_install: The file descriptor is indexed, the current process descriptor is associated with the above file object, and then the read and write operations are prepared.

The function finally returns the file descriptor for the file.

4.3. Processing of the virtual file system layer

The kernel function that the read system calls corresponds to is sys_read. The implementation is as follows (READ_WRITE.C):

syscall_define3 (read, unsigned int, FD, char __user *, buf, size_t, count) {struct file *file; ssize_t ret =-EBADF; int fput _needed; file = Fget_light (fd, &fput_needed); if (file) {loff_t pos = File_pos_read ( file); ret = Vfs_read (file, buf, count, &pos); File_pos_write (file, POS); Fput_light (file, fput_needed);} return ret;}        

Code parsing:

    • Fget_light (): Extracts the corresponding file object from the current process descriptor, based on the index specified by FD.
    • Call the File_pos_read () function to remove the current location of this read-write file.
    • Call Vfs_read () to perform a file read operation, and this function eventually calls the function that File->f_op.read () points to, the code is as follows:

if (File->f_op->read)

ret = file->f_op->read (file, buf, Count, POS);

    • Call File_pos_write () to update the current read and write location of the file.
    • Call Fput_light () to update the reference count of the file.
    • Finally returns the number of bytes read data.

The processing done by the virtual file system layer is completed, and control is given to the Ext2 file system layer.

4.4. EXT2 layer and subsequent treatment

Looking at the initialization of ext2_file_operations, we can see that ext2 's read points to Do_sync_read, and Ext2 's Aio_read function is called in Do_sync_read, and Aio_ Read points to Generic_file_aio_read, so Generic_file_aio_read is the entrance to the ext2 layer.

General trend of Generic_file_aio_read (FILEMAP.C):

4.4.1. Page cache structure for files

In the Linux operating system, when the application needs to read the data in the file, the operating system allocates some memory, reads the data from the storage device into the memory, and then distributes the data to the application, and when the data needs to be written to the file, the operating system allocates memory to receive the user data. The data is then written from memory to disk. File Cache management refers to the management of the memory allocated by the operating system and used to store file data. Cache management is measured by two indicators: the cache hit rate, the cache hit when the data can be obtained directly from memory, no longer need to access low-speed peripherals, it can significantly improve performance, the second is the ratio of effective cache, the effective cache refers to the actual access to the cache entries, If the ratio of valid caches is low, a considerable portion of the disk bandwidth is wasted on reading useless caches, and useless caches can indirectly cause system memory to be strained, which can eventually severely affect performance.

The file cache is a copy of the file's data in memory, so the file cache management is related to the memory management system and the file system: On the one hand, the file cache as part of the physical memory, need to participate in the allocation of physical memory collection process, on the other hand, the data in the file cache from the file on the storage device , you need to have read and write interactions with the storage device through the file system. From the operating system's perspective, the file Cache can be seen as a link between the memory management system and the file system. Therefore, the file Cache management is an important part of the operating system, its performance directly affects the performance of the file system and memory management system.

The specific process for the file read-ahead algorithm in the Linux kernel is this: for the first read request for each file, the system reads the requested page and reads in a few pages immediately following it (not less than one page, usually three pages), which is called synchronous pre-reading. For the second read request, if the Read page is not in the cache, that is not in the previous pre-read group, it indicates that the file access is not sequential access, the system continues to use synchronous pre-read, if the Read page in the cache, the previous read hit, the operating system to expand the pre-read Group One times, And let the underlying filesystem read into the file data block in the group that is not yet in the cache, when read-ahead is called asynchronous read-ahead. The system updates the size of the current read-ahead group regardless of whether the second read request is hit or not. In addition, a window is defined in the system, which includes the previous pre-read group and the pre-read group. Any subsequent read request will be in one of two cases: the first case is that the requested page is in a pre-read window, and the corresponding window and group are resumed asynchronously, and the second case is that the requested page is outside the pre-read window, and the system will Synchronize pre-read and reset the corresponding window and group.

The files are divided into blocks of data that are cells of page size, which are organized into a multi-fork tree (called Radix tree). All leaf nodes in the tree are a page frame structure (struct page) that represents each page that is used to cache the file. The first 4,096 bytes of the file are saved on the leftmost page of the leaf layer (if the page size is 4096 bytes), the next page holds the second 4,096 bytes of the file, and so on. All intermediate nodes in the tree are the organization nodes that indicate the page on which the data resides on an address. The hierarchy of this tree can be from 0 to 6 layers, supporting file sizes from 0 bytes to four T bytes. The root node pointer of the tree can be obtained from the Address_space object associated with the file (the object is in the Inode object that exists and is associated with the file).

A physical page may consist of multiple discontinuous physical disk blocks. It is also because the mapped disk blocks in the page are not necessarily contiguous, so it becomes less easy to detect whether specific data has been cached in the page cache. In addition, the Linux page cache is very wide defined for the range of cached pages. The target of the cache is any page-based object, which contains various types of files and various types of memory mappings. To meet the requirements of universality, Linux uses a structure defined in linux/fs.h to describe a page in the page cache, address_space the structure body.

4.4.2. Treatment of ext2 Layer

Do_generic_file_read do the work:

    • Based on the current read/write location of the file, the page of the cached request data is found in the page cache
    • If the page is up to date, copy the requested data to the user space
    • Otherwise, Lock the page
    • Call the Readpage function to make a page request to the disk (the page will be unlocked when the current layer completes the IO operation), code: Error = Mapping->a_ops->readpage (FILP, page);
    • Once again lock the page, the operation succeeds, the data is already in the page cache, because only the IO operation is complete before the page can be unlocked. Here is a synchronization point for synchronizing data from disk to memory.
    • Unlock the page
    • The data is already in page cache and then copied to the user space (after which the read call can be returned in user space)

Here we know: when the data on the page is not up to date, the function calls the function that Mapping->a_ops->readpage points to (the variable mapping is the Address_space object in the Inode object). So what exactly is this function? In the Ext2 file system, Readpage points to ext2_readpage.

4.4.3. Page Cache Layer Processing

As you know above: The Ext2_readpage function is the entry point for that layer. The function calls the Mpage_readpage function, following the code of the Mpage_readpage function.

int  Mpage_readpage (struct page *page, get_block_t get_block) { Span class= "Hljs-keyword" >struct bio *bio = NULL; sector_t Last_block_in_bio = 0; struct buffer_head map_bh; unsigned long First_logical_block = 0; map_bh.b_state = 0; map_bh.b_size = 0; bio = do_mpage_readpage (bio, page, 1, &last_block_in_bio, &MAP_BH, &first_logical_block, Get_block); if (bio) Mpage_bio_submit (READ, bio); return 0;}         

The function first calls the function Do_mpage_readpage function to create a bio request that indicates the location of the disk on which the data block is to be read, the number of data blocks, and the destination where the data was copied-the page in the cache. The Mpage_bio_submit function is then called to process the request. The Mpage_bio_submit function calls the Submit_bio function to process the request, which eventually passes the request to the function generic_make_request, and the Generic_make_request function submits the request to the common block layer processing.

This concludes the processing of the page cache layer.

4.4.4. Processing of common block layers

The Generic_make_request function is the entry point for this layer, which is the only function that handles the request. The code for the function is described in blk-core.c.

Main operation:
Request queue Q based on the block device number saved in bio
Detects whether the current IO scheduler is available, continues if available, or waits for the scheduler to be available
Call the function pointed to by Q->MAKE_REQUEST_FN to add the request (bio) to the request queue
This concludes the operation of the common block layer.

4.4.5. Processing of the IO scheduling layer

The call to the MAKE_REQUEST_FN function can be thought of as the ingress of the IO dispatch layer, which is used to add requests to the request queue. The function is specified when the request queue is created, as shown in the following code (Blk_init_queue function):
Q->REQUEST_FN = RFN;
Blk_queue_make_request (q, __make_request);
The function Blk_queue_make_request assigns the address of the function __make_request to the MAKE_REQUEST_FN member of the request queue Q, then the __make_request function is the real entry of the IO scheduling layer.
The main work of the __make_request function is:

    1. Detects whether the request queue is empty, and if so, slows the driver to handle the current request (which is intended to accumulate more requests, thus having the opportunity to merge adjacent requests, thereby improving the performance of the processing) and jumping to 3, otherwise jumping to 2.
    2. An attempt was made to merge the current request with an existing request in the request queue, and if the merge succeeds, the function returns, otherwise jumps to 3.
    3. The request is a new request, creates a new request descriptor, initializes the appropriate domain, joins the request descriptor to the request queue, and returns the function.

When a request is placed in the request queue, it is determined by the scheduling algorithm of the IO Scheduler (see Resources for the algorithm contents of the IO Scheduler). Once the request can be processed, it invokes the function processing that the member REQUEST_FN in the request queue points to. The initialization of this member is also set when the request queue is created:
Q->REQUEST_FN = RFN;
Blk_queue_make_request (q, __make_request);
The first line assigns the request handler function RFN pointer to the REQUEST_FN member of the request queue. The RFN is passed in as a parameter when creating the request queue.
The call to the request handler function REQUEST_FN means that the processing of the IO dispatch layer is over.

4.4.6. Processing of block device drive layer

The REQUEST_FN function is the entrance to the block device driver layer. It is passed by the driver to the IO scheduling layer when the driver creates the request queue.

The IO dispatch layer passes the request to the driver by means of the callback REQUEST_FN function. The driver obtains an IO request from the upper layer from the function's parameters and operates the device controller based on the information specified in the request (the request is issued according to the specifications specified by the physical device).

This concludes the operation of the block device driver layer.

4.4.7. Processing of block equipment layers

Accept requests from the driver layer, complete the actual data copy work, and so on. A set of specifications is also specified, and the driver must operate the hardware in accordance with this specification.

4.4.8. Follow-up work

When the device completes the IO request, the CPU is notified by interruption, and the interrupt handler calls the REQUEST_FN function for processing.

When the driver processes the request again, the upper function is notified based on the results of this data transfer whether the IO operation succeeds, and if successful, the upper function unlocks the page involved in the IO operation.

Once the page is unlocked, the lock (the synchronization point of the data) can be successfully obtained again, and the program continues to execute. After that, the function sys_read can be returned. The final read system call can also be returned.

At this point, the entire process of the read system call from the issue to the end is all over.

(GO) inux read system call

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.