Linux read System Call and linuxread system call

Source: Internet
Author: User

Linux read System Call and linuxread system call

A device that simulates a USB flash drive has been created in a recent project. However, when you read the content of a virtual USB flash drive, you must read it from the disk rather than from the system cache. Due to this problem, check the information and read the system calls of the read and some content of the file system. Because file systems involve a wide range of areas, such as virtual file systems (VFS), page caching, block caching, data synchronization, and other content, it is impossible to fully analyze them, here, only two methods related to read are recorded. Cached IO and direct IO.

1. What is system call

What can system call do first? In summary, the following things need to be implemented by a system call.

  1. Control hardware:System calls are often used as abstract interfaces of hardware resources and user space, such as write/read calls used to read and write files.
  2. Set system status or read kernel data:Because system calling is the only means of communication between user space and the kernel, you can set the system status, for example, to enable/disable a certain kernel Service (set a certain kernel variable ), or the kernel data must be read through the system call. For example, getpgid, getpriority, setpriority, and sethostname
  3. Process Management:It is used to ensure that the process in the system can run with multiple tasks in the virtual memory environment. For example, fork, clone, execve, and exit.

So why do we need to use system calls to access the content of the operating system? In fact, this can be seen as kernel protection. linux is divided into user space and kernel space, the user space does not allow access to the data in the kernel space. Therefore, when a user space program needs to access resources in the kernel space, it must be implemented through the man-in-the-middle called by the system. In this way, user space behaviors can be restricted. Only specific authorized (pre-defined) user space behaviors can enter the kernel space. In a word, a system call is an interface provided by the kernel to the user space to access kernel resources.

In addition, there are only two ways to switch from a user process to a kernel process: System calling and interruption.

To realize the system call, you must first be able to switch from the user space to the kernel space, this switch in the IA-32 system is to use the Assembly command int $0x80 to cause software interruption implementation. This part is generally implemented in the C standard library. After entering the kernel space, the system calls the central processing code (all system calls are handled by a central code) according to the passed parameters (the parameters are transferred by registers including the unique system call number) execute different functions from a static table. For example, when a read system call is executed, the 0x80 interrupt handler checks the system call number and then searches for the system call table based on the system call number, obtain the kernel function sys_read for processing the read system call from the system call table, pass the parameters, and run the sys_read function. So far, the kernel has actually started to process the read System Call (sys_read is the kernel entry of the read System Call ).

2. Processing hierarchies of read system calls in kernel space

The hierarchical model that the read system calls in the core space. It can be seen that a Read Request to a disk first goes through the Virtual File System layer (vfs layer), and then the specific file system layer (such as ext2 ), next is the cache layer (page cache layer), general block layer (generic block layer), and I/O scheduler layer (I/O scheduler layer) block device driver layer, and finally block device layer ).

  • The role of the virtual file system layer: shields the differences in the underlying file system operations and provides a unified interface for upper-layer operations. It is precisely because of this level that devices can be abstracted into files, making device operations as simple as file operations.
  • At the specific file system layer, different file systems (such as ext2 and NTFS) have different operation procedures. Each file system defines its own set of operations. For more information about file systems, see references.
  • The cache layer is introduced to improve the disk access performance of the linux operating system. The Cache layer caches some data on the disk in the memory. When a data request arrives, if this data exists in the cache and is up-to-date, the data is directly transmitted to the user program, eliminating the need for operations on the underlying disk and improving performance.
  • The main task of the general block layer is to receive disk requests from the upper layer and finally send IO requests. This layer hides the features of underlying hardware block devices and provides a general abstract view for Block devices.
  • Functions of the IO scheduling layer: receives IO requests from the general block layer, caches requests, and tries to merge adjacent requests (if the data of these two requests is adjacent to the disk ). Based on the configured scheduling algorithm, callback the request processing function provided by the driver layer to process specific IO requests.
  • The driver in the driver layer corresponds to a specific physical block device. It extracts the IO request from the upper layer and operates the device to transmit data by sending commands to the device controller of a specific block device based on the information specified in the IO request.
  • The device layer is a specific physical device. Defines the specifications for specific device operations.
3. Related kernel data structure
  • Dentry: contacts the I node of the file name and file.
  • Inode (index node): the file I node that stores information such as the file ID, permission, and content.
  • File: a collection of function pointers for saving information about files and various operation files
  • File_operations: a set of function interfaces for Operating Files
  • Address_space: Describes the page cache structure and related information of the file, and contains a set of function pointers for operating the page cache.
  • Address_space_operations: Set of function interfaces for page cache operations
  • Bio: Description of the IO request

For the definition of the above structure, refer to the VFS file system and kernel source code.

Shows the relationship between data structures:

Shows the relationship between the above data structures (except bio. We can see that the dentry object can find the inode object, and the address_space object can be retrieved from the inode object, and then the address_space_operations object can be found from the address_space object. The File object can be obtained based on the information provided in the current process descriptor, and then the dentry object, address_space object, and file_operations object can be found.

4. read System Call process 4.1. Prerequisites

For a specific read call, the kernel may encounter many processing situations. One of the examples is as follows:

  • The file to be read already exists.
  • File going through page cache
  • A common file is to be read.
  • The file system on the disk is an ext2 file system. For more information about the ext2 file system, see references.
4.2. open before read

The corresponding kernel function called by the open system is sys_open. Sys_open calls do_sys_open:

long do_sys_open(int dfd, const char __user *filename, int flags, int mode){struct open_flags op;int lookup = build_open_flags(flags, mode, &op);char *tmp = getname(filename);int fd = PTR_ERR(tmp);if (!IS_ERR(tmp)) {fd = get_unused_fd_flags(flags);if (fd >= 0) {struct file *f = do_filp_open(dfd, tmp, &op, lookup);if (IS_ERR(f)) {put_unused_fd(fd);fd = PTR_ERR(f);} else {fsnotify_open(f);fd_install(fd, f);}}putname(tmp);}return fd;}

The main code is described as follows:

  • Get_unused_fd_flags: Retrieves an unused file descriptor (the smallest unused file descriptor is selected each time ).
  • Do_filp_open: Call the open_namei () function to retrieve the dentry and inode related to the file. (because the premise indicates that the file already exists, dentry and inode can be searched and do not need to be created ), call the dentry_open () function to create a new file object, and initialize the file object with the information in dentry and inode (the current file read/write location is saved in the file object ). Note that dentry_open () has a statement: f-> f_op = fops_get (inode-> I _fop );
    This assignment statement assigns the function pointer set of the Operation file related to the specific file system to the f _ op variable of the file object (this pointer set is saved in the inode object ), in the next sys_read function, the read member in file-> f_op will be called.
  • Fd_install: indexes the file descriptor, associates the current process descriptor with the preceding file object, and prepares for subsequent read and write operations.

The function returns the file descriptor of the file.

4.3. Processing of the Virtual File System Layer

The corresponding kernel function called by the read system is sys_read. Implementation is as follows (read_write.c ):

SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count){ struct file *file; ssize_t ret = -EBADF; int fput_needed;file = fget_light(fd, &fput_needed); if (file) { loff_t pos = file_pos_read(file); ret = vfs_read(file, buf, count, &pos); file_pos_write(file, pos); fput_light(file, fput_needed); }return ret;}

Code parsing:

  • Fget_light (): extracts the corresponding file object from the current process descriptor Based on the index specified by fd.
  • Call the file_pos_read () function to retrieve the current location of the read/write file.
  • Call vfs_read () to execute the file read operation, and this function finally calls the function pointed to by file-> f_op.read (). The Code is as follows:

If (file-> f_op-> read)

Ret = file-> f_op-> read (file, buf, count, pos );

  • Call file_pos_write () to update the current read/write location of the file.
  • Call fput_light () to update the reference count of the file.
  • Finally, the number of bytes of data read is returned.

At this point, the processing done by the virtual file system layer is complete, and the control is handed over to the ext2 file system layer.

4.4. ext2 layer and subsequent processing

Check the initialization of ext2_file_operations. We can see that the read of ext2 points to do_sync_read, And the aio_read function of ext2 is called in do_sync_read, And the aio_read points to the cursor. Therefore, the cursor is the entry of ex.

General Trend of generic_file_aio_read (filemap. c ):

4.4.1. File page cache structure

In Linux, when an application needs to read data from a file, the operating system first allocates some memory to read data from the storage device to the memory, then, the data is distributed to the application. When you need to write data to a file, the operating system first allocates memory to receive user data, and then writes the data from the memory to the disk. File Cache management refers to the management of the memory allocated by the operating system and used to store file data. The advantages and disadvantages of Cache Management are measured by two indicators: the first is the Cache hit rate. When the Cache hits, data can be obtained directly from the memory, and low-speed peripherals are no longer needed, which can significantly improve the performance; the second is the ratio of effective Cache. Effective Cache refers to the Cache items actually accessed. If the ratio of effective Cache is low, A considerable amount of disk bandwidth will be wasted on reading useless Cache, and useless Cache will indirectly lead to system memory shortage, which may seriously affect the performance.

File Cache is a copy of file data in the memory. Therefore, File Cache Management is related to the memory management system and file system. On the one hand, File Cache is a part of the physical memory, it is necessary to participate in the physical memory allocation and recovery process. On the other hand, the data in the File Cache comes from files on the storage device and needs to be read and written to and interacted with the storage device through the file system. From the operating system perspective, File Cache can be seen as a link between the memory management system and the file system. Therefore, File Cache Management is an important part of the operating system. Its performance directly affects the performance of file systems and memory management systems.

The specific process of the file pre-read algorithm in the Linux kernel is as follows: for the first read request of each file, the System reads the requested page and the few pages that follow it (not less than one page, usually three pages). The pre-read is called synchronous pre-read. For the second read request, if the read page is not in the Cache, that is, it is not in the group of the previous pre-read, it indicates that file access is not sequential access, and the system continues to adopt synchronous pre-read; if the read page is in the Cache, it indicates the previous pre-read hit. The operating system doubles the pre-read group and allows the underlying File System to read the remaining file data blocks in the group that are not in the Cache, in this case, the pre-read is called asynchronous pre-read. The system updates the size of the current pre-read group regardless of whether the second read request hits. In addition, the system defines a window, which includes the previous pre-read group and the pre-read group. Any subsequent read requests will be in either of the following two situations: the first case is that the requested page is in the pre-read window, where asynchronous pre-read continues and the corresponding window and group are updated; the second case is that the requested page is outside the pre-read window, and the system needs to synchronize the pre-read and reset the corresponding window and group.

Files are divided into data blocks with page size as units. These data blocks (pages) are organized into a multi-Cross Tree (called the radix tree ). All the leaf nodes in the tree are in a page frame structure (struct page), indicating each page used to cache the file. The first page at the far left of the leaf layer stores the first 4096 bytes of the file (if the page size is 4096 bytes), and the next page stores the second 4096 bytes of the file, and so on. All the intermediate nodes in the tree are organizational nodes, indicating the page where data on an address is located. The hierarchy of the tree can be from layer 0 to layer 6. The Supported file size ranges from 0 bytes to 16 T bytes. The root node pointer of the tree can be obtained from the file-related address_space object (the object is saved in the inode object associated with the file.

A physical page may consist of multiple discontinuous physical disk blocks. Because the disk blocks mapped to the page are not necessarily consecutive, it is not so easy to check whether specific data has been cached in the page cache. In addition, the linux page cache defines the scope of cached pages very wide. The cache target is any page-based object, which contains various types of files and various types of memory ing. To meet universal requirements, linux uses the address_space struct description page defined in linux/fs. h to cache pages.

4.4.2. Processing of ext2 Layer

Do_generic_file_read:

  • Find the page in the page cache for caching request data based on the current file read/write location
  • If the page is up to date, copy the requested data to the user space.
  • Otherwise, Lock the page
  • Call the readpage function to send a page request to the disk (the page is unlocked when the underlying layer completes the IO operation), Code: error = mapping-> a_ops-> readpage (filp, page );
  • When the page is successfully locked again, the data is already in the page cache, because the page can be unlocked only after the I/O operation is complete. A synchronization point is used to synchronize data from disk to memory.
  • Unlock this page
  • So far, the data has been in the page cache and then copied to the user space (the read call can be returned in the user space)

At this point, we know that when the data on the page is not up-to-date, this function calls mapping-> a_ops-> readpage to point to the function (the ing variable is the address_space object in the inode object). What is this function? In the Ext2 file system, readpage points to ext2_readpage.

4.4.3. page cache layer Processing

We learned from the above that the ext2_readpage function is the entry point of this layer. This function calls the mpage_readpage function. The code of the following mpage_readpage function is as follows.

int mpage_readpage(struct page *page, get_block_t get_block){ struct bio *bio = NULL; sector_t last_block_in_bio = 0; struct buffer_head map_bh; unsigned long first_logical_block = 0; map_bh.b_state = 0; map_bh.b_size = 0; bio = do_mpage_readpage(bio, page, 1, &last_block_in_bio, &map_bh, &first_logical_block, get_block); if (bio) mpage_bio_submit(READ, bio); return 0;}

This function first calls the do_mpage_readpage function to create a bio request, this request specifies the location of the disk where the data block to be read is located, the number of data blocks, and the target location for copying the data-page information in the cache area. Then, call the mpage_bio_submit function to process the request. The mpage_bio_submit function calls the submit_bio function to process the request. The latter finally passes the request to the generic_make_request function, and the generic_make_request function submits the request to the general block layer for processing.

So far, the processing of the page cache layer has ended.

4.4.4. General block layer Processing

The generic_make_request function is the entry point of this layer. This layer only has this function to process requests. For the function code, see the blk-core.c.

Main Operations:
Obtain the Request queue q According to the block device number saved in bio
Checks whether the current IO scheduler is available. If it is available, continue; otherwise, wait for the scheduler to be available.
Call the function directed to q-> make_request_fn to add the request (bio) to the Request queue.
So far, the general block layer operation has ended.

4.4.5. IO scheduling layer Processing

The call to the make_request_fn function can be considered as the entry of the IO scheduling layer. This function is used to add requests to the Request queue. This function is specified when a request queue is created. The Code is as follows (in the blk_init_queue function ):
Q-> request_fn = rfn;
Blk_queue_make_request (q, _ make_request );
The blk_queue_make_request function assigns the address of the function _ make_request to the make_request_fn Member of the Request queue q. Then, the _ make_request function is the real entry of the IO scheduling layer.
The main tasks of the _ make_request function are:

  1. Check whether the request queue is empty. If yes, the driver will delay processing of the current request (the purpose is to accumulate more requests, so that it has the opportunity to merge adjacent requests, so as to improve the processing performance), and jump to 3, otherwise jump to 2.
  2. Try to merge the current request with the existing request in the Request queue. If the merge is successful, the function returns; otherwise, it jumps to 3.
  3. This request is a new request. A new request descriptor is created, the corresponding domain is initialized, And the request descriptor is added to the Request queue. The function returns.

After a request is put into the request queue, the scheduling algorithm of the IO scheduler determines when the request is processed. (For more information about the algorithm of the IO scheduler, see references ). Once the request can be processed, the function directed to request_fn in the Request queue is called. The initialization of this member is also set when the request queue is created:
Q-> request_fn = rfn;
Blk_queue_make_request (q, _ make_request );
The first line is to assign the rfn pointer of the request processing function to the request_fn Member of the Request queue. Rfn is passed in through parameters when creating the Request queue.
The call to the request processing function request_fn means that the IO scheduling layer has finished processing.

4.4.6. Handling of block device driver layer

The request_fn function is the entry to the driver layer of the block device. It is transmitted by the driver to the IO scheduling layer when the driver creates a request queue.

The IO scheduling layer calls back the request_fn function to send the request to the driver. The driver obtains the IO request sent by the upper layer from the parameter of the function, the device controller is operated according to the information specified in the request (the request must be sent according to the specification specified by the physical device ).

So far, the operation on the driver layer of the block device has ended.

4.4.7. Block device layer Processing

Accept requests from the driver layer and complete the actual data copying. In addition, a series of specifications are stipulated, and the driver must operate the hardware according to the specification.

4.4.8. Follow-up work

After the device completes the IO request, it notifies the cpu by means of interruption, and the interrupt handler calls the request_fn function for processing.

When the driver processes the request again, it notifies the upper-layer function whether the IO operation is successful based on the data transmission result. If yes, the upper-layer function unlocks the page involved in the IO operation.

After the page is unlocked, you can successfully obtain the lock (data synchronization point) again and continue executing the program. Then, the function sys_read can return. The final read system call can also be returned.

At this point, the entire process from sending to ending the read system call is complete.

From: http://www.coderonline.net /? P = 711

More articles: kernel driver, android and other articles please refer to the http://www.coderonline.net

Public interest platform: coder_online (coder_online) allows you to obtain original technical articles and become a friend of java/C ++/Android/Windows/Linux, exchange programming experience online, obtain basic programming knowledge, and solve programming problems. Programmer InterAction alliance, developer's own home.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.