Concatenates VFS objects in Linux

Source: Internet
Author: User

In the previous blog, the four data structures in the General Virtual File model provided by the kernel are extremely operational. How are these objects connected in order to make various file systems get along in harmony. This blog will focus on how they interact with the kernel, including how they interact with processes and some related caching mechanisms.


1. process-related files


First, the file must be opened by the process. Each process has its own working directory and its own root directory. This is only two examples of data that the kernel uses to indicate the interaction between processes and the file system and must be maintained. The entire data structure of the fs_struct type is used for this purpose, and the FS Field of each process descriptor task_struct points to the fs_struct structure of the process:

Struct fs_struct {
Atomic_t count;
Rwlock_t lock;
Int umask;
Struct dentry * root, * Pwd, * altroot;
Struct vfsmount * rootmnt, * pwdmnt, * altrootmnt;


Count: Number of processes that share the table
Lock: Used for the read/write spin lock of fields in the table
Umask: The bit mask used when the file is opened and the File Permission is set
Root: directory entry in the root directory
PWD: directory of the current working directory
Altroot: Simulate the directory items in the root directory (always null in the 80x86 structure)
Rootmnt: The File System Object installed in the root directory
Pwdmnt: The File System Object installed in the current working directory
Altrootmnt: Simulate the File System Object installed in the root directory (always null in the 80x86 structure)


The second table indicates the files opened by the process. The table address is stored in the files field of the Process descriptor task_struct. The table type is files_struct:

Struct files_struct {
Atomic_t count;
Struct fdtable * FDT;
Struct fdtable fdtab;

Spinlock_t file_lock ____ cacheline_aligned_in_smp;
Int next_fd;
Struct embedded_fd_set close_on_exec_init;
Struct embedded_fd_set open_fds_init;
Struct file * fd_array [nr_open_default];
Struct fdtable {
Unsigned int max_fds;
Int max_fdset;
Struct file ** FD;/* Current FD array */
Fd_set * close_on_exec;
Fd_set * open_fds;
Struct rcu_head RCU;
Struct files_struct * free_files;
Struct fdtable * next;
# Define nr_open_default bits_per_long
# Define bits_per_long 32/* asm-i386 */


The fdtable structure is embedded in files_struct and pointed by its FDT.


The FD field of the fdtable structure points to the pointer array of the file object. The length of the array is stored in the max_fds field. Generally, the FD field points to the fd_array field in the files_struct structure. This field includes 32 file object pointers. If a process opens more than 32 files, the kernel allocates a new, larger array of file pointers and stores the address in the FD field, the kernel also updates the value of the max_fds field ,:



For each file of all elements in the FD array, the index of the array is the file descriptor ). Generally, the first element of the array (index 0) is the standard input file of the process, and the second element of the array (Index 1) is the standard output file of the process, the third element of the array (index 2) is the standard error file of the process. Note that with the help of DUP (), dup2 (), and fcntl () system calls, the two file descriptors can point to the same open file, that is, the two elements of the array may point to the same file object. You can also see this when you use a shell structure (such as 2> & 1) to redirect a standard error file to a standard output file.


The process cannot use more than nr_open (usually 1 048 576) file descriptors. The kernel also forces a dynamic limit on the maximum number of file descriptors on the signal-> rlim [rlimit_nofile] structure of the process descriptor. This value is usually 1024, but if the process has the superuser privilege, you can increase the value.


The open_fds field initially contains the address of the open_fds_init field. The open_fds_init field indicates the bitmap of the file descriptor of the currently opened file. The max_fdset field stores the digits in the bitmap. Because the fd_set data structure has 1024 bits, it is usually not necessary to expand the bitmap size. However, if necessary, the kernel can still dynamically increase the bitmap size, which is very similar to the array of file objects.


When the kernel starts to use a file object, the kernel provides the fget () function for calling. This function receives the file descriptor FD as a parameter and returns the address in current-> files-> FD [FD], that is, the address of the corresponding file object. If no file corresponds to fd, returns null. In the first case, fget () increases the value of the file object reference counter fcount by 1:

Struct file fastcall * fget (unsigned int FD)
Struct file * file;
Struct files_struct * files = Current-> files;

Rcu_read_lock ();
File = fcheck_files (files, FD );
If (File ){
If (! Atomic_inc_not_zero (& file-> f_count )){
/* File Object ref couldn't be taken */
Rcu_read_unlock ();
Return NULL;
Rcu_read_unlock ();

Return file;

Static inline struct file * fcheck_files (struct files_struct * files, unsigned int FD)
Struct file * file = NULL;
/* Regardless of the RCU mechanism, the files_fdtable macro returns files-> FDT */
Struct fdtable * FDT = files_fdtable (files );

If (FD <FDT-> max_fds)
File = rcu_dereference (FDT-> FD [FD]);
Return file;


When the kernel control path is used for file objects, the fput () function provided by the kernel is called. This function uses the file object address as a parameter and reduces the file object reference counter f_count value. In addition, if this field changes to 0, the function calls the release method of File Operations (if defined ), reduce the I _writecount field value of the index object (if the file is writable), remove the file object from the super block linked list, and release the file object to the slab distributor, finally, reduce the reference counter value of the Directory item object of the relevant file system descriptor:

Void fastcall fput (struct file * file)
If (atomic_dec_and_test (& file-> f_count ))
_ Fput (File );

Void fastcall _ fput (struct file * file)
Struct dentry * dentry = file-> f_dentry;
Struct vfsmount * mnt = file-> f_vfsmnt;
Struct inode * inode = dentry-> d_inode;

Might_sleep ();

Fsnotify_close (File );
* The function eventpoll_release () shocould be the first called
* In the file cleanup chain.
Eventpoll_release (File );
Locks_remove_flock (File );

If (file-> f_op & file-> f_op-> release)
File-> f_op-> release (inode, file );
Security_file_free (File );
If (unlikely (s_ischr (inode-> I _mode) & inode-> I _cdev! = NULL ))
Cdev_put (inode-> I _cdev );
Fops_put (file-> f_op );
If (file-> f_mode & fmode_write)
Put_write_access (inode );
File_kill (File );
File-> f_dentry = NULL;
File-> f_vfsmnt = NULL;
File_free (File );
Dput (dentry );
Mntput (mnt );


The fget_light () and fget_light () functions are quick versions of fget () and fput (): the kernel needs to use them on the premise that the current process has file objects safely, that is, the process has previously added the file object reference counter value. For example, they are used by system call service routines that receive a file descriptor as a parameter, because the previous open () System Call has added the file object reference counter value.


2. Index node Cache


VFS uses a high-speed cache to accelerate access to the index node. What is different from the page cache we will talk about later is that each buffer zone does not have to be divided into two parts, because the inode structure already has a domain similar to the buffer header in the block cache. The implementation code of the index node high-speed cache is all in FS/inode. C. This part of code has not been modified much with the kernel version changes.


Each index node may be in a hash table or in one of the following "type" linked lists:

· "In_use"-valid index nodes, that is, I _count> 0 and I _nlink> 0 (see the inode structure above)
· "Dirty"-similar to "in_use", but still "dirty"
· "UNUSED"-a valid index node is not used yet, that is, I _count = 0.


These linked lists are defined as follows:


Static list_head (inode_in_use );
Static list_head (inode_unused );
Static struct hlist_head * inode_hashtable;
Static list_head (anon_hash_chain);/* For inodes with null I _sb */

Therefore, the structure of the index node cache is as follows:


· Global hash table inode_hashtable, where the hash value is obtained based on the value of each super block pointer and the 32-bit index node number. Add an index node without a superblock (inode-> I _sb = NULL)

The first of the anon_hash_chain linked list. We use the insert_inode_hash function to insert an inode structure to this hash.

· The index node linked list in use. The global variable inode_in_use points to the first and last elements in the linked list. The newly allocated index node is added to the linked list through the new_inode function.

· Unused index node linked list. The next and Prev fields of the global variable inode_unused point to the first and last elements in the linked list respectively.

· Dirty index node linked list. The s_dirty field of the super block points to the first and last elements in the linked list.

· The inode object cache is defined as follows: static kmem_cache_t * inode_cachu, which is an slab cache for allocating and releasing index node objects.



As shown in, the I _hash field of the index node points to the hash table, and the I _list points to a linked list of in_use, unused, or dirty. All these linked lists are protected by a single spin lock inode_lock. The initialization of the index node cache is implemented by inode_init (), which is called by the start_kernel () function in init/Main. C at system startup. Inode_init (unsigned long mempages) has only one parameter, indicating the number of physical pages used by the index node cache. Therefore, the index node cache can be configured based on the available physical memory size. For example, if the physical memory is large enough, you can create a large hash table.


The status information of the index node is stored in the data structure inodes_stat_t. It is defined in Linux/fs. h as follows:
Struct inodes_stat_t {
Int nr_inodes;
Int nr_unused;
Int dummy [5];
Extern struct inodes_stat_t inodes_stat

The user program can use/proc/sys/fs/inode-NR and/proc/sys/fs/inode-State to obtain the total number of index nodes and the number of unused index nodes in the cache of the index node.


3. High-speed cache of directory items


Because it takes a lot of time to read a directory item from the disk and construct the corresponding directory item object, you may need to use it later after completing the operation on the directory item object. Therefore, like the preceding index node, it is important to keep it in memory.


To maximize the efficiency of processing these directory item objects, Linux uses directory items for high-speed cache, which consists of two types of data structures:


-A set of directory item objects in the active, unused, or negative state.
-A hash to quickly obtain the directory item object corresponding to the given file name and directory name. Similarly, if the accessed object is not in the directory item cache, the hash function returns a null value.




All "UNUSED" directory item objects are stored in a two-way linked list of "least recently used (least recently used, LRU)", which is sorted by the insertion time. In other words, the finally released directory item object is placed in the first part of the linked list, so the minimum recently used directory item object is always close to the end of the linked list. Once the cache space for directory items starts to decrease, the kernel deletes elements from the end of the linked list, so that the most recently used objects can be retained. The addresses of the first and last elements of the LRU linked list are stored in the next and Prev fields of the dentry_unused variable of the list_head type. The d_lru field of the Directory item object contains a pointer to an adjacent directory item in the linked list.


Each "in use" directory item object is inserted into a two-way linked list, this linked list is pointed by the I _dentry field of the corresponding index Node object (because each index node may be associated with several hard links, a linked list is required ). The d_alias field of the Directory item object stores the addresses of Adjacent Elements in the linked list. The two fields are of the type struct list_head.


After the last hard link pointing to the corresponding file is deleted, a "using" directory item object may become "negative. In this case, the directory item object is moved to the LRU linked list consisting of "UNUSED" directory item objects. Each time the kernel reduces the directory item cache speed, the directory item object in the "negative" State moves toward the end of the LRU linked list. As a result, these objects are gradually released.


The hash is implemented by the dentry_hashtable array. Each element in the array is a pointer to the linked list, which is formed by hashing the directory items with the same hash value. The length of the array depends on the number of installed Ram. The default value is that each MB of Ram contains 256 elements. The d_hash field of the Directory item object points to the adjacent elements in the linked list with the same hash value. The value generated by the hash function is calculated by the Directory item object and file name of the directory.


Dcache_lock spin lock protects the directory item cache data structure from simultaneous access on a multi-processor system. The d_lookup () function searches for the given parent directory item object and file name in the hash table. To avoid competition, use the sequential lock (seqlock ). The _ d_lookup () function is similar to this function, but it assumes that there is no competition, so no sequential lock is used.


4. Implementation of VFS objects


If the specific file system is ext2, The do_mount () function will be called after the Mount-T ext2/dev/sda2/mnt/test command is executed. This function is used to install the file system. For more information about the code analysis, see the "file system installation" blog. Here we only provide a brief introduction: do_mount () the function will finally go to the vfs_kern_mount function, which calls the get_sb method dependent on the specific file system:
Static struct file_system_type ext2_fs_type = {
. Owner = this_module,
. Name = "ext2 ",
. Get_sb = ext2_get_sb,
. Kill_sb = kill_block_super,
. Fs_flags = fs_requires_dev | fs_has_fiemap,


The above is the file system type descriptor of the specific file system of ext2. For the definition of the file type, see the blog post "File System Registration ". We can see that the specific method of get_sb in the ext2 file system is ext2_get_sb. This function actually has only one line of code:
Return get_sb_bdev (fs_type, flags, dev_name, Data, ext2_fill_super, MNT );


The get_sb_bdev function is used to open the uploaded device file name, that is,/dev/sda2 in the preceding mount command, and obtain the idle super fast object of the registered file system; then, call the ext2_fill_super function passed in as a parameter to read some information about the super block on the ext2 disk into the memory. The specific implementation details will be discussed in the blog "ext2 super block object". Here we only propose the key step:
Sb-> s_op = & ext2_sops;
Sb-> s_export_op = & ext2_export_ops;
Sb-> s_xattr = ext2_xattr_handlers;
Root = iget (SB, ext2_root_ino );
Sb-> s_root = d_alloc_root (Root );


Therefore, when the ext2_fill_super function returns, the super fast s_op field corresponding to/dev/sda2 is assigned the following data structure:
Static struct super_operations ext2_sops = {
. Alloc_inode = ext2_alloc_inode,
. Destroy_inode = ext2_destroy_inode,
. Read_inode = ext2_read_inode,
. Write_inode = ext2_write_inode,
. Put_inode = ext2_put_inode,
. Delete_inode = ext2_delete_inode,
. Put_super = ext2_put_super,
. Write_super = ext2_write_super,
. Statfs = ext2_statfs,
. Remount_fs = ext2_remount,
. Clear_inode = ext2_clear_inode,
. Show_options = ext2_show_options,
# Ifdef config_quota
. Quota_read = ext2_quota_read,
. Quota_write = ext2_quota_write,
# Endif


Of course, after get_sb_bdev returns the result, the super fast will add the fs_supers field linked list of the file_system_type corresponding to the ext2 file system with the s_instances header.


To help you understand how index node high-speed cache helps a specific file system, we will study the role of the corresponding index node when opening a common file in the ext2 file system. Remember the example in the first blog post "Linux kernel entry (I)-architecture:

FD = open ("file", o_rdonly );
Close (FD );


Open () is called by fs/open. the sys_open function in C is implemented, and the real work is done by fs/open. the do_filp_open () function in C is complete. The specific implementation of the do_filp_open () function depends on a data structure called nameidata.


This data structure is temporary. We mainly focus on its dentry and MNT domains. The dentry structure and directory item object have been described earlier. The vfsmount structure records the installation information of the file system, such as the installation point of the file system and the root node of the file system, we will discuss it in detail in the following blog posts.


Do_filp_open ()-related code will be analyzed in detail in the "Implementation of VFS system calls" blog. Here we will only discuss the two main functions that it calls:


(1) open_namei (): Fill in the dentry structure of the directory where the target file is located and the vfsmount structure of the file system, and save the information in the nameidata structure. In the dentry structure, dentry-> d_inode points to the index node of the target file. This function is complex and huge, and will be detailed in the blog below.


(2) dentry_open (): create a "context" of the target file, that is, the file data structure, and hook it with the task_strrdbms structure of the current process. In addition, this function calls the OPEN function of a specific file system, that is, f_op-> open (). This function returns a pointer to the new file structure. To highlight the point, we will not analyze this function in detail here, which will be discussed later in the blog.


We can see in the previous section that in the ext2_fill_super function, when the initialization is super fast, one step is to call iget to set the index node number to ext2_root_ino (generally 2) the index node is assigned to the root directory of the ext2 disk partition.
Static inline struct inode * iget (struct super_block * Sb, unsigned long Ino)
Struct inode * inode = iget_locked (SB, Ino );
If (inode & (inode-> I _state & I _new )){
Sb-> s_op-> read_inode (inode );
Unlock_new_inode (inode );

Return inode;


Ext2_read_inode is the s_op-> read_inode implementation function of the ext2 super block. This function calls the ext2_get_inode function and reads a disk index node Structure ext2_inode from a page cache, then initialize the inode of VFS. The most important initialization code is extracted as follows:
If (s_isreg (inode-> I _mode) {/* normal file operation */
Inode-> I _op = & ext2_file_inode_operations;
If (ext2_use_xip (inode-> I _sb )){
Inode-> I _mapping-> a_ops = & ext2_aops_xip;
Inode-> I _fop = & ext2_xip_file_operations;
} Else if (test_opt (inode-> I _sb, nobh) {/* disable page cache */
Inode-> I _mapping-> a_ops = & ext2_nobh_aops;
Inode-> I _fop = & ext2_file_operations;
} Else {
Inode-> I _mapping-> a_ops = & ext2_aops;
Inode-> I _fop = & ext2_file_operations;
} Else if (s_isdir (inode-> I _mode) {/* directory file operations */
Inode-> I _op = & ext2_dir_inode_operations;
Inode-> I _fop = & ext2_dir_operations;
If (test_opt (inode-> I _sb, nobh ))
Inode-> I _mapping-> a_ops = & ext2_nobh_aops;
Inode-> I _mapping-> a_ops = & ext2_aops;
} Else if (s_islnk (inode-> I _mode) {/* symbolic link file operation */
If (ext2_inode_is_fast_symlink (inode ))
Inode-> I _op = & ext2_fast_symlink_inode_operations;
Else {
Inode-> I _op = & ext2_symlink_inode_operations;
If (test_opt (inode-> I _sb, nobh ))
Inode-> I _mapping-> a_ops = & ext2_nobh_aops;
Inode-> I _mapping-> a_ops = & ext2_aops;
} Else {/* Other Special File Operations */
Inode-> I _op = & ext2_special_inode_operations;
If (raw_inode-> I _block [0])
Init_special_inode (inode, inode-> I _mode,
Old_decode_dev (le32_to_cpu (raw_inode-> I _block [0]);
Init_special_inode (inode, inode-> I _mode,
New_decode_dev (le32_to_cpu (raw_inode-> I _block [1]);


Of course, we only pay attention to the most common situation, that is, operations related to normal files when high-speed cache is enabled and xip is not used:

(1) common file index node operations:
Struct inode_operations ext2_file_inode_operations = {
. Truncate = ext2_truncate,
# Ifdef config_ext2_fs_xattr
. Setxattr = generic_setxattr,
. Getxattr = generic_getxattr,
. Listxattr = ext2_listxattr,
. Removexattr = generic_removexattr,
# Endif
. Setattr = ext2_setattr,
. Permission = ext2_permission,
. Fiemap = ext2_fiemap,

(2) Common File Operations
Const struct file_operations ext2_file_operations = {
. Llseek = generic_file_llseek,
. Read = generic_file_read,
. Write = generic_file_write,
. Aio_read = generic_file_aio_read,
. Aio_write = generic_file_aio_write,
. IOCTL = ext2_ioctl,
. MMAP = generic_file_mmap,
. Open = generic_file_open,
. Release = ext2_release_file,
. Fsync = ext2_sync_file,
. Readv = generic_file_readv,
. Writev = generic_file_writev,
. Sendfile = generic_file_sendfile,
. Splice_read = generic_file_splice_read,
. Splice_write = generic_file_splice_write,


(3) normal file page high-speed cache operations:
Const struct address_space_operations ext2_aops = {
. Readpage = ext2_readpage,
. Readpages = ext2_readpages,
. Writepage = ext2_writepage,
. Sync_page = block_sync_page,
. Prepare_write = ext2_prepare_write,
. Commit_write = generic_commit_write,
. Bmap = ext2_bmap,
. Direct_io = ext2_direct_io,
. Writepages = ext2_writepages,
. Migratepage = buffer_migrate_page,


Similarly, in the open_namei () function, use path_lookup () to deal with the corresponding directory item high-speed cache to obtain the parent directory item, while path_lookup () call the inode_operations-> Lookup () method of the parent index node, that is, our ext2_lookup. This method finds and reads the directory items of the current node from the disk, and then uses iget (SB, (Ino), read the corresponding index node from the disk based on the index node number and establish the corresponding inode structure in the memory. This is the high-speed cache of the index node we have discussed. Path_lookup is one of the most important functions in the VFS system. We will discuss it in detail in the "pathname lookup" blog.


If o_creat is set in the access mode flag, the lookup operation starts with the settings of the lookup_parent, lookup_open, and lookup_create flag. Once the path_lookup () function returns successfully, check whether the requested file already exists. If not, call the create method of the parent index node, that is, allocate a new disk index node to ext2_create.


After the index node reads data into the memory, it calls d_add (dentry, inode) to establish the link between the dentry structure and the inode structure. The relationship between two data structures is bidirectional. On the one hand, the pointer d_inode in the dentry structure points to the inode structure, which is a one-to-one relationship, because a directory item only corresponds to one file. Otherwise, the same file may have multiple different file names or paths (established by the system call Link (). Note the difference with the symbolic connection, which is caused by symlink () system Call), so the direction from the inode structure to the dentry structure is one-to-many relationship. Therefore, the inode I _dentry structure is a linked list, and the dentry structure uses its queue Header

D_alias is attached to the I _dentry queue of the corresponding inode structure.


When dentry_open is returned, the open system call is over. At this time, almost all VFS objects are connected into a "small team ".


Finally, we use a big picture to describe the scenario after the VFS objects are connected in tandem, and end this article:



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.