Analysis of Linux virtual file system
virtual file System (VFS)
In my opinion, the word "virtual" has two meanings:
1, in the same directory structure, a number of different file systems can be mounted. VFS hides their implementation details, providing the user with a unified interface;
2, the directory structure itself is not absolute, and each process may see a different directory structure. The directory structure is described by "Address space (namespace)", and different processes may have different namespace, and different namespace may have different directory structures (because they may mount different file systems).
manipulate files that have been opened
The user of the VFS is the process (users who access the file system always need to start the process). The files pointer in the TASK_STRUCT structure that describes the process points to a FILES_STRUCT structure that describes the collection of documents that the process has opened.
The FILES_STRUCT structure maintains an array of pointers to the file structure of the open files, and the array subscript is used as a handle to the open file (often referred to as FD) by the user program operation. Files_struct also maintains a used FD bitmap so that it is assigned an unused FD when the file needs to be opened.
The file structure is an instance of an open document. The process of using FD to manipulate an open file is simple, the index of FD is indexed to the corresponding file structure, then the corresponding operation in the f_op of the file structure is executed (for example, read, write).
Different file structures may have different f_op, because they have different types of files (for example, ordinary files, sockets, FIFO, and so on).
and the corresponding F_op is assigned when the file is opened, for the open file, just use the function in the F_OP, do not have to determine the exact type of the file. And as for the specific f_op in how the function is implemented, this article does not describe (in fact, this part is also very complex, see <linux kernel file read > write analysis).
A user program that operates on an open file does not necessarily call a function in F_op, and some operations involve only the file structure itself. For example, the file structure maintains the current location of the files (F_pos), and the Lseek system call is only responsible for moving the POS value.
Similar to F_pos, F_mode (file access mode), and so on, are stored in the file structure, which means that these properties are related to an instance of an open file. A file may open multiple instances (in one or more processes), and each of these values may be different in each instance.
For example, two processes open the same file at the same time for read operations. Because the two instances (file structure) correspond to the F_pos, two read operations do not affect each other.
Sometimes multiple processes share the same open file instance, and when a child process is created with the clone system call, if the Clone_files flag is set, the parent-child process shares the FILES_STRUCT structure to share all open file instances. The typical example is multithreading.
Open File
The process of opening a file is complex compared to the simple operation of an open file. As you can see from the diagram above, the files that have been opened are only a small amount of space, while others are related to open files.
To open a file, you first need a file path, such as "Dir0/dir1/file". This path is split into multiple levels by '/' and each level is a file (directories are also files, such as DIR0, Dir1).
At the beginning of the search for this file path, we need a starting point. If the file path begins with '/', the root directory is the starting point; Otherwise, the current path is the starting point.
Both of these possible starting points are stored in the FS_STRUCT structure corresponding to the task_struct of the process. Each file is represented by a directory entry (DENTRY) structure in the directory structure, and the "starting point" itself is also a dentry structure.
When we execute the CD command in the shell, we actually change the dentry that represents the current path in the FS_STRUCT structure.
The process can also change the dentry that represents the root path in the FS_STRUCT structure by chroot system calls. In this way, those paths above the dentry will not be visible to the process.
As the index structure of the file, several dentry depict a tree-like directory structure, which is what the user sees as the directory structure. (We'll call it a dentry tree for the moment.)
Each dentry points to an index node (INODE) structure, which is the structure that actually describes the file information. Multiple dentry can point to the same inode, which makes it possible to link.
A set of methods (D_OP) is implemented in Dentry, mainly for matching child nodes. Dentry implements a hash table to facilitate the lookup of child nodes.
D_op may vary depending on the file system type, for example, the hash method may be different, and the node's matching method may be different (some file system filenames are case sensitive, others are not).
The process of finding a file path is to constantly find the sub-dentry in this dentry tree until the last dentry in the path is found.
Although the Dentry tree depicts the directory structure of the file system, these dentry structures are not resident memory. The entire directory structure can be so large that the memory does not fit at all.
Initially, the system only represents the root of the dentry and the inode it points to (this is generated when the root file system is mounted, see below). At this point, to open a file, the corresponding node in the file path does not exist, the root directory of the dentry cannot find the desired child node (it does not now have child nodes). The lookup method in Inode->i_op is used to find the child nodes of the desired inode (which is often found by means of a specific file system type definition, from the file system storage medium. See "Linux File System Implementation Analysis"), found later (when the inode has been loaded into memory), and then create a dentry associated with it.
This process can be seen, in fact, there is the inode first and then there is dentry. The inode itself is on the storage medium of the file system, and Dentry is generated in memory. The existence of Dentry accelerates the query of the inode.
Since the entire directory structure may not be fully loaded into memory, the dentry generated in memory will be freed when no one is used. The D_count field records the reference count for Dentry, which is 0 o'clock and Dentry is released.
The so-called release Dentry is not directly destroyed and recycled, but instead puts the dentry into a "least recently Used (LRU)" Queue (associated with the corresponding Super block). Some of the least recently used Dentry are actually released when the queue is too large or the system memory is scarce.
This LRU queue is like a cache pool that accelerates access to duplicate paths. When the Dentry is actually released, the inode corresponding to it is reduced to a reference. If the reference is 0, the inode is also freed.
When looking for a file path, there are three scenarios for each node in which it goes through:
1, the corresponding Dentry reference count has not been reduced to 0, they are also in the dentry tree, directly used;
2, if the corresponding dentry is not in the dentry tree, try to find it from the LRU queue. The dentry in the LRU queue are also hashed to a hash table for lookup. After finding the desired dentry, the dentry is removed from the LRU queue and added back to the dentry tree;
3, if the corresponding dentry in the LRU queue can not be found, then had to go to the file system storage media to find the inode. Found later Dentry is created and added to the dentry tree;
File System Mount
VFS allows a number of different file systems to be mounted in the same directory structure, and the file system mount path is called a mount point.
For example, the disk has two partitions A and B, a as the root file system is mounted under the "/" path, and B as a sub-file system, mounted under "/mnt/b/".
To complete this mount, the "/mnt/" directory must be in the a file system. Regardless of whether a has "/mnt/b", it will generate a dentry corresponding to it, but this dentry does not correspond to the inode of "/mnt/b" in a (even if the inode exists). The d_mounted tag in this dentry is set to indicate that this is a mount point.
If you encounter such a mount point during the search for a file path, the pointer representing the current path will be switched from the current dentry to the dentry corresponding to the "/" of the mounted file system. That is, when you access the "/mnt/b" path in the A partition, the "/" path in the B partition is actually accessed.
The file system is described using the VFSMOUNT structure, and multiple mounted file systems are also organized into tree-like structures.
The VFSMOUNT structure has two pointers to Dentry, Mnt_mountpoint points to the mount point of its parent file system Dentry (for example, "/mnt/b" in the A partition), and mnt_root points to the root path of the file system Dentry (for example, in the B partition "/"). With these two pointers, you can switch between the current path mentioned above.
Therefore, in the process of finding the file path, in addition to recording the current dentry, but also to record the current vfsmount. If the current dentry is a mount point, through the current vfsmount, locate the child vfsmount whose son is the current dentry, and then get the vfsmount of the child Mnt_root.
There may be multiple vfsmount mounted on the same dentry, when only one of the Vfsmount will be selected and the other vfsmount will be hidden. The hidden Vfsmount may not be selected until the selected Vfsmount is uninstalled. With this feature, we can implement the hidden directory. For example,/home/kouu/secret save some of the files you do not want others to see, you can mount a tmpfs in this directory to achieve hidden purposes.
The sub-file system is always mounted on one of the dentry of the parent file system, and the root filesystem is referenced by the Mnt_namespace object. Different Mnt_namespace can refer to different root file systems, organize different file system mount trees, and form different directory structures.
Generally, newly created processes always share mnt_namespace with their parent processes. While all processes are descendants of process 1th (init), in general all processes use the same mnt_namespace and live in the same directory structure.
However, when you create a new process by using the clone system call, you can specify the CLONE_NEWNS flag to create a new namespace for the child process (which contains the mnt_namespace, in addition to the namespace).
The front just said that a device is mounted, in fact, mount the file system in addition to add the corresponding storage media device files, but also in the kernel to register the file system type (corresponding FILE_SYSTEM_TYPE structure) (such as ext2, Ext3, TMPFS). A file system always contains devices and types of two features.
Registered File_system_type are stored in the linked list structure and are found by their registered names (such as ext3). They are the interpreter for the file data that interprets the data in the physical storage media corresponding to the device file.
Each filesystem has a super block (corresponding to the super_block structure), which is read from the block device using the GET_SB method of the File_system_type structure.
A file system can be mounted several times, forming multiple vfsmount structures. They all correspond to the same super_block. In fact, only when the file system is mounted for the first time will it read its super_block. Otherwise this super_block already exists, the direct reference can be.
In the process of GET_SB, the inode corresponding to the root path of this filesystem is also loaded from the storage media and the corresponding Dentry is created. The super_block->s_root points to the dentry of the root path.
Data Structure Summary
Finally, we organize some of the above data structures and their function pointers, which are really easy to find north.
File_system_type
Meaning: File system type, such as ext2, ext3, etc.
Create: Create a corresponding file_system_type structure for each file system type when kernel boot or kernel module is loaded
Function: GET_SB, gets the method of the Super block. Provided when registering the file system type
Super_block
Meaning: Super block, corresponding to a device that stores files
Create: When a file system is mounted, it is read from the device via the corresponding FILE_SYSTEM_TYPE->GET_SB and initialized (visible, part of the information in the Super_block structure is in the device, and part of it is initialized within)
Functions: S_op, the function set of the Super block, mainly contains the operation on the index node and the file system instance. FILE_SYSTEM_TYPE->GET_SB after the Super block is read from the device, it is initialized with a specific set of functions for File_system_type
Inode
Meaning: An index node that corresponds to a file stored on the device
Created: 1) When the Super block is loaded, the inode is loaded as the root; 2) Innovate new index node through Mknod call; 3) in the process of finding the file path, read from the device and initialize (as with Super_block, some of the information in the inode structure is in the device, and part of it is initialized internally)
Functions: I_op, the set of index node functions, mainly including the creation of sub-inode, delete, and other operations. F_op, a set of file functions, mainly contains read and write operations to this inode. After the inode is created, 1) if it is a special file, it is given a specific set of functions (not directly related to the device and file system type) based on the type of the corresponding file (including block devices, character devices, FIFO, etc.); 2) Otherwise, the corresponding file system type will provide the corresponding set of functions, and the directory and file function set is likely to be different
Dentry
Meaning: A directory entry, a tree structure used in the process of finding a file path, associated with an inode
Created: After the inode is created, Dentry is created and initialized
Functions: D_op, the set of directory item functions, mainly contains the query operation of the child Dentry. Determined by file system type
File
What it means: Open an instance of a file
Create: Created at open call and corresponds to an inode
Functions: F_op, file read and write, and other operations. 1) equals Inode->f_op, for ordinary files, block equipment files, etc.; 2) specified by the Inode->f_op->open function when the file is opened, the typical case is the character device. All character devices have the same inode->f_op, in the Inode->f_op->open process, find the corresponding device driver registration F_OP, assigned to File->f_op