Ubifs design overview-a brief introduce to the design of ubifs

Source: Internet
Author: User
Tags semaphore

In your spare time, If You Want To Know ubifs, start by translating the ubifs design document. You will have the opportunity to analyze the ubifs source code later.


The flash memory file system needs to be updated remotely (out-of-place updates). This is because flash storage must be erased before writing, and can only be written once after each erasure. If the erased block is small and fast, the erased block can be regarded as a disk sector, but this is not the case. It is faster to read the entire erased data and then write back the updated data. It may take 100 times or more time to write the updated data to an erased block. In other words, in-place-updates is much slower than remote updates. This is mainly because the flash memory erasure process consumes a lot of time.


Remote updates introduce the garbage collection technology. When data is updated remotely, The erased block of the original data contains discarded data. In the end, the file system will inevitably consume empty erased blocks, and some erased blocks contain discarded data. In order to write new data, the erased blocks containing discarded data must be recycled. The process of moving valid data to other erased blocks is called garbage collection.


Garbage collection requires the use of node-structure. In order to be able to garbage colleciton an erased block, a file system must be able to identify the data stored on the erased block. This is the opposite of the common index problems in file systems. The file system uses the file name to find the data of the file, while the garbage collection uses data to find the file (or not the file ). To solve this problem, save the metadata and data files together. The combination of metadata and data is called node. Each node records the file that owns the node and what data (such as the offset of the file where data is located,
Data Length) is stored in this node. Both jffs2 and ubifs use node-based designs. This allows them to directly read eraseblock to determine which data needs to be moved to other eraseblocks and which data can be discarded.


The biggest difference between jffs2 and ubifs is that ubifs stores the file index somewhere in Flash; jffs2 stores these indexes in the memory (the memory index is created in the Mount phase of the file system), which leads to the maximum size limit supported by the jffs2 file system, because the Mount time and memory consumption are proportional to the size of flash memory. Ubifs is designed to overcome this restriction.


Unfortunately, storing indexes in flash memory is very complicated because the index itself has to be out-of-place update. if some indexes require remote update, the indexes that reference the update part must also be updated. the method to transmit updates is to use the wandering tree.


Ubifs's wandering tree uses the B + tree. Only the leaf nodes of the tree store the file information. They are the valid data of the file system, and the internal element of the tree is index nodes. Therefore, a wandering tree contains two parts. The internal node of the tree is the structure of the file tree, while the leaf node of the tree stores the valid data of the file. File system updates include creating new leaf nodes and adding them to the tree, or replacing the leaves in the wandering tree. For leaf updates, the parent index node must be replaced and always passed to the root node of the tree. The number of index nodes to be updated is equal to the height of the tree.
This also raises another problem: the root node location, in ubifs, the root index node is stored in the master node



The master node is used to record the location of the flash structure, and the logical location of these structures is not fixed, so that these structures can be found through the master node. The mast Node itself is stored in the logical erasure block (LEB) One and Two. LEB provides ing from UBI to physically erased blocks. Therefore, in theory, Leb one LEB two may be anywhere in the Flash medium (strictly speaking, ubi device ). Two erased blocks are used to maintain the backup of the master node for rediscovery. In either case, the master node may be damaged. First, write the master node
The Flash Media may be degraded and damaged. In the first case, the master node of the previous version can be used in the rediscovery process; in the second case, the rediscovery cannot determine which master node version is valid. In the latter case, you can use a user space tool to analyze all nodes on the media to correct or recreate damaged or lost nodes. the two copies of the master node can help determine what is going on.


The first LEB is not leb1, but leb0. Leb0 stores the superblock node. the superblock node saves the parameters that remain unchanged in the file system. For example, the layout of Flash (delete block size, delete block quantity, etc ). Currently, there is only one situation where the super block node may need to be overwritten: Automatic resize. Ubifs currently supports conditional resize. the maximum size that can be Resize is identified when the file system is created. The resize mechanism is required because the exact size of the flash partition may change with the existence of Bad blocks. Therefore, when mkfs. ubifs is used to create a file system, the maximum number of erased blocks and the number of used erased blocks are recorded in the superblock
In node, when ubifs is mounted to a partition, if the number of erased partitions is greater than the used number of records in the superblock node and smaller than the maximum number of erased blocks, the ubifs file system automatically resize to the actual partition size.


In fact, when the ubifs file system is created, six regions are fixed:

1. superblock node leb0

2. master node leb1 leb2. Normally, the erased block stores the same data.

3. log area

4. LEB properties tree Area

5. Orphan Area

6. Main area

The first two zones have already been described above. The super block is leb0, and the super block node is always offset0. The super block LEB uses the ubi's atomic LEB change function to ensure that this LEB is successfully updated or remains unchanged. The next area is the master node area. It occupies leb1 leb2. Generally, the two lebs store the same data. Write operations on the master node are performed in the LEB sequence until there is no free space, at this time, the lebs is re-unmapped and then the write starts from offset 0 (this process ubi will re-map a clean erased LEB ). Note that the master node lebs cannot be unmapped at the same time, because this will cause the file system to have no valid master

Log is a part of ubifs logs. Ubifs logs are used to reduce the update frequency of Flash indexes. Recall that the preceding part of the wandering tree, that is, index nodes, stores these indexes. The leaf node that updates the file system must add or replace the wandering tree, and update all parent nodes. Each write leaf node immediately updates the index node on flash, which is very inefficient. Because the same index nodes may be repeatedly written, especially at the upper layer of the tree. Ubifs writes only leaf nodes through logs, but does not immediately update the index of on-flash.
Note that indexes in memory are updated immediately. Periodically, when the log system determines that the log is full enough, it is submitted. The submission process includes writing the memory index and the corresponding master node.


The existence of the log means that after ubifs is mounted, the index on flash expires. To get the latest index, the leaf node in the log must be read and reindexed. This process is called replay. Note that the larger the log, the longer the replay takes and the longer the Mount time. On the other hand, the larger the log, the lower the submission frequency, and the more efficient the file system. The log size can be determined by The mkfs. ubifs parameter, so you can adjust the file system as needed. By default, ubifs does not use the fast unmount option, but executes a commit before unmounting. This causes the log to be almost empty, so that the next time the file system re-mount, the Mount speed is very high. The commit process itself is very fast and takes about a few minutes, so unmount
Commit is a good balance.


Note that the commit process itself does not move the leaf node from the log. Instead, modify the log itself, that is, modify the log record location. Log contains two types of nodes,Commit Start NodeRecord that a commit has started;Reference nodesThe number of lebs that make up journal. These lebs are called buckets. Therefore, logs include logs and buckets. The log size is fixed and can be considered as a circular buffer. After submission, the previous reference nodes is no longer required

After a commit, the reference nodes that recorded the previous position of the journal are no longer needed so the tail of the log is erased at the same rate that the head of the log is extended. while the Commit-Start Node records the start of commit,
End of commit is defined to be when the master node is written, because the master node points to the new position of the log tail.

If the submission is incomplete due to system unmounted uncleanly caused by system power loss, replay Old and New logs will be used in the replay process.

The replay process is complex and involves the following aspects.

First, the leaf node must replay in sequence. Because ubifs uses multiheaded joural. The order of leaf nodes is not the order in which the bud erased block references in the log. To sort leaf nodes, each node contains a 64-bit serial number. Replay first reads all the leaf nodes in the log and inserts a Rb tree according to the serial number, and then processes the RB tree in order to update the in-memory index.


Another complexity is that replay must consider deleting and truncation. There are two types of deletion: 1. inode deletion (corresponding to the deletion of files and directories); 2. Directory deletion corresponds to unlinking and rename. Ubifs inode is recorded in inode node, and the links number or links count of the Directory item is recorded. When an inode is deleted, the inode node with links count 0 is written to the log. For 1, instead of adding that leaf node to the index, it is
Removed from the index along with all index entries for nodes with that inode number. for 2, a directory entry node is written to the log, but the inode number in the directory item is set to 0. Note that one directory item has two inode numbers and the other is the inode number in the parent directory, the other is the inode Number of the file or directory corresponding to the directory item. When the replay process encounters a directory item with zero inode number, delete the index of the Directory item instead of adding it.


Truncate changes the file length. In fact, truncate not only reduces the file length, but also extends the file length. For the ubifs file system, no additional processing is required to extend the file length. For a file system, to expand the file length through truncate is to create a hole in the file, which is assumed to have 0 content. Ubifs does not index these holes and does not save any nodes for these holes. When ubifs finds that there is no index, it means that this is a hole. If truncate reduces the file length, all data nodes that fall beyond the new file length will be removed from the index. To handle this type of truncate, truncate
The nodes is written to the log to record the length of the New and Old files. The replay process removes the deleted data nodes from the corresponding index entries.


The next complexity is that replay must ensure that LEB Properities tree (LPT) is updated. The LEB properties record the three properties of all lebs in the main area: free space, dirty space, and whether it is index eraseblock or not. Do not confuse index nodes, non-index nodes, and index eraseblock. Index eraseblock indicates that eraseblock only contains index nodes, and non-index eraseblock only contains non-Index
Nodes. Free Space is the number of unwritten bytes at the end of an eraseblock. These spaces can also be written to more nodes. dirty Space refers to the space occupied by discarded nodes and padding. It is a potential part that can be recycled by garbage colleciton. The LEB attribute can be used to discover free space, add it to the log, or index, or locate the dirtiest erased block to garbage collect. Each time a node is written, the idle space of the eraseblock of the node is reduced. Each time a node is discarded or a padding node is written or truncate
Deletion node is written. The dirty space of the eraseblock must be added. When an eraseblock is assigned to an index, this information must be recorded. An index eraseblock should not be allocated to journal even if there is free space, otherwise, index nodes and no-index nodes will be mixed in the same eraseblock. The reason is that budgeting has this requirement. budgeting will be discussed later.


In general, the index sub-system will notice the modification of the LEB properties sub-system. When gargage collected adds the erased block to the log, Leb properties will cause the complexity of replay. Like the index, the LPT area is updated only when the commit occurs. Like the index, The on-flash LPT is out-of-date from the Mount time and must be updated through the replay process. Therefore, the LEB properties of garbage collected LEB in on-flash only reflect the status after the last commit. Replay starts to update LEB
Properties. However, some of these changes occur before garbage collected, and others occur later. Depending on the time when garbage occurs, the final LEB property will be different. To handle this situation, replay inserts the reference to the RB tree, representing the time point when LEB is added to the log. That enables the replay to correctly adjust the LEB property values when the replay RB-tree is applied to the index.


Another complexity of replay is the impact of recovery on replay. Whether the ubifs file system is clean unmounted is recorded on the master node. If this flag is not displayed on the master node during mount, the specific condition triggers recovery to repair the file system. Replay is affected in two ways. First, a bud eraseblock may be damaged. For example, when eraseblock is being written, unclean unmount occurs. Second, a log eraseblock may be damaged for the same reason. Replay transfers the eraseblock to the recovery to fix the nodes on these eraseblocks. If the file system can be mounted in read/write mode, the rediscovery requires the necessary fix, recovered
The integrity of the ubifs file system is as perfect as that of the unclean unmount. If the file system is mounted in read-only mode, the rediscovery is postponed to the next read/write mount.


The final complexity is that the leaf node referenced by on-flash index may no longer exist. When the node is deleted and the eraseblock containing the node is garbage collected. In general, the deleted leaf nodes do not affect replay because they are not indexes. However, sometimes the index does need to read the leaf node to update the index. For example, directory entry nodes and extended attribute entry nodes. In ubifs, a directory contains an inode node and a directory entry node. To access the index, use the node
Key, the node key is a 64-bit value to identify this node. In most cases, the node key uniquely identifies the node. Therefore, you only need to use this key to access the index. Unfortunately, the unique identifier information of directory items and extended attribute items is the name, it may be very long (up to 255 characters in ubifs). To reduce it to 64-bit, the hashed value must be 29-bit, when two different names correspond to the same hash value, they are called hash collision. In this case, you must read the name of the leaf node and compare it with the name saved in the leaf node to solve the hash collision problem. Therefore, when the deleted leaf node does not exist due to GC. It
Turns out that it does not matter. directory Entry nodes (and extended attribute entry nodes) are only ever added or removed-they are never replaced because the information they contain never changes. so the outcome of the name comparison is known even though
The node contained one of the names is gone. when adding a hashed-key node, there will be no match. when removing a hashed-key node, there will always be a match, either to an existing node, or to a missing node that has the correct key. to provide this special index updating,
Use another function


After log area, the size of LPT area. log area is fixed when the file system is created. Therefore, the starting position of log area and the size of log area are the starting position of LPT area. The current LPT area size is calculated based on the LEB size and the maximum number of Leb during file system creation. Like log area, LPT area never consumes space. Unlike log area, LPT area is random rather than sequential. In addition, the LEB properties data may be very large, so the scalability should be considered. The solution is to store LEB properties to wandering.
Tree. In fact, the LPT area is more like a small file system. It has its own LEB properties-that is, Leb properties of Leb properties area is called ltab. It has its own garbage collection. It has its own node Structure and encapsulates nodes as closely as possible. However, like the index, the LPT zone is only updated at commit. Therefore, on-flash index and On-flash LPT indicate the status before the last update of the file system. The difference from the current file system status is reflected in the nodes in the log.


LPT has two slightly different forms: Small model and big model.

Small model is an eraseblock that can be written to the entire LEB properties table. In this case, lpt gc writes the entire table, so that eraseblocks in all other LPT areas can be reused; big model, lpt gc selects dirty LPT eraseblocks, this eraseblock marks the LEB that contains the dirty nodes and writes the dirty power-saving logs (this is part of the commit ). In addition, in the case of big model, a LEB numbers table is saved (where it is stored) so that the LPT table does not need to be scanned after the first ubifs mount. Small
Model, because only the table is small, it is assumed that scanning the entire table is very fast.

A major task of ubifs is to access the index in the wandering tree. To achieve high efficiency, index nodes is cached to the memory structure called tree node cache (TNC ). TNC is a B + tree, just like the index on-flash, except for all the changes since the last submission. The TNC node is called znodes. From another perspective, znode is called index node in on-flash, and index node is called znode in memory. No znodes at first. When an index is lookup, index nodes is read and added to TNC as znodes.
When a znode needs to be modified, it is marked as dirty until the next commit is followed by Mark as clean. At any time, the ubifs memory recycler can decide to release the clean znodes in TNC. Therefore, the amount of memory used is proportional to the index currently in use, rather than the size of the entire index. In addition, the leaf node cache (lnc) attached to the TNC is only used by directory items and extended attribute items. Lnc only needs cache nodes to resolve conflicts or readdir operations. Because lnc is attached to TNC, it can be effectively recycled when TNC is recycled.


It is required that TNC should not affect other ubifs operations during submission, making TNC a little complicated. The commit is divided into two parts. The first part is commit start. In commit start, use down to get semaphore to get commit semaphore to prevent others from updating logs. At the same time, the TNC subsystem generates a dirty znodes linked list and calculates the locations of these znodes to be written to flash. Release commit semaphore and enable a new log. The commit is still in progress. The second part of the commit is commit end. During the commit end, TNC
Write new index nodes without any TNC lock. This is because TNC can be updated when the new index is written to flash. During the update, the TNC is marked as copy-on-write. If a znode being submitted needs to be modified, copy the znode so that commit can see that it is an unmodified znode. In addition, most of the commit tasks are executed by ubifs background threads, so user threads almost do not need to wait for the commit execution.


Note that LPT and TNC have the same submission policy and are also implemented using the B + tree wandering trees. Therefore, LPT and TNC have similar code.


There are three main differences between ubifs and jffs2:

1. ubifs has an on-flash index, but jffs2 does not. Therefore, ubifs is scalable at this point.

2. ubifs runs on ubi, while ubi runs on MTD, while jffs runs directly on MTD. ubifs can benefit from the wear-leveling and error handling provided by UBI, of course, it also bears the cost of Flash space, memory and other resources occupied by UBI.

3. ubifs allows writeback to write back


Writeback is a function provided by VFS that allows data to be cached in the cache instead of being immediately written to the storage media. This improves the system's effectiveness because of the local principle of updating a file. The difficulty in supporting writeback is to predict the free space of the file system so that the cache will not exceed this free space. It is difficult to predict the free space, so another sub-system called Budgeting is introduced. The difficulty of foresight mainly lies in the following aspects:


First ubifs supports transparent compression. Because of compression, you cannot predict the required space in advance. budgeting must consider the worst case. It is assumed that no compression exists. However, in many cases, this assumption is not good. To overcome this problem, budgeting forces writeback when detecting insufficient space. In fact, the translator thinks that as the flash disk space increases, this is really not a problem, as long as we use the assumption that there is no compression, at this time we will force writeback and do not know if it will increase complexity (I will study it later ), if the answer is yes, it is assumed that there is no compression.


The second difficulty of budgeting is that the garbage collection cannot guarantee that all dirty spaces can be recycled. The ubifs garbage collection processes an eraseblock each time. The minimum write unit for NAND Flash is NAND pages. When dirty space is smaller than NAND pages, this space cannot be recycled. When the dirty space of an eraseblock is smaller than the minimum I/O size, this space is called dead space. Dead space cannot be recycled.

Dark space. dark space is the dirty space on eraseblock smaller than the maximum node size. In the worst case, the file system is full of nodes of the maximum size, and GC cannot release a space for a large node. Therefore, in the worst case, dark space is not recycled and can be recycled in the best case. Ubifs budgeting must assume the worst case, so both dead space and dark space are considered unavailable. However, if there is not enough space but there are many dark spaces, budgeting will run garbage
Collection to see if free space can be recycled.


The third reason is that the cached data may be discarded data on flash. Whether or not that is the case is not always known, and what the difference in compression may be is certainly not known. This is another reason for budgeting to force writeback when space is insufficient. Only after writeback, garbage collection, and log submission are attempted can budgeting decide to give up and return enospc


Of course, this means that the file system space of ubifs is close to the full validity rate is lower. In fact, all flash file systems become inefficient when Flash is full. It may be because an empty eraseblock may be erased in the background, but a larger possibility is that the garbage collection is working.


The fourth reason is that deletions and truncations need to write new nodes. Therefore, deletion becomes impossible if the file system consumes space, because there is no space to write a deletion inode node or truncation node. to prevent this, ubifs keeps a portion of the space to allow deletions and truncations


The following ubifs area is orphan area. orphan is an inode number. The corresponding inode node has been submitted to the index and the link count is 0. This occurs when you delete an opened file and run commit. Under normal circumstances, inode will be deleted when the file is closed. However, in the case of unclean unmount, you need to consider orphans. After unclean unmount, orphans 'inodes must be deleted, which means either scanning the entire index to find them or saving orphans somewhere in flash, and ubifs is using the latter method.


The orphan area is between LPT area and main area and consists of a fixed number of lebs. The number of lebs in orphan area is calibrated during file system creation. The minimum number is 1. The orphan area size should ensure that it can accommodate the number of orphans that may exist in the system. The number of orphan that a LEB can store is:


For example, a 15872 byte LEB can accommodate the next 1980 orphans, so a LEB is enough.


Orpans are stored in Rb-tree. When the Count of an inode's link is 0, the inode number is added to the RB-tree. It is removed from the tree when the inode is deleted. any new orphans that are in the orphan tree when the commit is run, are written to the orphan area in 1 or more orphan nodes. if the orphan
Area is full, it is too lidated to make space. orphan has enough space, because the code will check to ensure that more orphans are created by the user than allowed


The last area of ubifs is the main area. Main area contains the data and index of the file system composed of nodes. A main area LEB can be an index eraseblock or non-index eraseblock. A non-index eraseblock can be a bud (Journal part) or has been submitted. A bud may be journal heads. A leb contains submitted nodes. If it contains free space, it can still become a bud. Therefore, a bud led has an offset to identify the start position of the log nodes, although the offset is 0 in most cases.






Related Article

E-Commerce Solutions

Leverage the same tools powering the Alibaba Ecosystem

Learn more >

Apsara Conference 2019

The Rise of Data Intelligence, September 25th - 27th, Hangzhou, China

Learn more >

Alibaba Cloud Free Trial

Learn and experience the power of Alibaba Cloud with a free trial worth $300-1200 USD

Learn more >

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.