Ubifs White Paper-ubifs design overview

Source: Internet
Author: User
Tags truncated
Editor's note:

I recently came into contact with ubifs, so I learned how to download ubifs_whitepaper (ubifs White Paper) from the ubifs official website. Later I found that the translation version of this document is available on the chinaunix blog, however, after some research, I found that the translation version has the following Disadvantages:

1. English and Chinese are mixed. There are a lot of English phrases in the article that have not been translated.

2. There is no clue. I can't blame the blogger for this. Because this White Paper has no concept of chapter, I have added chapters and marks for your convenience.

3. Some translation errors may be difficult to translate in the original text. For such "long sentences", I have done some work to ensure that they are correct and easy for everyone to understand!

Well, translation is a very boring job, so I would like to express my gratitude to the author of the translation, because this article has been improved based on that translation!

========================================================== ============== Gorgeous text line ================================= ======================================

I. Flash file system design ideas

The file system designed for flash requires out-of-place update ). Because flash must be erased before writing, and can only be written once before it is erased again. If eraseblocks blocks are small and can be erased quickly, they can be viewed as disk sectors
But it is not. Read the entire erased block, erase it, and write back the updated data. The time taken is 100 times longer than that of the erased data. In other words, for a small update, local update is 100 times longer than remote update.
Remote update requires garbage collection ). When data is updated remotely, the original erased block may contain both valid data and discarded data (the data has been updated elsewhere ). In this way, the file system will use up all empty erased blocks, and each erased block includes valid data and discarded data. To write new data elsewhere, an erased block must be empty and can be used for erasure and reuse. Find the erased blocks that contain a lot of discarded data and move the valid data to other erased blocks. This process is called garbage collection.
Garbage collection reminds us of the benefits of the node structure. In order to reclaim an erased block, the file system must be able to confirm the stored data. This is a conflict between the General indexing problems faced by the file system. The file system usually starts with a file name and must find the data that belongs to the file. Garbage collection can start with any data and must discover which file the data belongs. One way to solve this problem is to store metadata (metadata) based on the file data ). The combination of data and metadata is called a node ). Each node records the file it belongs to and the data it contains. Both jffs2 and ubifs are designed according to a node structure. The node structure enables their garbage collectors to directly read and erase blocks and decide which data needs to be moved and which ones need to be discarded, change the index according to the actual situation.

II. Introduction to ubifs Design

The biggest difference between jffs2 and ubifs is that ubifs stores indexes in flash, while jffs2 stores indexes in memory. When the file system is mounted, the memory needs to re-create indexes. In this way, jffs2 may impose a limit on its maximum size, because the Mount time and memory usage both linearly increase with the size of flash, and buifs is designed to overcome this restriction.
Unfortunately, storing indexes in Flash is very complicated because indexes also need to be updated remotely. When a part of the index is updated remotely, other indexes related to the updated index must also be updated. Then, in sequence, the related parts must also be updated. This seems to be constantly updated. One solution is to use the free tree ).
For ubifs, the wandering tree is actually a B + tree. Only the leaves on the tree contain file information. They are valid nodes of the file system. The internal element of the tree is the index node (index nodes) and contains information about the subnode. That is to say, a node records its subnode location. Therefore, the free tree of ubifs can be regarded as two parts. The top part is composed of index nodes that create tree structures, and the bottom part is composed of leaf nodes pointing to actual data. The top score can be simply regarded as an index. The update process of a file system is composed of creating a new leaf node and adding the node to the tree (or replacing the node in the original tree. Then, the parent index node must also be replaced, so that the parent node of the parent node, until the root of the tree, must be replaced. The number of index nodes to be replaced is equal to the height of the tree. The question left here is how to know where the root of the tree is. In ubifs, the root index node is stored on the master node
Node.

2.1 master node

The master node stores the location of the root index node. The master node itself is repeatedly written to leb1 and leb2 (LEB, logical eraseblocks logical erasure blocks ). Lebs is an abstraction created by UBI. UBI transfers pebs (physical
Eraseblocks, a physical erasure block) is mapped to lebs, so leb1 and leb2 may be anywhere on the flash medium (strictly speaking, ubi devices, UBI records where they are. Two erased blocks are used to ensure that there are copies of two master nodes. This is for the sake of recovery, because there are two situations that will lead to damage or loss of the master node: The first case is a sudden power failure when the master node is being written; the second possible cause is that the Flash Media itself is damaged. In the first case, it can be restored because the previous master node can also be used. In the second case, it is impossible to recover because it is impossible to determine which master node is reliable. In the latter case, a user space program is used to analyze nodes on all media and repair or reconstruct damaged or lost nodes. With the copy of the two master nodes, you can know the situation and recover it based on the situation.

2.2 super block

Leb0 stores super block nodes. Super blocks contain parameters that are rarely changed by some file systems. For example, the flash geometric size (erased block size, number, etc.) is stored in the super block. So far, there is only one case where the super block needs to be overwritten, that is, when the size is automatically changed. Ubifs currently only has a limited capacity to change the size, which can only be changed to the maximum size when the file system is created. This feature is required because the actual size of Flash partitions varies with the number of Bad blocks. Therefore, when creating a file system image through mkfs. ubifs, the maximum number of erased blocks must be specified. The maximum number of erased blocks and the number of actually used erased blocks must be recorded in the super block. When ubifs is attached to a partition (actually an ubi volume ), if the number of erased blocks in the volume is larger than the number of erased blocks recorded in the super block and smaller than the maximum number of erased blocks (also recorded in the super block), ubifs automatically changes the size to adapt to the volume.
In fact, there are six areas in ubifs, and their locations are determined when the file system is created. The first two regions have already been described. Leb0 is a super block area, and the super block is usually offset 0. When writing a super block area, the atomic modification function of Leb is often used to ensure that LEB is either successfully modified or not succeeded. The next area is the master node, which occupies leb1 and leb2. Generally, the two lebs contain the same data. The master node is usually written to LEB consecutively until the LEB is fully written. At this time, the lebs is unmapped and the master node is written to the zero offset location (ubifs
Remap the LEB of the master node to an erased peb ). Note that the master node lebs is not canceled at the same time because it will cause the file system to temporarily fail to have a valid master node. Other ubifs regions are the log area, the LPT area (the LEB properties tree area), and the orphan Area) and the main area ).

2.3 log area

Log is a part of ubifs logs. Ubifs uses logs to reduce the frequency of updating falsh indexes. Recall that indexes constitute the top points of the free tree (only composed of index nodes). When a file system is updated and a leaf node in the free tree is added or replaced, all ancestor index nodes of the leaf node must be updated as needed. If each leaf node needs to be updated when it is written to an index, the efficiency will be very low because most of the same index nodes are repeatedly written, especially the head of the tree. Therefore, ubifs defines a log in which the leaf node is written instead of immediately added to the Flash index. Note that the indexes in the memory need to be updated (see TNC ). Periodically, logs are almost full and will be submitted. The submission process includes writing new indexes and the master node.
The existence of logs means that when ubifs is mounted, the index on Flash is out of date. To update it, you must read the leaf node in the log and re-index it. This process is called replay ). Note: The larger the log size, the longer it takes for playback, and the longer it takes to mount the ubifs file system. On the other hand, a large log is rarely submitted, which makes the file system very efficient. The log size is a parameter of mkfs. ubifs, so it can be modified to meet the needs of the file system. In any case, ubifs does not use the fast unmount option by default. Instead, it runs a submit operation before unmounting. In this way, when the file system is mounted again, the log is almost empty, making the mounting very fast. This is a good balance and coordination, because the submission process itself is generally very fast and only takes a little time.

Note that the submission process is not to move the leaf node from the log, but to move the log. The purpose of log is to record the log location. Log contains two types of nodes: one is the commit Start node, which records that a commit has started. The other node is the reference node, which records the number of lebs in the main area that makes up the remaining logs. These lebs are called buckets, so logs are composed of logs and buds. The log size is limited and can be considered as a circular buffer. After the submission, the relevant nodes that record the previous log location no longer need it, so the log tail is erased and the log header is extended. The write operation on the master node indicates the end of the submission, because the master node points to the end of the new log. If the submission is incomplete because the file system is not uninstalled, the playback operation replays the old and new logs (so that the logs are consistent ).

The playback operations are complex in several situations:

> The first case is that the leaf node must be played back in sequence. Because ubifs uses a multiheaded journal log, the write order of the leaf nodes is not simply the same as the order of the bud erasure blocks involved in the log. To sort leaf nodes, each node contains a 64-bit serial number, which is increased during file system activity. Replay reads all the leaf nodes in the log and puts them in a red/black tree, which is stored according to the serial number. Then, the red and black trees are processed in order, and the indexes in the memory are updated in actual conditions.
> The second complex case is that playback must manage deletion and truncation. There are two types of deletion. Deleting an inode node is equivalent to deleting files and directories, and deleting directory items means deleting connections and renaming. In ubifs, inodes have a consistent inode node. inode nodes record the directory item connection number, which is generally considered as the number of connections. When an inode is deleted, an inode node with a connection number of 0 is written to the log. In this complex case, instead of adding the leaf node to the index, it is removed along all the index items according to the inode number. If a directory item is deleted, the node of a directory item is written to the log, but the inode number involved in the previous directory item is set to 0. Note that there are two inode numbers in the directory. One is the number of its parent directory item, and the other is the number of its file or sub-directory item. If you delete a directory item, the latter is set to 0. When playback processes a directory item whose inode is 0, it directly removes the directory item from the index rather than adding it.

Truncation changes the file size. In fact, truncation can prolong the file length and shorten the file length. For ubifs, extending the file length does not require special control. In the file system, a hole is created by truncating and extending the length of the file. This hole cannot be written into the file, and it is all 0 bits. Ubifs does not index holes or store any node corresponding to holes. Replace a hole with an index that is not there. When ubifs looks for an index and finds that there is no index, it defines it as hole and creates 0 data. On the other hand, the extra nodes must be removed from the index to shorten the file length. In this case, the truncated node is written to the log, and the truncated node records the old and new file lengths. Replay processes these nodes by deleting related index items.
> The third complex case is that the LPT partition must be updated during playback (the LEB properties tree logically erases the block attribute tree ). The LEB attribute has three values for all lebs in the primary storage area. These values are: free space, dirty space, and whether the erased block is an index erased block. Note that the index node and the non-index node are never in the same erased block. Therefore, an index erased block is an erased block that only contains the index node, A non-index erased block only contains non-index nodes. Idle Space refers to the number of bytes in the area where the erased block is not written and can be filled with more nodes. Dirty Space refers to the number of discarded nodes and the number of bytes filled, which are potentially garbage collection. The LEB attribute is necessary for searching idle space for logs or indexes and for garbage collection of the dirtiest erased blocks. Each time a node is written, the free space of the erased block is reduced. When a node is discarded or filled or truncated or deleted, the dirty space of the erased block needs to be increased. When an erased block is applied for as an index erased block, it must be recorded. For example, an index erasure block with free space will not be applied for as a log, because it will cause the index and non-index nodes to mix in one erasure block. The following budget section will further explain the reasons why index nodes and non-index nodes cannot be mixed.
Generally, the index subsystem is responsible for notifying the LEB attribute subsystem of its LEB attribute change. When a recycled erased block is added to the log, the LEB attribute becomes more complex during playback. Like an index, the LPT region is updated only when it is submitted. Like the index, the LPT on Flash is outdated during mounting and must be updated through playback. Therefore, the LEB attribute in Flash reflects the status of the last commit. Replaying will start to update the LEB attribute, although some changes occur before garbage collection and some after garbage collection.
Depending on the garbage collection point, the final LEB attribute value will be different. To control this, insert a red/black tree that references it to describe the point when LEB is added to the log (use log to reference the node serial number ). When the playback red/black tree is applied to the index, the playback can correctly adjust the LEB attribute value.
> The fourth complex case is the restoration effect during playback. Ubifs records on the master node whether the file system is successfully uninstalled. If it is an unclean unmount, certain error conditions will trigger the recovery of the file system. Playback is affected in two cases. First, when a bud erasure block is being written, it is uncleaned and may be damaged. Second, similarly, the log erasure block may be corrupted due to uninstallation during write. Replay will try to fix the nodes by restoring the erased block. If the file system is mounted to a readable/writable file system, recovery will be necessary. In this case, the integrity of the recovered ubifs file system is the same as that of the file system that has never been uninstalled. If the file system is mounted as read-only, it will be restored until the file system is mounted as read/write.
> The last complex case is that the leaf node referenced in the index may no longer exist. This occurs when the node is deleted and its erased block is then reclaimed. In general, deleted leaf nodes do not affect playback because they are not part of the index. However, the index structure sometimes reads the leaf node when updating the index. In ubifs, a Directory consists of an inode node and a directory item. You can use a node key to obtain the index. The key is a 64-bit value to identify the node. In most cases, this node key can be used to uniquely identify this node, so Index Update uses the key. Unfortunately, the specified information of a directory item is the name, which is a long character (up to 255 characters in ubifs ). To squeeze this information into 64-bit, its name is hashed to a 29-bit value, which is not unique for the name. When two names give the same hash value, this is called hash conflict (hash
Collision ). In this case, the leaf node must be read and resolved by comparing the names stored in the leaf node. What will happen if the leaf node is lost due to the above reasons? In fact, this is not too bad. Directory item nodes will only be added and deleted, and they will never be replaced because the information they contain will never change. When a hash key node is added, there will be no matching. When a hash key node is removed, a matching node may be an existing node or a node with the correct key is lost. To update this special index for playback, you need to use an independently set function (indicating that the prefix of the code is "Incorrect ").

2.4 LPT Zone

The log area is followed by the LPT area. The size of the log area is defined when the file system is created, that is, the size of the LPT area is fixed when the file system is created (because it follows the log area ). Currently, the size of the LPT zone is automatically calculated based on the LEB size specified during file system creation and the maximum number of Leb. Like the log area, the LPT area does not exceed the space. Unlike the log area, the updates in the LPT area are not consecutive and they are random. In addition, the number of Leb attribute data is potentially huge and must be scalable.
The solution is to store the LEB attribute to a free tree. In fact, the LPT area is very similar to a micro file system. It has its own LEB attribute, which is the LEB attribute of the LEB attribute area (called ltab ). It also has its own garbage collection. It has its own node structure-a small bit level. In addition, like the index, the LPT area is updated only at the time of submission. Therefore, the index on flash and the LPT on flash depict the last time the file system was submitted.
The difference between it and the real file system is that it is described by the nodes in the log.

LPT actually has two slightly different forms, called small mode and big model ). When using the small mode. The entire LEB attribute table can be written to an erased block. In that case, the LPT garbage collection is to write the entire table, which leads to reuse of all the other LPT partition erasure blocks. In large mode, only dirty LPT blocks are used for garbage collection. When the garbage returns, the nodes marked as LEB are dirty and dirty nodes are written. Of course, in large mode, when ubifs is mounted for the first time after a table with a number of lebs is stored, searching for an empty erasure block will not search for the entire LPT. In small mode, we assume that it is not very slow to search for the entire table because it is very small.

A major task of ubifs is to read the index, and the index is a free tree. To make it more efficient, the index node is cached in the memory in a structure called TNC (tree node cache, tree node cache. TNC is a node of the same B + tree as the index on flash. The node of TNC is called znodes. Another idea is that a znode is called an index node in flash, while an index node is called a znode in memory. There is no znodes during initialization. When searching for indexes, you need to read the index nodes and add them as znodes to TNC. When a znode needs to be changed, it is marked as dirty in memory until the next commit is marked as clean again. At any time, the ubifs memory shrinkage mechanism (shrinker) may decide to release the clean znodes In the TNC, so that the memory required is commensurate with the size of the index in use. Note that it is the full size of the index. In addition, the bottom of the TNC is an lnc (leaf
Node cache, which only stores directory items. To resolve the collision or read directory operations, the node must be cached using lnc. Because lnc is attached to TNC, lnc also contracts when TNC contracts.

To minimize the conflict between the submission and other ubifs operations, TNC becomes more complex. To achieve this goal, the submission is divided into two main parts. The first part is commit start ). When the submission starts, the semaphore is submitted down to prevent log updates during this period. During this period, the TNC subsystem generates a lot of dirty znodes and finds the location where they will be written into flash. Then release the submission semaphore. A new log is used, and the submission process continues.
The second part is commit end ). At the end of the submission, TNC writes a new index node and does not use any lock (similar to the previous semaphore ). That is to say, TNC can be updated and new indexes can be written to flash. This is done by marking znodes, called copy-on-write ). If a znode needs to be modified when it is submitted, a copy will be copied so that the submitted znode still remains unchanged. In addition, the ubifs background thread runs the commit, so that the user process only needs to wait a little time for the commit.

Next, LPT and TNC adopt the same commit policy. They both use the free tree implemented by the B + tree, which leads to a lot of code similarity.


There are three important differences between ubifs and jffs2. First, we have mentioned that ubifs has an index stored in flash but jffs2 does not (jffs2's index is in memory), so ubifs has scalability. The second difference is that ubifs runs on the ubi layer, while ubi runs on the MTD layer, while jffs2 runs directly on the MTD layer. Ubifs benefits from ubi's profit and loss balancing and error management. The flash space, memory, and other resources occupied are allocated by UBI. The third important difference is that ubifs allows writeback ).

Write-back is a feature of VFS. It allows data to be written to the cache rather than to the media immediately. This makes the system response more efficient, because updates to the same file can be combined. The difficulty of write-back is that the file system must know how much free space is available so that the cache should not be larger than the free space of the media. This is very difficult for ubifs, so a subsystem called budget (budgeting) is dedicated to this job. There are several reasons for difficulties:
> The first reason is that ubifs supports transparent compression. Because we do not know the compression quantity or the required space in advance. The budget must assume the worst case-the assumption is not compressed. In most cases, it is a bad assumption. To overcome this problem, the budget starts to be forcibly written back when it detects insufficient space.
> The second reason is that garbage collection cannot ensure that all the dirty space can be viewed. Ubifs garbage collection processes one erased block at a time. For NAND Flash, only one complete NAND page can be written at a time. A nand erasure block consists of a fixed number of NAND pages. Ubifs indicates that the NAND page size is the smallest I/O unit. Ubifs processes an erased block at a time. If the dirty space is smaller than the minimum I/O size, it cannot be recycled and will be filled at the end of a NAND page. When the dirty space of an erased block is less than the minimum I/O size, the space is called The Dead Zone (dead
Space ). The dead zone cannot be recycled.
Similar to the dead zone, there is also a dark area (dark space ). The hidden area indicates that the dirty space of an erased block is smaller than the maximum node size. In the worst case, the file system is full of nodes of the maximum size, and garbage collection in multiple idle spaces will fail. Therefore, in the worst case, the hidden area cannot be recycled. In the best case, it can be recycled. Ubifs budget must assume the worst case, so the dead zone and dark zone are assumed invalid. In any case, if there is not enough space, but there are many hidden areas, the budget itself runs garbage collection to see if more space can be released.
> The third reason is that the cached data may be discarded data stored in flash. I usually don't know if this is the case, and I don't know the difference in compression. This is another reason to force write-back when the budget calculation is insufficient. The budget will be abandoned and enospc will be returned (no space error code) only after you try to write back, recycle and submit logs ).

Of course, that means ubifs will become inefficient when the file system is near full. In fact, this is true for all falsh file systems. This is because it is unlikely that an empty erasure block has been erased in the backend, and is more likely to be executed by garbage collection.
> The fourth reason is that you need to write a new node to delete or intercept a node. So if the file system really has no space, it will not be able to delete anything, because there is no space for writing to delete nodes or truncate nodes. To prevent this situation, ubifs often retains some space and allows deletion and truncation.

2.5 orphan Zone

The next ubifs area is the orphan area ). An orphan is the number of nodes and calculates the number of index nodes that have been submitted to the index. The number of links to these nodes is 0. This occurs when an opened file is deleted (unlinked) and then submitted. Normally, the index should be deleted when the file is closed. However, when uninstallation is not clean, orphans need to be taken into account. After uninstallation is not clean, no matter whether the entire index is searched or a list is kept somewhere in flash, the orphan node must be deleted. ubifs implements the latter solution.
An orphan zone has a fixed number of lebs located between the LPT region and the primary storage zone. The number of lebs In the orphan zone is specified when the file system is created. The minimum number is 1.
The size of the orphan area needs to be able to handle the maximum number of orphans expected at the same time. The size of the orphan area can be adapted to a LEB:
Leb_size-32/8
For example, a 15872-byte LEB can accommodate 1980 orphans, so a LEB is enough.
Orphans are accumulated in a red/black tree. When the number of links on the inode node changes to 0, the inode number is added to the red/black tree. When inode is deleted, it will be removed from the tree. When a node is submitted for running, any new orphan in the orphan tree is written to the orphan area and written to one or more nodes. If the orphan area is full, the space will be expanded. There is usually always enough space, because verification can prevent users from creating more than the maximum number of orphans allowed.

2.6 primary storage Zone

The last ubifs area is the main area ). The primary storage area contains the data and index nodes that constitute the file system. A primary store LEB may be an index erasure block or a non-index erasure block. A non-index erased block may be a bud or has been submitted. A bud may be one of the current log headers. A leb that contains submitted nodes can still become a bud if there is free space. Therefore, a Bud B has an offset from the start of the log, although the offset is usually 0.

For more learning, refer:

[1] https://zh.wikipedia.org/wiki/UBIFS

[2] http://lwn.net/Articles/290057/

[3] http://lwn.net/Articles/276025/

[4] http://www.linux-mtd.infradead.org/faq/ubifs.html

[5] http://www.linux-mtd.infradead.org/doc/ubifs.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.