Document directory
- 3.1 Reliability Assurance (Ensure file consistency before and after power failure)
- 3.2 recoverability (rollback to a correct intermediate state when an error occurs)
From my new blog: http://blog.chinaunix.net/u3/94771/showart_2106382.html file power failure Reliability Assurance author: bripengandre E-mail: bripengandre@126.com
I. background
A recently developed embedded device puts forward high requirements on system stability, reliability, and recoverability. In particular, if the system suddenly loses power when updating key files (such as the kernel), can it restore the system to the State instantly before the power-off after the system is restarted? If it cannot be restored to the State instantly before power failure, can it be restored to the latest correct configuration so that the system does not crash?
Ii. Question proposal
One of the reasons why reliability and recoverability are hard to guarantee is that writing files to a storage device in a general operating system user space involves the following process: user space File Buffer (if standard IO is used) --> Kernel File Buffer --> output buffer from kernel output to the storage device --> storage device, as long as there is a problem at any stage in the whole process, reliability is hard to guarantee. The preceding steps are non-blocking. The previous step does not have to wait until the previous step is completed to return a successful result. For example, fwrite and fflush file content to the Kernel File Buffer does not guarantee that the content has already entered the kernel output buffer. According to APUE, in order to improve read/write efficiency, the buffer from the kernel file to the kernel output buffer is generally delayed (generally 30 s), which is a process of delayed writing.
Iii. Problem Solving
Analysis of existing devices and other people's articles on the market found that there are several strategies (policies can be used in combination ).
3.1 Reliability Assurance (Ensure file consistency before and after power failure)
Strict reliability assurance. This can only be ensured through hardware. For example, if one UPS Power Supply (uninterruptible power supply) is provided, when the system suddenly loses power, the UPS power supply will interrupt the operating system, enable the operating system to start a cleaning service routine to ensure reliability.
Quasi-reliability assurance. Generally, it is impossible for a commercial embedded system to have a UPS Power Supply for reasons such as cost. This can only be ensured through software. In principle, as long as the time used to update files is reduced, according to the random principle, the probability of sudden power loss decreases during the update process. For small files, the biggest bottleneck of the update speed is generally the buffer from the kernel file to the kernel output buffer. Therefore, after the fflush file, call fsync to accelerate the file output to the kernel output buffer. Of course, these are at the cost of efficiency.
3.2 recoverability (rollback to a correct intermediate state when an error occurs)
Hardware reset. A hardware button is provided for resetting to the factory configuration or the most recent configuration. For example, many electronic dictionaries have a reset hole. The advantage of hardware reset is that it can be restored with one click at any time, but the disadvantage is that it is difficult to implement it. Consider the startup process of an embedded linux system by taking the following steps: bootloader (equivalent to BIOS + GRUB on PC) --> kernel --> application. A complete hardware reset should be able to perform hard reset in these three steps, that is, you must verify the correctness of the above three, and then make the corresponding response, such as a problem at the bootloader layer or the kernel layer, we have to perform the file recovery operation under NONOS, that is, to implement raw ops ). Fortunately, in many cases, we only update applications. In this case, we can do whatever we want in a powerful operating system.
Software reset. The difference between software and hardware reset is that the reset trigger points are different. The software reset trigger point is the software self-check, while the hard reset trigger point is generally the hardware interruption. For unattended devices, hardware interruptions cannot be triggered. In this case, software self-check and reset are required. The embedded linux Startup Process mentioned earlier is used as an example. As long as the three perform self-check when they are started, if the self-check fails, if you roll back to the old version of redundant backup (which poses a small challenge to storage), the recoverability will be almost perfect. As for how to perform self-check, the simplest thing is to ensure file integrity and security. This is much more about information security. In practice, only integrity is checked. You can generate a verification file (such as md5 or pgp) for each file. The new file is incorrectly verified during self-check, use the old backup file. Of course, there is a chicken and egg problem here, And the startup of self-check is also a code process. If the self-check Code breaks down... to cheer up, make sure that the self-check code is not broken and read-only. For example, at the application layer, you can write a read-only program that is always correct and never updated, this program is used to start other applications that may be updated.
Finally, this article is just my personal opinion. You are welcome to make other good suggestions.