Research on the scheme of automatically cleaning up a large number of files under Linux

Last Update:2017-01-13 Source: Internet

Author: User

Tags split

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Regular cleaning of outdated and junk files, maintaining a reasonable space usage of the file system, is a system administrator's daily work. Simple system commands or scripts can be implemented for small to medium sized file systems, but file cleanup becomes a daunting task for large, oversized filesystems with hundreds of millions of or even billions of of files. If you determine which files need to be cleaned up, how to clean up large volumes of files, and how to ensure cleanup performance, is a problem that system administrators need to address. This paper discusses the relevant commands and methods for the automatic cleaning of large volume files under Linux, and the best practice in the practical operation.

Requirements for automatic file cleanup
In the hands of the system administrator, it manages the most valuable assets of the enterprise-data, while Linux, which occupies half of the enterprise server operating system market, makes Linux system administrators the most important asset managers. The responsibility of the administrator is to allow limited IT resources to store the most valuable data. 1991 IBM launched the 3.5-inch 1GB hard drive, the administrator insight on the hard disk of each file, manual can achieve file management, and today's petabytes of storage devices, the file management brings unprecedented challenges.
File deletion operations, people who have used Linux should be able to complete. So what can you do with the following file deletion operations?
Deletes files in the entire file system that end with a specific suffix,
Deletes a specified file in a 1 million-file system,
From a TENS file system, delete the 100,000 files created on the specified date,
In the billion-level file system, the daily implementation of the file system cleanup, delete 1 years ago the millions of documents generated ....
The following discussion is about how to implement the above file deletion strategy and methods, if the above operation is easy for you, you can ignore this article.
For cleaning up the file system, we can simply split the cleanup tasks into two broad categories, clean up expired files, and clean up junk files.

Expired files
Any data has its own lifecycle, and the life cycle curve of the data tells us that the data is the most valuable for a period of time after production and generation, and then the value of the data decays over time. When the data lifecycle is over, you should delete the expired files and release the storage space for valuable data.
Junk files
System running, will produce a variety of temporary files, some of the application runtime temporary files, system errors generated Trace files, Core Dump, etc., after these files are processed, they lost the value of preserving, these files can be collectively referred to as junk files. Timely cleaning of garbage files, to help the system maintenance and management, to ensure the stable and effective operation of the system.

Overview of automatic file cleanup
Features and methods of automatic file cleaning
If you delete a file under the specified absolute path, RM can implement it, and if you only know the filename and do not know the path, we can find it by finding it and then delete it. By extension, if we can find the specified file according to the preset criteria, we can implement the delete operation. This is the basic idea of automatic file cleaning, generate a list of files to be deleted based on preset conditions, and then perform a delete operation on a periodic purge task.
For expired files, their common flag is the timestamp, depending on the file system, may be the file creation time, access time, expiration time and other time attributes. Due to the fact that outdated files are mostly on the filing system, such files are characterized by a large number of outdated files that can reach an order of magnitude of hundreds of thousands of or even millions per day for larger systems. For such a large number of files, scanning the file system, the generation of file list will require a lot of time, so file cleanup performance is a problem that such people have to consider.
For junk files, it's possible to have files stored in a specific directory, it may also be the end of a special suffix file, and possibly because of the system error generated 0 size or oversized files, for these files, the number of files is generally small, but a wide variety, the situation is more complex, According to the experience of the system administrator, make the detailed file query conditions, periodically scan, generate file list, and then further processing.

Introduction to related Linux commands
Common file System management commands include ' ls ', ' rm ', ' find ', and so on. Since these commands are common system administration commands, do not repeat them here, please refer to the command Help or Linux usage manual for detailed usage. Because large file systems are generally stored on dedicated file systems, these file systems provide unique commands for file system management. The practice section of this article is illustrated by IBM's GPFS file system, which briefly describes several file system management commands for GPFS.
Mmlsattr
This command is primarily used to view extended properties of files in the GPFS file system, such as storage pool information, expiration times, and so on.
Mmapplypolicy
GPFS uses policies to manage files, which can perform a variety of operations on the GPFS file system based on user-defined policy files, with very high efficiency.

The difficulty of automatic cleaning of mass files
Linux file deletion mechanism
Linux controls file deletion by the number of link, and only if a file does not have any link, the file is deleted. Each file has 2 link counters--i_count and i_nlink. The meaning of I_count is the number of current users, I_nlink is the number of media connections, or can be understood as I_count is a memory reference counter, I_nlink is a hard disk reference counter. In other words, the i_count increases when a file is referenced by a process, and the i_nlink increases when a hard connection to the file is created.
For RM, it is to reduce i_nlink. There's a problem here, what happens if a file is being invoked by a process and the user performs an RM operation to delete the file? When the user performs an RM operation, LS or other file management commands are no longer able to locate the file, but the process continues to perform correctly, and the contents are still readable from the file. This is because the ' RM ' operation simply resets the I_nlink to 0, and because the file is consumed by the process, the I_count is not 0, so the system does not actually delete the file. I_nlink is a sufficient condition for file deletion, and I_count is the prerequisite for file deletion.
For a single file deletion, we may not need to care about this mechanism at all, but for large-volume file deletion, this is a very important factor, please allow me to elaborate in the following chapters, please write down the Linux file deletion mechanism.

Generate a list to delete
when there are 10 files under a folder, ' ls ' can be seen at a glance, and can even use ' ls–alt ' to view the detailed properties of all files; When the file becomes 100, ' ls ' may only look at the file name; The number of files has risen to 1000, many A few pages may still be acceptable; What if it's 10,000? ' ls ' may need to wait for half to have results, and when extended to 100,000, most systems may not respond, or "Argument list too long". More than just ' ls ' will encounter problems, other common Linux system management commands will encounter similar problems, the Shell has parameters to limit the length of the command. Even if we can extend the command length by modifying the Shell parameters, this does not improve the execution efficiency of the command. For an oversized file system, it is not acceptable to wait for the return time of common file management commands such as ' ls ' and ' find '.
So how can we generate a list of deleted files on a larger number of file systems? A high-performance file system index is a good approach, but a high-performance file index is a minority patent (which also explains why Google and Baidu can make such a profit). The good news is that file systems of this size typically exist only in high-performance file systems, which provide very powerful file management capabilities. For example, the mmapplypolicy of IBM common parallel file System (GPFS) mentioned earlier, quickly scans the entire file system by scanning the inode directly, and can return a list of files based on specified criteria. The following shows how to get a list of files based on timestamps and file types.

The effect of deadlocks on file deletion performance
for a scheduled daily execution of a file deletion task system, the file to be deleted is generated first, and then the list is deleted as input; if one day the list to be deleted is particularly large, the first day's deletion task is not completed, and the next day's delete task starts. What's going to happen?
The first day has not had time to be deleted the file appears in the next day's list of deleted files, and then the next day's file deletion process will take it as the output to perform the delete operation. At this point, the first day of the deletion process and the next day delete will try to delete the same file, the system throws a large number of unlink failed errors, delete performance will be greatly affected. Delete performance drop, will cause the second day of the file is still not deleted, the third day of the removal process will aggravate the deletion of the file deadlock, into the delete performance degradation of the vicious circle.
If you simply delete the list of pending deletes generated on the first day, can you solve the above problem? No. As described in the previous Linux file deletion mechanism, delete the first day file list file can only i_nlink the file, when the first day of the file deletion process has not ended, the file i_count is not zero, so the file will not be deleted. Until the process finishes processing all the files in the list, the process exits, and the first day of the list file that is deleted is actually deleted.
at a minimum, we need to terminate any other file deletion process in the system before the new file deletion process is started, in order to ensure that the deletion of the deadlock does not occur. But in doing so, there are still some drawbacks. Considering that in extreme cases, if the process of deletion for a period of time can not complete the deletion task within one cycle, the list to be deleted will continue to grow, the file scan time will be extended, thus crowding out the file deletion process work time, into another vicious circle.
and the actual combat experience tells us that the removal process has a reduced performance when the deletion list is particularly large. An appropriately sized parameter input file can ensure that the process executes effectively. Therefore, by the fixed size of the list file to be deleted into a series of files, can let the deletion operation stable and efficient execution. And, with storage and host performance allowed, splitting into multiple files allows us to execute multiple deletion processes concurrently.

Best practices for automatic cleaning of mass files
Best practices for automated cleanup of large additional years under the GPFS file system
The following is a file-cleaning practice on a Tens GPFS file system: A hardware environment of two IBMx3650 servers and a DS4200 disk array of 50TB storage capacity, with Linux operating systems and GPFS v3.2 installed. The goal is to perform file cleanup operations daily 2:00am, delete files that were 30 days old, and all files that end with TMP.
Mmapplypolicy Scan results show that there are 323,784,950 files and 158,696 folders on the system.

The code is as follows:

.............
[I] Directories scan:323784950 files, 158696 directories,
0 other objects, 0 ' skipped ' files and/or errors.
.............

Define lookup rules as follows, save as Trash_rule.txt

The code is as follows:

Rule EXTERNAL LIST ' trash_list ' EXEC '
Rule ' exp_scan_rule ' LIST ' trash_list ' for Fileset (' data ')
WHERE days (Current_timestamp) –days (access_time) > 30
Rule ' tmp_scan_rule ' LIST ' trash_list ' to Fileset (' data ') WHERE NAME like '%.tmp '

Execute mmapplypolicy and use the grep and awk commands to generate a complete list of files to be deleted, and then split the complete list into a sub-list of 10,000 files per list with the split command:

The code is as follows:

Mmapplypolicy/data–p Trash_rule.txt–l 3 | Grep
"/data" |awk ' {pint $} ' > Trash.lst
Split–a 4–c 10000–d trash.lst trash_split_

Execute the following command to delete the operation:

The code is as follows:

For a in ' LS trash_splict_* '
Todo
RM ' Cat $a '
Done

Save the above actions as trash_clear.sh, and then define the crontab tasks as follows:

The code is as follows:

0 2 * * */path/trash_clear.sh

To manually perform the delete task, the file scan you want to delete results as follows:

The code is as follows:

[I] GPFS Policy decisions and File Choice totals:
Chose to migrate 0kb:0 of 0 candidates;
Chose to premigrate 0kb:0 candidates;
Already co-managed 0kb:0 candidates;
Chose to delete 0kb:0 of 0 candidates;
Chose to list 1543192kb:1752274 of 1752274 candidates;
0KB of chosen data is illplaced or illreplicated;

During the file deletion process, we can use the following command to calculate the number of file deletions per minute. From the output below you can conclude that the file deletion speed is 1546 files per minute:

The code is as follows:

Df–i/data;sleep 60;df–i/data
FileSystem inodes iused ifree iuse% mounted on
/dev/data 2147483584 322465937 1825017647 16%/data
FileSystem inodes iused ifree iuse% mounted on
/dev/data 2147483584 322467483 1825016101 16%/data

The file deletion operation is timed by the ' time ' command, as can be seen from the output, this file deletion operation takes 1168 minutes (19.5 hours):

The code is as follows:

Time trash_clear.sh </p> <p> Real 1168m0.158s
User 57m0.168s
SYS 2m0.056s

Of course, for GPFS file systems, the file system itself provides other methods of file cleanup, such as the possibility of performing file deletions through Mmapplypolicy, which makes it possible to achieve more efficient file cleanup tasks. The purpose of this article is to discuss a common method of large-scale file cleanup, where there is no further discussion of file cleanup operations based on the functionality provided by the file system, and interested readers can try.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More