Seven suggestions for processing large batches of Files

Source: Internet
Author: User
Recently, due to project requirements, many documents are frequently used, and many detours have taken place during this period. It takes a long time to process large volumes of files. Most of the time is spent waiting anxiously. With the rich experience, I gradually found out some principles for processing large batches of files. Here: Principle 1: select the command line instead of the GUI.
For example, a folder stores millions of files and contains large quantum folders. To count the number of all files in a folder, right-click the folder attribute to view the number of files. However, this often directly causes no Windows response. The alternative method is to use the command line tool for statistics, such as the dir command under dos, or the linux Command Line tool in windows: unxutils, it can simulate most linux commands in windows and Use find. -type f | wc-l for fast statistics. In this way, not only is it several times faster than the GUI, but there is no unfriendly interface like "no response. Principle 2: Compressed storage and transmission sometimes requires massive data storage on the disk, and takes the file for a walk on the disk block, which not only occupies space, but also consumes network transmission time. For common text files, the compression ratio of common compression formats is close to 10%. To support cross-platform, you can select a common format such as zip or tar. The transmission of the entire large file also saves a lot of time than the transmission of massive small files. Even taking the compression and decompression time into account, this is much faster than directly transferring the dispersed files. Principle 3: cache common information if your program often needs to traverse all files in a folder for processing, and the file set remains stable, the system will spend a lot of time traversing. In this case, you can maintain a file list. The system only traverses the list when the list is generated for the first time. After that, the system no longer needs to traverse the folder and directly reads the file list information. The cost of the latter is much lower than that of the former. Of course, there are also many situations where frequently used information is often changed. This requires the 4th principles mentioned in this article. Principle 4: If the folder mentioned in incremental modification Information Principle 3 is changed frequently, does it mean that you have to traverse all the information each time to obtain the latest Folder Information? Of course, the answer is: unless in the worst case, almost all files are updated, you can modify only the changed part of the information, to avoid overhead of re-calculation. For example, if a folder adds new data of the day and deletes the Data seven days ago, you only need to update the data and keep the data of the six days in the middle. Principle 5: Parallel Processing of the experiences of using download tools to download files. If your bandwidth supports a maximum bandwidth of K data per second, if your current task only downloads K data per second, you should start several more tasks and download them in parallel, until the total download volume reaches the limit of KB per second. In this way, we can use the bandwidth to the maximum extent and do the most. Of course, once the throughput reaches the bottleneck, increasing the process or thread will not only increase the processing speed, but may also lead to resource deadlock and depletion. Principle 6: It is time-consuming to reduce I/O overhead and read/write I/O. If not necessary, no unnecessary operations are required. For example, many software have different levels of logs. If it is used only by common users, you do not need to save a lot of detailed log information. In case of a fault that requires diagnosis, engineers can turn on the log switch for debugging. Test engineers can also redirect logs or nohup logs during normal tests. It not only saves I/O, but also saves enough information for Error Tracking. Principle 7: Select batch processing, instead of one-by-one processing. Many software programs contain batch processing functions. batch processing is performed through the given execution list. You do not need to control the processing process each time. For example, wget supports downloading a given URL list. If you split the URL list by yourself and pass it to wget one by one, each transfer will result in a wget start and end operation, resulting in unnecessary overhead. Good software can ensure that a batch processing is performed at a startup/result. So before using any software, read the help manual to see if there is any batch processing function, so that you can get twice the result with half the effort.

This article is from the "cainiao William" blog, please be sure to keep this source http://williamwhe.blog.51cto.com/720802/155424

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.