Small method for fast processing of large Linux files

Source: Internet
Author: User

    1. Background

Work using the MapReduce task to export a batch of files containing the path, the total number of rows 300w+, need to detect whether the file exists in the corresponding server, and the file is located on the server is not a Hadoop cluster server, so it is intended to use the bash script. The specific method is as follows ( see method 2 directly, Method 1 is less efficient ):

2. Method of Adoption

A. Method 1

You originally intended to use the following script for simple validation:

#!/bin/Bashcount=0CatOriTest.txt | whileRead Data DoCount=$ (($count +1 ))    Echo$countdir=`Echo "$data"|awk-F"\ t" '{print $}'`    if[-E $dir]; Then        Echo "$data">>Exist.txtElse        Echo "$data">>Noexist.txtfi Done

The original data format is as follows:

1      name  Mark        ID  dir

The runtime found that processing 5000 rows takes nearly 4, 5 minutes (Machine 8 cores), decisive not Ah, and then intend to use a multi-process approach to execute, see Method 2

B. Method 2

Mainly by the large file into small files, and then the small file background traversal read, the script is as follows:

#!/bin/Bashsource~/. bashrc# determine if the path exists ReadData () {Cat$1| whileRead Data Do        dir=`Echo "$data"|awk-F"\ t" '{print $}'`        if[-E $dir]; Then            Echo "$data">>"Exist_$1.txt"        Else            Echo "$data">>"Noexist_$1.txt"        fi     Done}# large files are divided into small files, generated files named XAA,AXB, etc. (you can name the file)Split-L10000Oritest.txtdeclare-A Files # declares an array of files=($(lsx*) # The small file name after the partition save the array # Traversal, and the background execution forIinch${files[@]}; Do    Echo$i readdata $i& Done

Execution efficiency varies by machine performance, 8-core machine performs 300W of data in more than 10 minutes

Small method for fast processing of large Linux files

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.