- Background
Work using the MapReduce task to export a batch of files containing the path, the total number of rows 300w+, need to detect whether the file exists in the corresponding server, and the file is located on the server is not a Hadoop cluster server, so it is intended to use the bash script. The specific method is as follows ( see method 2 directly, Method 1 is less efficient ):
2. Method of Adoption
A. Method 1
You originally intended to use the following script for simple validation:
#!/bin/Bashcount=0CatOriTest.txt | whileRead Data DoCount=$ (($count +1 )) Echo$countdir=`Echo "$data"|awk-F"\ t" '{print $}'` if[-E $dir]; Then Echo "$data">>Exist.txtElse Echo "$data">>Noexist.txtfi Done
The original data format is as follows:
1 name Mark ID dir
The runtime found that processing 5000 rows takes nearly 4, 5 minutes (Machine 8 cores), decisive not Ah, and then intend to use a multi-process approach to execute, see Method 2
B. Method 2
Mainly by the large file into small files, and then the small file background traversal read, the script is as follows:
#!/bin/Bashsource~/. bashrc# determine if the path exists ReadData () {Cat$1| whileRead Data Do dir=`Echo "$data"|awk-F"\ t" '{print $}'` if[-E $dir]; Then Echo "$data">>"Exist_$1.txt" Else Echo "$data">>"Noexist_$1.txt" fi Done}# large files are divided into small files, generated files named XAA,AXB, etc. (you can name the file)Split-L10000Oritest.txtdeclare-A Files # declares an array of files=($(lsx*) # The small file name after the partition save the array # Traversal, and the background execution forIinch${files[@]}; Do Echo$i readdata $i& Done
Execution efficiency varies by machine performance, 8-core machine performs 300W of data in more than 10 minutes
Small method for fast processing of large Linux files