Transferred from: http://longriver.me/?p=57
Method 1:
Single-process processing large-scale file speed (million) is slow, you can use the method of awk modulo, divide the file, so you can take advantage of the full use of multi-core CPU
1234 |
for ((i=0;i<5;i++)); do cat query_ctx.20k | awk ‘NR%5==‘ $i ‘‘ |\ wc -l 1> output_$i 2>err_$i & done |
Method 2:
Alternatively, you can use the Split method, or HashKey to divide the large file,
The drawback of this approach is the need for large file preprocessing, the process of dividing large files is a single process, but also a time-consuming
12345678910 |
infile=$1
opdir=querys
opfile=res
s=`
date
"+%s"
`
while
read
line
do
imei=`.
/awk_c "$line"
`
no=`.
/tools/default
$imei 1000`
echo
$line >> $opdir/$opfile-$no
done
<$infile
|
Method 3:
This method is an extension of Method 2, after preprocessing, you can use shell scripts to execute multiple processes in parallel, of course, in order to prevent the process between processes because of the confusion of output, can use the lock method, can also be divided by the naming method. The following example uses the MV operation skillfully. This synchronous operation acts as a mutex, making the incremental process more flexible, without causing errors on the output as long as the machine resources are sufficient to increase the process at any time.
1234567891011121314151617181920 |
output=hier_res
input=dbscan_res
prefix1=tmp-
prefix2=res-
for
file
in
`
ls
$input
/res
*`
do
tmp=`
echo ${
file
#*-}`
ofile1=${prefix1}${tmp}
ofile2=${prefix2}${tmp}
if
[ ! -f $output/$ofile1 -a ! -f $output/$ofile2 ];
then
touch
$output
/aaa_
$tmp
mv
$output
/aaa_
$tmp $output/$ofile1
if
[ $? -
eq
0 ]
then
echo
"dealing "
$
file
cat
$
file
| python hcluster.py 1> $output/$ofile1 2> hier.err
mv $output/$ofile1 $output/$ofile2
fi
fi
done
|
Shell script Processing (a) method summary of Big data series