Transferred from: http://longriver.me/?p=57
Method 1:
Single-process processing large-scale file speed (million) is slow, you can use the method of awk modulo, divide the file, so you can take advantage of the full use of multi-core CPU
| 1234 |
for((i=0;i<5;i++));do cat query_ctx.20k | awk ‘NR%5==‘$i‘‘ |\ wc-l 1> output_$i 2>err_$i & done |
Method 2:
Alternatively, you can use the Split method, or HashKey to divide the large file,
The drawback of this approach is the need for large file preprocessing, the process of dividing large files is a single process, but also a time-consuming
| 12345678910 |
infile=$1opdir=querysopfile=ress=`date"+%s"`whilereadlinedo imei=`./awk_c "$line"` no=`./tools/default$imei 1000` echo$line >> $opdir/$opfile-$nodone<$infile |
Method 3:
This method is an extension of Method 2, after preprocessing, you can use shell scripts to execute multiple processes in parallel, of course, in order to prevent the process between processes because of the confusion of output, can use the lock method, can also be divided by the naming method. The following example uses the MV operation skillfully. This synchronous operation acts as a mutex, making the incremental process more flexible, without causing errors on the output as long as the machine resources are sufficient to increase the process at any time.
| 1234567891011121314151617181920 |
output=hier_resinput=dbscan_resprefix1=tmp-prefix2=res-forfilein`ls$input/res*`do tmp=`echo ${file#*-}` ofile1=${prefix1}${tmp} ofile2=${prefix2}${tmp} if[ ! -f $output/$ofile1 -a ! -f $output/$ofile2 ];then touch$output/aaa_$tmp mv$output/aaa_$tmp $output/$ofile1 if[ $? -eq0 ] then echo"dealing "$file cat$file| python hcluster.py 1> $output/$ofile1 2> hier.err mv $output/$ofile1 $output/$ofile2 fi fidone |
Shell script Processing (a) method summary of Big data series