Objective
We all know that the shell has a great advantage in text processing, such as multiple text merging, heavy, and so on, but recently encountered a difficult problem, that is, two large data files to heavy. Let's take a look at the detailed introduction.
Requirements
have TXT file A.txt and B.txt.
Where a is the keyword and search volume, separated by commas, about 900,000 lines.
B is the key word, about 4 million lines.
You need to find the keywords that repeat with b from a.
I tried n posture, but the results are not satisfactory, the most strange is that some methods of small data volume of the test files useful, once used in A and B will fail, really called people baffled.
Posture One:
Awk-f, ' {print} ' A >keywords.txt
cat keywords.txt B.txt | sort | uniq-d
#先从A. txt type to remove the keyword, and then open with the B.txt, with the sort row Order, uniq-d to remove duplicate rows
Position two:
Awk-f, ' {print} ' A >keywords.txt
#照例先取出关键词
comm-1-2 keywords.txt B.txt
#利用comm命令, displaying rows with two files
Posture Three:
Awk-f, ' {print} ' A >keywords.txt for
i in ' cat keywords.txt ' do
a= ' egrep-c ' ^ $i $ ' B.txt '
if [$A! = 0]
then
echo $i >> repeat keywords
. txt fi done
#这种姿势就稍微复杂点
#首先取出关键词, Then use for loop one by one to B.txt inside match (note the regular writing ^ $i $), if the match to the number of results is not 0, indicating that the keyword is duplicated, and then output
#这种方法的优点是稳妥, the disadvantage is that the efficiency is too low, 900,000 words and 4 million words match, The shell defaults to no more threads and takes too long.
Posture Four:
Awk-f, ' {print} ' A >keywords.txt
cat keywords.txt B.txt | awk '!a[$1]++ '
#这个方法的原理其实我不太懂, the awk command is too powerful and profound, But this approach is simple and fast.
Actually there is a grep -v
grep -f
way, but I have not tried, so not listed here.
Summarize
The above is the entire content of this article, I hope the content of this article for everyone's study or work can bring certain help, if you have questions you can message exchange.