The recent use of shell to help companies optimize the process of mining keywords, with the shell to replace the operation of multiple links, greatly improving the efficiency.
Shell has a great advantage in text processing, such as multi-text merging, deduplication, etc., but recently encountered a difficult problem, that is, two large data volume file deduplication.
There are txt files A.txt and B.txt.
Where a is a keyword and a search volume, separated by commas, about 900,000 lines.
b is the keyword, about 4 million lines.
You need to find the keyword that repeats with b from a.
I tried n posture, but the results are not satisfactory, the strangest thing is that some methods of small data volume of the test file is useful, once used in A and B will fail, it is really people can not think of its solution.
Posture One:
Awk-f, ' {print $} ' A >keywords.txtcat keywords.txt B.txt | Sort | uniq-d# first remove the keywords from the A.txt, then open with B.txt, sort by sort, uniq-d take out duplicate rows
Posture Two:
Awk-f, ' {print $} ' A >keywords.txt# the keyword comm-1-2 keywords.txt b.txt# Use the Comm command to display the rows that exist for all two files
Posture Three:
Awk-f, ' {print $} ' A >keywords.txtfor i in ' cat keywords.txt ' do a= ' egrep-c ' ^ $i $ "B.txt ' if [$A! = 0] then echo $i >> repeat keywords. txt fidone# This poses a little bit more complex # First Take out the keywords, then use the For loop to B.txt the match (note the regular notation ^ $i $), if the number of matching results is not 0, Note that this keyword is repeated, and then output # The advantage of this method is safe, the disadvantage is that the efficiency is too low TM, 900,000 words each with 4 million words, shell default and no multi-threading, time is too long.
Posture Four:
Awk-f, ' {print $} ' A >keywords.txtcat keywords.txt B.txt | awk '!a[$1]++ ' #这个方法的原理其实我不太懂, the awk command is too powerful, too advanced, but this method is simple and fast
In fact, there is a kind of grep-v, grep-f method, but I have not tried, so not listed here.
Two files to go to the heavy n posture