When transferring or analyzing a large file, a reference is to split the file first, then process each sub-file and merge it if necessary.
The split function can be split by file size or number of rows.
-A: Specify suffix length
-B: How many bytes per file, units can be k and M
-D: Use a numeric suffix instead of a letter
-L: Specify the number of rows per file, default 1000
Example:
Cutting a file for each sub-file 20M size,-b specifies the 20M size, filename is the file name, and prefix is the prefix for each sub-file. The suffix is usually aa,ab,ac ....
Split -B 20m filename prefix
Prefixaaprefixabprefixacprefixad ...
Modify the suffix length to 2, i.e.-a 2. With the numeric suffix-d. 10M per file, i.e.-B 10m.
Split 2 -d-b 10m access.log haha
Haha00haha01haha02haha03 ...
Interview question: Two large files, the file content is as follows, need to find out two files of the same data.
#文件a 111 222 333 444 #文件b 444 555 666 222
Of course, can be found through cat a B | sort | uniq-d , if the file is too large, you need to split the file before the search, the general is the first split file A and file B separately divided into such as 10 files, but it is very troublesome to compare these sub-files, Because each sub-file of a file and B file each sub-file to do a comparison, is to compare 100 times, then can only be compared 10 times to produce results?
The answer is to split the line according to a certain set of rules in the same suffix sub-file, for example, here we take the remainder of the way to divide, 111 10 is 1, and then put into the sub-file a_1, then if there is 111 this row of data in B, it must also be placed in the sub-file b_1, So as long as the corresponding sub-files of a and B can be compared, 10 times the comparison is done.
First use awk to separate 10 sub-files, and then compare it.
awk ' {mod = $% 10}{print >> "A_" Mod}{close ("A_" MoD)} ' a awk ' {mod = $% 10}{print >> "b_" Mod}{close ("B_" MoD)} ' b
Split splits large files-contains files that are split by awk into corresponding sub-files