Split splits large files-contains files that are split by awk into corresponding sub-files

Source: Internet
Author: User

When transferring or analyzing a large file, a reference is to split the file first, then process each sub-file and merge it if necessary.

The split function can be split by file size or number of rows.

    • -A: Specify suffix length

    • -B: How many bytes per file, units can be k and M

    • -D: Use a numeric suffix instead of a letter

    • -L: Specify the number of rows per file, default 1000

Example:

Cutting a file for each sub-file 20M size,-b specifies the 20M size, filename is the file name, and prefix is the prefix for each sub-file. The suffix is usually aa,ab,ac ....

Split  -B  20m  filename  prefix
Prefixaaprefixabprefixacprefixad ...

Modify the suffix length to 2, i.e.-a 2. With the numeric suffix-d. 10M per file, i.e.-B 10m.

Split 2 -d-b 10m access.log haha
Haha00haha01haha02haha03 ...

Interview question: Two large files, the file content is as follows, need to find out two files of the same data.

#文件a 111 222 333 444 #文件b 444 555 666 222

Of course, can be found through cat a B | sort | uniq-d , if the file is too large, you need to split the file before the search, the general is the first split file A and file B separately divided into such as 10 files, but it is very troublesome to compare these sub-files, Because each sub-file of a file and B file each sub-file to do a comparison, is to compare 100 times, then can only be compared 10 times to produce results?

The answer is to split the line according to a certain set of rules in the same suffix sub-file, for example, here we take the remainder of the way to divide, 111 10 is 1, and then put into the sub-file a_1, then if there is 111 this row of data in B, it must also be placed in the sub-file b_1, So as long as the corresponding sub-files of a and B can be compared, 10 times the comparison is done.

First use awk to separate 10 sub-files, and then compare it.

awk ' {mod = $% 10}{print >> "A_" Mod}{close ("A_" MoD)} ' a awk ' {mod = $% 10}{print >> "b_" Mod}{close ("B_" MoD)} ' b

Split splits large files-contains files that are split by awk into corresponding sub-files

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.