Compare two UNIX text files to find new content (diff and comm commands ))

Source: Internet
Author: User

This article from http://blog.xuyuan.me/2011/03/17/unix_diff.html

I encountered a strange bug in my recent project. I checked it carefully and found that it was a simple line.Code. This line of code is easy to do: compare two UNIX text files, find and print the new content of text 2 compared with text 1. The Code calls the diff command, for example:

# Content of the temp1.txt File
$> CAT temp1.txt
20110224
20110225
20110228
20110301
20110302

# Content of the temp2.txt File
$> CAT temp2.txt
20110228
20110301
20110302
20110303
20110304

# Diff command output result
$> Diff temp1.txt temp2.txt
1, 2d0
<20110224
<20110225
5a4, 5
> 20110303
> 20110304

# Only output content exclusive to the temp2.txt File
$> Diff temp1.txt temp2.txt | grep ">" | SED's/> // G'
20110303
20110304

As you can see, the output result drops the common content of the two files, and only the new part of temp2.txt is output, which is the same as the expected result.

 

however, as the content of the temp1.txt file increases, the diff command has different expected results:

$> CAT temp1.txt
20101216
20101217
20101220
20101221
20101223
20101224
20101227
20101228
> 20101229
20101230
20101231
20110103
20110104
20110105
20110106
20110107
20110110
20110111
> 20110112
20110113
20110114
20110117
20110118
20110119
20110120
20110121
20110124
20110125
> 20110126
20110127
20110128
20110131
20110201
20110202
20110203
20110204
20110207
20110208
> 20110209
20110210
20110211
20110214
20110215
20110216
20110217
20110218
20110221
20110222
& gt; 20110223
20110224
20110225
20110228
20110301
20110302
20110303

$> CAT temp2.txt
20110228
20110301
20110302
20110303
20110304
20110307
20110308
20110309
20110310
20110311
20110314

$> Diff temp1.txt temp2.txt
1, 55c1, 11
<20101216
<20101217
<20101220
<20101221
<20101223
<20101224
<20101227
<20101228
<20101229
<20101230
<20101231
<20110103
<20110104
<20110105
<20110106
<20110107
<20110110
<20110111
<20110112
<20110113
<20110114
<20110117
<20110118
<20110119
<20110120
<20110121
<20110124
<20110125
<20110126
<20110127
<20110128
<20110131
<20110201
<20110202
<20110203
<20110204
<20110207
<20110208
<20110209
<20110210
<20110211
<20110214
<20110215
<20110216
<20110217
<20110218
<20110221
<20110222
<20110223
<20110224
<20110225
<20110228
<20110301
<20110302
<20110303
---
> 20110228
> 20110301
> 20110302
> 20110303
> 20110304
> 20110307
> 20110308
> 20110309
> 20110310
> 20110311
> 20110314

$> Diff temp1.txt temp2.txt | grep ">" | SED's/> // G'
20110228
20110301
20110302
20110303
20110304
20110307
20110308
20110309
20110310
20110311
20110314

As you can see, the diffcommand output adds a part of the temp2.txt file (20110304-20110314), and outputs the common content of the two files (20110228-20110303), resulting in inconsistent results as expected.

View the man manual of the diff command and find that diff compares the content of two files and outputs the differences between the two files to generate a list that can convert the two files, however, this list cannot be a minimum set of 100%. In the example, we can see the comparison results of the temp1.txtand temp2.txt files provided by diff, but they contain the common parts of the two files, so they are different from expected.

 

One solution is to replace diff with the comm command, for example:

$> Comm-13 temp1.txt temp2.txt
20110304
20110307
20110308
20110309
20110310
20110311
20110314

The comm command is used to compare two files. Usage:

Comm [-123] file1 file2
-1: Filter unique content of file1
-2: Filter unique content of file2
-3: Filter repeated content of file1 and file2

 

PS, read the output format of diff, mainly including the following:

N1 A N3, N4
N1, N2 D N3
N1, N2 C N3, N4

For example, "1, 2d0" "5a4, 5" "1, 55c1, 11. N1 and N2 indicate the number of rows in the first file, N3 and N4 indicate the number of rows in the second file. "A" indicates add, "D" indicates delete, and "C" indicates change.

With the diff output result, you can use the patch command to restore one file to another. For example:

$> CAT temp1.txt
20110224
20110225
20110228
20110301
20110302

$> CAT temp2.txt
20110228
20110301
20110302
20110303
20110304

$> Diff temp1.txt temp2.txt> temp. Diff

$> CAT temp. Diff
1, 2d0
<20110224
<20110225
5a4, 5
> 20110303
> 20110304

# Use temp.diffand temp1.txt to restore the temp2 File
$> Patch-I temp. Diff-O temp2_restore.txt temp1.txt
Looks like a normal diff.
Done

# After completion, the content of temp2_restore is consistent with that of the original temp2 file.
$> CAT temp2_restore.txt
20110228
20110301
20110302
20110303
20110304

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.