This article from http://blog.xuyuan.me/2011/03/17/unix_diff.html
I encountered a strange bug in my recent project. I checked it carefully and found that it was a simple line.Code. This line of code is easy to do: compare two UNIX text files, find and print the new content of text 2 compared with text 1. The Code calls the diff command, for example:
# Content of the temp1.txt File
$> CAT temp1.txt
20110224
20110225
20110228
20110301
20110302
# Content of the temp2.txt File
$> CAT temp2.txt
20110228
20110301
20110302
20110303
20110304
# Diff command output result
$> Diff temp1.txt temp2.txt
1, 2d0
<20110224
<20110225
5a4, 5
> 20110303
> 20110304
# Only output content exclusive to the temp2.txt File
$> Diff temp1.txt temp2.txt | grep ">" | SED's/> // G'
20110303
20110304
As you can see, the output result drops the common content of the two files, and only the new part of temp2.txt is output, which is the same as the expected result.
however, as the content of the temp1.txt file increases, the diff command has different expected results:
$> CAT temp1.txt
20101216
20101217
20101220
20101221
20101223
20101224
20101227
20101228
> 20101229
20101230
20101231
20110103
20110104
20110105
20110106
20110107
20110110
20110111
> 20110112
20110113
20110114
20110117
20110118
20110119
20110120
20110121
20110124
20110125
> 20110126
20110127
20110128
20110131
20110201
20110202
20110203
20110204
20110207
20110208
> 20110209
20110210
20110211
20110214
20110215
20110216
20110217
20110218
20110221
20110222
& gt; 20110223
20110224
20110225
20110228
20110301
20110302
20110303
$> CAT temp2.txt
20110228
20110301
20110302
20110303
20110304
20110307
20110308
20110309
20110310
20110311
20110314
$> Diff temp1.txt temp2.txt
1, 55c1, 11
<20101216
<20101217
<20101220
<20101221
<20101223
<20101224
<20101227
<20101228
<20101229
<20101230
<20101231
<20110103
<20110104
<20110105
<20110106
<20110107
<20110110
<20110111
<20110112
<20110113
<20110114
<20110117
<20110118
<20110119
<20110120
<20110121
<20110124
<20110125
<20110126
<20110127
<20110128
<20110131
<20110201
<20110202
<20110203
<20110204
<20110207
<20110208
<20110209
<20110210
<20110211
<20110214
<20110215
<20110216
<20110217
<20110218
<20110221
<20110222
<20110223
<20110224
<20110225
<20110228
<20110301
<20110302
<20110303
---
> 20110228
> 20110301
> 20110302
> 20110303
> 20110304
> 20110307
> 20110308
> 20110309
> 20110310
> 20110311
> 20110314
$> Diff temp1.txt temp2.txt | grep ">" | SED's/> // G'
20110228
20110301
20110302
20110303
20110304
20110307
20110308
20110309
20110310
20110311
20110314
As you can see, the diffcommand output adds a part of the temp2.txt file (20110304-20110314), and outputs the common content of the two files (20110228-20110303), resulting in inconsistent results as expected.
View the man manual of the diff command and find that diff compares the content of two files and outputs the differences between the two files to generate a list that can convert the two files, however, this list cannot be a minimum set of 100%. In the example, we can see the comparison results of the temp1.txtand temp2.txt files provided by diff, but they contain the common parts of the two files, so they are different from expected.
One solution is to replace diff with the comm command, for example:
$> Comm-13 temp1.txt temp2.txt
20110304
20110307
20110308
20110309
20110310
20110311
20110314
The comm command is used to compare two files. Usage:
Comm [-123] file1 file2
-1: Filter unique content of file1
-2: Filter unique content of file2
-3: Filter repeated content of file1 and file2
PS, read the output format of diff, mainly including the following:
N1 A N3, N4
N1, N2 D N3
N1, N2 C N3, N4
For example, "1, 2d0" "5a4, 5" "1, 55c1, 11. N1 and N2 indicate the number of rows in the first file, N3 and N4 indicate the number of rows in the second file. "A" indicates add, "D" indicates delete, and "C" indicates change.
With the diff output result, you can use the patch command to restore one file to another. For example:
$> CAT temp1.txt
20110224
20110225
20110228
20110301
20110302
$> CAT temp2.txt
20110228
20110301
20110302
20110303
20110304
$> Diff temp1.txt temp2.txt> temp. Diff
$> CAT temp. Diff
1, 2d0
<20110224
<20110225
5a4, 5
> 20110303
> 20110304
# Use temp.diffand temp1.txt to restore the temp2 File
$> Patch-I temp. Diff-O temp2_restore.txt temp1.txt
Looks like a normal diff.
Done
# After completion, the content of temp2_restore is consistent with that of the original temp2 file.
$> CAT temp2_restore.txt
20110228
20110301
20110302
20110303
20110304