"Python" file directory Comparison tool filecmp and Difflib

Source: Internet
Author: User

Tags: generator repetition return fun no code pfile ignore comparison

In some operations scenarios, it is often necessary to compare the application directory structure in both environments (whether there are additions or deletions at the file/directory level) and to compare the contents of files of the same name in both environments (i.e., changes at the file level). Python comes with two built-in modules to do the job well, filecmp and Difflib. The former is mainly used to compare the contents of the directory structure and the general document content comparison, the latter is used to compare the content of two different files. Comprehensive use of two modules can be compared to a complete comparison.


FILECMP provides several ways to easily compare the structure of two directories and the similarities and differences of general document content. Like what

FILECMP.CMP (F1,f2[,shallow]) is used to generally compare the contents of two files are the same, shallow can specify true or FALSE, when true, this method will put the properties of the file (Os.stat method call to see the information) Also as part of the comparative basis. The entire method finally returns true or FALSE to tell the caller the result of the comparison of two files.

Filecmp.cmpfiles (D1,d2,common[,shallow]) is used to compare whether those files with the same name in two directories are identical, common accepts a list or tuple to indicate which files of the same name are compared.

In addition to the static methods of the two modules above, there is also a dircmp class in filecmp for a more complete comparison processing work.

The definition of the DIRCMP class is described as follows:

    DIRCMP (A,b,ignore=none,hide=none)      A and B are directories.      IGNORE is a list of names to IGNORE,        defaults to [' RCS ', ' CVS ', ' tags '].      Hide is a list of names to hide,        defaults to [Os.curdir, Os.pardir].

After constructing a dircmp class object, you can call the following methods to output the contrast information

D.report () compares the contents of the current directory only, and does not involve the same content in the directory under the current directory. The result of the output is similar to the following:

Diff Testdir1 Testdir2
Only in Testdir1: [' SubDir1 ']
Only in Testdir2: [' SubDir2 ']
Identical files: [' Same.txt ']
Differing files: [' file.txt ']

The results above refer to the results of four categories only in Directory A, only in directory B, and with the same name (identical) and content of the same name (differing). In addition, it is possible to get a subdirectory of the same name, because some problems can not directly compare the content of the file (such as the contents of the file can not hash) and other results.


D.report_partial_closure () Compare the contents of the current directory and the next level subdirectory

D.report_full_closure () recursively compares the contents of all subdirectories in the current directory. In the formatted output returned, comparisons are listed according to the different subdirectories.

The above three methods are FILECMP modules that help us format the output well. If you need to get first-hand comparison results, you should call some of the other properties of the DIRCMP class. Like what:
Left_list directory a subdirectory and file list, equivalent to Os.listdir

Right_list subdirectory and file list in directory B

Left/right_only only exists in subdirectories and files in directory A/b

Common subdirectories and files with the same name in the two directory

Common_files two directory with the same name sub-file, the content is not necessarily the same

Common_dirs two directory with the same name subdirectory, the content is not necessarily the same

Non-comparable files with the same name in the Common_funny two directory

Same_files two directory with same name and same content sub-file

Diff_files two sub-files with the same name but different contents

Funny_files two sub-files with the same name but cannot be compared in the directory

All of the above properties return a list, which is the name of each file/directory

It is important to note that the DIRCMP class defaults to the current directory level by default, and requires some means for you to compare the descendants ' directories in depth and recursively. You can use DIRCMP's Subdirs this property

Subdirs is a dictionary, the key of the dictionary is a subdirectory of the same name under the two directories of the current DIRCMP class object (that is, Subdirs.keys () and so Common_dirs), and the value corresponding to each key is another dircmp object. This object compares the contents of the two sub-directories with the same name. In other words, it is possible to make recursive comparisons by constantly adjusting the Subdirs property. Although in this application I did not take the subdirs but a little trouble to use the method of judging Common_dirs, but the two principles are the same. I'm just trying to write it down later.



Difflib deep into the file, not only the "file content is not the same" level of the hint, but to specify which areas are different. Commonly used for comparison of text files. As you can see, using difflib or files that need to be compared can be hashed. The following instructions will be based on the comparison of text files.

Since the detailed file content is involved, it is necessary to have a way of showing the results of comparing the contents of the file to others. The more common method of character interface is the result of a diff command like Linux. Like what:

Catcatdiff  file1 file21, 2c1,2< this is File1< my name is Takanashi---> The is file2> My name is takanashi4c4<
     tomorrow is holiday---> i wouldnt work tomorrow

A detailed description of the character prompt is not mentioned here, you can refer to the introduction of Linux. The main point here is to show that some of the methods of Difflib This module will also output such a form of comparison results.

The Difflib module gives some classes for comparing text, the simplest of which is differ

Differ class

The differ class has a compare method for directly comparing two paragraphs of text, and the result of the comparison is presented in a form similar to the one shown above. For example, the above file1 and file2 two files, through such a script comparison:

ImportDIFFLIBD=Difflib. Differ () with open ('file1','R') as File1:content1=File1.read (). Splitlines () with open ('file2','R') as File2:content2=File2.read (). Splitlines ()Print '\ n'. Join (D.compare (CONTENT1,CONTENT2))

It can be noted that the object handled by the differ class is not a piece of text (or a string) but rather a list that is split by \ n based on a text string, which also applies to difflib other tool classes. So Difflib is actually a row-based comparison.

The Compare method returns a generator, which is the result of each row comparison. Running the above code output is:

-This is file1
? ^

+ This is File2
? ^

-My name is Takanashi
? ^

+ My name is Takanashi
? ^

Today is a little bit cold
-Tomorrow is holiday
+ i wouldnt work tomorrow

The first two sentences of the two files are different, the comparison result has the '-' sign left_only, the ' + ' sign represents the right_only, and for similar rows difflib will make further comparisons to find out where the change is. Add an extra line below the related line? The line that starts with the ^ number in the line identifies where the change occurred. The third line of the two files is the same, so only output once, the line before the space is to align with the upper and lower lines. The next two lines, because of the difference, are considered to be their own unique line, so there is no beginning of the line.

Also, if a line in the B file increases or decreases some characters in the line in the a file, then the-and + + characters are used in the start line to indicate what the increment or decrement character is. Like what:

-This a file12?              is a file2?    +++


Unified_diff method

It says that the differ class prints the same line once. If the two files are the same part of a lot, only a little different, then the same lines are displayed, even if printed only once is still a bit bad. And the Difflib.unified_diff method can solve this problem.

The Unified_diff method accepts n This parameter table name displays only the contents of the rows that have been found at different places. There are similar methods and Context_diff. It's not very good, so it's not in detail.

Sequencematcher class

This class can be used first to specify a comparison that ignores some characters. The first parameter specified in its constructor method is the function object. This function accepts a character and returns True or false after a certain judgment. Depending on the return result class will determine whether to count this character in the comparison result. For example s = sequencematcher (lambda x:x = = ", ' Some string A ', ' some string B ').

Above this s can call method S.find_longest_match (AB,AE,BB,BE). This method returns the tuple (I,J,K), which indicates that the longest common part a[i:i+k] and b[j:j+k] can be found in the two parts of the string, B, A[ab:ae] and B[bb:be].

Another NB of this class is that it is not only possible to compare strings, but to compare any sequence. For example, a comparison of two lists can also be implemented by it. At this point the first parameter of the constructor is the function object, which accepts not a single character but an element in a sequence.


Htmldiff class

This class is my use, it is based on the differ class, the original character interface results are changed to a more friendly HTML interface display. Glance

Construction method: __init__ (tabsize=8, wrapcolumn=none, linejunk=none, charjunk=is_character_ JUNK)

Tabsize is the number of spaces for tabs displayed in HTML, the default is 8 but I think it's too big to change to 2 or 4 to look better. WRAPCOLUMN Specifies the maximum width of the text in the comparison bar on the specified interface, which will wrap more than this width. The default is none that is not a newline, in the face of a long line, the page width will be very large. Linejunk and Charjunk are similar to the relevant parameters in the Ndiff method, and two are function objects that indicate what kind of rows or characters do not count toward comparisons.

Htmldiff class mainly uses two methods, Make_table and Make_file, the two parameters are similar, but the former return is a separate HTML file can be composed of HTML code (with <html><head> and other tags), The latter is the code that generates an HTML table (starting with <table>). Taking Make_file as an example, its parameters are Make_file (fromlines, Tolines [, fromdesc][, todesc][, context][, Numlines]). Fromlines and Tolines are the two lists that host the comparison, as stated above, and not a string is a list of strings that have been split by ' \ n ' . The Fromdesc and Todesc are then used to specify the text that is raised in the generated HTML in the comparison column. The context parameter defaults to False, and if set to True, the HTML will only display the contents of the line number of rows that have changed numlines, and most of the same content will not be displayed repeatedly.

Speaking for a long while, here is my use of the Htmldiff class:

    defCheck_diff (self, Index, wrapcolumn): File1, File2=Self.differing[index] with open (File1,'R') as F:content1=F.read (). Splitlines () with open (File2,'R') as F:content2=F.read (). Splitlines () Htmldiff= Htmldiff (tabsize=2,wrapcolumn=Wrapcolumn) with open ('tmp.html','W') as F:f.write (Htmldiff.make_file (Content1, Content2, Fromdesc=self.dir1, todesc=self.dir2)) Webbrowser.open ('tmp.html')

This is part of the code, combined with the WebBrowser module, the generated HTML file can be opened immediately, the resulting HTML interface is about the length of this:

You can see that the interface is relatively friendly.



"Python" file directory Comparison tool filecmp and Difflib

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

Tags Index: