After downloading a lot of documents, due to poor classification, many folders have duplicate files, so I want to use python to write a small tool for searching duplicate files.
The main idea is as follows:
1. Search for same-life files
2. crc32 is used to check the files of the same size first, and then calculate the crc32 to get the same file name list.
The following is a reprinted code. Although it can meet the requirements, it is very slow to search for a large number of files. I took the time to tune it.
Code 1 #! /Usr/bin/env python
2 # coding = UTF-8
3 import binascii, OS
4
5 filesizes = {}
6 samefiles = []
7
8 def filesize (path ):
9 if OS. path. isdir (path ):
10 files = OS. listdir (path)
11 for file in files:
12 filesize (path + "/" + file)
13 else:
14 size = OS. path. getsize (path)
15 if not filesizes. has_key (size ):
16 filesizes [size] = []
17 filesizes [size]. append (path)
18
19 def filecrc (files ):
20 filecrcs = {}
21 for file in files:
22 f = open (file, "r ")
23 crc = binascii. crc32 (f. read ())
24 f. close ()
25 if not filecrcs. has_key (crc ):
26 filecrcs [crc] = []
27 filecrcs [crc]. append (file)
28 for filecrclist in filecrcs. values ():
29 if len (filecrclist)> 1:
30 samefiles. append (filecrclist)
31
32 if _ name _ = '_ main __':
33 path = r "J: \ My Work"
34 filesize (path)
35 for sizesamefilelist in filesizes. values ():
36 if len (sizesamefilelist)> 1:
37 filecrc (sizesamefilelist)
38 for samfile in samefiles:
39 print "****** same file group ******"
40 for file in samefile:
41 print file