Remove Python code for the same file under the directory (step-by-step optimization)

Remove Python code for the same file under the directory (step-by-step optimization) _python

Last Update:2017-01-18 Source: Internet

Author: User

Tags file size

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

These two days to do nothing in Baidu Amoy Point pictures, not much, also on tens of thousands of, which has many beautiful pictures of Austria! Ha ha! Here for a moment do not say how the picture is obtained, we chat to get pictures of what happened after.
The first problem encountered is that some pictures do not have suffix names. Under Windows, a file without a suffix name is not recognized correctly, there is no preview, and when you open it, you have to choose how to open it. This problem is easier to solve, add a suffix name to each picture. No suffix name of the picture is not much, less than 1000, a piece of the change is very troublesome, fortunately I was learning computer, I wrote a program to modify the http://www.jb51.net/article/30400.htm batch. This problem is solved. Then encountered a new problem: more pictures, inevitable duplication, some pictures exactly the same, there is no need to keep, I would like to delete all the duplicate pictures.

Let's analyze This problem: first, the number of files is very large, manual search is unrealistic, and, moreover, alone with our eyes, in thousands of pictures inside find the exact same difficulty is also very big. If it is not a picture but another document, it is difficult to distinguish it correctly without previewing it. So use the program to achieve. Then how to use the program how to achieve it? According to what judge two files exactly the same? First of all, judging by filename is unreliable, because the file name can be changed arbitrarily, but the contents of the file are unchanged. Again, under the same folder, it is not possible to have two identical file names, which the operating system does not allow. There is another way to judge the size of a file, which is a good idea, but a picture with the same file size may not be the same. Furthermore, the picture is generally relatively small, more than 3 m of the basic no, most not enough 1M, if the folder under the file is very large, the same size of the file possibility is considerable. So the size of the file alone is not reliable. Another way is to read the contents of each picture, and then compare the content of the picture and other pictures are exactly the same, if the same content then the two pictures must be exactly the same. This method seems to be more perfect, let us analyze his space-time efficiency: first of all the content of each picture and other pictures to compare, this is a double cycle, reading efficiency is low, comparison efficiency is lower, all compare down is very time-consuming! In memory, if you read all the pictures in advance to memory can speed up the efficiency of the file, but the normal computer memory resources are limited, if the picture is very much, several g, read memory is not realistic. If you do not read all the files to memory, then each time before the comparison must read the contents of the file, a few times to read a few times, from the hard disk read data is relatively slow, it is obviously inappropriate. So is there a better way? I think, racked my brains, finally thought of MD5. What is MD5? Don't you know? Well, you're on Mars, hurry up and DuckDuckGo! Perhaps you would ask, MD5 is not encrypted? Does it have anything to do with our problems? Good question! MD5 can encrypt strings of any length to form a sequence of 32 characters, including numbers and letters (uppercase or lowercase), because any minor change in the string can result in a MD5 sequence change, so MD5 could be considered a string of ' fingerprints ' or ' Information Digest ' because there are 3,632 total MD5 strings, so two different strings get a same MD5 probability is very small, almost 0, the same truth, we can get each file MD5, a number of files MD5 the same words can basically be sure that two files are the same, because MD5 the same and file different probability too Small, basically can ignore, so we can do this: get each file MD5, by comparing MD5 is the same we can determine whether two picture is the same. Here is the codeImplementation, Python's

Copy Code code as follows:

#-*-coding:cp936-*-
Import MD5
Import OS
From time import clock as now
def getmd5 (filename):
file_txt = open (filename, ' RB '). Read ()
m = md5.new (file_txt)
Return M.hexdigest ()
def main ():
Path = Raw_input ("Path:")
All_md5=[]
Total_file=0
Total_delete=0
Start=now ()
For file in Os.listdir (path):
Total_file + 1;
Real_path=os.path.join (Path,file)
If Os.path.isfile (real_path) = = True:
FILEMD5=GETMD5 (Real_path)
If FILEMD5 in ALL_MD5:
Total_delete + 1
print ' Delete ', file
Else
All_md5.append (FILEMD5)
End = Now ()
Time_last = End-start
Total print ' files: ', Total_file
print ' Delete number: ', Total_delete
print ' time consuming: ', time_last, ' seconds '
If __name__== ' __main__ ':
Main ()

The above procedure principle is very simple, is reads each file sequentially, calculates MD5, if MD5 does not exist in the MD5 list, adds this MD5 to the MD5 list to go, if exists, we thought this md5 corresponding file already appeared, this picture is superfluous, Then we can delete the picture. The following is a screenshot of the program's operation:

As we can see, there are 8,674 files under this folder, and 31 are duplicates, and it takes 155.5 seconds to find all the duplicate files. Efficiency is not high, can be optimized? I analyzed, my program has two functions to compare time, one is to calculate the MD5 of each file, this accounted for most of the time, there is in the list to find the existence of MD5, also relatively time-consuming. From these two aspects, we can further optimize.

The first thing I want to do is solve the problem, maybe we can arrange the elements in the list first, and then go to find, but the list is changed, each time the order is less efficient. What I want is to use the dictionary to optimize. The most notable feature of a dictionary is that a key corresponds to a value we can use the MD5 as a key,key corresponding value does not need, in the case of changes in the dictionary search efficiency is higher than the sequence efficiency, because the sequence is unordered, and the dictionary is orderly, find out of course faster. So we just have to decide if the MD5 value is in all the keys. Here is the improved code:

Copy Code code as follows:

#-*-coding:cp936-*-
Import MD5
Import OS
From time import clock as now
def getmd5 (filename):
file_txt = open (filename, ' RB '). Read ()
m = md5.new (file_txt)
Return M.hexdigest ()
def main ():
Path = Raw_input ("Path:")
all_md5={}
Total_file=0
Total_delete=0
Start=now ()
For file in Os.listdir (path):
Total_file + 1;
Real_path=os.path.join (Path,file)
If Os.path.isfile (real_path) = = True:
FILEMD5=GETMD5 (Real_path)
If FILEMD5 in All_md5.keys ():
Total_delete + 1
print ' Delete ', file
Else
All_md5[filemd5]= '
End = Now ()
Time_last = End-start
Total print ' files: ', Total_file
print ' Delete number: ', Total_delete
print ' time consuming: ', time_last, ' seconds '

If __name__== ' __main__ ':
Main ()

From the time, it is indeed a little faster than the original, but not ideal. The following are also optimized. What else can be optimized? md5! The above program, each file to calculate MD5, very time-consuming, is not every file need to calculate MD5? Can you find a way to reduce the number of MD5 calculations? I think of a way: The above analysis we mentioned that can be compared to file size to determine whether the picture is exactly the same, fast, but this method is inaccurate, MD5 is accurate, can we combine the two? The answer is yes. We can assume that: if two files are identical, then the size of these two files and MD5 must be the same, if two files of different sizes, then this two files are certainly different! In that case, we just need to see if the size of the file exists in the size dictionary, if it does not exist, add it to the size dictionary, if the size exists, this means that there are at least two pictures of the same size, so we just calculate the file size of the same MD5, if MD5 the same, So these two files must be exactly the same, we can delete, if the MD5 is different, we add it to the list, to avoid duplication MD5. The specific code implementation is as follows:

Copy Code code as follows:

#-*-coding:cp936-*-
Import MD5
Import OS
From time import clock as now
def getmd5 (filename):
file_txt = open (filename, ' RB '). Read ()
m = md5.new (file_txt)
Return M.hexdigest ()
def main ():
Path = Raw_input ("Path:")
ALL_MD5 = {}
All_size = {}
Total_file=0
Total_delete=0
Start=now ()
For file in Os.listdir (path):
Total_file + 1
Real_path=os.path.join (Path,file)
If Os.path.isfile (real_path) = = True:
Size = Os.stat (real_path). st_size
Name_and_md5=[real_path, "']
If size in All_size.keys ():
NEW_MD5 = GETMD5 (Real_path)
If all_size[size][1]== ':
ALL_SIZE[SIZE][1]=GETMD5 (All_size[size][0])
If NEW_MD5 in All_size[size]:
print ' Delete ', file
Total_delete + 1
Else
All_size[size].append (NEW_MD5)
Else
All_size[size]=name_and_md5
End = Now ()
Time_last = End-start
Total print ' files: ', Total_file
print ' Delete number: ', Total_delete
print ' time consuming: ', time_last, ' seconds '

If __name__== ' __main__ ':
Main ()

What about time efficiency? Look at the picture below:

It took only 7.28 seconds! It is more than 10 times times more efficient than the first two! It's a time to accept.

Algorithm is a very magical thing, inadvertently use will have an unexpected harvest! The above code can also be further optimized, such as improving the search algorithm, and so on, readers have any idea can communicate with me. It may be quicker to switch to C to implement. Oh, I like the simplicity of Python!
Bo Master ma6174

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More