function : Find all duplicate files in the specified directory (one or more) and sub-directories, list them in groups, and manually select or automatically delete redundant files randomly, and keep one copy of each group of duplicate files. (The supporting file name has spaces, for example: "File name", etc.)
implementation:find iterates through the specified directory for all files and MD5 checks for all files found, processing duplicate files by classifying MD5 values.
Insufficient: Find traversal file time-consuming;
MD5 checking large files is time consuming;
Time-consuming for all file checksum ratios (consider the first round of repetitive filtering over file size, which is obvious for directories that hold large numbers of files, and is not used in this script);
Demonstrate:
650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M01/7E/E7/wKiom1cMa8-j8vdmAACUSNJMCks146.png "title=" 1.png " alt= "Wkiom1cma8-j8vdmaacusnjmcks146.png"/>
Comments:
The MD5 verification process is displayed during script execution, and after that, the statistics are as follows:
Files : Total number of documents verified
Groups: Number of duplicate filegroups
Size : This is a statistic of the total size of the extra files, the size of the extra duplicate files that are about to be deleted, in other words, the space that disk space will save after deleting duplicate files.
Available at "Show Detailed information?" After prompting, press "Y" to repeat the file group view, in order to confirm, can also skip directly, go to Delete file mode of the selection menu:
There are two ways to delete files, one is manual selection (the default way), each time a set of duplicate files, manually select the files to be left, the other files will be deleted, if not selected, the default is to keep the first file of the list, the demo is as follows:
650) this.width=650; "src=" Http://s5.51cto.com/wyfs02/M00/7E/E7/wKiom1cMeu-BIb4FAAC-lT0tjn0614.png "title=" 2.png " alt= "Wkiom1cmeu-bib4faac-lt0tjn0614.png"/>
Another way is to automatically select the way, the default is to keep the first file for each group of files, and other duplicate files are automatically deleted. (The first way to prevent deletion of important files is recommended), as shown below:
650) this.width=650; "src=" Http://s2.51cto.com/wyfs02/M00/7E/E7/wKiom1cMed-D8yzCAADPmyQ8gKw439.png "title=" 3.png " alt= "Wkiom1cmed-d8yzcaadpmyq8gkw439.png"/>
Support for file name spaces is shown below:
650) this.width=650; "src=" Http://s5.51cto.com/wyfs02/M01/7E/E5/wKioL1cMe3WCvwtaAACrm9a7JVg552.png "title=" 4.png " alt= "Wkiol1cme3wcvwtaaacrm9a7jvg552.png"/>
Code area:
#!/bin/bash#author: lingyi#date: 2016.4.12#func: delete duplicate files#eg : $0 [ dir1 dir2 ... dirn ]# define a temporary file and determine the user's Write permission md5sum_result_log= "/tmp /$ (date +%y%m%d%h%m%s) "echo -e " \033[1;31mmd5suming ...\033[0m "# Traverse directories, validate files and do output and write temporary files find [email protected] -type f -print0 | xargs -0 -I {} md5sum {} | tee -a $md 5sum_result_logfiles_sum=$ (cat $md 5sum_result_log | &NBSP;WC&NBSP;-L) # defines an array whose index is the MD5 value of the file, the element is the file name, and therefore needs to be declared in advance (Bash support is required) Declare -a md5sum_value_arrywhile read md5sum_value md5sum_filenamedo #为了支持文件名有空格的情况, use "+" Instead of whitespace as separators for separating each file name #因此, if the file name with "+", the script execution results will be problematic, when the user needs to delete the file, select Manual mode to confirm md5sum_value_arry[$ md5sum_value]= "${md5sum_value_arry[$md 5sum_value]}+ $md 5sum_filename" (( _${md5sum_value}+=1 )) done < $MD 5sum_result_log# This cycle for statistical repetitionThe group data of the file and the size of the extra file groups_sum=0repfiles_size=0for md5sum_value_index in ${!md5sum_value_arry[@]} doif eval [[ \${_${md5sum_value_index}} -gt 1 ]]; thenlet groups_sum++ need_print_indexes= "$need _print_indexes $md 5sum_value_index" eval repfile_sum=\$\ (\ ( \$_$md5sum_ value_index - 1 \) \) repfile_size=$ ( ls -lS "' echo ${md5sum_value_arry[$ Md5sum_value_index]}|awk -f ' + ' ' {print $2} ' " | awk ' {print $5} ') Repfiles_ size=$ ( repfiles_size + repfile_sum*repfile_size ) fidone #输出统计信息echo -e " \033[1;31mfiles: $files _sum groups: $groups _sum size: ${repfiles_size}b $ ((repfiles_size/1024)) k $ ((repfiles_size/1024/1024)) M\033[0m "[[ $ groups_sum -eq 0 ]] && exit# the user chooses whether to view the group details of the duplicate file. Read -n 1 -s -t 300 -p ' show detailed information ? ' user_ch[[ $user _ch == ' n ' ]] && echo | | {[[ $user _ch == ' q ' ]] && exitfor print_value_index in $need _print_indexesdoecho -ne "\n\033[1;35m$ ((++i)) \033[0m" eval echo -ne "\\\033[1\;34m$print_value_index [ \$_${print_value_index} ]:\\\033[0m" Echo ${md5sum_value _arry[$print _value_index]} | tr ' + ' ' \ n ' done | more} #用户选择删除文件的方式echo -e "\n\nmanual selection by default !" echo -e " 1 manual selection\n 2 random selection" echo -ne "\033[1;31m" read -t 300 user_checho -ne "\033[0m" [[ $USER _ch == ' Q ' ]] && exit[[ $USER _ch -ne 2 ]] && user_ch=1 | | {echo -ne "\033[31mwarning: you have choiced the random selection mode, files will be deleted at random !\nare you sure ?\033[0m "read -t 300 yn[[ $yn == ' q ' ]] && exit[[ $yn != ' y ' ]] && user_ch=1} #根据用户选择的方式, Process echo -e "\033[31m\nwarn: keep the first file by default.\033[0m "for exec_value_index in $need _print_ indexesdo #此循环获取包含即将删除的文件的数组for (i=0,j=2;i<$ (echo ${md5sum_value_arry[$exec _ value_index]} | grep -o ' + ' | wc -l); i++,j++)) Do file_choices_ Arry[i]= "$ (echo ${md5sum_value_arry[$exec _value_index]}|awk -f ' + ' ' {print $J} ' J= $j)" doneeval file_sum=\$_$exec_value_indexif [[ $USER _ch -eq 1 ]]; then #如果用户选择手动模式, the loop outputs the repeating file groupings and processes echo -e "\033[1;34m$exec_value_index\033[0m" for (j=0; j <${#file_choices_arry [@]}; j++)] doecho "[ $j ] ${file_choices_arry[j]}" doneread -p "number of the file you want to keep: " num _ch[[ $num _ch == ' Q ' ]] && exitseq 0 $ (${#file_choices_arry [@ ]}-1)) | grep -w $num _ch &>/dev/null | | num_ch=0elsenum_ch=0fi# If the user chooses the automatic deletion method, the redundant duplicate file for (n=0; n<${#file_choices_arry [@]}; n++) is deleted directly. do[[ $n -ne $num _ch ]] && {echo -ne "\033[1mdeleting file \ " ${file_choices_arry[n]} \" ... \033[0m "rm -f " ${file_choices_arry[n]} "[ [ $? -eq 0 ]] && echo -e "\033[1;32mok" | | echo -e "\033[1;31mfail" echo -ne "\033[0m"}donedone
Code Download link
This article is from "retrograde person" blog, declined reprint!
One of the scripting apps: Find and delete duplicate files