One of the scripting apps: Find and delete duplicate files

Last Update:2016-04-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

function : Find all duplicate files in the specified directory (one or more) and sub-directories, list them in groups, and manually select or automatically delete redundant files randomly, and keep one copy of each group of duplicate files. (The supporting file name has spaces, for example: "File name", etc.)

implementation:find iterates through the specified directory for all files and MD5 checks for all files found, processing duplicate files by classifying MD5 values.

Insufficient: Find traversal file time-consuming;

MD5 checking large files is time consuming;

Time-consuming for all file checksum ratios (consider the first round of repetitive filtering over file size, which is obvious for directories that hold large numbers of files, and is not used in this script);

Demonstrate:

650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M01/7E/E7/wKiom1cMa8-j8vdmAACUSNJMCks146.png "title=" 1.png " alt= "Wkiom1cma8-j8vdmaacusnjmcks146.png"/>

Comments:

The MD5 verification process is displayed during script execution, and after that, the statistics are as follows:

Files : Total number of documents verified

Groups: Number of duplicate filegroups

Size : This is a statistic of the total size of the extra files, the size of the extra duplicate files that are about to be deleted, in other words, the space that disk space will save after deleting duplicate files.

Available at "Show Detailed information?" After prompting, press "Y" to repeat the file group view, in order to confirm, can also skip directly, go to Delete file mode of the selection menu:

There are two ways to delete files, one is manual selection (the default way), each time a set of duplicate files, manually select the files to be left, the other files will be deleted, if not selected, the default is to keep the first file of the list, the demo is as follows:

650) this.width=650; "src=" Http://s5.51cto.com/wyfs02/M00/7E/E7/wKiom1cMeu-BIb4FAAC-lT0tjn0614.png "title=" 2.png " alt= "Wkiom1cmeu-bib4faac-lt0tjn0614.png"/>

Another way is to automatically select the way, the default is to keep the first file for each group of files, and other duplicate files are automatically deleted. (The first way to prevent deletion of important files is recommended), as shown below:

650) this.width=650; "src=" Http://s2.51cto.com/wyfs02/M00/7E/E7/wKiom1cMed-D8yzCAADPmyQ8gKw439.png "title=" 3.png " alt= "Wkiom1cmed-d8yzcaadpmyq8gkw439.png"/>

Support for file name spaces is shown below:

650) this.width=650; "src=" Http://s5.51cto.com/wyfs02/M01/7E/E5/wKioL1cMe3WCvwtaAACrm9a7JVg552.png "title=" 4.png " alt= "Wkiol1cme3wcvwtaaacrm9a7jvg552.png"/>

Code area:

#!/bin/bash#author: lingyi#date: 2016.4.12#func: delete duplicate files#eg   : $0 [ dir1 dir2 ... dirn ]# define a temporary file and determine the user's Write permission md5sum_result_log= "/tmp /$ (date +%y%m%d%h%m%s) "echo -e " \033[1;31mmd5suming ...\033[0m "# Traverse directories, validate files and do output and write temporary files find  [email protected] -type f -print0 | xargs -0 -I {}  md5sum {} | tee -a  $md 5sum_result_logfiles_sum=$ (cat  $md 5sum_result_log | &NBSP;WC&NBSP;-L) # defines an array whose index is the MD5 value of the file, the element is the file name, and therefore needs to be declared in advance (Bash support is required) Declare -a md5sum_value_arrywhile  read md5sum_value md5sum_filenamedo      #为了支持文件名有空格的情况, use "+" Instead of whitespace as separators for separating each file name      #因此, if the file name with "+", the script execution results will be problematic, when the user needs to delete the file, select Manual mode to confirm md5sum_value_arry[$ md5sum_value]= "${md5sum_value_arry[$md 5sum_value]}+ $md 5sum_filename" (( _${md5sum_value}+=1 )) done  < $MD 5sum_result_log# This cycle for statistical repetitionThe group data of the file and the size of the extra file groups_sum=0repfiles_size=0for md5sum_value_index in ${!md5sum_value_arry[@]} doif eval [[ \${_${md5sum_value_index}} -gt 1 ]]; thenlet groups_sum++ need_print_indexes= "$need _print_indexes  $md 5sum_value_index" eval repfile_sum=\$\ (\ ( \$_$md5sum_ value_index - 1 \) \) repfile_size=$ ( ls -lS  "' echo ${md5sum_value_arry[$ Md5sum_value_index]}|awk -f ' + '   ' {print $2} ' " | awk  ' {print $5} ') Repfiles_ size=$ ( repfiles_size + repfile_sum*repfile_size ) fidone  #输出统计信息echo  -e  " \033[1;31mfiles:  $files _sum    groups:  $groups _sum    size:  ${repfiles_size}b $ ((repfiles_size/1024)) k $ ((repfiles_size/1024/1024)) M\033[0m "[[ $ groups_sum -eq 0 ]] && exit# the user chooses whether to view the group details of the duplicate file. Read -n 1 -s -t 300 -p  ' show detailed information ? '  user_ch[[  $user _ch ==  ' n '  ]] && echo | |  {[[  $user _ch ==  ' q '  ]] && exitfor print_value_index in   $need _print_indexesdoecho -ne  "\n\033[1;35m$ ((++i))  \033[0m" eval echo -ne  "\\\033[1\;34m$print_value_index [ \$_${print_value_index} ]:\\\033[0m" Echo ${md5sum_value _arry[$print _value_index]} | tr  ' + '   ' \ n ' done | more} #用户选择删除文件的方式echo  -e   "\n\nmanual selection by default !" echo -e  " 1 manual selection\n 2 random selection"  echo -ne   "\033[1;31m" read -t 300 user_checho -ne  "\033[0m" [[  $USER _ch ==  ' Q '  ]] && exit[[  $USER _ch -ne 2 ]] && user_ch=1  | |  {echo -ne  "\033[31mwarning: you have choiced the random selection mode, files  will be deleted at random !\nare you sure ?\033[0m "read -t  300  yn[[  $yn  ==  ' q '  ]] && exit[[  $yn  !=  ' y '  ]] && user_ch=1} #根据用户选择的方式, Process echo -e  "\033[31m\nwarn: keep the  first file by default.\033[0m "for exec_value_index in  $need _print_ indexesdo     #此循环获取包含即将删除的文件的数组for (i=0,j=2;i<$ (echo ${md5sum_value_arry[$exec _ value_index]} | grep -o  ' + '  | wc -l);  i++,j++)) Do file_choices_ Arry[i]= "$ (echo ${md5sum_value_arry[$exec _value_index]}|awk -f ' + '   ' {print  $J} '  J= $j)" doneeval file_sum=\$_$exec_value_indexif [[  $USER _ch -eq 1 ]]; then          #如果用户选择手动模式, the loop outputs the repeating file groupings and processes echo -e  "\033[1;34m$exec_value_index\033[0m" for (j=0; j <${#file_choices_arry [@]}; j++)] doecho  "[  $j  ]  ${file_choices_arry[j]}" doneread -p  "number of the file you want to keep: "  num _ch[[  $num _ch ==  ' Q '  ]] && exitseq 0 $ (${#file_choices_arry [@ ]}-1))  | grep -w  $num _ch &>/dev/null | |  num_ch=0elsenum_ch=0fi# If the user chooses the automatic deletion method, the redundant duplicate file for (n=0; n<${#file_choices_arry [@]}; n++) is deleted directly. do[[  $n  -ne  $num _ch ]] && {echo -ne  "\033[1mdeleting file  \ " ${file_choices_arry[n]} \"  ... \033[0m "rm -f " ${file_choices_arry[n]} "[ [ $? -eq 0 ]]  && echo -e  "\033[1;32mok"  | |  echo -e  "\033[1;31mfail" echo -ne  "\033[0m"}donedone

Code Download link

This article is from "retrograde person" blog, declined reprint!

One of the scripting apps: Find and delete duplicate files

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

One of the scripting apps: Find and delete duplicate files

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

One of the scripting apps: Find and delete duplicate files

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support