Come to a domain name how to determine its cache or not, the high-level professional crawler can certainly do analysis, if not very rigorous analysis, through the shell script can also be implemented, to see me this layer of pages of small crawl bar, ha ha, first script execution after the result diagram:
650) this.width=650; "src=" Http://s5.51cto.com/wyfs02/M00/7C/D6/wKioL1bZGG_QERYEAAE4F_-W1L8999.png "title=" qq%e5% 9b%be%e7%89%8720151029152933.png "alt=" Wkiol1bzgg_qeryeaae4f_-w1l8999.png "/>
In the process, will use ELinks to pull out all the elements of the page, and do statistics, with Curl probe header information, through the Cache-control head to determine whether it can be cached, if a domain name more than 70% of the URL can be cached, then I simply think that the host can be cached , although relatively rough, but do a rough reference and learning should be enough.
The script is as follows:
#/bin/sh#### analysis of cacheable conditions for domain-level page elements ############### #writer: Gaolixushowprint () {tput clearecho -e ' Tput bold ' "The web has the url host is (more than 70% Item can be cache is yes): "echo -e ' tput bold;tput setaf 1; ' "Times\thost\t\t\t\tip\t\t\t\t\t\t\tyes%\t\tno%\t\tunknown%\tcache (yes/no)" tput sgr0i=2cat /tmp/ Cachetest/cache.test |awk -f ' + ' ' {print $1} ' |sort|uniq -c |sort -nr |while read urldourl_host= ' echo $url |awk ' {print $2} ' url_yes= ' cat / tmp/cachetest/cache.test|egrep $url _host 2>/dev/null |awk -f ' + ' ' {print $5} ' | Egrep -c yes ' url_no= ' cat /tmp/cachetest/cache.test|egrep $url _host 2>/dev/null |awk -f ' + ' ' {print $5} ' |egrep -c no ' url_unknown= ' cat /tmp/cachetest/ cache.test|egrep $url_host 2>/dev/null |awk -f ' + ' ' {print $5} ' |egrep -c unknown ' (url_sum=url _yes+url_no+url_unknown)) url_yes_b= ' awk -v url_yes= $url _yes -v url_sum= $url _sum ' begin{printf "%.1f", url_yes/url_sum*100} ' url_no_b= ' awk -v url_no= $url _no -v url_sum= $url _sum ' begin{printf "%.1f", url_no/url_sum*100} ' url_unknown_b= ' awk -v url_unknown= $url _unknown -v url_sum= $url _sum ' begin{printf "%.1f", url_unknown/url_sum*100} ' url_ip= ' cat /tmp/cachetest/cache.test|egrep $url _host 2>/dev /null |head -1|awk -f ' + ' ' {print $3} ' url_status= ' awk -v url_yes_b= $url _yes _b ' begin{if (url_yes_b>70) print "yes";else print "no"} "echo -n - e "$url" |sed ' s/ /\t/' tput cup $i 40echo -n "$url _ip" Tput cup $i 96echo -n " $url _yes_b% "tput cup $i 112echo -n " $url _no_b% "tput cup $i 128echo "$url _unknown_b%" tput cup $i 144echo "$url _status" ((i+=1)) done} host $1 &>/dev/null | | { echo "The url is error,can ' t host!!"; Exit;} mkdir /tmp/cachetest &> /dev/null[ -f /tmp/cachetest/cache.test ] && rm /tmp/cachetest/cache.testnum= ' elinks --dump $1 |egrep $2|sed '/http:/s/http:/\nhttp:/' |awk -f ' [/|,] ' '/^http/{print $0} ' |sort|wc -l ' echo ' Tput bold ' "the sum links is $num!" echo ' Tput bold ' "the analysis is running ..." tput sgr0elinks --dump $1 |egrep $2|sed '/http:/s/http:/\nhttp:/' |awk -f ' [/|,] ' '/^http/{print $0} ' |sort|while read urldo url_host= ' echo $url |awk -f ' [/|,] ' '/^http/{print $3} ' url_url= ' echo $url | sed "s/$url _host/\t/" |awk ' {print $2} ' url_ip= ' host $url _host | egrep address |awk ' {print $NF} ' |sed ' {1h;1! H; $g; $!d;s/\n/\//g} ' cc= ' curl -s -i -m 3 $url |egrep cache-control|head -1|egrep cache-control|sed ' {1h;1! H; $g; $!d;s/\n/ /g} " if [ " $CC " ];then cc_n=${#cc} ((cc_n-=1)) cc= ' echo $cc |cut -b1-$cc _n ' else cc= "no cache-control flags Or time out " fi cc_i= ' echo $cc |sed ' s/max-age=/\n/' |awk -f "[, | ]" ' nr==2{print $1} ' if echo $cc |egrep no-cache &>/dev/null | | echo $cc |egrep no-store &>/dev/null;then cc_status= " No " elif [[ $cc _i = 0 ]];then cc_status = "No" elif [[ $cc _i > 0 ]];then cc_ status= "yes" else cc_status= "Unknown" fi echo -e "$url _host+ $url _url+ $url _ip+ $cc + $cc _status" >> /tmp/cachetest/ cache.test echo -n "#" Donesleep 2 (Showprint)
This article is from the "Running Linux" blog, so be sure to keep this source http://benpaozhe.blog.51cto.com/10239098/1747543
Shell scripts-Crawl domain-level page elements and determine their cacheable nature