Shell scripts-Crawl domain-level page elements and determine their cacheable nature

Last Update:2016-03-04 Source: Internet

Author: User

Tags egrep

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Come to a domain name how to determine its cache or not, the high-level professional crawler can certainly do analysis, if not very rigorous analysis, through the shell script can also be implemented, to see me this layer of pages of small crawl bar, ha ha, first script execution after the result diagram:

650) this.width=650; "src=" Http://s5.51cto.com/wyfs02/M00/7C/D6/wKioL1bZGG_QERYEAAE4F_-W1L8999.png "title=" qq%e5% 9b%be%e7%89%8720151029152933.png "alt=" Wkiol1bzgg_qeryeaae4f_-w1l8999.png "/>

In the process, will use ELinks to pull out all the elements of the page, and do statistics, with Curl probe header information, through the Cache-control head to determine whether it can be cached, if a domain name more than 70% of the URL can be cached, then I simply think that the host can be cached , although relatively rough, but do a rough reference and learning should be enough.

The script is as follows:

#/bin/sh####  analysis of cacheable conditions for domain-level page elements ############### #writer: Gaolixushowprint () {tput clearecho -e  ' Tput bold ' "The web has the  url host is (more than 70%  Item can be cache is yes): "echo -e  ' tput bold;tput setaf 1; ' "Times\thost\t\t\t\tip\t\t\t\t\t\t\tyes%\t\tno%\t\tunknown%\tcache (yes/no)" tput sgr0i=2cat /tmp/ Cachetest/cache.test |awk -f ' + '   ' {print $1} '  |sort|uniq -c |sort -nr  |while read urldourl_host= ' echo  $url  |awk  ' {print $2} ' url_yes= ' cat / tmp/cachetest/cache.test|egrep  $url _host 2>/dev/null |awk -f ' + '   ' {print $5} ' | Egrep -c yes ' url_no= ' cat /tmp/cachetest/cache.test|egrep  $url _host 2>/dev/null  |awk -f ' + '   ' {print $5} ' |egrep -c no ' url_unknown= ' cat /tmp/cachetest/ cache.test|egrep  $url_host 2>/dev/null |awk -f ' + '   ' {print $5} ' |egrep -c unknown ' (url_sum=url _yes+url_no+url_unknown)) url_yes_b= ' awk -v url_yes= $url _yes -v url_sum= $url _sum  ' begin{printf  "%.1f",  url_yes/url_sum*100} ' url_no_b= ' awk  -v url_no= $url _no -v  url_sum= $url _sum  ' begin{printf  "%.1f",  url_no/url_sum*100} ' url_unknown_b= ' awk    -v url_unknown= $url _unknown -v url_sum= $url _sum  ' begin{printf  "%.1f",  url_unknown/url_sum*100} ' url_ip= ' cat /tmp/cachetest/cache.test|egrep  $url _host 2>/dev /null |head -1|awk -f ' + '   ' {print $3} ' url_status= ' awk -v url_yes_b= $url _yes _b  ' begin{if  (url_yes_b>70)  print  "yes";else print  "no"} "echo -n - e  "$url" |sed  ' s/ /\t/' tput cup  $i  40echo -n  "$url _ip" Tput cup   $i  96echo -n  " $url _yes_b% "tput cup  $i  112echo -n " $url _no_b% "tput cup  $i  128echo   "$url _unknown_b%" tput cup  $i  144echo  "$url _status" ((i+=1)) done} host $1   &>/dev/null | |  { echo  "The url is error,can ' t host!!"; Exit;} mkdir /tmp/cachetest &> /dev/null[ -f /tmp/cachetest/cache.test ]  && rm /tmp/cachetest/cache.testnum= ' elinks --dump $1 |egrep $2|sed  '/http:/s/http:/\nhttp:/' |awk -f ' [/|,] '   '/^http/{print $0} '  |sort|wc -l ' echo  ' Tput bold ' "the sum links is  $num!" echo  ' Tput bold ' "the analysis is running ..." tput sgr0elinks --dump $1  |egrep $2|sed  '/http:/s/http:/\nhttp:/' |awk -f ' [/|,] '   '/^http/{print $0} '   |sort|while read urldo   url_host= ' echo  $url |awk -f ' [/|,] '   '/^http/{print $3} '    url_url= ' echo  $url | sed  "s/$url _host/\t/" |awk  ' {print $2} '    url_ip= ' host  $url _host |  egrep address |awk  ' {print  $NF} ' |sed  ' {1h;1! H; $g; $!d;s/\n/\//g} '    cc= ' curl -s -i -m 3  $url  |egrep  cache-control|head -1|egrep cache-control|sed  ' {1h;1! H; $g; $!d;s/\n/ /g} "    if [ " $CC " ];then      cc_n=${#cc}      ((cc_n-=1))      cc= ' echo  $cc |cut  -b1-$cc _n '    else     cc= "no cache-control flags  Or time out "   fi  cc_i= ' echo  $cc  |sed  ' s/max-age=/\n/' |awk  -f "[, | ]"   ' nr==2{print $1} '   if echo  $cc |egrep no-cache &>/dev/null | |  echo  $cc |egrep no-store &>/dev/null;then     cc_status= " No "   elif [[  $cc _i = 0 ]];then     cc_status = "No"    elif [[  $cc _i > 0 ]];then     cc_ status= "yes"    else     cc_status= "Unknown"   fi   echo -e  "$url _host+ $url _url+ $url _ip+ $cc + $cc _status"   >> /tmp/cachetest/ cache.test  echo -n  "#" Donesleep 2 (Showprint)

This article is from the "Running Linux" blog, so be sure to keep this source http://benpaozhe.blog.51cto.com/10239098/1747543

Shell scripts-Crawl domain-level page elements and determine their cacheable nature

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More