Shell scripts-Crawl domain-level page elements and determine their cacheable nature

Source: Internet
Author: User
Tags egrep

Come to a domain name how to determine its cache or not, the high-level professional crawler can certainly do analysis, if not very rigorous analysis, through the shell script can also be implemented, to see me this layer of pages of small crawl bar, ha ha, first script execution after the result diagram:


650) this.width=650; "src=" Http://s5.51cto.com/wyfs02/M00/7C/D6/wKioL1bZGG_QERYEAAE4F_-W1L8999.png "title=" qq%e5% 9b%be%e7%89%8720151029152933.png "alt=" Wkiol1bzgg_qeryeaae4f_-w1l8999.png "/>


In the process, will use ELinks to pull out all the elements of the page, and do statistics, with Curl probe header information, through the Cache-control head to determine whether it can be cached, if a domain name more than 70% of the URL can be cached, then I simply think that the host can be cached , although relatively rough, but do a rough reference and learning should be enough.


The script is as follows:

#/bin/sh####  analysis of cacheable conditions for domain-level page elements ############### #writer: Gaolixushowprint () {tput clearecho -e  ' Tput bold ' "The web has the  url host is (more than 70%  Item can be cache is yes): "echo -e  ' tput bold;tput setaf 1; ' "Times\thost\t\t\t\tip\t\t\t\t\t\t\tyes%\t\tno%\t\tunknown%\tcache (yes/no)" tput sgr0i=2cat /tmp/ Cachetest/cache.test |awk -f ' + '   ' {print $1} '  |sort|uniq -c |sort -nr  |while read urldourl_host= ' echo  $url  |awk  ' {print $2} ' url_yes= ' cat / tmp/cachetest/cache.test|egrep  $url _host 2>/dev/null |awk -f ' + '   ' {print $5} ' | Egrep -c yes ' url_no= ' cat /tmp/cachetest/cache.test|egrep  $url _host 2>/dev/null  |awk -f ' + '   ' {print $5} ' |egrep -c no ' url_unknown= ' cat /tmp/cachetest/ cache.test|egrep  $url_host 2>/dev/null |awk -f ' + '   ' {print $5} ' |egrep -c unknown ' (url_sum=url _yes+url_no+url_unknown)) url_yes_b= ' awk -v url_yes= $url _yes -v url_sum= $url _sum  ' begin{printf  "%.1f",  url_yes/url_sum*100} ' url_no_b= ' awk  -v url_no= $url _no -v  url_sum= $url _sum  ' begin{printf  "%.1f",  url_no/url_sum*100} ' url_unknown_b= ' awk    -v url_unknown= $url _unknown -v url_sum= $url _sum  ' begin{printf  "%.1f",  url_unknown/url_sum*100} ' url_ip= ' cat /tmp/cachetest/cache.test|egrep  $url _host 2>/dev /null |head -1|awk -f ' + '   ' {print $3} ' url_status= ' awk -v url_yes_b= $url _yes _b  ' begin{if  (url_yes_b>70)  print  "yes";else print  "no"} "echo -n - e  "$url" |sed  ' s/ /\t/' tput cup  $i  40echo -n  "$url _ip" Tput cup   $i  96echo -n  " $url _yes_b% "tput cup  $i  112echo -n " $url _no_b% "tput cup  $i  128echo   "$url _unknown_b%" tput cup  $i  144echo  "$url _status" ((i+=1)) done} host $1   &>/dev/null | |  { echo  "The url is error,can ' t host!!"; Exit;} mkdir /tmp/cachetest &> /dev/null[ -f /tmp/cachetest/cache.test ]  && rm /tmp/cachetest/cache.testnum= ' elinks --dump $1 |egrep $2|sed  '/http:/s/http:/\nhttp:/' |awk -f ' [/|,] '   '/^http/{print $0} '  |sort|wc -l ' echo  ' Tput bold ' "the sum links is  $num!" echo  ' Tput bold ' "the analysis is running ..." tput sgr0elinks --dump $1  |egrep $2|sed  '/http:/s/http:/\nhttp:/' |awk -f ' [/|,] '   '/^http/{print $0} '   |sort|while read urldo   url_host= ' echo  $url |awk -f ' [/|,] '   '/^http/{print $3} '    url_url= ' echo  $url | sed  "s/$url _host/\t/" |awk  ' {print $2} '    url_ip= ' host  $url _host |  egrep address |awk  ' {print  $NF} ' |sed  ' {1h;1! H; $g; $!d;s/\n/\//g} '    cc= ' curl -s -i -m 3  $url  |egrep  cache-control|head -1|egrep cache-control|sed  ' {1h;1! H; $g; $!d;s/\n/ /g} "    if [ " $CC " ];then      cc_n=${#cc}      ((cc_n-=1))      cc= ' echo  $cc |cut  -b1-$cc _n '    else     cc= "no cache-control flags  Or time out "   fi  cc_i= ' echo  $cc  |sed  ' s/max-age=/\n/' |awk  -f "[, | ]"   ' nr==2{print $1} '   if echo  $cc |egrep no-cache &>/dev/null | |  echo  $cc |egrep no-store &>/dev/null;then     cc_status= " No "   elif [[  $cc _i = 0 ]];then     cc_status = "No"    elif [[  $cc _i > 0 ]];then     cc_ status= "yes"    else     cc_status= "Unknown"   fi   echo -e  "$url _host+ $url _url+ $url _ip+ $cc + $cc _status"   >> /tmp/cachetest/ cache.test  echo -n  "#" Donesleep 2 (Showprint)


This article is from the "Running Linux" blog, so be sure to keep this source http://benpaozhe.blog.51cto.com/10239098/1747543

Shell scripts-Crawl domain-level page elements and determine their cacheable nature

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.