A study of the script http://life2death.blog.51cto.com/7550586/1657133 for crawling Web pages:
Statement: I just take to study, thank you pompous past great God.
An enterprise Shell programming practical problem http://oldboy.blog.51cto.com/2561410/1657042
This script to http://edu.51cto.com/video has universality, no bug found, if found, please crossing self-resolution.
---------------------------------------------------------------------------------
#!/bin/bash# oldboy linux training# 2015-06-01# happy children ' s Day# Description: This script is from the old boy linux21 period student Zhang Yao development! edufile=/tmp/edu.html # #定义文件来存放网页的HTML源码EduFile2 =/tmp/edu2.html Url= "$*" # #网址参数 # Judge url is ok?curl -I $Url &>/ dev/null # #对网址进行一个测试 to see if the URL can be connected [ $? -ne 0 ] &&{ echo "Bad url,please check it" exit 1 # #如果不能连上的话, the direct interrupt exits, no longer perform the subsequent steps} # defined get Pagenum and courseid functionsfunction getnum () { ## This function is mainly used to extract the total number of pages of video from a URL page curl -s $Url > $EduFile grep ' pagesgoend ' $EduFile &>/dev/null # #用于检测这个视频是否已经完结
# #51cto中完结了的视频和正在更新的视频html代码如:
Completed video, monogram with Pagesgoend:
650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M01/6D/E6/wKioL1Vuxr6iA2FiAANcNtlEqGw515.jpg "style=" float: none; "title=" Wang.png "alt=" Wkiol1vuxr6ia2fiaancntleqgw515.jpg "/>
The video is being updated with the page number "#" on the last page:
650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M02/6D/EB/wKiom1Vu1r7zrjZhAAG8FCLMEvQ889.jpg "title=" Gengxing.png "alt=" Wkiom1vu1r7zrjzhaag8fclmevq889.jpg "/>
If [$?-eq 0] # #如果视频是已经更新玩的话, make the following extract page numbers. Then num= ' Sed-rn ' s#.*page= ([0-9].*) "class=" Pagesgoend ". *$#\1#gp ' $EduFile '
####
Sed:
-R table supports extended regular expressions,-n table silent output.
s///gp:s table substitution, G-table Global, p prints the line to be matched (with N)
\1: Refers to the contents of the first parenthesis, and page= ([0-9].*) refers to the contents of the parentheses here, thus printing out the numbers, that is, the number of pages of the video.
Else # #表示视频还处于跟新的状态, take the following page num= ' Sed-rn ' s|. *page= ([0-9].*) # "class=" Pagesnum ". *$|\1|gp ' $EduFile ' fi pagenum=${num:-1} # #表如果num不存在或为空的时候, then take the following value, which is 1, and":- "is the fixed symbol courseid= ' echo $Url |awk-f" [-.] " ' {print $4} ' # #取课程号,}
# # #抽取的页数如: 650) this.width=650; "Src=" http://s3.51cto.com/wyfs02/M01/6D/E5/wKioL1VuwoeA5pD8AAEI9B-UnOU972.jpg " Title= "Yeshu.jpg" alt= "Wkiol1vuwoea5pd8aaei9b-unou972.jpg"/>
The course number is as follows:
650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M02/6D/E7/wKioL1Vu0bPwNX6zAABe-l9Wi1w456.jpg "title=" Idkec.png "alt=" Wkiol1vu0bpwnx6zaabe-l9wi1w456.jpg "/>
# Defined Curl HTML Functionsfunction Curl () {# #将每页的课程代码存放到/tmp/edu.html getnum for i in ' seq $pagenum ' Do curl "http://edu.51cto.com/index.php?do=course&m=lessions&course_id= $CourseId &page= $i" 1> ;> $EduFile 2>/dev/null Done
}
# # #http://edu.51cto.com/index.php?do=course&m=lessions&course_id=839&page=1
This link opens with the following page: (so that you can reach one page of open video page)
650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M00/6D/E7/wKioL1Vu086goFgeAAPEsf0Gx_A129.jpg "title=" Yemian.jpg "alt=" Wkiol1vu086gofgeaapesf0gx_a129.jpg "/>
# Defined Create Table Functionsfunction table () {sum= "" Index=1 sed-rn '/do=lesson/s#<.* (<a Href= ") (. *) </H4>$#\1HTTP://EDU.51CTO.COM\2#GP ' $EduFile > $EduFile 2 # #抽取每个视频的网址和其对应的标题
The extracted sections are as follows:
650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M00/6D/EB/wKiom1Vu0-rx2E6cAALxatYv-Y0951.jpg "title=" Biao.png "alt=" Wkiom1vu0-rx2e6caalxatyv-y0951.jpg "/>
While the read line does sum= $sum "<tr><th width=" "scope=" Row "> $index </th><td width=" 520 "> $line </td>" ((index++) # #统计有多少个视频 Done < $EduFile 2 # #对抽取的网址和标题进行重新的编排. }
The choreography works as follows:
650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M01/6D/E7/wKioL1Vu1wzAUfV3AAOiRhlwRno394.jpg "title=" th.png "alt=" Wkiol1vu1wzaufv3aaoirhlwrno394.jpg "/>
# defined create html functionsfunction html () { cat >/tmp/oldboy.html<<-end
This article is from the "Tiandaochouqin" blog, make sure to keep this source http://luzhi1024.blog.51cto.com/8845546/1657977
A study of the script for crawling Web pages