A study of the script for crawling Web pages

Source: Internet
Author: User

A study of the script http://life2death.blog.51cto.com/7550586/1657133 for crawling Web pages:

Statement: I just take to study, thank you pompous past great God.


An enterprise Shell programming practical problem http://oldboy.blog.51cto.com/2561410/1657042

This script to http://edu.51cto.com/video has universality, no bug found, if found, please crossing self-resolution.

---------------------------------------------------------------------------------

#!/bin/bash# oldboy linux training# 2015-06-01# happy children ' s Day#  Description: This script is from the old boy linux21 period student Zhang Yao development! edufile=/tmp/edu.html     # #定义文件来存放网页的HTML源码EduFile2 =/tmp/edu2.html    Url= "$*"   # #网址参数  # Judge url is ok?curl -I  $Url  &>/ dev/null     # #对网址进行一个测试 to see if the URL can be connected [ $? -ne 0 ] &&{    echo  "Bad url,please check it"      exit 1        # #如果不能连上的话, the direct interrupt exits, no longer perform the subsequent steps} # defined get  Pagenum and courseid functionsfunction getnum () {      ## This function is mainly used to extract the total number of pages of video from a URL page         curl -s  $Url > $EduFile          grep  ' pagesgoend '   $EduFile  &>/dev/null  # #用于检测这个视频是否已经完结 

# #51cto中完结了的视频和正在更新的视频html代码如:

Completed video, monogram with Pagesgoend:

650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M01/6D/E6/wKioL1Vuxr6iA2FiAANcNtlEqGw515.jpg "style=" float: none; "title=" Wang.png "alt=" Wkiol1vuxr6ia2fiaancntleqgw515.jpg "/>

The video is being updated with the page number "#" on the last page:

650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M02/6D/EB/wKiom1Vu1r7zrjZhAAG8FCLMEvQ889.jpg "title=" Gengxing.png "alt=" Wkiom1vu1r7zrjzhaag8fclmevq889.jpg "/>

If [$?-eq 0] # #如果视频是已经更新玩的话, make the following extract page numbers. Then num= ' Sed-rn ' s#.*page= ([0-9].*) "class=" Pagesgoend ". *$#\1#gp ' $EduFile '

####

Sed:

-R table supports extended regular expressions,-n table silent output.

s///gp:s table substitution, G-table Global, p prints the line to be matched (with N)

\1: Refers to the contents of the first parenthesis, and page= ([0-9].*) refers to the contents of the parentheses here, thus printing out the numbers, that is, the number of pages of the video.


Else # #表示视频还处于跟新的状态, take the following page num= ' Sed-rn ' s|. *page= ([0-9].*) # "class=" Pagesnum ". *$|\1|gp ' $EduFile ' fi pagenum=${num:-1} # #表如果num不存在或为空的时候, then take the following value, which is 1, and":- "is the fixed symbol courseid= ' echo $Url |awk-f" [-.] " ' {print $4} ' # #取课程号,}

# # #抽取的页数如: 650) this.width=650; "Src=" http://s3.51cto.com/wyfs02/M01/6D/E5/wKioL1VuwoeA5pD8AAEI9B-UnOU972.jpg " Title= "Yeshu.jpg" alt= "Wkiol1vuwoea5pd8aaei9b-unou972.jpg"/>

The course number is as follows:

650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M02/6D/E7/wKioL1Vu0bPwNX6zAABe-l9Wi1w456.jpg "title=" Idkec.png "alt=" Wkiol1vu0bpwnx6zaabe-l9wi1w456.jpg "/>

# Defined Curl HTML Functionsfunction Curl () {# #将每页的课程代码存放到/tmp/edu.html getnum for i in ' seq $pagenum ' Do curl "http://edu.51cto.com/index.php?do=course&m=lessions&course_id= $CourseId &page= $i" 1> ;> $EduFile 2>/dev/null Done

}

# # #http://edu.51cto.com/index.php?do=course&m=lessions&course_id=839&page=1

This link opens with the following page: (so that you can reach one page of open video page)

650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M00/6D/E7/wKioL1Vu086goFgeAAPEsf0Gx_A129.jpg "title=" Yemian.jpg "alt=" Wkiol1vu086gofgeaapesf0gx_a129.jpg "/>

# Defined Create Table Functionsfunction table () {sum= "" Index=1 sed-rn '/do=lesson/s#<.* (<a Href= ") (. *) </H4>$#\1HTTP://EDU.51CTO.COM\2#GP ' $EduFile > $EduFile 2 # #抽取每个视频的网址和其对应的标题

The extracted sections are as follows:

650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M00/6D/EB/wKiom1Vu0-rx2E6cAALxatYv-Y0951.jpg "title=" Biao.png "alt=" Wkiom1vu0-rx2e6caalxatyv-y0951.jpg "/>

While the read line does sum= $sum "<tr><th width=" "scope=" Row "> $index </th><td width=" 520 "> $line </td>" ((index++) # #统计有多少个视频 Done < $EduFile 2 # #对抽取的网址和标题进行重新的编排. }

The choreography works as follows:

650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M01/6D/E7/wKioL1Vu1wzAUfV3AAOiRhlwRno394.jpg "title=" th.png "alt=" Wkiol1vu1wzaufv3aaoirhlwrno394.jpg "/>

# defined create html functionsfunction html () {         cat >/tmp/oldboy.html<<-end        


This article is from the "Tiandaochouqin" blog, make sure to keep this source http://luzhi1024.blog.51cto.com/8845546/1657977

A study of the script for crawling Web pages

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.