Crawler-oriented crawling

Source: Internet
Author: User

Crawler-oriented crawling technology: According to the theme of the set, to crawl the URL or the content of the Web page to filter.


Information screening methods of reptiles

(1) Filtering through regular expressions


(2) Filtering through XPath expressions (scrapy often used)


(3) Filter by XSLT


Third, the orientation of the crawler crawling review content

Import urllib.request import Http.cookiejar import re #视频编号 vid= "1472528692" #刚开始时候的评论ID comid= "6173403130078248384" url = "http://coral.qq.com/article/" +vid+ "/comment?commentid=" +comid+ "&reqnum=20" headers={"Accept": "Text/html,
                        application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 "," accept-encoding ":" Gb2312,utf-8 ", "Accept-language": "zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3", "user-agent": "Mozilla /5.0 (Windows NT 6.1;
                        WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/38.0.2125.122 safari/537.36 SE 2.X METASR 1.0 ", "Connection": "Keep-alive", "Referer": "Qq.com"} cjar=http.cookiejar.cookiejar () opener = Urllib . Request.build_opener (Urllib.request.HTTPCookieProcessor (Cjar)) headall=[] for Key,value in Headers.items (): item= ( Key,value) Headall.append (item) opener.addheaders = Headall Urllib.request.install_opener (opener) #建立一个自定义函数craw (vid , Comid), implement automatic crawl corresponding commentWeb page and return to crawl Data def Craw (Vid,comid): Url= "http://coral.qq.com/article/" +vid+ "/comment?commentid=" +comid+ "&reqnum=20" "Data=urllib.request.urlopen (URL). read (). Decode (" Utf-8 ") return data idpat= ' ID ':" (. *?) "' Userpat= '" Nick ":" (. *?) ", ' conpat= '" Content ":" (. *?) ", ' #第一层循环, representing how many pages to crawl, each time the outer loop crawls one page for I-in range (1,10): Print ("----------------------------- -------") Print (" +str "(i) +" page comment content ") Data=craw (Vid,comid) #第二层循环, extract and process information for each comment based on the results of the capture, one page 20 comments for J in range (0,20): Idlist=re.compile (idpat,re. S). FindAll (data) userlist=re.compile (userpat,re. S). FindAll (data) conlist=re.compile (conpat,re.
        S). FindAll (data) print ("User name is:" +eval (' U "' +userlist[j]+ ')) print (" Comments are: "+eval (' U" ' +conlist[j]+ '))
 Print ("\ n") #将comid改变为该页的最后一条评论id to enable automatic loading of comid=idlist[19]

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.