Crawler-oriented crawling technology: According to the theme of the set, to crawl the URL or the content of the Web page to filter.
Information screening methods of reptiles
(1) Filtering through regular expressions
(2) Filtering through XPath expressions (scrapy often used)
(3) Filter by XSLT
Third, the orientation of the crawler crawling review content
Import urllib.request import Http.cookiejar import re #视频编号 vid= "1472528692" #刚开始时候的评论ID comid= "6173403130078248384" url = "http://coral.qq.com/article/" +vid+ "/comment?commentid=" +comid+ "&reqnum=20" headers={"Accept": "Text/html,
application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 "," accept-encoding ":" Gb2312,utf-8 ", "Accept-language": "zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3", "user-agent": "Mozilla /5.0 (Windows NT 6.1;
WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/38.0.2125.122 safari/537.36 SE 2.X METASR 1.0 ", "Connection": "Keep-alive", "Referer": "Qq.com"} cjar=http.cookiejar.cookiejar () opener = Urllib . Request.build_opener (Urllib.request.HTTPCookieProcessor (Cjar)) headall=[] for Key,value in Headers.items (): item= ( Key,value) Headall.append (item) opener.addheaders = Headall Urllib.request.install_opener (opener) #建立一个自定义函数craw (vid , Comid), implement automatic crawl corresponding commentWeb page and return to crawl Data def Craw (Vid,comid): Url= "http://coral.qq.com/article/" +vid+ "/comment?commentid=" +comid+ "&reqnum=20" "Data=urllib.request.urlopen (URL). read (). Decode (" Utf-8 ") return data idpat= ' ID ':" (. *?) "' Userpat= '" Nick ":" (. *?) ", ' conpat= '" Content ":" (. *?) ", ' #第一层循环, representing how many pages to crawl, each time the outer loop crawls one page for I-in range (1,10): Print ("----------------------------- -------") Print (" +str "(i) +" page comment content ") Data=craw (Vid,comid) #第二层循环, extract and process information for each comment based on the results of the capture, one page 20 comments for J in range (0,20): Idlist=re.compile (idpat,re. S). FindAll (data) userlist=re.compile (userpat,re. S). FindAll (data) conlist=re.compile (conpat,re.
S). FindAll (data) print ("User name is:" +eval (' U "' +userlist[j]+ ')) print (" Comments are: "+eval (' U" ' +conlist[j]+ '))
Print ("\ n") #将comid改变为该页的最后一条评论id to enable automatic loading of comid=idlist[19]