URL https://movie.douban.com/top250
A total of 250 films, with pagination, to get detailed information for each part
Do not use frames, read Web pages with Urilib, re perform regular expression matching, lxml XPath lookups
1 fromFilmImport*2 fromUrllibImportRequest3 ImportTime,re4Url=r'https://movie.douban.com/top250?start='5 forIinchRange (10):6Url=url+str (i*25)7 Print(URL)8 9headers = {Ten 'user-agent': R'mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko)' OneR'chrome/45.0.2454.85 safari/537.36 115browser/6.0.3', A 'Connection':'keep-alive' - } -Req=request. Request (url,headers=headers) thePage=Request.urlopen (req). Read () -Page=page.decode ('Utf-8') - #Fp=open ("Page.txt", mode= "W", encoding= "UTF-8") - #fp.writelines (page) +P=re.compile (R'\<em\sclass=\ "\" \>\d+\</em\>\s*\<a\shref=\ "https://movie.douban.com/subject/\d+/\" \> ') -result=P.findall (page) + forIteminchResult: A #print (item) atP=re.compile (R'\d+') -no=P.findall (item) - #print (no[0]) -P=re.compile (R'https://movie.douban.com/subject/\d+/') -Rurl=P.findall (item) - #print (rurl[0]) inFilma=film (No[0],rurl[0],"',"',"',"',"',"') - Filma.getall () to Filma.detail () +Time.sleep (3) - #print (Result) theTime.sleep (3) * #print (i)
film.py If you want to do data persistence, here's the implementation
1 fromUrllibImportRequest2 fromlxmlImportetree3 classFilm:4 def __init__(self,no,url,name,year,score,director,classification,actor):5Self.name=name6Self.year= Year7Self.score=score8Self.director=Director9self.classification=ClassificationTenSelf.actor=actor OneSelf.url=URL Aself.no=No - - defdetail (self): thetemp ="no:%s;url:%s; name:%s; year:%s; score:%s; director:%s; rating:%s; actor:%s ;"%(Self.no,self.url,self.name,self.year,self.score,self.director,self.classification,self.actor) - Print(temp) - defgetall (self): -headers={ + 'user-agent': R'mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko)' -R'chrome/45.0.2454.85 safari/537.36 115browser/6.0.3', + 'Connection':'keep-alive' A } atReq=request. Request (self.url,headers=headers) -Page=Request.urlopen (req). Read () -Page=page.decode ('Utf-8') -Selector=etree. HTML (page) - Print(page) -Self.name=selector.xpath ('/html/body/div[3]/div[1]/h1/span[1]/text ()') inSelf.year=selector.xpath ('//*[@id = "Content"]/h1/span[2]/text ()') -Self.score=selector.xpath ('//*[@id = "Interest_sectl"]/div[1]/div[2]/strong/text ()') toSelf.director=selector.xpath ('//*[@id = "info"]/span[1]/span[2]/a/text ()') +Self.classification=selector.xpath ('//*[@id = "info"]/span[5]/text ()') -Self.actor=selector.xpath ('//*[@id = "info"]/span[3]/span[2]/a/text ()') the *
Python crawlers Get more information about Douban's top 250 movies