Do a little exercise and grab the page of the latest movie in movie Heaven. Link Address:http://www.dytt8.net/html/gndy/dyzz/index.html
First we need to get the web address of the movie details:
ImportUrllib2ImportOSImportReImportstring#Movie URL CollectionMovieurls = []#get a list of moviesdefquerymovielist (): URL='http://www.dytt8.net/html/gndy/dyzz/index.html'conent=urllib2.urlopen (URL) conent=conent.read () conent= Conent.decode ('gb2312','Ignore'). Encode ('Utf-8','Ignore') Pattern= Re.compile ('<div class= "Title_all" >'+'(.*?) <TD height= "align=" center "bgcolor=" #F4FAE2 ">', Re. S) Items=re.findall (pattern,conent) str="'. Join (items) pattern= Re.compile ('<a href= "(. *?)" class= "Ulink" > (. *?) </A>.*?<TD colspan.*?> (. *?) </td>', Re. S) News=Re.findall (pattern, str) forJinchNews:movieUrls.append ('http://www.dytt8.net'+j[0])
crawl the movie data in the details page
defQuerymovieinfo (movieurls): forIndex, iteminchEnumerate (movieurls):Print('Movie URL:'+Item) conent=Urllib2.urlopen (item) conent=conent.read () conent= Conent.decode ('gb2312','Ignore'). Encode ('Utf-8','Ignore') Moviename= Re.findall (r'<div class= "Title_all" >', Conent, re. S)if(Len (Moviename) >0): Moviename= Moviename[0] +"" #Intercept nameMoviename = Moviename[moviename.find ("the") + 3:moviename.find (""")] Else: Moviename="" Print("Movie Name:"+Moviename.strip ()) Moviecontent= Re.findall (r'<div class= "Co_content8" > (. *?) </tbody>', Conent, re. S) Pattern= Re.compile ('<ul> (. *?) <tr>', Re. S) Moviedate=Re.findall (pattern,moviecontent[0])if(Len (moviedate) >0): Moviedate= Moviedate[0].strip () +"' Else: Moviedate="" Print("Movie release Time:"+ moviedate[-10:]) Pattern= Re.compile ('<br/><br/> (. *?) <br/><br/>') Movieinfo=Re.findall (pattern, moviecontent[0])if(Len (Movieinfo) >0): Movieinfo= movieinfo[0]+"' #Remove <br/> TagsMovieinfo = Movieinfo.replace ("<br/>","") #Split by SymbolMovieinfo= Movieinfo.split ('◎') Else: Movieinfo="" Print("Basic movie Information:") forIteminchMovieinfo:Print(item)#movie posterPattern = Re.compile ('', Re. S) movieimg=Re.findall (pattern,moviecontent[0])if(Len (movieimg) >0): movieimg=Movieimg[0]Else: Movieimg="" Print("movie poster:"+movieimg) Pattern= Re.compile ('<td style= "Word-wrap:break-word" bgcolor= "#fdfddf" ><a href= "(. *?)" >.*?</a></td>', Re. S) Moviedownurl=Re.findall (pattern,moviecontent[0])if(Len (Moviedownurl) >0): Moviedownurl=Moviedownurl[0]Else: Moviedownurl="" Print("Movie:"+ Moviedownurl +"") Print("------------------------------------------------\n\n\n")
Perform crawl
if __name__= ='__main__': print(" start grabbing movie data "); Querymovielist () print(len (movieurls)) querymovieinfo (movieurls) Print (" end crawl movie data ")
Python crawler--capture movie Paradise movie information