Python2.7mac OS
Crawling is the latest movie page in the movie Paradise. Link Address: http://www.dytt8.net/html/gndy/dyzz/index.html
Get a link to the movie Details page in the page
Import Urllib2import osimport reimport string# Movie URL Collection movieurls = []# get movie list Def querymovielist (): url = ' Http://www.dytt8. Net/html/gndy/dyzz/index.html ' conent = Urllib2.urlopen (URL) conent = Conent.read () conent = Conent.decode (' gb2312 ', ' ignore '). Encode (' utf-8 ', ' ignore ') pattern = Re.compile ('.*?>
'+ '(.*?) ', Re. S) items = Re.findall (pattern,conent) str = '. Join (items) pattern = Re.compile (' (. *?). *? (. *?) ', Re. S) News = Re.findall (pattern, str) for J in News: movieurls.append (' http://www.dytt8.net ' +j[0])
Crawl the movie data in the details page
def querymovieinfo (movieurls): For index, item in enumerate (Movieurls): print (' movie URL: ' + item ') Conent = Urllib2.urlopen (i TEM) Conent = Conent.read () conent = Conent.decode (' gb2312 ', ' ignore '). Encode (' utf-8 ', ' ignore ') Moviename = Re.findall ( R(.*?)
', Conent, re. S) if (len (moviename) > 0): moviename = moviename[0] + "" # Intercept name moviename = Moviename[moviename.find ("") + 3:movie Name.find ("")] Else:moviename = "" Print ("movie Name:" + Moviename.strip ()) Moviecontent = Re.findall (R ' (. *?)', Conent, re. S) pattern = Re.compile ('
(. *?) ', Re. S) moviedate = Re.findall (pattern,moviecontent[0]) if (len (moviedate) > 0): moviedate = Moviedate[0].strip () + ' else : Moviedate = "" Print ("Movie release Time:" + moviedate[-10:]) pattern = Re.compile ('
(.*?)
0): Movieinfo = movieinfo[0]+ ' # delete
Label Movieinfo = Movieinfo.replace ("
"," ") # split movieinfo by symbol = Movieinfo.split (") Else:movieinfo = "" Print ("movie Base info:") for item in Movieinfo:print (item) # movie poster pattern = Re.compile (', Re. S) movieimg = Re.findall (pattern,moviecontent[0]) if (len (movieimg) > 0): movieimg = movieimg[0] else:movieimg = "" Print ("movie poster:" + movieimg) pattern = Re.compile ('. *? '), Re. S) Moviedownurl = Re.findall (pattern,moviecontent[0]) if (len (Moviedownurl) > 0): Moviedownurl = moviedownurl[0] Else: Moviedownurl = "" Print ("Movie Download Address:" + Moviedownurl + ") print ("------------------------------------------------\n\n\n ")
Perform crawl
If __name__== ' __main__ ': print ("Start capturing movie data"); Querymovielist () print (len (movieurls)) querymovieinfo (movieurls) print ("End grab movie data")
Summarize
It is important and important to learn regular expressions well!!!! Python's syntax is good to feel, compared to Java ...