Python code for capturing movie heaven information, python code for capturing
Python2.7Mac OS
Capture the pages of the latest movies in movie heaven. Link: http://www.dytt8.net/html/gndy/dyzz/index.html
Get the link to the movie details page on the page
Import urllib2import osimport reimport string # movieUrls collection of movie URLs = [] # Get movie list def queryMovieList (): URL = 'HTTP: // movie conent = urllib2.urlopen (url) conent = conent. read () conent = conent. decode ('gb2312', 'ignore '). encode ('utf-8', 'ignore') pattern = re. compile ('<div class = "title_all">
Capture movie data on the details page
Def queryMovieInfo (movieUrls): for index, item in enumerate (movieUrls): print ('movie URL: '+ item) conent = urllib2.urlopen (item) conent = conent. read () conent = conent. decode ('gb2312', 'ignore '). encode ('utf-8', 'ignore') movieName = re. findall (R' <div class = "title_all">
Capture
If _ name __= = '_ main _': print ("starting to capture movie data"); queryMovieList () print (len (movieUrls) queryMovieInfo (movieUrls) print ("End capturing movie data ")
Summary
It is important to learn regular expressions !!!! Python syntax is good. Compared with Java...