The crawled Web page address is: https://movie.douban.com/top250
When you open a webpage, you can see that TOP250 's movie is divided into 10 pages, each with 25 movies.
Then to crawl all the movie information, you need to know the other 9 pages of the URL link.
First page: https://movie.douban.com/top250
Second page: https://movie.douban.com/top250?start=25&filter=
Third page: https://movie.douban.com/top250?start=50&filter=
etc...
Analyze Web page source code: Take home page as an example
After observation, it can be found that:
All movie information within an OL tag, the tag's class attribute value is Grid_view;
Each movie is inside an LI tag;
The movie name of each movie is in: The first Class attribute value for the HD div tag under the first Class attribute value is the span tag of title;
The score for each movie is in the span tag corresponding to the Li Tag (unique) A class attribute value of Rating_num;
The number of reviews per film in the corresponding Li tag a class attribute value is the last number in the star's DIV tag;
The essays in each movie are in the span tag of the INQ with a class attribute value in the corresponding Li tag.
Python main module: Requests Module BEAUTIFULSOUP4 module
>PIP Install Requests
>pip Install BEAUTIFULSOUP4
Main code:
top250.py
1 2 3 4 5 6 7 8 9 Ten One A - - the - - - + - + A at - - - - - in - to + - the * $ Panax Notoginseng - the + A the + - $ $ - - the - Wuyi the - Wu - About $ - - - A + the - $ the the the the - in |
|
#-*-Coding:utf-8-*- ImportRequests# Requests Module fromBS4ImportBeautifulSoup# BEAUTIFULSOUP4 Module ImportRe# Regular Expression module Import Time# Time Module ImportSYS# System Module
"" gets the HTML document "" " defgethtmltext (URL, k): Try: if(k = =0): # Home kw = {} Else: # Other Pages kw = {' Start ': K,' Filter ':"'} R = requests.Get(url, params = kw, headers = {' User-agent ': ' mozilla/4.0 '}) R.raise_for_status () r.encoding = r.apparent_encoding returnR.text except: Print("failed!")
"" " Parse Data " "" defgetData (HTML): Soup = BeautifulSoup (HTML,"Html.parser") Movielist = soup.Find(' ol ', attrs = {' class ':' Grid_view '}) # Find the first OL label with a class attribute value of Grid_view moveinfo = [] forMovieliinchMovielist.find_all (' Li '): # Find All Li tags data = [] # get a movie name Moviehd = Movieli.Find(' div ', attrs = {' class ':' HD '}) # Find the first div tag with a class attribute value of HD moviename = Moviehd.Find(' span ', attrs = {' class ':' title '}). GetText ()# Find the first span label with a class attribute value of title # You can also use the. String method data.Append(Moviename)
# get a movie score Moviescore = Movieli.Find(' span ', attrs={' class ':' Rating_num '}). GetText () Data.Append(Moviescore)
# Get a movie rating Movieeval=movieli.Find(' div ', attrs={' class ':' star '}) Movieevalnum=re.findall (R' \d+ ',Str(Movieeval)) [-1] Data.Append(Movieevalnum)
# Get the film's essays moviequote = Movieli.Find(' span ', attrs={' class ': ' Inq '}) if(moviequote): Data.Append(Moviequote.gettext ()) Else: Data.Append("No")
Print(Outputmode.format(data[0], data[1], data[2], data[3], CHR(12288)))
# REDIRECT output to TXT file output = Sys.stdout OutputFile =Open("Moviedata.txt", ' W ', encoding =' Utf-8 ') Sys.stdout = OutputFile Outputmode ="{0:{4}^20}\t{1:^10}\t{2:^10}\t{3:{4}<10}" Print(Outputmode.format(' movie name ', ' score ', ' Number of comments ', ' Essays ', CHR(12288))) Basicurl =' https://movie.douban.com/top250 ' K =0 whilek <=225: html = Gethtmltext (Basicurl, K) Time.sleep (2) K + = - GetData (HTML)
OutputFile.Close() Sys.stdout = Output |
Reference Source: 62444947
Python crawler-watercress movie Top 250