This plugin makes it easy to see many things, including HTML.
Open the Watercress Movie Leaderboard top250 page, found that each page has 25 movies, altogether 10 pages, each page URL has the following characteristics:
Http://movie.douban.com/top250?start=0
Http://movie.douban.com/top250?start=25
Http://movie.douban.com/top250?start=50
Http://movie.douban.com/top250?start=75
......
And so you only need to use loops on the back of the 0,25,... 225 processing can be done.
Click on any of the movies in the Chinese name, right click on the mouse "view elements" to view the HTML source code:
You can find the name of the movie and the English name in it .
You can use regular expressions (. *) to match the Chinese name and English name of the movie, but only to get the Chinese name, so you need to filter the English name.
The filter method can be implemented using the Find (str,pos_start,pos_end) function, rejecting the characteristic features in English names: "and"/", see the code.
3. Code implementation
The code here is simple, so you don't have to define a function.
#!/usr/bin/python#-*-coding:utf-8-*-#import requests,sys,refrom bs4 import beautifulsoupreload (SYS) Sys.setdefaultencoding (' utf-8 ') print ' is fetching data from the Watercress movie Top250 ... ' for page in range: url= ' https:// Movie.douban.com/top250?start= ' +str ((page-1) *25) print '---------------------------is crawling the first ' +str (page+1) + ' page ...--------------------------------' html=requests.get (URL) html.raise_for_status () try: Soup=beautifulsoup (Html.text, ' Html.parser ') soup=str (soup) # Use regular expressions to convert Web page text to string title=re.compile (R ' (. *)') Names=re.findall (title,soup) for name in names: if Name.find (') ==-1 and Name.find ('/') ==-1: # Excludes English names (English name features are "and"/") Print name # created, score except Exception as e: print eprint ' crawl complete! '