Use python2.7 to crawl watercress movie top250

Source: Internet
Author: User
This plugin makes it easy to see many things, including HTML.

Open the Watercress Movie Leaderboard top250 page, found that each page has 25 movies, altogether 10 pages, each page URL has the following characteristics:

Http://movie.douban.com/top250?start=0

Http://movie.douban.com/top250?start=25

Http://movie.douban.com/top250?start=50

Http://movie.douban.com/top250?start=75

......

And so you only need to use loops on the back of the 0,25,... 225 processing can be done.

Click on any of the movies in the Chinese name, right click on the mouse "view elements" to view the HTML source code:

You can find the name of the movie and the English name in it .

You can use regular expressions (. *) to match the Chinese name and English name of the movie, but only to get the Chinese name, so you need to filter the English name.

The filter method can be implemented using the Find (str,pos_start,pos_end) function, rejecting the characteristic features in English names: "and"/", see the code.

3. Code implementation

The code here is simple, so you don't have to define a function.

#!/usr/bin/python#-*-coding:utf-8-*-#import requests,sys,refrom bs4 import beautifulsoupreload (SYS) Sys.setdefaultencoding (' utf-8 ') print ' is fetching data from the Watercress movie Top250 ... ' for page in range:    url= ' https:// Movie.douban.com/top250?start= ' +str ((page-1) *25)    print '---------------------------is crawling the first ' +str (page+1) + ' page ...--------------------------------'    html=requests.get (URL)    html.raise_for_status ()    try:        Soup=beautifulsoup (Html.text, ' Html.parser ')        soup=str (soup) # Use regular expressions to convert Web page text to string        title=re.compile (R ' (. *)')        Names=re.findall (title,soup) for        name in names:            if Name.find (') ==-1 and Name.find ('/') ==-1: # Excludes English names (English name features are "and"/")                Print name            # created, score    except Exception as e:        print eprint ' crawl complete! '
  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.