Python crawler--grab watercress movie Top250 data

Source: Internet
Author: User

Write Leetcode too tired, occasionally practice python, write a little reptile play ~

Determine URL format

First find the Watercress movie TOP250 any one page URL address format, such as the first page: https://movie.douban.com/top250?start=0&filter=, analyze the address:

    • https://uses the HTTPS protocol on behalf of the resource Transfer Protocol;
    • MOVIE.DOUBAN.COM/TOP250 is the two-level domain name of the watercress, pointing to the watercress server;
    • /TOP250 is a resource of the server;
    • Start=0&filter= is the URL's two parameters, representing the start position and filter criteria, respectively.

Through analysis, it can be found that there is a change in the value of start between different pages, and the other is fixed part.

Get page data

This program is written in object-oriented coding mode, and good coding habits are developed.

The basic information is initialized in the \ (__init__\) function. Notice that there is a \ (headers\), what is this used for? Some sites have a little anti-crawler mechanism, for the General crawler request refused to return the actual data. In general, the basic anti-crawler is to determine whether the sending request has the basic information of the browser, so we can disguise as a browser to send these crawler requests, by modifying the HTTP packet in the Heafer implementation.

The

question comes again, where does this user-agent come from? If you are in a hurry, copy mine can be used! If you want to know the basic information of your computer browser, you can search the download applet "Fiddler", open the program, open a Web page, the applet will get the request sent, which contains what you want.

class MovieTop(object):    def __init__(self):        self.start = 0        self.param = '&filter'        self.headers = {"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; WOW64) "                                       "AppleWebKit/537.36 (KHTML, like Gecko) "                                       "Chrome/65.0.3325.146 Safari/537.36"}        self.movieList = []        self.filePath = './DoubanTop250.txt'    def get_page(self):        try:            url = 'https://movie.douban.com/top250?start=' + str(self.start) + '&filter='            myRequest = request.Request(url, headers=self.headers)            response = request.urlopen(myRequest)            page = response.read().decode('utf-8')            print('正在获取第' + str((self.start+25)//25) + '页数据...')            self.start += 25            return page        except request.URLError as e:            if hasattr(e, 'reason'):                print('获取失败,失败原因:', e.reason)
Extracting page information

In the above code, you can get the code of the page, is the text in HRML format, we need to extract useful information from it. In Chrome, right-click on the page source code to see the slightly formatted HTML text, which is the same as the page content we get. Find the key data, as follows:

From which you can see the page code structure of a record, and how do you extract the desired information from it? Regular expression Matching! Remember \ ( compile\) in \ (re module \) ?

This is a more troublesome thing, for the time being written, understand this line know there is a thing called \ (Beautiful Soup\), it can easily understand the page information, given the practice of regular expressions, but also relatively simple, directly write the entire match. Here is the reference code:

There is a small problem, originally wanted to extract all the information of each record, and later found that some films do not have "aliases", and some films did not "starring", had to ignore the two messages, and finally each record only extracted 10 information.

    def get_page_info (self): Patern = re.compile (U ' <div.*?class= "item" >.*? ')                            + u ' <div.*?class= ' pic ' >.*? ' + U ' <em.*?class= "" > (. *?)                            </em>.*? '                            + U ' <div.*?class= "info" >.*? ' + u ' <span.*?class= "title" > (. *?)                            </span>.*? ' + U ' <span.*?class= "other" > (. *?)                            </span>.*? '                            + u ' <div.*?class= ' bd ' >.*? '                            + U ' <p.*?class= "" >.*? " + U ' Director: (. *?) &nbsp;&nbsp;&nbsp;. *?<br> ' + u ' (. *?)                            &nbsp;/&nbsp; ' + U ' (. *?) &nbsp;/&nbsp; (. *?)                            </p>.*? '                            + u ' <div.*?class= ' star ' >.*? ' + u ' <span.*?class= "Rating_num". *?property= "V:average" > ' + U ' (. *?)           </span>.*? '                 + U ' <span> (. *?)                            People rating </span>.*? ' + u ' <span.*?class= "Inq" > (. *?) </span> ', Re.            S) while Self.start <= 225:page = self.get_page () Movies = Re.findall (patern, page)                                        For movie in Movies:self.movieList.append ([movie[0], movie[1], Movie[2].lstrip (' &nbsp;/&nbsp; '), MOV                                       IE[3], Movie[4].lstrip (), movie[5],                                       Movie[6].rstrip (), movie[7], MOVIE[8], movie[9])
Write file

This is relatively simple, write it directly to the TXT.

    def write_page(self):        print('开始写入文件...')        file = open(self.filePath, 'w', encoding='utf-8')        try:            for movie in self.movieList:                file.write('电影排名:' + movie[0] + '\n')                file.write('电影名称:' + movie[1] + '\n')                file.write('电影别名:' + movie[2] + '\n')                file.write('导演:' + movie[3] + '\n')                file.write('上映年份:' + movie[4] + '\n')                file.write('制作国家/地区:' + movie[5] + '\n')                file.write('电影类别:' + movie[6] + '\n')                file.write('评分:' + movie[7] + '\n')                file.write('参评人数:' + movie[8] + '\n')                file.write('简短影评:' + movie[9] + '\n')                file.write('\n')            print('成功写入文件...')        except Exception as e:            print(e)        finally:            file.close()

Complete code: Https://github.com/Pacsiy/learnPY.

The copyright belongs to the author Alvinzh and blog Park all, welcome reprint and Commercial, but without the consent of the author must retain this paragraph statement, and in the article page obvious location to the original link, otherwise reserves the right to pursue legal responsibility.

Python crawler--grab watercress movie Top250 data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.