Python crawler--grab watercress movie Top250 data

Last Update:2018-03-16 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Write Leetcode too tired, occasionally practice python, write a little reptile play ~

Determine URL format

First find the Watercress movie TOP250 any one page URL address format, such as the first page: https://movie.douban.com/top250?start=0&filter=, analyze the address:

https://uses the HTTPS protocol on behalf of the resource Transfer Protocol;
MOVIE.DOUBAN.COM/TOP250 is the two-level domain name of the watercress, pointing to the watercress server;
/TOP250 is a resource of the server;
Start=0&filter= is the URL's two parameters, representing the start position and filter criteria, respectively.

Through analysis, it can be found that there is a change in the value of start between different pages, and the other is fixed part.

Get page data

This program is written in object-oriented coding mode, and good coding habits are developed.

The basic information is initialized in the \ (__init__\) function. Notice that there is a \ (headers\), what is this used for? Some sites have a little anti-crawler mechanism, for the General crawler request refused to return the actual data. In general, the basic anti-crawler is to determine whether the sending request has the basic information of the browser, so we can disguise as a browser to send these crawler requests, by modifying the HTTP packet in the Heafer implementation.

The

question comes again, where does this user-agent come from? If you are in a hurry, copy mine can be used! If you want to know the basic information of your computer browser, you can search the download applet "Fiddler", open the program, open a Web page, the applet will get the request sent, which contains what you want.

class MovieTop(object):    def __init__(self):        self.start = 0        self.param = '&filter'        self.headers = {"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; WOW64) "                                       "AppleWebKit/537.36 (KHTML, like Gecko) "                                       "Chrome/65.0.3325.146 Safari/537.36"}        self.movieList = []        self.filePath = './DoubanTop250.txt'    def get_page(self):        try:            url = 'https://movie.douban.com/top250?start=' + str(self.start) + '&filter='            myRequest = request.Request(url, headers=self.headers)            response = request.urlopen(myRequest)            page = response.read().decode('utf-8')            print('正在获取第' + str((self.start+25)//25) + '页数据...')            self.start += 25            return page        except request.URLError as e:            if hasattr(e, 'reason'):                print('获取失败，失败原因：', e.reason)

Extracting page information

In the above code, you can get the code of the page, is the text in HRML format, we need to extract useful information from it. In Chrome, right-click on the page source code to see the slightly formatted HTML text, which is the same as the page content we get. Find the key data, as follows:

From which you can see the page code structure of a record, and how do you extract the desired information from it? Regular expression Matching! Remember \ ( compile\) in \ (re module \) ?

This is a more troublesome thing, for the time being written, understand this line know there is a thing called \ (Beautiful Soup\), it can easily understand the page information, given the practice of regular expressions, but also relatively simple, directly write the entire match. Here is the reference code:

There is a small problem, originally wanted to extract all the information of each record, and later found that some films do not have "aliases", and some films did not "starring", had to ignore the two messages, and finally each record only extracted 10 information.

    def get_page_info (self): Patern = re.compile (U ' <div.*?class= "item" >.*? ')                            + u ' <div.*?class= ' pic ' >.*? ' + U ' <em.*?class= "" > (. *?)                            </em>.*? '                            + U ' <div.*?class= "info" >.*? ' + u ' <span.*?class= "title" > (. *?)                            </span>.*? ' + U ' <span.*?class= "other" > (. *?)                            </span>.*? '                            + u ' <div.*?class= ' bd ' >.*? '                            + U ' <p.*?class= "" >.*? " + U ' Director: (. *?) &nbsp;&nbsp;&nbsp;. *?<br> ' + u ' (. *?)                            &nbsp;/&nbsp; ' + U ' (. *?) &nbsp;/&nbsp; (. *?)                            </p>.*? '                            + u ' <div.*?class= ' star ' >.*? ' + u ' <span.*?class= "Rating_num". *?property= "V:average" > ' + U ' (. *?)           </span>.*? '                 + U ' <span> (. *?)                            People rating </span>.*? ' + u ' <span.*?class= "Inq" > (. *?) </span> ', Re.            S) while Self.start <= 225:page = self.get_page () Movies = Re.findall (patern, page)                                        For movie in Movies:self.movieList.append ([movie[0], movie[1], Movie[2].lstrip (' &nbsp;/&nbsp; '), MOV                                       IE[3], Movie[4].lstrip (), movie[5],                                       Movie[6].rstrip (), movie[7], MOVIE[8], movie[9])

Write file

This is relatively simple, write it directly to the TXT.

    def write_page(self):        print('开始写入文件...')        file = open(self.filePath, 'w', encoding='utf-8')        try:            for movie in self.movieList:                file.write('电影排名：' + movie[0] + '\n')                file.write('电影名称：' + movie[1] + '\n')                file.write('电影别名：' + movie[2] + '\n')                file.write('导演：' + movie[3] + '\n')                file.write('上映年份：' + movie[4] + '\n')                file.write('制作国家/地区：' + movie[5] + '\n')                file.write('电影类别：' + movie[6] + '\n')                file.write('评分：' + movie[7] + '\n')                file.write('参评人数：' + movie[8] + '\n')                file.write('简短影评：' + movie[9] + '\n')                file.write('\n')            print('成功写入文件...')        except Exception as e:            print(e)        finally:            file.close()

Complete code: Https://github.com/Pacsiy/learnPY.

The copyright belongs to the author Alvinzh and blog Park all, welcome reprint and Commercial, but without the consent of the author must retain this paragraph statement, and in the article page obvious location to the original link, otherwise reserves the right to pursue legal responsibility.

Python crawler--grab watercress movie Top250 data

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler--grab watercress movie Top250 data

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler--grab watercress movie Top250 data

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support