For Scrapy we have already introduced the simple application, today we use a complete example, climbed the watercress film TOP250 to do a small exercise, the scrapy stage to do a summary.
1 Environment Configuration
语言:Python 3.6.1 IDE: Pycharm浏览器:firefox爬虫框架:Scrapy 1.5.0操作系统:Windows 10 家庭中文版
2 Pre-crawl Analysis 2.1 data to be saved
First determine what we want to get, define fields in items to generate structured data from unstructured data, and get content that includes: rank, movie name, score, number of reviews.
Such as:
In Scrapy, you need to define items.py to configure our fields:
import scrapyclass SpItem(scrapy.Item): """ 定义item字段 """ # 排名 ranking = scrapy.Field() # 电影名称 movie_name = scrapy.Field() # 评分 score = scrapy.Field() # 评论人数 people_num = scrapy.Field()
2.2 Writing Spider Spiders
The basic framework is as follows:
import scrapyfrom sp.items import SpItemclass DoubanSpider(scrapy.Spider): """ 爬取豆瓣电影TOP250类,继承了scrapy.Spider类 """ # 定义spider名称,必须要有而且是唯一值 name = ‘douban‘ # 初始url,可以使用start_requests(self)也可以直接使用start_urls方式,而且区别是start_requests(self)可以更灵活,添加更多内容,如header等。 start_urls = [‘https://movie.douban.com/top250‘] def parse(self, response): """ 解析response中的字段,传送到items中 """ # 实例化item,用来添加获取的内容 item = SpItem() pass
Let's take a look at what we need to parse, open the Firefox browser, developer mode (F12), select what we want with the element selection arrows, and find that the information we want to get from the current page is in a class Grid_view ol tag, such as:
2.3 Using Scrapy Shell to get content
Since we are not able to get our content accurately at once, it is recommended to use the Scrapy shell to get the content and execute the results as follows:
scrapy shell "https://movie.douban.com/top250"
The results of the implementation are as follows:
Small partners have not found the problem, yes see 403, what is the reason, is not to add the header, OK, then I'll add the header to try, scrapy Shell Add header command run as follows:
scrapy shell -s USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0" "https://movie.douban.com/top250"
Use the-s parameter for more parameters to view official documents (Scrapy Shell), incoming user_agent header information, header information where do we get it? See:
Let's take a look at the results:
OK, success, then we get what we want.
By using the developer tools and scrapy shell, we first get all the LI tags under ol, and then loop each tag to extract the content we need, the analysis step is slightly, we have more contact XPath and CSS use method. The final result is as follows:
# 所有电影的标签,每个电影的信息在一个li中,我们先获取所有的li标签 movies = response.css("ol.grid_view li") # 循环每一个li标签,也就是每一个电影的信息for movie in movies: # 排名 item[‘ranking‘] = movie.css("div.pic em::text").extract_first() # 电影名字 item[‘movie_name‘] = movie.css("div.hd a span:first-child::text").extract_first() # 得分 item[‘score‘] = movie.css("div.star span.rating_num::text").extract_first() # 评论人数 item[‘people_num‘] = movie.css("div.star span:last-child::text").re("\d+")[0]
Refine our Spider code (douban_spider.py) as follows:
# -*- coding: utf-8 -*-import scrapyfrom sp.items import SpItemclass DoubanSpider(scrapy.Spider): """ 爬取豆瓣电影TOP250 """ name = ‘douban‘ start_urls = [‘https://movie.douban.com/top250‘] def parse(self, response): """ 解析response中的字段,传送到items中 """ item = SpItem() movies = response.css("ol.grid_view li") for movie in movies: # 排名 item[‘ranking‘] = movie.css("div.pic em::text").extract_first() # 电影名字 item[‘movie_name‘] = movie.css("div.hd a span:first-child::text").extract_first() # 得分 item[‘score‘] = movie.css("div.star span.rating_num::text").extract_first() # 评论人数 item[‘people_num‘] = movie.css("div.star span:last-child::text").re("\d+")[0] yield item
2.4 Running the spider
My spider has no add-on head, and I added the following parameters to the settings.py:
It's also possible to add it here.
Run the spider output to the CSV file command as follows:
scrapy crawl douban -o douban.csv
The results of the implementation are as follows:
Here are two places to be aware of:
1. If you open this CSV file directly with Excel, it will be garbled, you need to use a similar notepadd++ tool to select the encoding-transcoding to UTF-8 format, and then open with Excel after the problem.
2. We look at the output is a blank line, this need to modify the source content of Scrapy, in the pycharm consecutive click the Shift key, pop-up search box after the search exporters.py, such as:
Find the following and add a line newline= ', such as:
After we perform the operation again, the output is as follows:
Isn't it amazing!
2.5 Crawling All Page content
First or the analysis of the page, find the bottom page Content section, using the browser tool to find the corresponding page of the label, the href content is what we want to get the content, as follows:
Note that this is a relative link, not an absolute link, and there are two ways to invoke a relative connection:
# scrapy 1.4版本以前的版本使用此段代码 next_page = response.css("span.next a::attr(href)").extract_first() if next_page is not None: next_page = response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse) # scrapy 1.4版本以后加入了follow方法,可以使用下面代码 # next_page = response.css("span.next a::attr(href)").extract_first() # if next_page is not None: # yield response.follow(next_page, callback=self.parse)
Our complete spider code is as follows:
#-*-Coding:utf-8-*-import scrapyfrom sp.items import spitemclass doubanspider (scrapy. Spider): "" "Crawl watercress Movie TOP250" "" name = ' Douban ' start_urls = [' https://movie.douban.com/top250 '] def parse (Self, Response): "" "resolves a field in response and is routed to items" "" Item = SpItem () Movies = Response.cs S ("Ol.grid_view Li") for the movie in movies: # ranking item[' ranking ') = Movie.css ("Div.pic em::text" ). Extract_first () # Movie name item[' movie_name ' = Movie.css ("Div.hd a Span:first-child::text"). Extract_f Irst () # score item[' score ' = Movie.css ("Div.star span.rating_num::text"). Extract_first () # Number of reviews item[' people_num ' = Movie.css ("Div.star span:last-child::text"). Re ("\d+") [0] Yield item # Scrapy The previous version of version 1.4 uses this snippet of code next_page = Response.css ("Span.next a::attr (HREF)"). Extract_first () if Next_page is not none:next_page = Response.urljoin (NEXt_page) yield scrapy. Request (Next_page, Callback=self.parse) # Scrapy 1.4 later added the follow method, you can use the following code # Next_page = Response.css ("Spa N.next a::attr (HREF) "). Extract_first () # If Next_page is not None: # yield Response.follow (Next_page, CA Llback=self.parse)
The output results are as follows:
Remove the title, exactly 250.
Operations and Learning Python Reptile Advanced Chapter (v) scrapy crawl watercress film TOP250