First, in tutorial (ii) (HTTP://BLOG.CSDN.NET/U012150179/ARTICLE/DETAILS/32911511), the research is to crawl a single Web page method. In Tutorial (iii) (http://blog.csdn.net/u012150179/article/details/34441655), the Scrapy core architecture is discussed. Now on the basis of (b) and combined with the principle method of crawling multiple Web pages mentioned in (c), this paper studies the method of automatic multiple Web page crawl.
Also, in order to better understand the scrapy core architecture and data flow, Scrapy.spider.Spider is still used as the base class for writing reptiles here.
first create project:
Scrapy Startproject Csdnblog
I. items.py preparation
Here for a clear description, only extracts the article name and the article URL.
#-*-Coding:utf-8-*-
from Scrapy.item Import Item, Field
class Csdnblogitem (item): ""
store extract information data structure
"" " article_name = field ()
article_url = field ()
two. pipelines.py prepared
Import JSON
import codecs
class Csdnblogpipeline (object):
def __init__ (self):
self.file = Codecs.open (' Csdnblog_data.json ', mode= ' WB ', encoding= ' Utf-8 ')
def process_item (self, item, spider): line
= Json.dumps (Dict (item)) + ' \ n '
self.file.write (Line.decode ("Unicode_escape")) return
item
Where the stored file is created and opened in the constructor in a writable manner. Implements the item processing in Process_item, which contains the resulting item to be written to the output file in JSON form.
three. settings.py prepared
For setting files, he acts as a configuration file, primarily to perform the configuration of spider. Some configuration parameters that are easily changed can be placed in the writing of the Spider class, while the parameters that are not changed during the operation of the crawler are configured in settings.py.
#-*-Coding:utf-8-*-
bot_name = ' Csdnblog '
spider_modules = [' csdnblog.spiders ']
newspider_module = ' Csdnblog.spiders '
#禁止cookies to prevent being ban
cookies_enabled = False
item_pipelines = {
' CSDNBlog.pipelines.CsdnblogPipeline ':
# Crawl responsibly by identifying yourself (and your website) on the User-agent
#USER_AGENT = ' Csdnblog (+http://www.yourdomain.com) '
Here the cookies_enabled parameter is set to true, so that the site visited according to the cookies can not find the crawler trajectory, to prevent ban.
The Item_pipelines type is a dictionary that is used to set the startup pipeline, where key is the defined pipeline class, value is the boot order, default 0-1000.
four. Crawler writing
Crawler writing is always a play. The principle is to analyze the page to get the "next" link and return to the request object. And then continue to crawl down an article until No.
Upper Code:
#!/usr/bin/python #-*-Coding:utf-8-*-# from scrapy.contrib.spiders import crawlspider,rule from Scrapy.spider Impor T Spider from scrapy.http import Request from scrapy.selector import selector from csdnblog.items import Csdnblogitem cl
Ass Csdnblogspider (Spider): "" "Reptile Csdnblogspider" "" Name = "csdnblog" #减慢爬取速度 1s download_delay = 1 Allowed_domains = ["Blog.csdn.net"] Start_urls = [#第一篇文章地址] http://blog.csdn.net/u012150179/article/ details/11749017 "] def parse (self, response): SEL = Selector (response) #items = [] #获得文 Chapter URL and Title item = Csdnblogitem () Article_url = str (response.url) Article_name = Sel.xpath ('//div[@id = "Article_details"]/div/h1/span/a/text () "). Extract () item[' article_name '] = [N.encode (' Utf-8 ') for N in Article_n AME] item[' article_url ' = Article_url.encode (' utf-8 ') yield item #获得下一篇文章的url urls = SE L.xpath ('//li[@class = NEXt_article "]/a/@href"). Extract () for URL in urls:print url url = "http://blog.csdn.net" +
URL Print url yield Request (URL, callback=self.parse)
slowly analyze:
(1) The Download_delay parameter is set to 1, the waiting time before downloading the download next page is set to 1s, which is also one of the strategies to prevent ban. The main is to reduce server-side load.
(2) Extracts the article link and the article topic from the response, the code is utf-8. Note the use of yield.
(3) Extract the "Next" url, as a result of the extraction of the missing http://blog.csdn.net part, so added. Two print only for debugging, no practical significance. The point is
Yield Request (URL, callback=self.parse)
That is, the new request is returned to the engine to implement the loop. It also implements " automatically crawl the next page ".
Five. Implementation
Scrapy Crawl Csdnblog
Partial Storage Data screenshot:
Original, reprinted annotated: http://blog.csdn.net/u012150179/article/details/34486677