Research and exploration of Scrapy (v.)--Automatic multi-page crawl (Crawl someone blog all articles)

Source: Internet
Author: User
Tags xpath


First, in tutorial (ii) (HTTP://BLOG.CSDN.NET/U012150179/ARTICLE/DETAILS/32911511), the research is to crawl a single Web page method. In Tutorial (iii) (http://blog.csdn.net/u012150179/article/details/34441655), the Scrapy core architecture is discussed. Now on the basis of (b) and combined with the principle method of crawling multiple Web pages mentioned in (c), this paper studies the method of automatic multiple Web page crawl.

Also, in order to better understand the scrapy core architecture and data flow, Scrapy.spider.Spider is still used as the base class for writing reptiles here.


first create project:

Scrapy Startproject Csdnblog

I. items.py preparation

Here for a clear description, only extracts the article name and the article URL.

#-*-Coding:utf-8-*-

from Scrapy.item Import Item, Field

class Csdnblogitem (item): ""
    store extract information data structure

    "" " article_name = field ()
    article_url = field ()

two. pipelines.py prepared

Import JSON
import codecs

class Csdnblogpipeline (object):

    def __init__ (self):
        self.file = Codecs.open (' Csdnblog_data.json ', mode= ' WB ', encoding= ' Utf-8 ')

    def process_item (self, item, spider): line
        = Json.dumps (Dict (item)) + ' \ n '
        self.file.write (Line.decode ("Unicode_escape")) return

        item

Where the stored file is created and opened in the constructor in a writable manner. Implements the item processing in Process_item, which contains the resulting item to be written to the output file in JSON form.


three. settings.py prepared

For setting files, he acts as a configuration file, primarily to perform the configuration of spider. Some configuration parameters that are easily changed can be placed in the writing of the Spider class, while the parameters that are not changed during the operation of the crawler are configured in settings.py.

#-*-Coding:utf-8-*-

bot_name = ' Csdnblog '

spider_modules = [' csdnblog.spiders ']
newspider_module = ' Csdnblog.spiders '

#禁止cookies to prevent being ban
cookies_enabled = False

item_pipelines = {
    ' CSDNBlog.pipelines.CsdnblogPipeline ':

# Crawl responsibly by identifying yourself (and your website) on the User-agent
#USER_AGENT = ' Csdnblog (+http://www.yourdomain.com) '

Here the cookies_enabled parameter is set to true, so that the site visited according to the cookies can not find the crawler trajectory, to prevent ban.

The Item_pipelines type is a dictionary that is used to set the startup pipeline, where key is the defined pipeline class, value is the boot order, default 0-1000.


four. Crawler writing

Crawler writing is always a play. The principle is to analyze the page to get the "next" link and return to the request object. And then continue to crawl down an article until No.

Upper Code:

#!/usr/bin/python #-*-Coding:utf-8-*-# from scrapy.contrib.spiders import crawlspider,rule from Scrapy.spider Impor T Spider from scrapy.http import Request from scrapy.selector import selector from csdnblog.items import Csdnblogitem cl
    Ass Csdnblogspider (Spider): "" "Reptile Csdnblogspider" "" Name = "csdnblog" #减慢爬取速度 1s download_delay = 1 Allowed_domains = ["Blog.csdn.net"] Start_urls = [#第一篇文章地址] http://blog.csdn.net/u012150179/article/ details/11749017 "] def parse (self, response): SEL = Selector (response) #items = [] #获得文 Chapter URL and Title item = Csdnblogitem () Article_url = str (response.url) Article_name = Sel.xpath ('//div[@id = "Article_details"]/div/h1/span/a/text () "). Extract () item[' article_name '] = [N.encode (' Utf-8 ') for N in Article_n AME] item[' article_url ' = Article_url.encode (' utf-8 ') yield item #获得下一篇文章的url urls = SE L.xpath ('//li[@class = NEXt_article "]/a/@href"). Extract () for URL in urls:print url url = "http://blog.csdn.net" +
 URL Print url yield Request (URL, callback=self.parse)

slowly analyze:

(1) The Download_delay parameter is set to 1, the waiting time before downloading the download next page is set to 1s, which is also one of the strategies to prevent ban. The main is to reduce server-side load.

(2) Extracts the article link and the article topic from the response, the code is utf-8. Note the use of yield.

(3) Extract the "Next" url, as a result of the extraction of the missing http://blog.csdn.net part, so added. Two print only for debugging, no practical significance. The point is

Yield Request (URL, callback=self.parse)

That is, the new request is returned to the engine to implement the loop. It also implements " automatically crawl the next page ".


Five. Implementation

Scrapy Crawl Csdnblog

Partial Storage Data screenshot:


Original, reprinted annotated: http://blog.csdn.net/u012150179/article/details/34486677

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.