Scrapy Framework uses Crawlspider to create automatic crawlers

Last Update:2018-07-24 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. Conditions of application

can automatically crawl a regular or irregular site

Second, the code explanation

(1) Chong Jian Scrapy Project

E:myweb>scrapy startproject mycwpjt
New scrapy project ' MYCWPJT ', using template directory ' d:\\python35\\lib\\ Site-packages\\scrapy\\templates\\project ', created in:
    D:\Python35\myweb\part16\mycwpjt
can start your Spider with:
    cd MYCWPJT
    scrapy genspider example example.com

(2) Chuang Jian crawler

E:\myweb>scrapy genspider-t Crawl Weisuen sohu.com Created ' spider ' Using Weisuen ' template ' in
module:
  Mycwpjt.spiders.weisuen

(3) Item writing

#-*-Coding:utf-8-*-

# Define Here's models for your scraped items # to documentation in
:
# http:/ /doc.scrapy.org/en/latest/topics/items.html

Import Scrapy


class Mycwpjtitem (scrapy. Item):
    # define the fields for your item here like:
    name = Scrapy. Field ()
    link = scrapy. Field ()

(4) Pipeline writing

#-*-Coding:utf-8-*-

# Define Your item pipelines here
#
Don ' t forget to add your pipeline to the Item_ Pipelines Setting
# see:http://doc.scrapy.org/en/latest/topics/item-pipeline.html


class Mycwpjtpipeline (object):
    def process_item (self, item, spider):
        print (item["name"])
        print (item["link")
        Return item

(5) Settings setting

Item_pipelines = {
   ' mycwpjt.pipelines.MycwpjtPipeline ': +,
}

(6) Crawler writing

#-*-Coding:utf-8-*-import scrapy from scrapy.linkextractors import linkextractor from scrapy.spiders import crawlspid Er, rule from mycwpjt.items import Mycwpjtitem #显示可用的模板 scrapy genspider-l #利用crawlspider创建的框架 scrapy gens Pider-t Crawl Weisun sohu.com #开始爬取 scrapy crawl Weisun--nolog class Weisunspider (Crawlspider): n Ame = ' Weisun ' allowed_domains = [' sohu.com '] start_urls = [' http://sohu.com/'] rules = (# News Page URL Address is similar to: # "http://news.sohu.com/20160926/n469167364.shtml" # so you can get a regular expression of '. *?/n.*?shtml ' rule (Lin Kextractor (allow= ('. *?/n.*?shtml '), allow_domains= (' sohu.com '), callback= ' Parse_item ', follow=true), Def parse _item (self, response): i = Mycwpjtitem () #i [' domain_id '] = Response.xpath ('//input[@id = Sid ']/@value '). Ext Ract () # Extracts the title i["Name" = Response.xpath ("/html/head/title/text ()") from a news page based on an XPath expression. Extract () # based on
    XPath expression extracts links to current news pages    i["link" = Response.xpath ("//link[@rel = ' canonical ']/@href"). Extract () return I

Crawlspider is a common reptile crawling with certain rules, based on spider and having some unique properties rules: A collection of rule objects that match the target site and exclude interference Parse_start_url: Used to crawl the initial response, You must return one of the item,request.

Because rules are a collection of rule objects, here's a description of rule. It has several parameters: Link_extractor, Callback=none, Cb_kwargs=none, Follow=none, Process_links=none, Process_request=none
The link_extractor can be defined either by itself or by using an existing Linkextractor class, the main parameter being: Allow: The value of the regular expression in the parentheses is extracted, and if it is empty, all matches. Deny: A URL that does not match the regular expression (or the regular expression list) must not be extracted. Allow_domains: The domains of the link that will be extracted. Deny_domains: The domains of the link must not be extracted. restrict_xpaths: Use xpath expressions, and allow together to filter links. Third, the results show

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More