Scrapy Framework uses Crawlspider to create automatic crawlers

Source: Internet
Author: User
Tags xpath

I. Conditions of application

can automatically crawl a regular or irregular site


Second, the code explanation

(1) Chong Jian Scrapy Project

E:myweb>scrapy startproject mycwpjt
New scrapy project ' MYCWPJT ', using template directory ' d:\\python35\\lib\\ Site-packages\\scrapy\\templates\\project ', created in:
    D:\Python35\myweb\part16\mycwpjt
can start your Spider with:
    cd MYCWPJT
    scrapy genspider example example.com
(2) Chuang Jian crawler

E:\myweb>scrapy genspider-t Crawl Weisuen sohu.com Created ' spider ' Using Weisuen ' template ' in
module:
  Mycwpjt.spiders.weisuen
(3) Item writing

#-*-Coding:utf-8-*-

# Define Here's models for your scraped items # to documentation in
:
# http:/ /doc.scrapy.org/en/latest/topics/items.html

Import Scrapy


class Mycwpjtitem (scrapy. Item):
    # define the fields for your item here like:
    name = Scrapy. Field ()
    link = scrapy. Field ()

(4) Pipeline writing

#-*-Coding:utf-8-*-

# Define Your item pipelines here
#
Don ' t forget to add your pipeline to the Item_ Pipelines Setting
# see:http://doc.scrapy.org/en/latest/topics/item-pipeline.html


class Mycwpjtpipeline (object):
    def process_item (self, item, spider):
        print (item["name"])
        print (item["link")
        Return item

(5) Settings setting

Item_pipelines = {
   ' mycwpjt.pipelines.MycwpjtPipeline ': +,
}

(6) Crawler writing

#-*-Coding:utf-8-*-import scrapy from scrapy.linkextractors import linkextractor from scrapy.spiders import crawlspid Er, rule from mycwpjt.items import Mycwpjtitem #显示可用的模板 scrapy genspider-l #利用crawlspider创建的框架 scrapy gens Pider-t Crawl Weisun sohu.com #开始爬取 scrapy crawl Weisun--nolog class Weisunspider (Crawlspider): n Ame = ' Weisun ' allowed_domains = [' sohu.com '] start_urls = [' http://sohu.com/'] rules = (# News Page URL Address is similar to: # "http://news.sohu.com/20160926/n469167364.shtml" # so you can get a regular expression of '. *?/n.*?shtml ' rule (Lin Kextractor (allow= ('. *?/n.*?shtml '), allow_domains= (' sohu.com '), callback= ' Parse_item ', follow=true), Def parse _item (self, response): i = Mycwpjtitem () #i [' domain_id '] = Response.xpath ('//input[@id = Sid ']/@value '). Ext Ract () # Extracts the title i["Name" = Response.xpath ("/html/head/title/text ()") from a news page based on an XPath expression. Extract () # based on
    XPath expression extracts links to current news pages    i["link" = Response.xpath ("//link[@rel = ' canonical ']/@href"). Extract () return I
 

Crawlspider is a common reptile crawling with certain rules, based on spider and having some unique properties rules: A collection of rule objects that match the target site and exclude interference Parse_start_url: Used to crawl the initial response, You must return one of the item,request.

Because rules are a collection of rule objects, here's a description of rule. It has several parameters: Link_extractor, Callback=none, Cb_kwargs=none, Follow=none, Process_links=none, Process_request=none
The link_extractor can be defined either by itself or by using an existing Linkextractor class, the main parameter being: Allow: The value of the regular expression in the parentheses is extracted, and if it is empty, all matches. Deny: A URL that does not match the regular expression (or the regular expression list) must not be extracted. Allow_domains: The domains of the link that will be extracted. Deny_domains: The domains of the link must not be extracted. restrict_xpaths: Use xpath expressions, and allow together to filter links. Third, the results show




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.