I. Conditions of application
can automatically crawl a regular or irregular site
Second, the code explanation
(1) Chong Jian Scrapy Project
E:myweb>scrapy startproject mycwpjt
New scrapy project ' MYCWPJT ', using template directory ' d:\\python35\\lib\\ Site-packages\\scrapy\\templates\\project ', created in:
D:\Python35\myweb\part16\mycwpjt
can start your Spider with:
cd MYCWPJT
scrapy genspider example example.com
(2) Chuang Jian crawler
E:\myweb>scrapy genspider-t Crawl Weisuen sohu.com Created ' spider ' Using Weisuen ' template ' in
module:
Mycwpjt.spiders.weisuen
(3) Item writing
#-*-Coding:utf-8-*-
# Define Here's models for your scraped items # to documentation in
:
# http:/ /doc.scrapy.org/en/latest/topics/items.html
Import Scrapy
class Mycwpjtitem (scrapy. Item):
# define the fields for your item here like:
name = Scrapy. Field ()
link = scrapy. Field ()
(4) Pipeline writing
#-*-Coding:utf-8-*-
# Define Your item pipelines here
#
Don ' t forget to add your pipeline to the Item_ Pipelines Setting
# see:http://doc.scrapy.org/en/latest/topics/item-pipeline.html
class Mycwpjtpipeline (object):
def process_item (self, item, spider):
print (item["name"])
print (item["link")
Return item
(5) Settings setting
Item_pipelines = {
' mycwpjt.pipelines.MycwpjtPipeline ': +,
}
(6) Crawler writing
#-*-Coding:utf-8-*-import scrapy from scrapy.linkextractors import linkextractor from scrapy.spiders import crawlspid Er, rule from mycwpjt.items import Mycwpjtitem #显示可用的模板 scrapy genspider-l #利用crawlspider创建的框架 scrapy gens Pider-t Crawl Weisun sohu.com #开始爬取 scrapy crawl Weisun--nolog class Weisunspider (Crawlspider): n Ame = ' Weisun ' allowed_domains = [' sohu.com '] start_urls = [' http://sohu.com/'] rules = (# News Page URL Address is similar to: # "http://news.sohu.com/20160926/n469167364.shtml" # so you can get a regular expression of '. *?/n.*?shtml ' rule (Lin Kextractor (allow= ('. *?/n.*?shtml '), allow_domains= (' sohu.com '), callback= ' Parse_item ', follow=true), Def parse _item (self, response): i = Mycwpjtitem () #i [' domain_id '] = Response.xpath ('//input[@id = Sid ']/@value '). Ext Ract () # Extracts the title i["Name" = Response.xpath ("/html/head/title/text ()") from a news page based on an XPath expression. Extract () # based on
XPath expression extracts links to current news pages i["link" = Response.xpath ("//link[@rel = ' canonical ']/@href"). Extract () return I
Crawlspider is a common reptile crawling with certain rules, based on spider and having some unique properties rules: A collection of rule objects that match the target site and exclude interference Parse_start_url: Used to crawl the initial response, You must return one of the item,request.
Because rules are a collection of rule objects, here's a description of rule. It has several parameters: Link_extractor, Callback=none, Cb_kwargs=none, Follow=none, Process_links=none, Process_request=none
The link_extractor can be defined either by itself or by using an existing Linkextractor class, the main parameter being: Allow: The value of the regular expression in the parentheses is extracted, and if it is empty, all matches. Deny: A URL that does not match the regular expression (or the regular expression list) must not be extracted. Allow_domains: The domains of the link that will be extracted. Deny_domains: The domains of the link must not be extracted. restrict_xpaths: Use xpath expressions, and allow together to filter links. Third, the results show