Scrapy framework CrawlSpiders introduction and Usage Details, scrapycrawlspiders

Source: Internet
Author: User

Scrapy framework CrawlSpiders introduction and Usage Details, scrapycrawlspiders

In Scrapy basics-Spider, I briefly talked about the Spider class. Spider can basically do a lot of things, but if you want to crawl things or the whole site of jianshu, you may need a more powerful weapon. CrawlSpider is based on Spider, but is born for full-site crawling.

CrawlSpiders is a derived class of Spider. The Spider class is designed to crawl only the webpage in the start_url list, and the crawler class defines some Rules (rule) to provide a convenient mechanism for follow-up link, it is more suitable to retrieve links from the crawled web pages and continue crawling.

1. Analyze the source code of CrawlSpiders.

Source code parsing class crawler (Spider): rules = () def _ init _ (self, * a, ** kw): super (crawler, self ). _ init _ (* a, ** kw) self. _ compile_rules () # first, call parse () to process the response objects returned in start_urls # parse (). Then, these response objects are passed to the _ parse_response () function for processing, and set the callback function to parse_start_url () # Set the follow-up flag to True # parse returns the item and the following Request object def parse (self, response): return self. _ parse_response (response, self. parse_start_url, cb_kwargs ={}, fo Llow = True) # to process the response returned by start_url, rewrite def parse_start_url (self, response): return [] def process_results (self, response, results ): return results # extract a link conforming to any user-defined 'rule' from response and construct a Resquest object to return def _ requests_to_follow (self, response): if not isinstance (response, HtmlResponse): return seen = set () # all links in the extraction, as long as any 'rule' is used, it indicates legal for n, rule in enumerate (self. _ rules): links = [l for l in rule. link _ Extractor. extract_links (response) if l not in seen] # Use the process_links specified by the user to process each connection if links and rule. process_links: links = rule. process_links (links) # Add the link to the seen set, generate a Request object for each link, and set the callback function to _ repsonse_downloaded () for link in links: seen. add (link) # construct the Request object and use the callback function defined in the Rule as the callback function r = Request (url = link. url, callback = self. _ response_downloaded) r. meta. update (rule = n, link_text = link. text )# Call the process_request () function for each Request. This function defaults to indentify, that is, the Request is directly returned without any processing. yield rule. process_request (r) # process the connection extracted by rule and return the item and request def _ response_downloaded (self, response): rule = self. _ rules [response. meta ['rule'] return self. _ parse_response (response, rule. callback, rule. cb_kwargs, rule. follow) # parse the response object and Use callback to parse it and return the request or Item object def _ parse_response (self, response, callback, cb_kwargs, follow = True ): # first determine whether callback is set Function. (This callback function may be a parsing function in rule, or parse_start_url function) # If the callback function (parse_start_url () is set, use parse_start_url () to process the response object first, # submit it to process_results for processing. Returns a list of cb_res if callback: # if it is called by parse, it will be parsed into a Request object # if it is rule callback, it will be parsed into Item cb_res = callback (response, ** cb_kwargs) or () cb_res = self. process_results (response, cb_res) for requests_or_item in iterate_spider_output (cb_res. _ follow_links: # returns each Request object for request_or_item in self. _ requests_to_follow (response): yield request_or_item def _ compile_rules (self): def get_method (method): if callable (method): return method elif isinstance (method, basestring ): return getattr (self, method, None) self. _ rules = [copy. copy (r) for r in self. rules] for rule in self. _ rules: rule. callback = get_method (rule. callback) rule. process_links = get_method (rule. process_links) rule. process_request = get_method (rule. process_request) def set_crawler (self, crawler): super (crawler, self ). set_crawler (crawler) self. _ follow_links = crawler. settings. getbool ('crawlspider _ FOLLOW_LINKS ', True)

II. Introduction to Crawler file Fields

1. CrawlSpider inherits from the Spider class. In addition to the inherited attributes (name and allow_domains), it also provides new attributes and Methods: class scrapy. linkextractors. the purpose of LinkExtractorLink Extractors is simple: extract links. Each LinkExtractor has a unique public method extract_links (), which receives a Response object and returns a scrapy. link. link object.

Link Extractors must be instantiated once, And the extract_links method will call multiple extraction links based on different response 。

class scrapy.linkextractors.LinkExtractor(  allow = (),  deny = (),  allow_domains = (),  deny_domains = (),  deny_extensions = None,  restrict_xpaths = (),  tags = ('a','area'),  attrs = ('href'),  canonicalize = True,  unique = True,  process_value = None)

Main parameters:

① Allow: values that meet the "Regular Expression" in parentheses are extracted. If they are empty, all values are matched.
② Deny: URLs that do not match this regular expression (or regular expression list) must not be extracted.
③ Allow_domains: the domains of the extracted link.
④ Deny_domains: Do not extract the domains of the link.
⑤ Restrict_xpaths: uses an xpath expression to filter links with allow.

2. The rules contains one or more Rule objects. Each Rule defines a specific action for crawling a website. If multiple rule matches the same link, the first rule is used according to the sequence defined in this set.

class scrapy.spiders.Rule(    link_extractor,     callback = None,     cb_kwargs = None,     follow = None,     process_links = None,     process_request = None)

① Link_extractor: A Link Extractor object used to define the Link to be extracted.

② Callback: when each link is obtained from link_extractor, the value specified by the parameter is used as the callback function. The callback function accepts a response as its first parameter.

Note: When writing crawler rules, avoid using parse as the callback function. Because CrawlSpider uses the parse method to implement its logic, if the parse method is overwritten, crawl spider will fail to run.

③ Follow: A boolean value that specifies whether the link extracted from response needs to be followed up based on the Rule. If callback is None, follow is set to True by default; otherwise, the default value is False.

④ Process_links: Specifies which function in the spider will be called. This function will be called when the link list is obtained from link_extractor. This method is mainly used for filtering.

⑤ Process_request: Specifies which function in the spider will be called. This function is called when the rule is extracted to every request. (Used to filter requests)

3. Scrapy provides the log function, which can be used by the logging module. You can modify the configuration file settings. py and add the following two lines at any location. The effect will be much clearer.

LOG_FILE = "TencentSpider.log"LOG_LEVEL = "INFO"

Scrapy provides a layer-5 logging level:

① CRITICAL-critical)
② ERROR-General ERROR (regular errors)
③ WARNING-warning message (WARNING messages)
④ INFO-general information (informational messages)
⑤ DEBUG-DEBUG information (debugging messages)

You can configure logging by setting the following settings in setting. py:

① LOG_ENABLED default: True, logging enabled
② LOG_ENCODING default: 'utf-8', the encoding used by logging
③ LOG_FILE default: None. Create the name of the logging output file in the current directory.
④ LOG_LEVEL default: 'debug', the lowest level of log
⑤ LOG_STDOUT default: False if it is True, all standard output (and errors) of the process will be redirected to the log. For example, execute print "hello" and it will be displayed in Scrapy log.

Iii. crawler Case Analysis

1. Create a project: scrapy startproject CrawlYouYuan

2. Create a crawler file: scrapy genspider-t crawl youyuan youyuan.com

3. Project File Analysis

Items. py

Model class import scrapyclass CrawlyouyuanItem (scrapy. item): # username = scrapy. field () # age = scrapy. field () # header_url = scrapy. field () # album image link images_url = scrapy. field () # inner monologue content = scrapy. field () # birthplace place_from = scrapy. field () # education degree education = scrapy. field () # hobbies holobby = scrapy. field () # personal homepage source_url = scrapy. field () # data source website sourec = scrapy. field () # utc time = scrapy. field () # crawler name spidername = scrapy. field ()

Youyuan. py

Crawler file #-*-coding: UTF-8-*-import scrapyfrom scrapy. linkextractors import LinkExtractorfrom scrapy. spiders import CrawlSpider, Rulefrom CrawlYouYuan. items import CrawlyouyuanItemimport reclass YouyuanSpider (crawler): name = 'youyuanyuan 'allowed_domains = ['youyuan. com '] start_urls = ['HTTP: // www.youyuan.com/find/beijing/mm18-25/advance-0-0-0-0-0-0-0/p1/'] # automatically generated files do not need to be modified, you only need to add the Rule role in the rules file. # The Rule page_links = LinkExtractor (allow = (r "youyuan.com/find/beijing/mm18-25/advance-0-0-0-0-0-0/pw.d /")) # per-person homepage matching Rule profile_links = LinkExtractor (allow = (r "youyuan.com/mongod0000-profile/") rules = (# No callback function, indicating that follow is True Rule (page_links ), # If there is a callback function, it means that follow is False Rule (profile_links, callback = 'parse _ item', follow = True),) def parse_item (self, response): item = CrawlyouyuanItem () item ['username'] = self. get_username (response) # age item ['age'] = self. get_age (response) # item ['header _ url'] = self. get_header_url (response) # album image link item ['images _ url'] = self. get_images_url (response) # inner monologue item ['content'] = self. get_content (response) # item ['place _ from'] = self. get_place_from (response) # education level item ['ucation'] = self. get_education (response) # interests item ['hobby'] = self. get_holobby (response) # personal homepage item ['source _ url'] = response. url # data source website item ['sourec'] = "youyuan" yield item def get_username (self, response): username = response. xpath ("// dl [@ class = 'personal _ cen'] // div [@ class = 'main']/strong/text ()"). extract () if len (username): username = username [0] else: username = "NULL" return username. strip () def get_age (self, response): age = response. xpath ("// dl [@ class = 'personal _ cen'] // dd/p/text ()"). extract () if len (age): age = re. findall (u "\ d + years old", age [0]) [0] else: age = "NULL" return age. strip () def get_header_url (self, response): header_url = response. xpath ("// dl [@ class = 'personal _ cen']/dt/img/@ src "). extract () if len (header_url): header_url = header_url [0] else: header_url = "NULL" return header_url.strip () def get_images_url (self, response): images_url = response. xpath ("// div [@ class = 'Ph _ show']/ul/li/a/img/@ src "). extract () if len (images_url): images_url = ",". join (images_url) else: images_url = "NULL" return images_url def get_content (self, response): content = response. xpath ("// div [@ class = 'pre _ data']/ul/li/p/text ()"). extract () if len (content): content = content [0] else: content = "NULL" return content. strip () def get_place_from (self, response): place_from = response. xpath ("// div [@ class = 'pre _ data']/ul/li [2] // ol [1]/li [1]/span/text () "). extract () if len (place_from): place_from = place_from [0] else: place_from = "NULL" return place_from.strip () def get_education (self, response): education = response. xpath ("// div [@ class = 'pre _ data']/ul/li [3] // ol [2]/li [2]/span/text () "). extract () if len (education): education = education [0] else: education = "NULL" return education. strip () def get_holobby (self, response): holobby = response. xpath ("// dl [@ class = 'personal _ cen'] // ol/li/text ()"). extract () if len (holobby): holobby = ",". join (holobby ). replace ("", "") else: holobby = "NULL" return holobby. strip ()

Pipelines. py

Import jsonimport codecsclass crawler lyouyuanpipeline (object): def _ init _ (self): self. filename = codecs. open ('content. json ', 'w', encoding = 'utf-8') def process_item (self, item, spider): html = json. dumps (dict (item), ensure_ascii = False) self. filename. write (html + '\ n') return item def spider_closed (self, spider): self. filename. close ()

Settings. py

BOT_NAME = 'CrawlYouYuan'SPIDER_MODULES = ['CrawlYouYuan.spiders']NEWSPIDER_MODULE = 'CrawlYouYuan.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agentUSER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:56.0)'# Obey robots.txt rulesROBOTSTXT_OBEY = TrueITEM_PIPELINES = {  'CrawlYouYuan.pipelines.CrawlyouyuanPipeline': 300,}

Begin. py

from scrapy import cmdlinecmdline.execute('scrapy crawl youyuan'.split())

Make the Scrapy version and Twisted version consistent before running the program. The settings are as follows:

This sharing details the specific steps for using the Scrapy framework crawler, and at the same time write a crawler case for analysis, a good explanation of the ease and ease of understanding of the Scrapy framework crawling data, in the next article, I will share Scrapy distributed crawling websites. Let's learn and discuss crawler technologies together.

The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.