(2) What should Scrapy do for Distributed crawlers?-Reflection on Scrapy and introduction to core objects; crawler scrapy

Source: Internet
Author: User

(2) What should Scrapy do for Distributed crawlers?-Reflection on Scrapy and introduction to core objects; crawler scrapy

This article mainly introduces the thinking about a crawler framework, the introduction of core components, and the conventional thinking methods:

I. Conjecture

Crawlers we talk about generally contain at least several basic elements:

1. request sending object (sender, encapsulation of request to prevent being blocked)

2. parse the Document Object (treat the requested webpage as an html document or string)

3. Carry the required resolution object (standard format data bearer)

4. Operator after obtaining the desired object (after obtaining the object, it is saved as a file or saved to the database)

5. The error handler of the entire process (the exception monitor of the entire process)

 

Ii. Verification

 

Let's take a look at what core objects Scrapy provides

 

 

Basic Concepts
Command line tools)
Learn the command line tools used to manage Scrapy Projects
Items
Define the crawled data
Spiders
Compile a website crawling rule
Selectors)
Use XPath to extract webpage data
Scrapy terminal (Scrapy shell)
Test the data extraction code in the interactive environment
Item Loaders
Use the crawled data to fill the item
Item Pipeline
Post-process stores crawled data
Feed exports
Output crawled data in different formats to different storage terminals
Link Extractors
This class is convenient for extracting follow-up links.

Refer from: https://scrapy-chs.readthedocs.org/zh_CN/0.24/

 

Basically, the Scrapy object we imagine will be included in it.

 

 

 

Iii. Crawling

 

We know that crawlers generally crawl data according to the following rules:

Enter the target url => write processing rules (regular expression or xpath syntax) => process the obtained data

The Scrapy method is as follows:

1) create a project

Switch to the folder where the code is to be placed as a command line, and enter the following command:

scrapy startproject cnblogs

A cnblogs file is generated under the folder and switched to the folder (remember to switch)

 

 

Item. py is the data bearer we need.

Modify it to the following code:

# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyfrom scrapy.item import Field,Itemclass CnblogsItem(scrapy.Item):    # define the fields for your item here like:    # name = scrapy.Field()    Title = Field()    TitleUrl = Field()

 

Add BasicGroupSpider. py in the Spider folder to the following content:

From scrapy. spider import BaseSpiderfrom scrapy. selector import HtmlXPathSelectorfrom cnblogs. items import CnblogsItemclass CnblogsSpider (BaseSpider): name = "cnblogs" # spider name allowed_domains = ["cnblogs.com"] start_urls = ['HTTP: // www.cnblogs.com/'] # def parse (self, response): self. log ("Fetch douban homepage page: % s" % response. url) hxs = HtmlXPathSelector (response) # authors = hxs. select ('// a [@ class = "titlelnk"]') items = hxs. select ('// a [contains (@ class, "titlelnk")]') listitems = [] for author in items: # print author. select ('text ()'). extract () item = CnblogsItem () # property item ['title'] = author. select ('text ()'). extract () item ['titleurl'] = author. select ('@ href '). extract () listitems. append (item) return listitems

 

OK. Go back to the console interface of step 1 and enter the following command:

scrapy crawl cnblogs --logfile=test.log -o cnblogs.json -t json

 

 

4. Results

 

 

Let's take a look at the code functions in it. Most people who have written the code know it.

====> DEMO download ====

 

Summary:

This article mainly analyzes the general components of the crawler framework and verifies our conjecture that there are many python crawler frameworks, but it is worth starting and researching like scrapy ,.. net.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.