Scrapy framework architecture

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. The engine opens a domain, locates the spider that handles that domain, and asks the spider for the first URLs
To Crawl.
2. The engine gets the first URLs to crawl from the spider and schedules them in the schedider, as requests.
3. The engine asks the scheduler for the next URLs to crawl.
4. The scheduler returns the next URLs to crawl to the engine and the engine sends them to the downloader,
Passing through the downloader middleware (request direction ).
5. Once the page finishes downloading the downloader generates a response (with that page) and sends it to
Engine, passing through the downloader middleware (response direction ).
6. the engine has es the response from the downloader and sends it to the spider for processing, passing
Through the spider middleware (input direction ).
7. The Spider processes the response and returns scraped items and new requests (to follow) to the engine.
8. The engine sends scraped items (returned by the SPIDER) to the item pipeline and requests (returned by SPIDER)
To the Scheduler
9. The process repeats (from step 2) until there are no more requests from the schedses, and the engine closes
Domain.

Translation:

The entire data processing process of scrapy is controlled by the scrapy engine. The main operation mode is as follows:

When the engine opens a domain name, the spider processes the domain name and asks the spider to obtain the first crawled URL.
The engine obtains the first URL to be crawled from the spider and then schedules the request as a request in scheduling.
The engine obtains the page for crawling from the scheduling.
The scheduler returns the next crawled URL to the engine, which sends them to the downloader through the download middleware.
After the webpage is downloaded by the download loader, the response content is sent to the engine through the download middleware.
The engine receives a response from the download tool and sends it to the spider through the spider middleware for processing.
The spider processes the response and returns the crawled project, and then sends a new request to the engine.
The engine captures the project pipeline and sends a request to the scheduler.
The system repeats the operation next to the second part until there is no request in the scheduling, and then disconnects the connection between the engine and the domain.

Spider example:

Class myspider (basespider ):
Name = 'example. com'
Allowed_domains = ['example. com']
Start_urls = [
Http://www.example.com/1.html ',
Http://www.example.com/2.html ',
Http://www.example.com/3.html ',
]
Def parse (self, response ):
HXS = htmlxpathselector (response)

# This part is items
For H3 in HXS. Select ('// H3'). Extract ():
Yield myitem (Title = H3)

# This part is a new request. For example, the next page will be sent to schedule.
For URL in HXS. Select ('// A/@ href'). Extract ():
Yield request (URL, callback = self. PARSE)

You can also:

Def parse (self, response ):
HXS = htmlxpathselector (response)
Items = []
# This part is items
For H3 in HXS. Select ('// H3'). Extract ():
Items. append (myitem (Title = H3 ))

# This part is a new request. For example, the next page will be sent to schedule.
For URL in HXS. Select ('// A/@ href'). Extract ():
Items. append (Request (URL, callback = self. PARSE ))
Return itemss

The spider analyzes two types of results: one is the link that needs to be further crawled, for example, the link on the "next page" analyzed previously, and these items will be returned to scheduler; the other is the data to be saved, which is sent to the item pipeline, which is a place for post-processing (detailed analysis, filtering, storage, etc.) of the data. It is worth noting that, the two results can be mixed in a list and returned in different types. One is item, the other is request, and the request is sent back to scrapy for continuous download scheduling, then it is processed by the specified callback function.

Spider instructions:

For spiders, the scraping cycle goes through something like this:
1. You start by generating the initial requests to crawl the first URLs, and specify a callback function to be called
With the response downloaded from those requests.
The first requests to perform are obtained by calling the start_requests () method which (by default)
Generates request for the URLs specified in the start_urls and the parse method as Callback Function
For the requests.

2. In the callback function, you parse the response (web page) and return either item objects, request objects,
Or an iterable of both. Those requests will also contain a callback (maybe the same) and will then be downloaded
By scrapy and then their response handled by the specified callback.

3. In callback functions, you parse the page contents, typically using XPath selectors (but you can also use beautifusoup, lxml or whatever mechanic you prefer) and generate items with the parsed data.

4. Finally, the items returned from the spider will be typically persisted in some item pipeline.

The general process is as follows:

Spider ------> items ------> Pipeline

| -------> Requests -----> engine --> schedule queque

Another example:

def parse(self, response):
    items = []
    hxs = HtmlXPathSelector(response)
    posts = hxs.x('//h1/a/@href').extract()
    items.extend([self.make_requests_from_url(url).replace(callback=self.parse_post)
for url in posts])

    page_links = hxs.x('//div[@class="wp-pagenavi"]/a[not(@title)]')
for link in page_links:
if link.x('text()').extract()[0] == u'\xbb':
            url = link.x('@href').extract()[0]
            items.append(self.make_requests_from_url(url))

return items

The first half is the link to parse the blog body to be crawled, and the second half is the link to the next page. It should be noted that the URLs in the returned list are not in string format. What scrapy wants to get isRequestObject, which can carry more things than a string URL, such as cookies or callback functions. We can seeRequestThe callback function is replaced because the default callback functionparseIt is used to parse pages such as the article list, andparse_postDefinition:

def parse_post(self, response):    item = BlogCrawlItem()    item.url = unicode(response.url)    item.raw = response.body_as_unicode()    return [item]

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Scrapy framework architecture

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Scrapy framework architecture

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support