Scrapy -- 04

Last Update:2014-10-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The translation documentation on the official website is of good quality. I am looking at my own translation, tat.

Spider

ClassScrapy. Spider. Spider # There are several other manuals on the official website, such as crawlspider. But I still feel that this is a lot of work. The ratio of use on GitHub is 30000 to 4300. If crawlspider is easier to use, please let us know.
Spider is the simplest spider. Each other spider must inherit from this class (including other spider that comes with scrapy and your own spider ). Spider does not provide any special functions. It only requests the given start_urls/start_requests and calls the parse method of spider Based on the returned result (resulting responses.

Closed (Reason)
This function is called when the spider is disabled. This method provides a shortcut to call signals. Connect () to listen to the spider_closed signal.
Log (Message[,Level,Component])
Use the scrapy. log. MSG () method to record (log) messages. The spider name attribute is automatically included in the log. For more data, seeLogging.
Parse (response)
When response does not specify a callback function, this method is the default method for scrapy to process the downloaded response.
Parse is responsible for processing response and returning processed data and (/or) Follow-up URLs. Spider has the same requirements for callback functions of other requests.
This method and its request callback function must return an iteratable object containing the request and (or) item.

Parameters: Response(Response)-response for analysis
Make_requests_from_url (URL)
This method accepts a URL and returns the request object for crawling. This method is called by start_requests () during request initialization and is also used to convert the URL to request.
If the request object returned by this method is not overwritten by default, parse () is used as the callback function, and the dont_filter parameter is also set to enabled. (For more information, see request ).
Start_requests ()(Notes for login verification: The hour is a small number. Be careful with the seal !!)
This method must return an iterable object ). This object contains the first request that spider uses to crawl.
This method is called when the spider starts crawling without specifying a URL. When a URL is specified, make_requests_from_url () is called to create a request object. This method is called only once by scrapy, so you can implement it as a generator.
The default implementation of this method is to generate a request using the start_urls URL.
If you want to modify the request object that originally crawled a website, you can override this method. For example, if you need to log on to a website using post at startup, you can write as follows:
```
def start_requests(self):    return [scrapy.FormRequest("http://www.example.com/login",                               formdata={‘user‘: ‘john‘, ‘pass‘: ‘secret‘},                               callback=self.logged_in)]def logged_in(self, response):    # here you would extract links to follow and return Requests for    # each of them, with another callback    pass
```
Start_urls
URL list. If no specific URL is specified, the spider crawls from the list. Therefore, the URL of the first retrieved page will be one of the list. Subsequent URLs will be extracted from the obtained data.
Allowed_domains
Optional. Contains the list of domain names allowed to be crawled by Spider ). When offsitemiddleware is enabled, URLs with domain names not in the list will not be followed up. (Be sure to pay attention, otherwise you will not know how to kneel down, the prompt of DEBUG is very bad .)
Name
A string that defines the name of a spider ). The spider name defines how scrapy locates (and initializes) spider, so it must be unique. However, you can generate multiple identical spider instances without any restrictions. Name is the most important attribute of SPIDER and is required.
If the spider crawls a single website (single domain), a common practice is to name the spider with the website (domain) (with or without a suffix. For example, if a spider crawls mywebsite.com, the spider is usually named mywebsite.

Parameters:	Response(Response)-response for analysis

The preceding figure is composed of scrapy. Spider. The parse method must return the iteratable objects of the request and item. The returned item can be understood. Why do we need to return the request? In fact, scrapy calls the parse method of spider, iterates the obtained object, and executes the callback function of the object. For details about the scrapy architecture, we recommend this blog: http://blog.csdn.net/frylion/article/details/8558538.

With the above features, we can perform recursive crawling and other operations. For example. Like this:

# To be improved, the boss won't let us talk about the details t_tdef parse (self, response): for I in response. XPath (XPath expression): url = chapter URL yield scrapy. request (URL, callback = self. parse_page2) def parse_page2 (self, response): # This wowould log http://www.example.com/some_page.html self. log ("visited % s" % response. URL)

In addition, if you need to capture some information on the parent page and other information on the child page, you can use scrapy. meta field of the request: Meta is a dictionary, which will be copied to the response of the Child parse. meta. You can add items to meta in the parent parse function, and use response. Meta to extract items in the Child parse function. Example:

# This is useless at the moment... Def parse_page1 (self, response): item = myitem () item ['main _ url'] = response. URL request = scrapy. request ("http://www.example.com/some_page.html", callback = self. parse_page2) request. meta ['item'] = item return requestdef parse_page2 (self, response): item = response. meta ['item'] item ['other _ url'] = response. URL return item

To be continued

Scrapy -- 04

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Scrapy -- 04

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Scrapy -- 04

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support