A little talk about Python scrapy

Last Update:2015-05-26 Source: Internet

Author: User

Tags setcookie python scrapy

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

These days in order to do the curriculum design, think of the previous learning Python, so think about to do a reptile, so used on the Scrapy framework, of course, during this period also learned about requests, but there is no scrapy convenient, after all, it does not deal with the mechanism of cookies, Need to manually handle, more trouble, let me tell you a little bit about the scrapy operating principle of understanding:

We can see that this is the approximate structure of the scrapy, scrapy running flow:

1.scrapy engine opens a domain and finds the corresponding spider that handles this domain based on domain (spider middleware is important for adding refer to the request header, based on the URL of the corresponding response)

2.spider process this URL and return request,scrapy engine put the request to the scheduler

3.engine request the next URL to the scheduler

4. The scheduler uploads this request to the download middle key (processing cookies, authentication, user-agent), then the downloader downloads the content of this request and returns the corresponding response to the spider

5.Spider generates the item or request according to response and uploads it to the Itempipeline or scheduler

6. Repeat from 2 steps until there is no request

Adding an episode here, the Spider Middleware examines the request from the spider and only satisfies the allow_domain to be allowed to send it, and here's my own question: there's a linkextractor in Scrapy, My understanding is that when it will parse the URL of the response, if it does, it will send the callback method, but there is a doubt that when you use the request method to create a request, you can also set a callback, So at this time this response is sent to which callback to deal with it? Or is it all handled? (Doesn't it feel right?) Or is there a priority? ）

The most important thing about scrapy is that it can handle cookies automatically, but we can see a strange place from official documents,

For I, URL in enumerate (URLs):    yield scrapy. Request ("http://www.example.com", meta={' Cookiejar ': i},        callback=self.parse_page)

My understanding here is that when we have only one request at a time in a single spider, because the default is to use a cookiejar, we do not need to manually use meta to assign cookiejar to it when we issue the request. However, when multiple requests for a single spider are requested, each request is manually added Cookiejar to each request, since each response requires a different cookie for the next request.

By the way, we use the Crawler simulation browser to access, in fact, with a cookie and the request header of the refer and user-agent information, because scrapy to help us deal with these, so that we can focus more on the business logic, which is very interesting, Be sure to take a look at the source when you are free

Another discussion on the difference between the next session and the cookie

Sessions and cookies are a way to keep an HTTP connection, the session is to store the information on the server side, and the cookie is stored on the client

In the case of a cookie, when the client first sends a request to the server side, the server generates a session_id and then sets it in the Setcookie option of the response message, and when the client receives it, Knowing that the next time a request message is to be added with this cookie, the server will find the corresponding session (handled by Tomcat itself) on this session_id when it is sent, and if cookies are disabled, Then generally take the form of a URL or use input hidden to pass in the session_id

Each time a client sends a cookie, it searches locally for a cookie that is larger than the requested resource, and then adds it to the cookie in the request, which must be noted that response actually does not have a cookie option, So we look at the response message header is also no cookie option, it has only one setcookie option to tell the next request to send what kind of cookie

May you have some gains, by the way, the great God to save my doubts ~

Finally attached to my github with scrapy new Small project Https://github.com/yue060904/Spider

A little talk about Python scrapy

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More