A little talk about Python scrapy

Source: Internet
Author: User
Tags setcookie python scrapy

These days in order to do the curriculum design, think of the previous learning Python, so think about to do a reptile, so used on the Scrapy framework, of course, during this period also learned about requests, but there is no scrapy convenient, after all, it does not deal with the mechanism of cookies, Need to manually handle, more trouble, let me tell you a little bit about the scrapy operating principle of understanding:

We can see that this is the approximate structure of the scrapy, scrapy running flow:

1.scrapy engine opens a domain and finds the corresponding spider that handles this domain based on domain (spider middleware is important for adding refer to the request header, based on the URL of the corresponding response)

2.spider process this URL and return request,scrapy engine put the request to the scheduler

3.engine request the next URL to the scheduler

4. The scheduler uploads this request to the download middle key (processing cookies, authentication, user-agent), then the downloader downloads the content of this request and returns the corresponding response to the spider

5.Spider generates the item or request according to response and uploads it to the Itempipeline or scheduler

6. Repeat from 2 steps until there is no request


Adding an episode here, the Spider Middleware examines the request from the spider and only satisfies the allow_domain to be allowed to send it, and here's my own question: there's a linkextractor in Scrapy, My understanding is that when it will parse the URL of the response, if it does, it will send the callback method, but there is a doubt that when you use the request method to create a request, you can also set a callback, So at this time this response is sent to which callback to deal with it? Or is it all handled? (Doesn't it feel right?) Or is there a priority? )


The most important thing about scrapy is that it can handle cookies automatically, but we can see a strange place from official documents,

For I, URL in enumerate (URLs):    yield scrapy. Request ("http://www.example.com", meta={' Cookiejar ': i},        callback=self.parse_page)

My understanding here is that when we have only one request at a time in a single spider, because the default is to use a cookiejar, we do not need to manually use meta to assign cookiejar to it when we issue the request. However, when multiple requests for a single spider are requested, each request is manually added Cookiejar to each request, since each response requires a different cookie for the next request.


By the way, we use the Crawler simulation browser to access, in fact, with a cookie and the request header of the refer and user-agent information, because scrapy to help us deal with these, so that we can focus more on the business logic, which is very interesting, Be sure to take a look at the source when you are free


Another discussion on the difference between the next session and the cookie

Sessions and cookies are a way to keep an HTTP connection, the session is to store the information on the server side, and the cookie is stored on the client

In the case of a cookie, when the client first sends a request to the server side, the server generates a session_id and then sets it in the Setcookie option of the response message, and when the client receives it, Knowing that the next time a request message is to be added with this cookie, the server will find the corresponding session (handled by Tomcat itself) on this session_id when it is sent, and if cookies are disabled, Then generally take the form of a URL or use input hidden to pass in the session_id


Each time a client sends a cookie, it searches locally for a cookie that is larger than the requested resource, and then adds it to the cookie in the request, which must be noted that response actually does not have a cookie option, So we look at the response message header is also no cookie option, it has only one setcookie option to tell the next request to send what kind of cookie


May you have some gains, by the way, the great God to save my doubts ~

Finally attached to my github with scrapy new Small project Https://github.com/yue060904/Spider

A little talk about Python scrapy

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.