Scrapy manually add the Add cookie and turn off the duplicate filter dupfilters

Last Update:2018-07-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The scrapy itself is capable of handling cookies, working in a browser similar to a browser sending a request server to return a response, and using Set-cookie to request the browser to next request, with a cookie next request, The browser uses cookies in the request header to bring the cookies that the server requires to set

The entire process needs no manual intervention and is automatically completed by the browser.

In Scrapy, this is also unnecessary intervention, and its function is automatically completed in Cookiemiddleware

The method to use is: 1 settings.py Open the switch

Cookies_enabled=true
Cookies_debug=true #这个是在log中能看到每次请求发出了和接收到什么Cookie 2. Run your crawler.

So, here's the question, how do you put extra cookies on hand?

Turned a lot of documents and source code, the special summary as follows with cookies, must be in the request operation, so, do not go over the response source. Request has 2 methods
1) Request, formrequest
2 Response.follow (...) #这个执行完了, returns a Request Object 3. Why would you add a cookie manually

The answer is that many of the Web pages are now using JS to add cookies to the document
such as document.cookie= ' Person=zhouxingchi ';
This on the Web page, by JS added cookies, and then the next request, if the use of the browser, can be brought to the next request.

But in the scrapy, because the acquisition is the source code, so JS can not be executed, so, ask the developer to manually add the cookie to the next request. 4. How to add cookies manually

Read a lot of source code, documents, and examples, finally find out.

Step One: Modify Start_requests, enable Meta meta[' Cookiejar '

def start_requests (self):
    for URL in self.start_urls:
        yield Request (url,
                      meta={' Cookiejar ': ' mysitecom '} ,
                      callback=self.parse
                     )

The second step, when requesting req, bring the meta[' Cookiejar '

Def parse (self, Response):
    req = Response.follow (Response.url, self.parse_nextpage)
    #或者
    #req = Request ( URL, self.parse_nextpage)
    req.meta[' cookiejar '] = response.meta[' Cookiejar ']
    #同时, update request.cookies ( A similar to the dictionary)
    req.cookies.update ({' Person ': ' Zhouxingchi '})
    yield req

So, add cookies, the Req is yield, you can bring the person this cookie, to initiate the request. 5. Additional tips on turning off duplicate filters Dupfilters

If the URL in step 4th has been requested once, then resubmit the yield req (URL) and see it in the log

msg = ("Filtered Duplicate Request:% (Request) S"
"-no more duplicates'll be shown"
"(Click Dupefilter_debug to Show all Duplicates) ")

That is, before accessing the URL, you do not have a cookie, plus a cookie after you are scrapy by the repetitive link filter scrapy dupefilter_class = ' Scrapy.dupefilters.RFPDupeFilter ' Filtered out, sad to urge very much.

So the simplest idea is to turn off this repetitive filter.

Scrapy source code turned over again, found that can not turn off this thing.

But find a place to control.

The answer is scrapy/core/scheduler.py the following function:

def enqueue_request (self, request):
    if not request.dont_filter and Self.df.request_seen (request):
        Self.df.log (Request, Self.spider) return
        False
    dqok = self._dqpush (Request)
    if Dqok:
        self.stats.inc _value (' Scheduler/enqueued/disk ', Spider=self.spider)
    else:
        self._mqpush (Request)
        SELF.STATS.INC_ Value (' Scheduler/enqueued/memory ', Spider=self.spider)
    self.stats.inc_value (' scheduler/enqueued ', spider= Self.spider) return
    True

Request.dont_filter defaults to False, which can be found in the request's definition file

Class Request (Object_ref):

    def __init__ (self, URL, callback=none, method= ' get ', Headers=none, Body=none,
                 Cookies=none, Meta=none, encoding= ' Utf-8 ', priority=0,
                 dont_filter=false, Errback=none, Flags=none):

So, to remove a duplicate filter in the hopeless case, you can first in the request initialization function, the incoming dont_filter=true.

That's it:

Def parse (self, Response):
    req = Request.follow (Response.url, Self.parse_nextpage, dont_filter=true)
    req.meta[' Cookiejar '] = response.meta[' Cookiejar ']
    req.cookies.update ({' Person ': ' Zhouxingchi '})
    yield req

Complete.

If you have any questions, please join the following groups to communicate, direct the ads kick OH:

Python reptile Swarm 10255808

Python crawler 2 Group 429348027

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Scrapy manually add the Add cookie and turn off the duplicate filter dupfilters

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Scrapy manually add the Add cookie and turn off the duplicate filter dupfilters

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support