Scrapy manually add the Add cookie and turn off the duplicate filter dupfilters

Source: Internet
Author: User

The scrapy itself is capable of handling cookies, working in a browser similar to a browser sending a request server to return a response, and using Set-cookie to request the browser to next request, with a cookie next request, The browser uses cookies in the request header to bring the cookies that the server requires to set

The entire process needs no manual intervention and is automatically completed by the browser.

In Scrapy, this is also unnecessary intervention, and its function is automatically completed in Cookiemiddleware

The method to use is: 1 settings.py Open the switch

Cookies_enabled=true
Cookies_debug=true #这个是在log中能看到每次请求发出了和接收到什么Cookie 2. Run your crawler.

So, here's the question, how do you put extra cookies on hand?

Turned a lot of documents and source code, the special summary as follows with cookies, must be in the request operation, so, do not go over the response source. Request has 2 methods
1) Request, formrequest
2 Response.follow (...) #这个执行完了, returns a Request Object 3. Why would you add a cookie manually

The answer is that many of the Web pages are now using JS to add cookies to the document
such as document.cookie= ' Person=zhouxingchi ';
This on the Web page, by JS added cookies, and then the next request, if the use of the browser, can be brought to the next request.

But in the scrapy, because the acquisition is the source code, so JS can not be executed, so, ask the developer to manually add the cookie to the next request. 4. How to add cookies manually

Read a lot of source code, documents, and examples, finally find out.

Step One: Modify Start_requests, enable Meta meta[' Cookiejar '

def start_requests (self):
    for URL in self.start_urls:
        yield Request (url,
                      meta={' Cookiejar ': ' mysitecom '} ,
                      callback=self.parse
                     )

The second step, when requesting req, bring the meta[' Cookiejar '

Def parse (self, Response):
    req = Response.follow (Response.url, self.parse_nextpage)
    #或者
    #req = Request ( URL, self.parse_nextpage)
    req.meta[' cookiejar '] = response.meta[' Cookiejar ']
    #同时, update request.cookies ( A similar to the dictionary)
    req.cookies.update ({' Person ': ' Zhouxingchi '})
    yield req

So, add cookies, the Req is yield, you can bring the person this cookie, to initiate the request. 5. Additional tips on turning off duplicate filters Dupfilters

If the URL in step 4th has been requested once, then resubmit the yield req (URL) and see it in the log

msg = ("Filtered Duplicate Request:% (Request) S"
"-no more duplicates'll be shown"
"(Click Dupefilter_debug to Show all Duplicates) ")

That is, before accessing the URL, you do not have a cookie, plus a cookie after you are scrapy by the repetitive link filter scrapy dupefilter_class = ' Scrapy.dupefilters.RFPDupeFilter ' Filtered out, sad to urge very much.

So the simplest idea is to turn off this repetitive filter.

Scrapy source code turned over again, found that can not turn off this thing.

But find a place to control.

The answer is scrapy/core/scheduler.py the following function:

def enqueue_request (self, request):
    if not request.dont_filter and Self.df.request_seen (request):
        Self.df.log (Request, Self.spider) return
        False
    dqok = self._dqpush (Request)
    if Dqok:
        self.stats.inc _value (' Scheduler/enqueued/disk ', Spider=self.spider)
    else:
        self._mqpush (Request)
        SELF.STATS.INC_ Value (' Scheduler/enqueued/memory ', Spider=self.spider)
    self.stats.inc_value (' scheduler/enqueued ', spider= Self.spider) return
    True

Request.dont_filter defaults to False, which can be found in the request's definition file

Class Request (Object_ref):

    def __init__ (self, URL, callback=none, method= ' get ', Headers=none, Body=none,
                 Cookies=none, Meta=none, encoding= ' Utf-8 ', priority=0,
                 dont_filter=false, Errback=none, Flags=none):

So, to remove a duplicate filter in the hopeless case, you can first in the request initialization function, the incoming dont_filter=true.

That's it:

Def parse (self, Response):
    req = Request.follow (Response.url, Self.parse_nextpage, dont_filter=true)
    req.meta[' Cookiejar '] = response.meta[' Cookiejar ']
    req.cookies.update ({' Person ': ' Zhouxingchi '})
    yield req

Complete.

If you have any questions, please join the following groups to communicate, direct the ads kick OH:

Python reptile Swarm 10255808

Python crawler 2 Group 429348027

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.