The scrapy itself is capable of handling cookies, working in a browser similar to a browser sending a request server to return a response, and using Set-cookie to request the browser to next request, with a cookie next request, The browser uses cookies in the request header to bring the cookies that the server requires to set
The entire process needs no manual intervention and is automatically completed by the browser.
In Scrapy, this is also unnecessary intervention, and its function is automatically completed in Cookiemiddleware
The method to use is: 1 settings.py Open the switch
Cookies_enabled=true
Cookies_debug=true #这个是在log中能看到每次请求发出了和接收到什么Cookie 2. Run your crawler.
So, here's the question, how do you put extra cookies on hand?
Turned a lot of documents and source code, the special summary as follows with cookies, must be in the request operation, so, do not go over the response source. Request has 2 methods
1) Request, formrequest
2 Response.follow (...) #这个执行完了, returns a Request Object 3. Why would you add a cookie manually
The answer is that many of the Web pages are now using JS to add cookies to the document
such as document.cookie= ' Person=zhouxingchi ';
This on the Web page, by JS added cookies, and then the next request, if the use of the browser, can be brought to the next request.
But in the scrapy, because the acquisition is the source code, so JS can not be executed, so, ask the developer to manually add the cookie to the next request. 4. How to add cookies manually
Read a lot of source code, documents, and examples, finally find out.
Step One: Modify Start_requests, enable Meta meta[' Cookiejar '
def start_requests (self):
for URL in self.start_urls:
yield Request (url,
meta={' Cookiejar ': ' mysitecom '} ,
callback=self.parse
)
The second step, when requesting req, bring the meta[' Cookiejar '
Def parse (self, Response):
req = Response.follow (Response.url, self.parse_nextpage)
#或者
#req = Request ( URL, self.parse_nextpage)
req.meta[' cookiejar '] = response.meta[' Cookiejar ']
#同时, update request.cookies ( A similar to the dictionary)
req.cookies.update ({' Person ': ' Zhouxingchi '})
yield req
So, add cookies, the Req is yield, you can bring the person this cookie, to initiate the request. 5. Additional tips on turning off duplicate filters Dupfilters
If the URL in step 4th has been requested once, then resubmit the yield req (URL) and see it in the log
msg = ("Filtered Duplicate Request:% (Request) S"
"-no more duplicates'll be shown"
"(Click Dupefilter_debug to Show all Duplicates) ")
That is, before accessing the URL, you do not have a cookie, plus a cookie after you are scrapy by the repetitive link filter scrapy dupefilter_class = ' Scrapy.dupefilters.RFPDupeFilter ' Filtered out, sad to urge very much.
So the simplest idea is to turn off this repetitive filter.
Scrapy source code turned over again, found that can not turn off this thing.
But find a place to control.
The answer is scrapy/core/scheduler.py the following function:
def enqueue_request (self, request):
if not request.dont_filter and Self.df.request_seen (request):
Self.df.log (Request, Self.spider) return
False
dqok = self._dqpush (Request)
if Dqok:
self.stats.inc _value (' Scheduler/enqueued/disk ', Spider=self.spider)
else:
self._mqpush (Request)
SELF.STATS.INC_ Value (' Scheduler/enqueued/memory ', Spider=self.spider)
self.stats.inc_value (' scheduler/enqueued ', spider= Self.spider) return
True
Request.dont_filter defaults to False, which can be found in the request's definition file
Class Request (Object_ref):
def __init__ (self, URL, callback=none, method= ' get ', Headers=none, Body=none,
Cookies=none, Meta=none, encoding= ' Utf-8 ', priority=0,
dont_filter=false, Errback=none, Flags=none):
So, to remove a duplicate filter in the hopeless case, you can first in the request initialization function, the incoming dont_filter=true.
That's it:
Def parse (self, Response):
req = Request.follow (Response.url, Self.parse_nextpage, dont_filter=true)
req.meta[' Cookiejar '] = response.meta[' Cookiejar ']
req.cookies.update ({' Person ': ' Zhouxingchi '})
yield req
Complete.
If you have any questions, please join the following groups to communicate, direct the ads kick OH:
Python reptile Swarm 10255808
Python crawler 2 Group 429348027