Default Request Header
command line execution, new crawler
Scrapy startproject myspidercd myspider scrapy genspider scrapy_spider httpbin.org
Through the request to https://httpbin.org/get?show_env=1, we can view the browser information of this request and open it to see if it is your browser information.
Copy the returned text to https://www.json.cn/formatted as a JSON format for easy viewing, and the following actions are not mentioned.
Modify the request header
In that case, we'll modify the request header to achieve the fake effect.
Open the link below and choose a request header that you like
Http://www.useragentstring.com/pages/useragentstring.php?name=Chrome
This uses the Chrome browser request header
Re-visit to find that our request header has been successfully replaced
"User-agent": "mozilla/5.0 (Windows NT 6.1) applewebkit/537.36 (khtml, like Gecko) chrome/41.0.2228.0 safari/537.36",
Way Three: Link settings
This method takes effect on a single link, just this link of the request enjoy
Setting the headers parameter in the request method
Method Four: Middleware settings
This method can be changed from the entire project to modify the request header settings rules, changeable, different wording, you can configure the different settings, the following is a relatively simple example
We refer to scrapy default processing of the request header middleware
From scrapy.downloadermiddlewares.useragent import Useragentmiddleware
Writing middleware
Action Priority
If you make the following settings
# settings.pyuser_agent = "Settings"
# scrapy_spider.pycustom_settings = {"User_agent": "Custom_settings",}headers={"user-agent": "Header"}
The effect is:
"User-agent": "Header"
Comment out headers
"User-agent": "Custom_settings"
Comment out Custom_settings
"User-agent": "Custom_settings"
Comment out Settings
"User-agent": "scrapy/1.1.2 (+http://scrapy.org)"
The visible priority level is:
Headers > Custom_settings > settings.py > Scrapy Default
Attention
Note the wording of the user-agent parameter
headers={"User-agent": User_agent})
If it's wrong, it's likely something strange happened.
headers={"User_agent": User_agent}
More scrapy in the request header ...
"User-agent": "scrapy/1.1.2 (+http://scrapy.org), mozilla/5.0 (Windows NT 6.1) applewebkit/537.36 (khtml, like Gecko) chrome/41.0.2228.0 safari/537.36 ",
In fact, very well differentiated:
User-agent: Browser request header parameters, often used in HTML code-
User_agent:python variable
Suggestions:
Every time you write the browser parameters, if you are afraid to write wrong to open their own browser, casually test a page, copy from inside
As from the introduction to the actual combat, I stepped on a lot of pits, this article made a simple summary, but also shared a few more useful sites. Hope this article to share can provide you with a less detours shortcut, then the value of this article is reflected.
By the way, an ad.
Recently want to write an open source library, Chinesename Chinese name, has achieved the basic name, but the names need to be optimized, if you want to do things together classmates, can together
Request was rejected when crawling a Web site? Scrapy easy to resolve the request header settings! It's not reasonable.