Request was rejected when crawling a Web site? Scrapy easy to resolve the request header settings! It's not reasonable.

Source: Internet
Author: User

Default Request Header

command line execution, new crawler

Scrapy startproject myspidercd myspider scrapy genspider scrapy_spider httpbin.org

Through the request to https://httpbin.org/get?show_env=1, we can view the browser information of this request and open it to see if it is your browser information.

Copy the returned text to https://www.json.cn/formatted as a JSON format for easy viewing, and the following actions are not mentioned.

Modify the request header

In that case, we'll modify the request header to achieve the fake effect.

Open the link below and choose a request header that you like

Http://www.useragentstring.com/pages/useragentstring.php?name=Chrome

This uses the Chrome browser request header

Re-visit to find that our request header has been successfully replaced

"User-agent": "mozilla/5.0 (Windows NT 6.1) applewebkit/537.36 (khtml, like Gecko) chrome/41.0.2228.0 safari/537.36",

Way Three: Link settings

This method takes effect on a single link, just this link of the request enjoy

Setting the headers parameter in the request method

Method Four: Middleware settings

This method can be changed from the entire project to modify the request header settings rules, changeable, different wording, you can configure the different settings, the following is a relatively simple example

We refer to scrapy default processing of the request header middleware

From scrapy.downloadermiddlewares.useragent import Useragentmiddleware

Writing middleware

Action Priority

If you make the following settings

# settings.pyuser_agent = "Settings"
# scrapy_spider.pycustom_settings = {"User_agent": "Custom_settings",}headers={"user-agent": "Header"}

The effect is:

"User-agent": "Header"

Comment out headers

"User-agent": "Custom_settings"

Comment out Custom_settings

"User-agent": "Custom_settings"

Comment out Settings

"User-agent": "scrapy/1.1.2 (+http://scrapy.org)"

The visible priority level is:

Headers > Custom_settings > settings.py > Scrapy Default

Attention

Note the wording of the user-agent parameter

headers={"User-agent": User_agent})

If it's wrong, it's likely something strange happened.

headers={"User_agent": User_agent}

More scrapy in the request header ...

"User-agent": "scrapy/1.1.2 (+http://scrapy.org), mozilla/5.0 (Windows NT 6.1) applewebkit/537.36 (khtml, like Gecko) chrome/41.0.2228.0 safari/537.36 ",

In fact, very well differentiated:

User-agent: Browser request header parameters, often used in HTML code-

User_agent:python variable

Suggestions:

Every time you write the browser parameters, if you are afraid to write wrong to open their own browser, casually test a page, copy from inside

As from the introduction to the actual combat, I stepped on a lot of pits, this article made a simple summary, but also shared a few more useful sites. Hope this article to share can provide you with a less detours shortcut, then the value of this article is reflected.

By the way, an ad.

Recently want to write an open source library, Chinesename Chinese name, has achieved the basic name, but the names need to be optimized, if you want to do things together classmates, can together

Request was rejected when crawling a Web site? Scrapy easy to resolve the request header settings! It's not reasonable.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.