The conclusion of requests on the Exceeded 30 redirects problem is exceededredirects.
First, the conclusion is that the request to send requests must contain headers; otherwise, the session between bs cannot be maintained. The preceding error is reported.
Yesterday, a friend encountered a problem while crawling the webpage, And I conducted a simple test on the problem later.
First, let's briefly describe the problem.
First, use urllib to request a webpage:
# Urllib. request initiated by request import urllib. requestresponse = urllib. request. urlopen ("https://baike.baidu.com") html = response. read (). decode ('utf8') print (type (html) print (html)
The page information of the encyclopedia is displayed normally:
We use requests to request this https page
# Request initiated by requests import requestshtml = requests. get ('https: // baike.baidu.com ') print (type (html) print (html)
Then an error is reported:
If the number of redirection attempts exceeds 30, Baidu will cancel the default redirection request.
Here we come to the first conclusion:
Requests sent by urllib and requests are redirected based on the response location by default.
Baidu once, according to the consistent recommendations of all netizens, we disable the allow_redirects field.
In the source code, redirection is allowed by default.
After the redirection is disabled, the page will not jump.
# For requests initiated by requests, disable the redirection import requestshtml = requests. get ('https: // baike.baidu.com ', allow_redirects = False). textprint (type (html) print (html)
If the redirection page is disabled, the normal encyclopedia Homepage cannot be displayed. Here we get the 302 jump page.
Again, there are always some people in Baidu who only solve the current problem without explaining the solution ideas, or putting the results as answers is irresponsible.
Here, the redirection problem is not the page jump, but the reason why the page jumps multiple times.
I have found an article about requesting Amazon to exceed the redirection limit: http://www.it1352.com/330504.html.
Simply put, the page is redirected to an endless loop without establishing a session with the server. That is, your original URL is redirected to a url B without a new one. It is redirected to C, it is redirected to B, and so on.
At the end of the article, we mentioned adding a request header to maintain session persistence.
# For a request initiated by requests, add the request header import requestsheaders = {"User-Agent": "User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; trident/5.0; "} html = requests. get ('https: // your ke.baidu.com ', headers = headers ). textprint (type (html) print (html)
The requested page should be correct, but the following garbled code appears:
The second conclusion of this article also comes out: There is no encoding method for the http header, and requests uses its own encoding method by default. It is also very capricious. For more information about the causes and solutions for garbled requests behaviors, see this blog. Https://www.cnblogs.com/billyzh/p/6148066.html.
# Request initiated by requests to solve the garbled problem import requestsheaders = {"User-Agent": "User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; trident/5.0; "} html = requests. get ('https: // your ke.baidu.com ', headers = headers ). content. decode ('utf8') print (type (html) print (html)
At this time, no exception is displayed on the page. The Encyclopedia address is displayed correctly.
# Request initiated by requests, plus redirect prohibition import requestsheaders = {"User-Agent": "User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; trident/5.0; "} html = requests. get ('https: // your ke.baidu.com ', headers = headers, allow_redirects = False ). content. decode ('utf8') print (type (html) print (html)
There is no impact on the results, so the solution to solve the redirection problem mentioned above, most people mentioned that the prohibition of redirection is not effective at all, it is simply to keep the session and prevent redirection from entering an endless loop.
Conclusion 3: Google and Baidu (only for technical issues ).
In this case, what problems have we encountered when simulating the request header? The request header information can only be analyzed step by step.
# Import urllib. requestrequest = urllib. request. request ("https://baike.baidu.com") print (request. headers) # {} print (request. get_header ("User-agent") # None
But in fact, an empty request header cannot be sent, So we capture packets to obtain the sent request information.
Urllib Response Header
# Urllib request Response Header import urllib. requestrequest = urllib. request. urlopen ("https://baike.baidu.com") print (request. headers)
Urllib did not do anything during the request, and there was nothing in the request header. However, the server waited for him gently and responded to the correct jump page.
# The requests request has been redirected more than 30 times and cannot get its request header import requestsh = requests. get ('https: // your ke.baidu.com ') print (h. request. headers)
Similarly, I cannot see the response header.
# Requests stops redirecting his request header import requestsh = requests. get ('https: // your ke.baidu.com ', allow_redirects = False) print (h. request. headers)
{'User-Agent': 'python-requests/2.18.4', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
The User-Agent represents a request that uses the python interpreter.
# Requests prevent redirecting his response header import requestsh = requests. get ('https: // your ke.baidu.com ', allow_redirects = False). headersprint (h)
{'Connection': 'keep-alive', 'Content-Length': '154', 'Content-Type': 'text/html', 'Date': 'Wed, 31 Jan 2018 04:07:32 GMT', 'Location': 'https://baike.baidu.com/error.html?status=403&uri=/', 'P3p': 'CP=" OTI DSP COR IVA OUR IND COM "', 'Server': 'Apache', 'Set-Cookie': 'BAIDUID=C827DBDDF50E38C0C10F649F1DAAA462:FG=1; expires=Thu, 31-Jan-19 04:07:32 GMT; max-age=31536000; path=/; domain=.baidu.com; version=1'}
The connection is keep alive, and the redirection address is displayed in location.
# Requests import requestsheaders = {"User-Agent": "User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; trident/5.0; "} h = requests. get ('https: // your ke.baidu.com ', allow_redirects = False, headers = headers) print (h. request. headers)
{'User-Agent': 'User-Agent:Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
# Requests carries the request header of your browser information. By default, the request header can be redirected to import requestsheaders = {"User-Agent": "User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; windows NT 6.1; Trident/5.0; "} h = requests. get ('https: // your ke.baidu.com ', headers = headers) print (h. request. headers)
As above, re-verifying that blocking does not block page redirection is not the key to solving the problem.
According to the test above, I have a bold guess that the server accepts the urllib request and responds to the setcookie field, when a session is created with a cookie, the normal redirect request is finally sent to a final page. However, without adding a request header to requests, the server will not return the setcookie and generate a circular redirection, when the page cannot be located, and the User-Agent field of the request header is added, the server creates a session by default to ensure that the page is redirected to a normal page.
In addition, the conclusion is that the request header is not added, and requests cannot guarantee the session with the server. Every time the server is connected, it is directly redirected as a new request, and there is no redirection loop problem.
# Request Header import requestsh = requests. get ('https: // baike.baidu.com ', allow_redirects = False, verify = False) print (h. request. headers)
Get request package:
# Requests import requestsheaders = {"User-Agent": "User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; trident/5.0; "} h = requests. get ('https: // your ke.baidu.com ', allow_redirects = False, headers = headers, verify = False) print (h. request. headers)
Therefore, you must add the request header information when using requests.
I hope that if you accidentally read this article, please point out what I said is wrong, or what aspects of understanding are not profound enough. Thank you.
The test did not consider too much, in fact, you can view the request response information through the http://httpbin.org is more intuitive and convenient, this URL is specifically used to test the http request.