Python crawler headers is set to invalid solution, pythonheaders
This is a problem caused by unskillful use of functions, but with the analysis tool, you can quickly locate the problem (Here we recommend a great packet capture tool fiddler)
The text is as follows:
When crawling an app data (the app data is all requested by http), use Fidder to analyze the request information and write the python request header information in the program for request data.
The Code is as follows:
import requestsurl = 'http://xxx?startDate=2017-10-19&endDate=2017-10-19&pageIndex=1&limit=50&sort=datetime&order=desc'headers={ "Host":"xxx.com", "Connection": "keep-alive", "Accept": "application/json, text/javascript, */*; q=0.01", "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.59 Safari/537.36", "X-Requested-With": "XMLHttpRequest", "Referer": "http://app.jg.eastmoney.com/html_Report/index.html", "Accept-Encoding": "gzip,deflate", "Accept-Language": "en-us,en", "Cookie":"xxx"}r = requests.get(url,headers)print (r.text)
The request is successful. However
{"Id": "6202c187-2fad-46e8-b4c6-b72ac8de0142", "ReturnMsg": "loading failed! "}
The request is intercepted because it is not a normal request.
Then I went to Fidder to check the record of the request sent by python. # The two parts covered are Host and URL.
When you view the request details, the request header is not loaded, and the User-Agent says python-requests! # The UA information in the request header is java and python. websites and apps that are somewhat anti-crawler aware will be intercepted
The Header details are as follows:
GET http://xxx?istartDate=2017-10-19&endDate=2017-10-19&pageIndex=1&limit=50&sort=datetime&order=desc &Host=xxx.com &Connection=keep-alive &Accept=application%2Fjson%2C+text%2Fjavascript%2C+%2A%2F%2A%3B+q%3D0.01 &User-Agent=Mozilla%2F5.0+%28Windows+NT+6.1%3B+WOW64%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F29.0.1547.59+Safari%2F537.36 &X-Requested-With=XMLHttpRequest &Referer=xxx &Accept-Encoding=gzip%2Cdeflate &Accept-Language=en-us%2Cen &Cookie=xxxHTTP/1.1Host: xxx.comUser-Agent: python-requests/2.18.4Accept-Encoding: gzip, deflateAccept: */*Connection: keep-aliveHTTP/1.1 200 OKServer: nginx/1.2.2Date: Sat, 21 Oct 2017 06:07:21 GMTContent-Type: application/json; charset=utf-8Content-Length: 75Connection: keep-aliveCache-Control: privateX-AspNetMvc-Version: 5.2X-AspNet-Version: 4.0.30319X-Powered-By: ASP.NET
I didn't find it at the beginning. After I read all the request URL Information, I found that the program put the request header information as a parameter into the URL.
That is, when I request the request function, the Header information parameter is incorrect.
I checked the Headers parameter usage method of the Requests library again and found that a line of code was wrong. When using the request. get () method, I had to write the parameter "headers = ".
The changes are as follows:
import requestsurl = 'http://xxx?startDate=2017-10-19&endDate=2017-10-19&pageIndex=1&limit=50&sort=datetime&order=desc'headers={ "Host":"xxx.com", "Connection": "keep-alive", "Accept": "application/json, text/javascript, */*; q=0.01", "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.59 Safari/537.36", "X-Requested-With": "XMLHttpRequest", "Referer": "http://app.jg.eastmoney.com/html_Report/index.html", "Accept-Encoding": "gzip,deflate", "Accept-Language": "en-us,en", "Cookie":"xxx"}r = requests.get(url,headers=headers)
Then, view the requests in Fiddler.
The request header in python is normal. The request details are as follows:
GET http://xxx?startDate=2017-10-19&endDate=2017-10-19&pageIndex=1&limit=50&sort=datetime&order=desc HTTP/1.1User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.59 Safari/537.36Accept-Encoding: gzip,deflateAccept: application/json, text/javascript, */*; q=0.01Connection: keep-aliveHost: xxx.comX-Requested-With: XMLHttpRequestReferer: http://xxxAccept-Language: en-us,enCookie: xxxHTTP/1.1 200 OKServer: nginx/1.2.2Date: Sat, 21 Oct 2017 06:42:21 GMTContent-Type: application/json; charset=utf-8Content-Length: 75Connection: keep-aliveCache-Control: privateX-AspNetMvc-Version: 5.2X-AspNet-Version: 4.0.30319X-Powered-By: ASP.NET
Then, a request is sent again using the python program. If the request is successful, the returned result is still
{"Id": "6202c187-2fad-46e8-b4c6-b72ac8de0142", "ReturnMsg": "loading failed! "}
Because the cookie will expire in a short period of time, the cookie is updated and the request is successful.
It should be noted that the Header must be set for program crawlers. It may be troublesome for this app to block ip Addresses During Anti-crawling.
The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.