Python crawler headers is set to invalid solution, pythonheaders

Last Update:2017-10-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This is a problem caused by unskillful use of functions, but with the analysis tool, you can quickly locate the problem (Here we recommend a great packet capture tool fiddler)

The text is as follows:

When crawling an app data (the app data is all requested by http), use Fidder to analyze the request information and write the python request header information in the program for request data.

The Code is as follows:

import requestsurl = 'http://xxx?startDate=2017-10-19&endDate=2017-10-19&pageIndex=1&limit=50&sort=datetime&order=desc'headers={ "Host":"xxx.com", "Connection": "keep-alive", "Accept": "application/json, text/javascript, */*; q=0.01", "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.59 Safari/537.36", "X-Requested-With": "XMLHttpRequest", "Referer": "http://app.jg.eastmoney.com/html_Report/index.html", "Accept-Encoding": "gzip,deflate", "Accept-Language": "en-us,en", "Cookie":"xxx"}r = requests.get(url,headers)print (r.text)

The request is successful. However

{"Id": "6202c187-2fad-46e8-b4c6-b72ac8de0142", "ReturnMsg": "loading failed! "}

The request is intercepted because it is not a normal request.

Then I went to Fidder to check the record of the request sent by python. # The two parts covered are Host and URL.

When you view the request details, the request header is not loaded, and the User-Agent says python-requests! # The UA information in the request header is java and python. websites and apps that are somewhat anti-crawler aware will be intercepted

The Header details are as follows:

GET http://xxx?istartDate=2017-10-19&endDate=2017-10-19&pageIndex=1&limit=50&sort=datetime&order=desc　　&Host=xxx.com　　&Connection=keep-alive　　&Accept=application%2Fjson%2C+text%2Fjavascript%2C+%2A%2F%2A%3B+q%3D0.01　　&User-Agent=Mozilla%2F5.0+%28Windows+NT+6.1%3B+WOW64%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F29.0.1547.59+Safari%2F537.36　　&X-Requested-With=XMLHttpRequest　　&Referer=xxx　　&Accept-Encoding=gzip%2Cdeflate　　&Accept-Language=en-us%2Cen　　&Cookie=xxxHTTP/1.1Host: xxx.comUser-Agent: python-requests/2.18.4Accept-Encoding: gzip, deflateAccept: */*Connection: keep-aliveHTTP/1.1 200 OKServer: nginx/1.2.2Date: Sat, 21 Oct 2017 06:07:21 GMTContent-Type: application/json; charset=utf-8Content-Length: 75Connection: keep-aliveCache-Control: privateX-AspNetMvc-Version: 5.2X-AspNet-Version: 4.0.30319X-Powered-By: ASP.NET

I didn't find it at the beginning. After I read all the request URL Information, I found that the program put the request header information as a parameter into the URL.

That is, when I request the request function, the Header information parameter is incorrect.

I checked the Headers parameter usage method of the Requests library again and found that a line of code was wrong. When using the request. get () method, I had to write the parameter "headers = ".

The changes are as follows:

import requestsurl = 'http://xxx?startDate=2017-10-19&endDate=2017-10-19&pageIndex=1&limit=50&sort=datetime&order=desc'headers={ "Host":"xxx.com", "Connection": "keep-alive", "Accept": "application/json, text/javascript, */*; q=0.01", "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.59 Safari/537.36", "X-Requested-With": "XMLHttpRequest", "Referer": "http://app.jg.eastmoney.com/html_Report/index.html", "Accept-Encoding": "gzip,deflate", "Accept-Language": "en-us,en", "Cookie":"xxx"}r = requests.get(url,headers=headers)

Then, view the requests in Fiddler.

The request header in python is normal. The request details are as follows:

GET http://xxx?startDate=2017-10-19&endDate=2017-10-19&pageIndex=1&limit=50&sort=datetime&order=desc HTTP/1.1User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.59 Safari/537.36Accept-Encoding: gzip,deflateAccept: application/json, text/javascript, */*; q=0.01Connection: keep-aliveHost: xxx.comX-Requested-With: XMLHttpRequestReferer: http://xxxAccept-Language: en-us,enCookie: xxxHTTP/1.1 200 OKServer: nginx/1.2.2Date: Sat, 21 Oct 2017 06:42:21 GMTContent-Type: application/json; charset=utf-8Content-Length: 75Connection: keep-aliveCache-Control: privateX-AspNetMvc-Version: 5.2X-AspNet-Version: 4.0.30319X-Powered-By: ASP.NET

Then, a request is sent again using the python program. If the request is successful, the returned result is still

{"Id": "6202c187-2fad-46e8-b4c6-b72ac8de0142", "ReturnMsg": "loading failed! "}

Because the cookie will expire in a short period of time, the cookie is updated and the request is successful.

It should be noted that the Header must be set for program crawlers. It may be troublesome for this app to block ip Addresses During Anti-crawling.

The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler headers is set to invalid solution, pythonheaders

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler headers is set to invalid solution, pythonheaders

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support