Python crawler headers is set to invalid solution, pythonheaders

Source: Internet
Author: User

Python crawler headers is set to invalid solution, pythonheaders

This is a problem caused by unskillful use of functions, but with the analysis tool, you can quickly locate the problem (Here we recommend a great packet capture tool fiddler)

The text is as follows:

When crawling an app data (the app data is all requested by http), use Fidder to analyze the request information and write the python request header information in the program for request data.

The Code is as follows:

import requestsurl = 'http://xxx?startDate=2017-10-19&endDate=2017-10-19&pageIndex=1&limit=50&sort=datetime&order=desc'headers={ "Host":"xxx.com", "Connection": "keep-alive", "Accept": "application/json, text/javascript, */*; q=0.01", "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.59 Safari/537.36", "X-Requested-With": "XMLHttpRequest", "Referer": "http://app.jg.eastmoney.com/html_Report/index.html", "Accept-Encoding": "gzip,deflate", "Accept-Language": "en-us,en", "Cookie":"xxx"}r = requests.get(url,headers)print (r.text)

The request is successful. However

{"Id": "6202c187-2fad-46e8-b4c6-b72ac8de0142", "ReturnMsg": "loading failed! "}

The request is intercepted because it is not a normal request.

Then I went to Fidder to check the record of the request sent by python. # The two parts covered are Host and URL.

When you view the request details, the request header is not loaded, and the User-Agent says python-requests! # The UA information in the request header is java and python. websites and apps that are somewhat anti-crawler aware will be intercepted

The Header details are as follows:

GET http://xxx?istartDate=2017-10-19&endDate=2017-10-19&pageIndex=1&limit=50&sort=datetime&order=desc  &Host=xxx.com  &Connection=keep-alive  &Accept=application%2Fjson%2C+text%2Fjavascript%2C+%2A%2F%2A%3B+q%3D0.01  &User-Agent=Mozilla%2F5.0+%28Windows+NT+6.1%3B+WOW64%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F29.0.1547.59+Safari%2F537.36  &X-Requested-With=XMLHttpRequest  &Referer=xxx  &Accept-Encoding=gzip%2Cdeflate  &Accept-Language=en-us%2Cen  &Cookie=xxxHTTP/1.1Host: xxx.comUser-Agent: python-requests/2.18.4Accept-Encoding: gzip, deflateAccept: */*Connection: keep-aliveHTTP/1.1 200 OKServer: nginx/1.2.2Date: Sat, 21 Oct 2017 06:07:21 GMTContent-Type: application/json; charset=utf-8Content-Length: 75Connection: keep-aliveCache-Control: privateX-AspNetMvc-Version: 5.2X-AspNet-Version: 4.0.30319X-Powered-By: ASP.NET

I didn't find it at the beginning. After I read all the request URL Information, I found that the program put the request header information as a parameter into the URL.

That is, when I request the request function, the Header information parameter is incorrect.

I checked the Headers parameter usage method of the Requests library again and found that a line of code was wrong. When using the request. get () method, I had to write the parameter "headers = ".

The changes are as follows:

import requestsurl = 'http://xxx?startDate=2017-10-19&endDate=2017-10-19&pageIndex=1&limit=50&sort=datetime&order=desc'headers={ "Host":"xxx.com", "Connection": "keep-alive", "Accept": "application/json, text/javascript, */*; q=0.01", "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.59 Safari/537.36", "X-Requested-With": "XMLHttpRequest", "Referer": "http://app.jg.eastmoney.com/html_Report/index.html", "Accept-Encoding": "gzip,deflate", "Accept-Language": "en-us,en", "Cookie":"xxx"}r = requests.get(url,headers=headers)

Then, view the requests in Fiddler.

The request header in python is normal. The request details are as follows:

GET http://xxx?startDate=2017-10-19&endDate=2017-10-19&pageIndex=1&limit=50&sort=datetime&order=desc HTTP/1.1User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.59 Safari/537.36Accept-Encoding: gzip,deflateAccept: application/json, text/javascript, */*; q=0.01Connection: keep-aliveHost: xxx.comX-Requested-With: XMLHttpRequestReferer: http://xxxAccept-Language: en-us,enCookie: xxxHTTP/1.1 200 OKServer: nginx/1.2.2Date: Sat, 21 Oct 2017 06:42:21 GMTContent-Type: application/json; charset=utf-8Content-Length: 75Connection: keep-aliveCache-Control: privateX-AspNetMvc-Version: 5.2X-AspNet-Version: 4.0.30319X-Powered-By: ASP.NET

Then, a request is sent again using the python program. If the request is successful, the returned result is still

{"Id": "6202c187-2fad-46e8-b4c6-b72ac8de0142", "ReturnMsg": "loading failed! "}

Because the cookie will expire in a short period of time, the cookie is updated and the request is successful.

It should be noted that the Header must be set for program crawlers. It may be troublesome for this app to block ip Addresses During Anti-crawling.

The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.