Objective
An Agent pool project was maintained on GitHub, and the proxy source was to crawl some free agent publishing sites. I had a little brother in the morning telling me that there was a proxy fetch interface is not available, return status 521. With the help of people to solve the problem of the mentality to run over the code. Found it to be so.
By comparing the Fiddler capture package, it is basically possible to determine that JavaScript generates an encrypted cookie that causes the original request to return 521.
Discover problems
Open the Fiddler software and open the target site (http://www.kuaidaili.com/proxylist/2/) with your browser. You can find that the browser has loaded two times on this page, returned 521 for the first time, and returned the data normally the second time. Many children's shoes that have not been written on the site or are inexperienced with reptiles may find it odd why this is so. Why is the browser likely to return data normally and the code does not?
A closer look at the results of two returns can be found:
1, the second request more than the first request cookie content of this_ydclearance=0c316df6ea04c5281b421aa8-5570-47ae-9768-2510d9fe9107-1490254971
2, the first return of the content of some complex to understand the JS code, the second time to return is the correct content
In fact, this is the site anti-crawler common means. The approximate process is this: the first time the data is requested, the server returns the dynamic obfuscation of the encrypted JS, and this part of the role of JS is to add new content to the cookie for the server authentication, the return status code is 521. The browser takes a new cookie with the request again, and the server validates the cookie by returning the data (which is why the code cannot return the data).
Solve the problem
function LQ (VA) {var qo, MO = "", no = "", oo = [0x8c, 0xCD, 0x4c, 0xf9, 0xd7, 0x4d, 0x25, 0xba, 0x3c, 0x16, 0x96, 0x4 4, 0x8d, 0x0b, 0x90, 0x1e, 0xa3, 0x39, 0xc9, 0x86, 0x23, 0x61, 0x2f, 0xc8, 0x30, 0xDD, 0x57, 0xec, 0x92, 0x84, 0xc4, 0x6a, 0xeb, 0x99, 0x37, 0xeb, 0x25, 0x0e, 0xBB, 0xb0, 0x95, 0x76, 0x45, 0xde, 0x80, 0x59, 0xf6, 0x9c, 0x58, 0x39, 0x12, 0xc7, 0 x9c, 0x8d, 0x18, 0xe0, 0xc5, 0x77, 0x50, 0x39, 0x01, 0xed, 0x93, 0x39, 0x02, 0x7e, 0x72, 0x4f, 0x24, 0x01, 0xe9, 0x66, 0x7 5, 0x4e, 0x2b, 0xd8, 0x6e, 0xe2, 0xfa, 0xc7, 0xa4, 0x85, 0x4e, 0xc2, 0xa5, 0x96, 0x6b, 0x58, 0x39, 0xd2, 0x7f, 0x44, 0xe5, 0x7b, 0x48, 0x2d, 0xf6, 0XDF, 0XBC, 0x31, 0x1e, 0xf6, 0XBF, 0x84, 0x6d, 0x5e, 0x33, 0x0c, 0x97, 0x5c, 0x39, 0x26, 0xf2, 0 x9b, 0x77, 0x0d, 0xd6, 0xc0, 0x46, 0x38, 0x5f, 0xf4, 0xe2, 0x9f, 0xf1, 0x7b, 0xe8, 0xBE, 0x37, 0XDF, 0xd0, 0XBD, 0XB9, 0x3 6, 0x2c, 0xd1, 0xc3, 0x40, 0xe7, 0XCC, 0xa9, 0x52, 0x3b, 0x20, 0x40, 0x09, 0xe1, 0xd2, 0xa3, 0x80, 0x25, 0x0a, 0xb2, 0xd8, 0xce, 0x21, 0x69, 0x3e, 0xe6, 0x80, 0xfd, 0x73, 0xAB, 0x51, 0xde, 0x60, 0x15, 0x95, 0x07, 0x94, 0x6a, 0x18, 0x9d, 0x37, 0x31, 0xde, 0x64, 0XDD, 0 x63, 0xe3, 0x57, 0x05, 0x82, 0xFF, 0XCC, 0x75, 0x79, 0x63, 0x09, 0xe2, 0x6c, 0x21, 0x5c, 0xe0, 0x7d, 0x4a, 0xf2, 0xd8, 0x9 C, 0x22, 0xa3, 0x3d, 0xba, 0xa0, 0xaf, 0x30, 0xc1, 0x47, 0xf4, 0xca, 0xEE, 0x64, 0xf9, 0x7b, 0x55, 0xd5, 0xd2, 0x4c, 0xc9, 0x7f, 0x25, 0xFE, 0x48, 0xCD, 0x4b, 0XCC, 0x81, 0x1b, 0x05, 0x82, 0x38, 0x0e, 0x83, 0x19, 0xe3, 0x65, 0x3f, 0XBF, 0x16, 0 x88, 0x93, 0XDD, 0x3b]; Qo = "qo=241; Do{oo[qo]= (-oo[qo]) &0xff; oo[qo]= (((oo[qo]>>3) | ( (oo[qo]<<5) &0xff)) -70) &0xff;} while (--qo>=2); "; eval (QO); Qo = 240; do {Oo[qo] = (Oo[qo]-oo[qo-1]) & 0xFF; } while (--qo >= 3); Qo = 1; for (;;) {if (Qo >) break; OO[QO] = (((((((((((((((OO[QO) + 2) & 0xFF + +) & 0xff) << 1) & 0xff) | (((((((Oo[qo) (((((2) & 0xff) + +) & 0xff) >> 7); qo++; } PO = ""; for (Qo = 1; qo < oo.length-1; qo++) if (qo% 6) po + = String.fromCharCode (Oo[qo] ^ VA); Eval ("Qo=eval;qo (PO);");}
See this kind of JS code, I can only say forgive me JS ability poor, can not restore ...
But the front-end experienced children's shoes can immediately think of a way to solve, that is, using the browser's JS code debugging function. This will be solved, create a new HTML file, the first return of the original HTML copy in, save in the browser open, before the Eval breakpoint, see such output:
You can see that this variable PO isdocument.cookie='_ydclearance=0c316df6ea04c5281b421aa8-5570-47ae-9768-2510d9fe9107-1490254971; expires=Thu, 23-Mar-17 07:42:51 GMT; domain=.kuaidaili.com; path=/'; window.document.location=document.URL
, there is anothereval("qo=eval;qo(po);")
。 The eval in JS is almost the same as in Python, and the second sentence means that the Eval method is assigned to QO. Then go to eval string po. And the first half of the string po means to add Cooklie to the browser, the second half of the paragraphwindow.document.location=document.URL
is to refresh the current page.
This also confirms my statement above, the first request without a cookie, the service returns a block of generated cookies and automatically refresh the JS code. The browser gets the code to execute successfully and requests the data again with the new cookie. Python's access to this code is only the first step.
So how to make Python can also execute this JS, the answer is PyV8. V8 is the JavaScript engine embedded in chromium, which is known as the fastest running. PyV8 is using Python to wrap a python shell in V8 's external API, which allows Python to operate directly with JavaScript. PyV8 installation of everyone can Baidu itself.
Code
Analysis completed, cut into the following code.
The first is the normal request page, return the HTML with the encrypted JS function:
Import reimport Pyv8import requeststarget_url = "http://www.kuaidaili.com/proxylist/1/" def gethtml (URL, Cookie=none): Header = { "Host": "www.kuaidaili.com", ' Connection ': ' keep-alive ', ' cache-control ': ' max-age=0 ', ' upgrade-insecure-requests ': ' 1 ', ' user-agent ': ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/49.0.2623.87 safari/537.36 ', ' Accept ': ' Text/html, application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 ', ' accept-encoding ': ' gzip, deflate, SDCH ', ' accept-language ': ' zh-cn,zh;q=0.8 ', } html = Requests.get (Url=url, Headers=header, timeout=30, Cookies=cookie). Content return html# First access to get dynamic encrypted jsfirst_html = gethtml (Target_url)
Because the return is HTML, not simple JS function, so need to use regular extraction parameters of the JS function parameter.
# Extract the JS encryption function Js_func = ". Join (Re.findall (function. *) </script> ', first_html)) print ' Get JS func:\n ', js_func# extract the parameters which execute the JS function Js_arg = '. Join (Re.findall (R ' settimeout\ ( \ "\d+\ ((\d+) \) \" ', first_html) "print ' Get ja arg:\n ', js_arg
It is also important to note that in the JS function, the cookie is not returned, but the cookie is set directly to the browser, so we need to eval("qo=eval;qo(po);")
replace it return po
. This will successfully return the contents of the PO.
#-*-Coding:utf-8-*-"" "-------------------------------------------------File Name:demo_1.py.py Description : Python crawler-crack JS encrypted cookie fast Proxy website For example: http://www.kuaidaili.com/proxylist/1/Document:Author:JHao DATE:2017/3/23-------------------------------------------------Change ACTIVITY:2017/3/23: Crack JS Encrypted cookie-------------------------------------------------"" "__author__ = ' Jhao ' Import reimport pyv8import Requeststarget_url = "http://www.kuaidaili.com/proxylist/1/" def gethtml (URL, cookie=none): Header = {"Host": "W Ww.kuaidaili.com ", ' Connection ': ' keep-alive ', ' cache-control ': ' max-age=0 ', ' upgrade-insecure-reques ' TS ': ' 1 ', ' user-agent ': ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/49.0.2623.87 safari/537.36 ', ' Accept ': ' text/html,application/ xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 ', ' accept-encoding ': ' gzip, deflate, Sdch ', ' accept-language ': ' zh-cn,zh;q=0.8 ',} HTML = Requests.get (Url=url, Headers=header, timeout=30, cook Ies=cookie). Content return Htmldef Executejs (js_func_string, arg): Ctxt = Pyv8.jscontext () ctxt.enter () func = Ctxt.eval ("({JS})". Format (js=js_func_string)) return func (ARG) def parsecookie (string): String = String.Replace ("Doc Ument.cookie= ' "," ") Clearance = String.Split (';') [0] Return {clearance.split (' = ') [0]: clearance.split (' = ') [1]}# first access get dynamic encrypted jsfirst_html = gethtml (target_url) # First_ html = "" "#