Preface a proxy pool project is maintained on GitHub. the proxy source is to capture some free proxy publishing websites. In the morning, a younger brother told me that a proxy crawling interface is unavailable and the returned status is 521. I ran the code again with the mentality of helping people solve the problem. This is the case. Through the Fiddler packet capture comparison, it can be basically determined that the encrypted Cookie generated by JavaScript causes 521 to be returned for the original request.
Preface
A proxy pool project is maintained on GitHub. the proxy source is to capture some free proxy publishing websites. In the morning, a younger brother told me that a proxy crawling interface is unavailable and the returned status is 521. I ran the code again with the mentality of helping people solve the problem. This is the case.
Through the Fiddler packet capture comparison, it can be basically determined that the encrypted Cookie generated by JavaScript causes 521 to be returned for the original request.
Problems found
Open the Fiddler software and open the target site (http://www.kuaidaili.com/proxylist/2/) in a browser ). You can find that the browser loads this page twice. the first time 521 is returned, the second time the data is returned normally. Many children's shoes without Website or crawler experience may wonder why? Why does the browser return data while the code does not work?
1. the second request contains more than the Cookie content of the first request._ydclearance=0c316df6ea04c5281b421aa8-5570-47ae-9768-2510d9fe9107-1490254971
2. the content returned for the first time is complex JavaScript code that cannot be understood, and the content returned for the second time is correct.
In fact, this is a common method for website Anti-crawler. The general process is as follows: when the data is requested for the first time, the server returns a dynamically obfuscated and encrypted JS. the function of this JS section is to add new content to the Cookie for server verification, the status code returned is 521. The browser sends a new Cookie request again, and the server verifies that the Cookie returns data (this is why the code cannot return data ).
Solve the problem
In fact, the first time I encountered such a problem was that since you used the Cookie generated by JS, I could also translate the JS function into Python to run it. However, I still found that I am too stupid and naive, because nowadays JavaScript is popular with obfuscation encryption. The original JS is like this:
function lq(VA) { var qo, mo = "", no = "", oo = [0x8c, 0xcd, 0x4c, 0xf9, 0xd7, 0x4d, 0x25, 0xba, 0x3c, 0x16, 0x96, 0x44, 0x8d, 0x0b, 0x90, 0x1e, 0xa3, 0x39, 0xc9, 0x86, 0x23, 0x61, 0x2f, 0xc8, 0x30, 0xdd, 0x57, 0xec, 0x92, 0x84, 0xc4, 0x6a, 0xeb, 0x99, 0x37, 0xeb, 0x25, 0x0e, 0xbb, 0xb0, 0x95, 0x76, 0x45, 0xde, 0x80, 0x59, 0xf6, 0x9c, 0x58, 0x39, 0x12, 0xc7, 0x9c, 0x8d, 0x18, 0xe0, 0xc5, 0x77, 0x50, 0x39, 0x01, 0xed, 0x93, 0x39, 0x02, 0x7e, 0x72, 0x4f, 0x24, 0x01, 0xe9, 0x66, 0x75, 0x4e, 0x2b, 0xd8, 0x6e, 0xe2, 0xfa, 0xc7, 0xa4, 0x85, 0x4e, 0xc2, 0xa5, 0x96, 0x6b, 0x58, 0x39, 0xd2, 0x7f, 0x44, 0xe5, 0x7b, 0x48, 0x2d, 0xf6, 0xdf, 0xbc, 0x31, 0x1e, 0xf6, 0xbf, 0x84, 0x6d, 0x5e, 0x33, 0x0c, 0x97, 0x5c, 0x39, 0x26, 0xf2, 0x9b, 0x77, 0x0d, 0xd6, 0xc0, 0x46, 0x38, 0x5f, 0xf4, 0xe2, 0x9f, 0xf1, 0x7b, 0xe8, 0xbe, 0x37, 0xdf, 0xd0, 0xbd, 0xb9, 0x36, 0x2c, 0xd1, 0xc3, 0x40, 0xe7, 0xcc, 0xa9, 0x52, 0x3b, 0x20, 0x40, 0x09, 0xe1, 0xd2, 0xa3, 0x80, 0x25, 0x0a, 0xb2, 0xd8, 0xce, 0x21, 0x69, 0x3e, 0xe6, 0x80, 0xfd, 0x73, 0xab, 0x51, 0xde, 0x60, 0x15, 0x95, 0x07, 0x94, 0x6a, 0x18, 0x9d, 0x37, 0x31, 0xde, 0x64, 0xdd, 0x63, 0xe3, 0x57, 0x05, 0x82, 0xff, 0xcc, 0x75, 0x79, 0x63, 0x09, 0xe2, 0x6c, 0x21, 0x5c, 0xe0, 0x7d, 0x4a, 0xf2, 0xd8, 0x9c, 0x22, 0xa3, 0x3d, 0xba, 0xa0, 0xaf, 0x30, 0xc1, 0x47, 0xf4, 0xca, 0xee, 0x64, 0xf9, 0x7b, 0x55, 0xd5, 0xd2, 0x4c, 0xc9, 0x7f, 0x25, 0xfe, 0x48, 0xcd, 0x4b, 0xcc, 0x81, 0x1b, 0x05, 0x82, 0x38, 0x0e, 0x83, 0x19, 0xe3, 0x65, 0x3f, 0xbf, 0x16, 0x88, 0x93, 0xdd, 0x3b]; qo = "qo=241; do{oo[qo]=(-oo[qo])&0xff; oo[qo]=(((oo[qo]>>3)|((oo[qo]<<5)&0xff))-70)&0xff;} while(--qo>=2);"; eval(qo); qo = 240; do { oo[qo] = (oo[qo] - oo[qo - 1]) & 0xff; } while (--qo >= 3); qo = 1; for (; ;) { if (qo > 240) break; oo[qo] = ((((((oo[qo] + 2) & 0xff) + 76) & 0xff) << 1) & 0xff) | (((((oo[qo] + 2) & 0xff) + 76) & 0xff) >> 7); qo++; } po = ""; for (qo = 1; qo < oo.length - 1; qo++) if (qo % 6) po += String.fromCharCode(oo[qo] ^ VA); eval("qo=eval;qo(po);");}
When I see such JS code, I can only say that I forgive me for poor JS capabilities and cannot restore it...
However, the experienced front-end shoes can immediately think of another way to solve the problem, that is, using the JS code debugging function of the browser. In this way, we can solve the problem by creating an html file, copying the original html text returned for the first time, saving it and opening it in a browser, hitting the breakpoint before eval, and seeing the output like this:
The variable po isdocument.cookie='_ydclearance=0c316df6ea04c5281b421aa8-5570-47ae-9768-2510d9fe9107-1490254971; expires=Thu, 23-Mar-17 07:42:51 GMT; domain=.kuaidaili.com; path=/'; window.document.location=document.URL
, There iseval("qo=eval;qo(po);")
. The eval in JS is similar to that in Python. The second sentence is to assign the eval method to qo. Then go to the eval string po. The first half of the string po indicates adding Cooklie to the browser and the second half.window.document.location=document.URL
Is to refresh the current page.
This also confirms my above statement that the first request does not have a Cookie, and the server returns a piece of JS code that generates the Cookie and automatically refreshes it. The browser can successfully execute the code and request data again with the new Cookie. Python can only get this code in the first step.
So how can we make Python execute this JavaScript code? the answer is PyV8. V8 is an embedded javascript engine in Chromium, and is the fastest running engine. PyV8 uses Python to wrap a python shell in the external API of V8, so that python can be directly operated with javascript. You can install PyV8 on your own.
Code
After the analysis is complete, let's look at the subject coding.
First, the webpage is normally requested, and the html with encrypted JS functions is returned:
Import reimport PyV8import requestsTARGET_URL = "http://www.kuaidaili.com/proxylist/1/" def getHtml (url, cookie = None): header = {"Host": "www.kuaidaili.com", 'connection': 'Keep-alive ', 'cache-control': 'Max-age = 0', 'Upgrade-Insecure-requests': '1', 'user-Agent ': 'mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/80', 'access': 'Text/html, application/xhtml + xml, application/xml; q = 0.9, image/webp, */*; q = 0.8 ', 'Accept-encoding': 'gzip, deflate, sdch ', 'Accept-color': 'zh-CN, zh; q = 66661',} html = requests. get (url = url, headers = header, timeout = 30, cookies = cookie ). content return html # get dynamic encryption JSfirst_html = getHtml (TARGET_URL)
Because html is returned, not just JS functions, we need to use regular expressions to extract parameters of JS functions.
# Extract the JS encryption function js_func = ''. join (re. findall (r' (function .*?) Script ', first_html) print 'get js func: \ n', js_func # extract the js_arg = ''parameter for executing the JS function ''. join (re. findall (r 'settimeout \ (\ "\ D + \ (\ d +) \" ', first_html) print 'get ja arg: \ n', js_arg
Note that the JS function does not return a cookie, but directly sets the cookie to the browser. Therefore, we needeval("qo=eval;qo(po);")
Replacereturn po
. In this way, the content in the po can be returned successfully.
#-*-Coding: UTF-8-*-"----------------------------------------------- File Name: demo_1.py.py Description: Python crawler-cracking the JS-encrypted Cookie fast proxy website as an example: http://www.kuaidaili.com/proxylist/1/ Document: Author: JHao date: 2017/3/23 parse Change Activity: crack JS-encrypted Cookie response "" _ author _ = 'jhao' import reimport PyV8import requestsTARGET_URL =" http://www.kuaidaili.com/proxylist/1/ "Def getHtml (url, cookie = None): header = {" Host ":" www.kuaidaili.com ", 'connection': 'Keep-alive ', 'cache-control ': 'Max-age = 0', 'Upgrade-Insecure-requests': '1', 'user-Agent': 'mozilla/5.0 (Windows NT 6.1; WOW64) appleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36 ', 'accept': 'Text/html, application/xhtml + xml, application/xml; q = 0.9, image/webp, */*; q = 0.8 ', 'Accept-encoding': 'gzip, deflate, sdch', 'Accept-color': 'zh-CN, zh; q = 0.8 ',} html = requests. get (url = url, headers = header, timeout = 30, cookies = cookie ). content return htmldef executeJS (js_func_string, arg): ctxt = PyV8.JSContext () ctxt. enter () func = ctxt. eval ("({js })". format (js = js_func_string) return func (arg) def parseCookie (string): string = string. replace ("document. cookie = '"," ") clearance = string. split (';') [0] return {clearance. split ('=') [0]: clearance. split ('=') [1]} # The first access to obtain dynamic encryption JSfirst_html = getHtml (TARGET_URL) # first_html = """#