How to use Python crawlers to crack JS-encrypted cookies

Source: Internet
Author: User
Tags webp
Preface a proxy pool project is maintained on GitHub. the proxy source is to capture some free proxy publishing websites. In the morning, a younger brother told me that a proxy crawling interface is unavailable and the returned status is 521. I ran the code again with the mentality of helping people solve the problem. This is the case. Through the Fiddler packet capture comparison, it can be basically determined that the encrypted Cookie generated by JavaScript causes 521 to be returned for the original request.

Preface

A proxy pool project is maintained on GitHub. the proxy source is to capture some free proxy publishing websites. In the morning, a younger brother told me that a proxy crawling interface is unavailable and the returned status is 521. I ran the code again with the mentality of helping people solve the problem. This is the case.

Through the Fiddler packet capture comparison, it can be basically determined that the encrypted Cookie generated by JavaScript causes 521 to be returned for the original request.

Problems found

Open the Fiddler software and open the target site (http://www.kuaidaili.com/proxylist/2/) in a browser ). You can find that the browser loads this page twice. the first time 521 is returned, the second time the data is returned normally. Many children's shoes without Website or crawler experience may wonder why? Why does the browser return data while the code does not work?

1. the second request contains more than the Cookie content of the first request._ydclearance=0c316df6ea04c5281b421aa8-5570-47ae-9768-2510d9fe9107-1490254971

2. the content returned for the first time is complex JavaScript code that cannot be understood, and the content returned for the second time is correct.

In fact, this is a common method for website Anti-crawler. The general process is as follows: when the data is requested for the first time, the server returns a dynamically obfuscated and encrypted JS. the function of this JS section is to add new content to the Cookie for server verification, the status code returned is 521. The browser sends a new Cookie request again, and the server verifies that the Cookie returns data (this is why the code cannot return data ).

Solve the problem

In fact, the first time I encountered such a problem was that since you used the Cookie generated by JS, I could also translate the JS function into Python to run it. However, I still found that I am too stupid and naive, because nowadays JavaScript is popular with obfuscation encryption. The original JS is like this:

function lq(VA) {    var qo, mo = "", no = "", oo = [0x8c, 0xcd, 0x4c, 0xf9, 0xd7, 0x4d, 0x25, 0xba, 0x3c, 0x16, 0x96, 0x44, 0x8d, 0x0b, 0x90, 0x1e, 0xa3, 0x39, 0xc9, 0x86, 0x23, 0x61, 0x2f, 0xc8, 0x30, 0xdd, 0x57, 0xec, 0x92, 0x84, 0xc4, 0x6a, 0xeb, 0x99, 0x37, 0xeb, 0x25, 0x0e, 0xbb, 0xb0, 0x95, 0x76, 0x45, 0xde, 0x80, 0x59, 0xf6, 0x9c, 0x58, 0x39, 0x12, 0xc7, 0x9c, 0x8d, 0x18, 0xe0, 0xc5, 0x77, 0x50, 0x39, 0x01, 0xed, 0x93, 0x39, 0x02, 0x7e, 0x72, 0x4f, 0x24, 0x01, 0xe9, 0x66, 0x75, 0x4e, 0x2b, 0xd8, 0x6e, 0xe2, 0xfa, 0xc7, 0xa4, 0x85, 0x4e, 0xc2, 0xa5, 0x96, 0x6b, 0x58, 0x39, 0xd2, 0x7f, 0x44, 0xe5, 0x7b, 0x48, 0x2d, 0xf6, 0xdf, 0xbc, 0x31, 0x1e, 0xf6, 0xbf, 0x84, 0x6d, 0x5e, 0x33, 0x0c, 0x97, 0x5c, 0x39, 0x26, 0xf2, 0x9b, 0x77, 0x0d, 0xd6, 0xc0, 0x46, 0x38, 0x5f, 0xf4, 0xe2, 0x9f, 0xf1, 0x7b, 0xe8, 0xbe, 0x37, 0xdf, 0xd0, 0xbd, 0xb9, 0x36, 0x2c, 0xd1, 0xc3, 0x40, 0xe7, 0xcc, 0xa9, 0x52, 0x3b, 0x20, 0x40, 0x09, 0xe1, 0xd2, 0xa3, 0x80, 0x25, 0x0a, 0xb2, 0xd8, 0xce, 0x21, 0x69, 0x3e, 0xe6, 0x80, 0xfd, 0x73, 0xab, 0x51, 0xde, 0x60, 0x15, 0x95, 0x07, 0x94, 0x6a, 0x18, 0x9d, 0x37, 0x31, 0xde, 0x64, 0xdd, 0x63, 0xe3, 0x57, 0x05, 0x82, 0xff, 0xcc, 0x75, 0x79, 0x63, 0x09, 0xe2, 0x6c, 0x21, 0x5c, 0xe0, 0x7d, 0x4a, 0xf2, 0xd8, 0x9c, 0x22, 0xa3, 0x3d, 0xba, 0xa0, 0xaf, 0x30, 0xc1, 0x47, 0xf4, 0xca, 0xee, 0x64, 0xf9, 0x7b, 0x55, 0xd5, 0xd2, 0x4c, 0xc9, 0x7f, 0x25, 0xfe, 0x48, 0xcd, 0x4b, 0xcc, 0x81, 0x1b, 0x05, 0x82, 0x38, 0x0e, 0x83, 0x19, 0xe3, 0x65, 0x3f, 0xbf, 0x16, 0x88, 0x93, 0xdd, 0x3b];    qo = "qo=241; do{oo[qo]=(-oo[qo])&0xff; oo[qo]=(((oo[qo]>>3)|((oo[qo]<<5)&0xff))-70)&0xff;} while(--qo>=2);";    eval(qo);    qo = 240;    do {        oo[qo] = (oo[qo] - oo[qo - 1]) & 0xff;    } while (--qo >= 3);    qo = 1;    for (; ;) {        if (qo > 240) break;        oo[qo] = ((((((oo[qo] + 2) & 0xff) + 76) & 0xff) << 1) & 0xff) | (((((oo[qo] + 2) & 0xff) + 76) & 0xff) >> 7);        qo++;    }    po = "";    for (qo = 1; qo < oo.length - 1; qo++) if (qo % 6) po += String.fromCharCode(oo[qo] ^ VA);    eval("qo=eval;qo(po);");}

When I see such JS code, I can only say that I forgive me for poor JS capabilities and cannot restore it...

However, the experienced front-end shoes can immediately think of another way to solve the problem, that is, using the JS code debugging function of the browser. In this way, we can solve the problem by creating an html file, copying the original html text returned for the first time, saving it and opening it in a browser, hitting the breakpoint before eval, and seeing the output like this:

The variable po isdocument.cookie='_ydclearance=0c316df6ea04c5281b421aa8-5570-47ae-9768-2510d9fe9107-1490254971; expires=Thu, 23-Mar-17 07:42:51 GMT; domain=.kuaidaili.com; path=/'; window.document.location=document.URL, There iseval("qo=eval;qo(po);"). The eval in JS is similar to that in Python. The second sentence is to assign the eval method to qo. Then go to the eval string po. The first half of the string po indicates adding Cooklie to the browser and the second half.window.document.location=document.URLIs to refresh the current page.

This also confirms my above statement that the first request does not have a Cookie, and the server returns a piece of JS code that generates the Cookie and automatically refreshes it. The browser can successfully execute the code and request data again with the new Cookie. Python can only get this code in the first step.

So how can we make Python execute this JavaScript code? the answer is PyV8. V8 is an embedded javascript engine in Chromium, and is the fastest running engine. PyV8 uses Python to wrap a python shell in the external API of V8, so that python can be directly operated with javascript. You can install PyV8 on your own.

Code

After the analysis is complete, let's look at the subject coding.

First, the webpage is normally requested, and the html with encrypted JS functions is returned:

Import reimport PyV8import requestsTARGET_URL = "http://www.kuaidaili.com/proxylist/1/" def getHtml (url, cookie = None): header = {"Host": "www.kuaidaili.com", 'connection': 'Keep-alive ', 'cache-control': 'Max-age = 0', 'Upgrade-Insecure-requests': '1', 'user-Agent ': 'mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/80', 'access': 'Text/html, application/xhtml + xml, application/xml; q = 0.9, image/webp, */*; q = 0.8 ', 'Accept-encoding': 'gzip, deflate, sdch ', 'Accept-color': 'zh-CN, zh; q = 66661',} html = requests. get (url = url, headers = header, timeout = 30, cookies = cookie ). content return html # get dynamic encryption JSfirst_html = getHtml (TARGET_URL)

Because html is returned, not just JS functions, we need to use regular expressions to extract parameters of JS functions.

# Extract the JS encryption function js_func = ''. join (re. findall (r' (function .*?) Script ', first_html) print 'get js func: \ n', js_func # extract the js_arg = ''parameter for executing the JS function ''. join (re. findall (r 'settimeout \ (\ "\ D + \ (\ d +) \" ', first_html) print 'get ja arg: \ n', js_arg

Note that the JS function does not return a cookie, but directly sets the cookie to the browser. Therefore, we needeval("qo=eval;qo(po);")Replacereturn po. In this way, the content in the po can be returned successfully.

#-*-Coding: UTF-8-*-"----------------------------------------------- File Name: demo_1.py.py Description: Python crawler-cracking the JS-encrypted Cookie fast proxy website as an example: http://www.kuaidaili.com/proxylist/1/ Document: Author: JHao date: 2017/3/23 parse Change Activity: crack JS-encrypted Cookie response "" _ author _ = 'jhao' import reimport PyV8import requestsTARGET_URL =" http://www.kuaidaili.com/proxylist/1/ "Def getHtml (url, cookie = None): header = {" Host ":" www.kuaidaili.com ", 'connection': 'Keep-alive ', 'cache-control ': 'Max-age = 0', 'Upgrade-Insecure-requests': '1', 'user-Agent': 'mozilla/5.0 (Windows NT 6.1; WOW64) appleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36 ', 'accept': 'Text/html, application/xhtml + xml, application/xml; q = 0.9, image/webp, */*; q = 0.8 ', 'Accept-encoding': 'gzip, deflate, sdch', 'Accept-color': 'zh-CN, zh; q = 0.8 ',} html = requests. get (url = url, headers = header, timeout = 30, cookies = cookie ). content return htmldef executeJS (js_func_string, arg): ctxt = PyV8.JSContext () ctxt. enter () func = ctxt. eval ("({js })". format (js = js_func_string) return func (arg) def parseCookie (string): string = string. replace ("document. cookie = '"," ") clearance = string. split (';') [0] return {clearance. split ('=') [0]: clearance. split ('=') [1]} # The first access to obtain dynamic encryption JSfirst_html = getHtml (TARGET_URL) # first_html = """#

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.