Text detailed python crawler hack JS encrypted cookie step

Source: Internet
Author: User
Tags translate function webp

Objective

An Agent pool project was maintained on GitHub, and the proxy source was to crawl some free agent publishing sites. I had a little brother in the morning telling me that there was a proxy fetch interface is not available, return status 521. With the help of people to solve the problem of the mentality to run over the code. Found it to be so.

By comparing the Fiddler capture package, it is basically possible to determine that JavaScript generates an encrypted cookie that causes the original request to return 521.

Discover problems

Open the Fiddler software and open the target site (http://www.kuaidaili.com/proxylist/2/) with your browser. You can find that the browser has loaded two times on this page, returned 521 for the first time, and returned the data normally the second time. Many children's shoes that have not been written on the site or are inexperienced with reptiles may find it odd why this is so. Why is the browser likely to return data normally and the code does not?


A closer look at the results of two returns can be found:



1, the second request more than the first request cookie content of this_ydclearance=0c316df6ea04c5281b421aa8-5570-47ae-9768-2510d9fe9107-1490254971

2, the first return of the content of some complex to understand the JS code, the second time to return is the correct content

In fact, this is the site anti-crawler common means. The approximate process is this: the first time the data is requested, the server returns the dynamic obfuscation of the encrypted JS, and this part of the role of JS is to add new content to the cookie for the server authentication, the return status code is 521. The browser takes a new cookie with the request again, and the server validates the cookie by returning the data (which is why the code cannot return the data).

Solve the problem

function LQ (VA) {var qo, MO = "", no = "", oo = [0x8c, 0xCD, 0x4c, 0xf9, 0xd7, 0x4d, 0x25, 0xba, 0x3c, 0x16, 0x96, 0x4  4, 0x8d, 0x0b, 0x90, 0x1e, 0xa3, 0x39, 0xc9, 0x86, 0x23, 0x61, 0x2f, 0xc8, 0x30, 0xDD, 0x57, 0xec, 0x92, 0x84, 0xc4, 0x6a, 0xeb, 0x99, 0x37, 0xeb, 0x25, 0x0e, 0xBB, 0xb0, 0x95, 0x76, 0x45, 0xde, 0x80, 0x59, 0xf6, 0x9c, 0x58, 0x39, 0x12, 0xc7, 0 x9c, 0x8d, 0x18, 0xe0, 0xc5, 0x77, 0x50, 0x39, 0x01, 0xed, 0x93, 0x39, 0x02, 0x7e, 0x72, 0x4f, 0x24, 0x01, 0xe9, 0x66, 0x7  5, 0x4e, 0x2b, 0xd8, 0x6e, 0xe2, 0xfa, 0xc7, 0xa4, 0x85, 0x4e, 0xc2, 0xa5, 0x96, 0x6b, 0x58, 0x39, 0xd2, 0x7f, 0x44, 0xe5, 0x7b, 0x48, 0x2d, 0xf6, 0XDF, 0XBC, 0x31, 0x1e, 0xf6, 0XBF, 0x84, 0x6d, 0x5e, 0x33, 0x0c, 0x97, 0x5c, 0x39, 0x26, 0xf2, 0 x9b, 0x77, 0x0d, 0xd6, 0xc0, 0x46, 0x38, 0x5f, 0xf4, 0xe2, 0x9f, 0xf1, 0x7b, 0xe8, 0xBE, 0x37, 0XDF, 0xd0, 0XBD, 0XB9, 0x3  6, 0x2c, 0xd1, 0xc3, 0x40, 0xe7, 0XCC, 0xa9, 0x52, 0x3b, 0x20, 0x40, 0x09, 0xe1, 0xd2, 0xa3, 0x80, 0x25, 0x0a, 0xb2, 0xd8, 0xce, 0x21, 0x69, 0x3e, 0xe6, 0x80, 0xfd, 0x73, 0xAB, 0x51, 0xde, 0x60, 0x15, 0x95, 0x07, 0x94, 0x6a, 0x18, 0x9d, 0x37, 0x31, 0xde, 0x64, 0XDD, 0 x63, 0xe3, 0x57, 0x05, 0x82, 0xFF, 0XCC, 0x75, 0x79, 0x63, 0x09, 0xe2, 0x6c, 0x21, 0x5c, 0xe0, 0x7d, 0x4a, 0xf2, 0xd8, 0x9  C, 0x22, 0xa3, 0x3d, 0xba, 0xa0, 0xaf, 0x30, 0xc1, 0x47, 0xf4, 0xca, 0xEE, 0x64, 0xf9, 0x7b, 0x55, 0xd5, 0xd2, 0x4c, 0xc9, 0x7f, 0x25, 0xFE, 0x48, 0xCD, 0x4b, 0XCC, 0x81, 0x1b, 0x05, 0x82, 0x38, 0x0e, 0x83, 0x19, 0xe3, 0x65, 0x3f, 0XBF, 0x16, 0    x88, 0x93, 0XDD, 0x3b]; Qo = "qo=241; Do{oo[qo]= (-oo[qo]) &0xff; oo[qo]= (((oo[qo]>>3) | ( (oo[qo]<<5) &0xff)) -70) &0xff;}    while (--qo>=2); ";    eval (QO);    Qo = 240;    do {Oo[qo] = (Oo[qo]-oo[qo-1]) & 0xFF;    } while (--qo >= 3);    Qo = 1; for (;;)        {if (Qo >) break; OO[QO] = (((((((((((((((OO[QO) + 2) & 0xFF + +) & 0xff) << 1) & 0xff) |        (((((((Oo[qo) (((((2) & 0xff) + +) & 0xff) >> 7);    qo++; } PO = "";   for (Qo = 1; qo < oo.length-1; qo++) if (qo% 6) po + = String.fromCharCode (Oo[qo] ^ VA); Eval ("Qo=eval;qo (PO);");}

See this kind of JS code, I can only say forgive me JS ability poor, can not restore ...

But the front-end experienced children's shoes can immediately think of a way to solve, that is, using the browser's JS code debugging function. This will be solved, create a new HTML file, the first return of the original HTML copy in, save in the browser open, before the Eval breakpoint, see such output:

You can see that this variable PO isdocument.cookie='_ydclearance=0c316df6ea04c5281b421aa8-5570-47ae-9768-2510d9fe9107-1490254971; expires=Thu, 23-Mar-17 07:42:51 GMT; domain=.kuaidaili.com; path=/'; window.document.location=document.URL, there is anothereval("qo=eval;qo(po);")。 The eval in JS is almost the same as in Python, and the second sentence means that the Eval method is assigned to QO. Then go to eval string po. And the first half of the string po means to add Cooklie to the browser, the second half of the paragraphwindow.document.location=document.URLis to refresh the current page.

This also confirms my statement above, the first request without a cookie, the service returns a block of generated cookies and automatically refresh the JS code. The browser gets the code to execute successfully and requests the data again with the new cookie. Python's access to this code is only the first step.

So how to make Python can also execute this JS, the answer is PyV8. V8 is the JavaScript engine embedded in chromium, which is known as the fastest running. PyV8 is using Python to wrap a python shell in V8 's external API, which allows Python to operate directly with JavaScript. PyV8 installation of everyone can Baidu itself.

Code

Analysis completed, cut into the following code.

The first is the normal request page, return the HTML with the encrypted JS function:

Import reimport Pyv8import requeststarget_url = "http://www.kuaidaili.com/proxylist/1/" def gethtml (URL, Cookie=none):    Header = {        "Host": "www.kuaidaili.com",        ' Connection ': ' keep-alive ',        ' cache-control ': ' max-age=0 ', '        upgrade-insecure-requests ': ' 1 ',        ' user-agent ': ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/49.0.2623.87 safari/537.36 ',        ' Accept ': ' Text/html, application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 ',        ' accept-encoding ': ' gzip, deflate, SDCH ', '        accept-language ': ' zh-cn,zh;q=0.8 ',    }    html = Requests.get (Url=url, Headers=header, timeout=30, Cookies=cookie). Content    return html# First access to get dynamic encrypted jsfirst_html = gethtml (Target_url)

Because the return is HTML, not simple JS function, so need to use regular extraction parameters of the JS function parameter.

# Extract the JS encryption function Js_func = ". Join (Re.findall (function. *) </script> ', first_html)) print ' Get JS func:\n ', js_func# extract the parameters which execute the JS function Js_arg = '. Join (Re.findall (R ' settimeout\ ( \ "\d+\ ((\d+) \) \" ', first_html) "print ' Get ja arg:\n ', js_arg

It is also important to note that in the JS function, the cookie is not returned, but the cookie is set directly to the browser, so we need to eval("qo=eval;qo(po);") replace it return po . This will successfully return the contents of the PO.

#-*-Coding:utf-8-*-"" "-------------------------------------------------File Name:demo_1.py.py Description   : Python crawler-crack JS encrypted cookie fast Proxy website For example: http://www.kuaidaili.com/proxylist/1/Document:Author:JHao  DATE:2017/3/23-------------------------------------------------Change ACTIVITY:2017/3/23: Crack JS Encrypted cookie-------------------------------------------------"" "__author__ = ' Jhao ' Import reimport pyv8import Requeststarget_url = "http://www.kuaidaili.com/proxylist/1/" def gethtml (URL, cookie=none): Header = {"Host": "W Ww.kuaidaili.com ", ' Connection ': ' keep-alive ', ' cache-control ': ' max-age=0 ', ' upgrade-insecure-reques ' TS ': ' 1 ', ' user-agent ': ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/49.0.2623.87 safari/537.36 ', ' Accept ': ' text/html,application/ xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 ', ' accept-encoding ': ' gzip, deflate, Sdch ', ' accept-language ': ' zh-cn,zh;q=0.8 ',} HTML = Requests.get (Url=url, Headers=header, timeout=30, cook  Ies=cookie). Content return Htmldef Executejs (js_func_string, arg): Ctxt = Pyv8.jscontext () ctxt.enter () func = Ctxt.eval ("({JS})". Format (js=js_func_string)) return func (ARG) def parsecookie (string): String = String.Replace ("Doc Ument.cookie= ' "," ") Clearance = String.Split (';') [0] Return {clearance.split (' = ') [0]: clearance.split (' = ') [1]}# first access get dynamic encrypted jsfirst_html = gethtml (target_url) # First_ html = "" "# 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.