Python Reptile scrambled Egg net sister figure reptile, decrypt picture link encryption Way __python

Source: Internet
Author: User
Tags base64 decrypt hash key string md5 md5 encryption ord

Before in the Fish C Forum, see a lot of people are using Python to write crawler climbing egg Nets sister map, I also wrote, climbed a lot of sister pictures. Later, the egg net to the sister map of the Web page to improve the image of the address of the encryption, so the people in the Forum often asked how to request the page no link. This article is about the fried egg net Ooxx sister map of the link to obtain the way.

First of all, the reason why the egg net added to the anti-reptile mechanism, it should be because there are too many people to climb their site. The crawler frequently visit the website to bring the pressure to the website, therefore, suggested that everybody writes the crawler simple operation success to be enough, do not go overboard to climb others ' things. The analysis of Reptile thinking Picture Download Flowchart

First, use a simple flowchart (non-canonical flowchart format) to show the entire process of crawling a simple network of sister graphs:

Flowchart Interpretation

1, climb the egg net sister figure, we first want to open any sister map of the page, such as http://jandan.net/ooxx/page-44#comments then, we need to request this page, get 2 key information (follow-up will explain the specific role of information), The first message is the hash value of each sister's picture, which is the key message that is used to decrypt the resulting image address.

2, in addition to the page to extract the hash of the picture, there are extracts to the current page of a key JS file address, this JS file contains a same is used to generate picture address key parameters, to get this parameter, must go to request this JS address, At that time the sister map of each page of the JS address is different, so need to extract from the page.

3, get the picture of the hash and JS in the key parameters, you can provide in accordance with JS decryption method, get pictures of the link, the decryption method followed by Python code and JS code reference to explain.

4, with the picture link, download the picture is not much said, follow-up will have a second article, to use multi-threaded + multi-process way to download pictures. Page Analysis Web page source code interpretation

We can open a sister map of the page, or the beginning of the http://jandan.net/ooxx/page-44#comments as an example, and then look at the source code (note, not the review element), you can see this should be the location of the picture address does not have a picture address, Instead, it resembles the following code:

<p><span class=" Img-hash ">ece8ozwut/vggxw1hlbitpge0xmz9y/ywpci5rz5f/h2uswgxwv6iql6daeufit9mh2ep3cetllpwyd+ku0yhpshplny6lmhyiqo6stu9 /udy5k+vjt3eq</span></p>

From this code can be seen, the picture address by a JS function instead, that is, the image address is by this jandan_load_img (this) function to get and load, so, now the key is, need to find the meaning of this function in the JS file. js file Interpretation

By searching for jandan_load_img in each JS file, you can end up with an address similar to http://cdn.jandan.net/static/min/ 1d694f08895d377af4835a24f06090d0.29100001.js file to find the definition of this function, the compressed JS code format view, you can see the specific definition of the following fragment:

function Jandan_load_img (b) {
    var d = $ (b);
    var f = d.next ("Span.img-hash");
    var e = F.text ();
    F.remove ();
    var c = F_QA8JE29JONVWCRMET1AJOCGATAINWKCN (E, "agc37is2vpayzkfi9wvobfdn5bcfn1px");

The meaning of this code is easy to understand, first of all, it extracts the current label CSS for Img-hash span label text, that is, we start to say the image of the hash value, and then put this value and a string parameter (each page of this parameter is variable, this page is AGC37IS2VPAYZKFI9WVOBFDN5BCFN1PX) is passed to another function f_qa8je29jonvwcrmet1ajocgatainwkcn, so we have to look at the meaning of the function. This function is the function used to generate a link to a picture. the interpretation of f_ function

You can find the definition of the F_ function in JS, you can see that there are two, but it doesn't matter, according to code from top to bottom of the law, we just need to see the next one on the line, the complete contents are as follows:

var f_qa8je29jonvwcrmet1ajocgatainwkcn = function (M, r, D) {var e = "DECODE"; var r = r?
    R: ""; var d = d?
    d:0;
    var q = 4;
    R = MD5 (r);
    var o = MD5 (r.substr (0, 16));
    var n = MD5 (R.SUBSTR (16, 16));
    if (q) {if (E = = "DECODE") {var L = m.substr (0, Q)}} else {var L = ""} var c = O + MD5 (o + L);
    var k;
        if (E = = "DECODE") {m = m.substr (q);
    k = Base64_decode (M)} var h = new Array (256);
    for (var g = 0; g < 256; g++) {h[g] = g} var b = new Array ();
        for (var g = 0; g < 256. g++) {B[g] = c.charcodeat (g% c.length)} for (var f = g = 0; g < 256; g++) {
        F = (f + h[g] + b[g])% 256;
        TMP = H[g];
        H[G] = h[f];
    H[F] = tmp} var t = "";
    K = K.split ("");
        for (var p = f = g = 0; g < k.length; g++) {p = (p + 1)% 256;
        F = (f + h[p])% 256;
        TMP = H[p];
        H[P] = h[f];
        H[F] = tmp; T + = Chr (ord (k[g)) ^ (h[(H[P] + h[f])% 256])} if (E = = "DECODE") {if (t.substr (0) = 0 | | t.substr (0)-time () > 0) &&am P
 T.SUBSTR = = MD5 (T.SUBSTR () + N). substr (0)) {t = T.substr ()} else {t = ""}} return T};

This function needs to pass 3 parameters, the first parameter is the hash value of the picture, the second parameter is a string that is seen in the JANDAN_LOAD_IMG function, and the third argument is useless, because there is no pass in the JANDAN_LOAD_IMG function. We just need to follow the JS code meaning to rewrite this function into Python code on the line. python rewrite function

This should be the case when you rewrite the f_ function using Python:

Def get_imgurl (M, r= ', d=0):
    ' decrypt get Picture link '
    e = "DECODE"
    q = 4
    r = _md5 (r)
    o = _md5 (r[0:0 +))
    n = _md5 (r[16:16 +))
    L = m[0:q]
    c = o + _md5 (o + L)
    m = m[q:]
    k = _base64_decode (m)
    h = List (rang E (256))
    B = [Ord (c[g% Len (c)]) for G in range (256)]

    f = 0
    for g in range (0, 256):
        F = (f + h[g] + b[g ]% 256
        tmp = H[g]
        h[g] = h[f]
        h[f] = tmp

    t = ""
    p, f = 0, 0 for
    G in range (0, Len (k)): 
  p = (p + 1)% 256
        F = (f + h[p])% 256
        tmp = h[p]
        h[p] = h[f]
        h[f] = tmp
        T + + chr (k[g) ^ (h[(h[ P] + h[f]) (% 256])
    t = t[26:] Return
    t

This function needs to use two other functions, the first is the MD5 encryption function, this function corresponds to the paragraph in JS:

var o = MD5 (r.substr (0, 16));

JS substr () function is actually the use of slices in python, a little look at the definition can understand, do not explain.

MD5 encryption is translated into the Python version as follows:

def _md5 (value):
    ' MD5 encryption '
    m = hashlib.md5 ()
    m.update (Value.encode (' Utf-8 ')) return
    m.hexdigest ( )

Then there is a bash64 decoding function, which is used in this section of JS:

k = Base64_decode (M)

When using Python, you need to be aware that if you use Python's base64.b64decode directly, you will be able to make an error:

Binascii. Error:incorrect padding

So before you decode the data, you have to deal with it, and the specific functions are:

def _base64_decode (data):
    ' bash64 decoding, pay attention to the original string length error problem '
    missing_padding = 4-len (data)% 4
    if missing_ padding:
        Data + = ' = ' * missing_padding return
    base64.b64decode (data)

Here, the function to get the picture link is complete, mainly using 3 functions.

We can test this function by passing in two parameters copied from the Web page:

m = ' ece8ozwut/vggxw1hlbitpge0xmz9y/ywpci5rz5f/h2uswgxwv6iql6daeufit9mh2ep3cetllpwyd+ku0yhpshplny6lmhyiqo6stu9/ Udy5k+vjt3eq '
r = ' HPRB2OSFT5RHLSYZAXV8XYPVEAGDTHCA '
print (Get_imgurl (m,r))

You can see the following output:

Ww3.sinaimg.cn/mw600/0073ob6pgy1fpet9wku7dj30hs0qljuz.jpg

Note: The R parameter here is copied from the JS in each page, the JS address of each page is changed, this parameter is also changed. get hash and JS address

Previously said that the hash value is to get the picture address of the key parameters, while the other parameters in the JS file, and this JS file each page is different, so now to extract these two key parameters. Bulk Get Hash

To get the hash value of the picture is very convenient, we can use the BeautifulSoup method, the specific code fragment:

def get_urls (URL):
    ' get a link to all pictures of a page '
    headers = {
        ' user-agent ': ' mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:47.0) gecko/20100101 firefox/47.0 ',
        ' Host ': ' Jandan.net '
    }
    html = requests.get (URL, headers=headers). Text
    js_url = ' http: ' + re.findall (' <script src= ' (//cdn.jandan.net/static/min/[\w\d]+\.\d+\.js) ></ Script> ', HTML] [-1]
    _r = Get_r (js_url)
    soup = beautifulsoup (html, ' lxml ')
    tags = soup.select ('. Img-hash ') for
    tag in tags:
        img_hash = tag.text
        Img_url = Get_imgurl (img_hash,_r)
        print (Img_url)

The code to extract the hash of the picture is this sentence:

Soup = beautifulsoup (html, ' lxml ')
    tags = soup.select ('. Img-hash ') for
    tag in tags:
        img_hash = Tag.text
get key strings in JS

The way to get a JS address is to use a regular expression:

Js_url = ' http: ' + re.findall (' <script src= ' (//cdn.jandan.net/static/min/[\w\d]+\.\d+\.js) ' ></script> ', HTML) [-1]

Notice here, because the regular fetch is a list, so the final need to take a link in the list, after viewing, I found that some of the pages have two of these JS files, one is commented out, so you have to use the last one, this expression is in the list index using [-1] to take the last one.

Get the JS address to request, and then find the key string, you can write a function:

def get_r (js_url):
    ' Get key string '
    js = Requests.get (js_url). Text
    _r = Re.findall (' C=f_[\w\d]+\ (E, "(. *?)" \) ', JS ' [0] return
    _r
Complete Code

Here's the complete code to get the entire picture link for a page:

#-*-Coding:utf-8-*-import requests from BS4 import beautifulsoup import hashlib import re import base64 def _MD5 (VA Lue): ' MD5 encryption ' m = Hashlib.md5 () m.update (Value.encode (' Utf-8 ')) return M.hexdigest () def _base64_dec Ode (data): ' Bash64 decoding, pay attention to the original string length error problem ' missing_padding = 4-len (data)% 4 if Missing_padding:data + = ' = ' * missing_padding return Base64.b64decode (data) def get_imgurl (M, r= ', d=0): ' Decrypt get picture link ' e = ' DECO 
    DE "q = 4 R = _md5 (r) o = _md5 (r[0:0 +)) n = _md5 (r[16:16 +)) L = m[0:q] c = o + _md5 (o + L) m = m[q:] k = _base64_decode (m) H = List (range (256)) b = [Ord (c[g% Len (c)]) for G in range (256)] F = 0 for G in range (0, 256): F = (f + h[g] + b[g])% 256 TMP = H[g] h[g] = h[f] h[f] =
        TMP t = "" p, f = 0, 0 for G in range (0, Len (k)): P = (p + 1)% 256 F = (f + h[p])% 256 TMP = H[p] H[P] = h[f] h[f] = tmp T + + chr (K[g] ^ (h[(h[p) + h[f])% 256]) T = t[26:] return t def get_r (JS _url): ' Get key string ' JS = Requests.get (js_url). Text _r = Re.findall (' C=f_[\w\d]+\ (E, "(. *?)"  \) ', JS ' [0] return _r def get_urls (URL): ' Get a link to all pictures of a page ' headers = {' user-agent ': ' mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:47.0) gecko/20100101 firefox/47.0 ', ' Host ': ' jandan.net '} html = requests.get (URL, headers=headers). Te XT Js_url = ' http: ' + re.findall (' <script src= ' (//cdn.jandan.net/static/min/[\w\d]+\.\d+\.js) ></script>  ', HTML ' [-1] _r = Get_r (js_url) soup = beautifulsoup (html, ' lxml ') tags = soup.select ('. Img-hash ') for tag In Tags:img_hash = Tag.text Img_url = Get_imgurl (img_hash,_r) print (img_url) If __name__ = ' _
 _main__ ': Get_urls (' http://jandan.net/ooxx/page-44 ')

Run the above code, you can print out all the pictures of this page link, some links are as follows:

Ww3.sinaimg.cn/mw600/0073ob6pgy1fpet9wku7dj30hs0qljuz.jpg
//ww3.sinaimg.cn/mw600/ 0073tlpggy1fpet9mszjwj30hs0g1jsv.jpg
//ww3.sinaimg.cn/mw600/0073ob6pgy1fpesskkgobj31jk1jkk5b.jpg
// Wx3.sinaimg.cn/mw600/006xfbarly1fpesq2jn1vj30j60svaz3.jpg
//wx3.sinaimg.cn/mw600/ 6967abd2gy1fpenoyobrcj20u03d0b2d.jpg
//wx3.sinaimg.cn/mw600/6967abd2gy1fpenp38v9uj20u03zkhdy.jpg

Summary: To this end, to extract the egg net sister pictures of the way the picture link has actually been given out, the next will be followed by multithreading + multi-process way to download the picture.

Original start: http://www.tendcode.com/article/jiandan-meizi-spider/

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.