In fact, the previous implementation of this function, is to use the Selenium simulation browser page Click to complete, but the efficiency is actually relatively low. This time the process of crawling is done with the decryption parameters.
First open the fried egg net Http://jandan.net/ooxx, view the Web page source code. We search for the number of one of the images, such as 3869006, to see if we can find the image link in the source code.
From the above HTML structure to find some of the corresponding attributes of the label, there is no direct image link address, only a src=//img.jandan.net/blank.gif, which is obviously not a real link address, because each picture number has this value. We notice that there is also a onload= ' jandan_load_img (this), which can boldly guess the real address, passing in a This parameter, that is, the object that called it. Then there is a img-hash, after the text a long string of characters, according to the name should be the image of the hash value, the current do not know what to use, first note the next good, continue to go back, we go to see jandan_load_img () this function.
Open the Chrome F12 Developer tool, refresh the page, view JS in the Network tab, and find a JS with the jandan_load_img () function in the list item on the left. Such as:
Let's take this function and analyze it:
functionJandan_load_img (b) {
Pass in a parameter B, assign B to the variable d, assign the Img-hash of D to F, and extract the text of F. That is, E equals the hash value of the picture.varD =$ (b); varf = D.next ("Span.img-hash"); varE =F.text (); F.remove (); //focus on the following two sentences varc = jdn1vojzudgdm5wyosjf4yblbpryr4ovpc (E, "9e03yzcyohuebmj5es4c2tblvwiqsgn1"); varA = $ (' <a href= "' + c.replace (/(\/\/\w+\.sinaimg\.cn\/) (\w+) (\/.+\. ( Gif|jpg|jpeg))/, "$1large$3") +
' "target=" _blank "class=" View_img_link ">[View original]</a> '); D.before (a); D.before ("<br>"); D.removeattr ("OnLoad"); D.attr ("src", Location.protocol + c.replace (/(\/\/\w+\.sinaimg\.cn\/) (\w+) (\/.+\.gif)/, "$1thumb180$3")); if(/\.gif$/. Test (c)) {D.attr ("Org_src", Location.protocol +c); B.onload=function() {Add_img_loading_mask ( This, Load_sina_gif)} }}
It is obvious that there is a word in it, look at the original image, and explain the link here. So this a is what we need, but a is a concatenation of a label, the real picture address should be c.replace (/(\/\/\w+\.sinaimg\.cn\/) (\w+) (\/.+\. Gif|jpg|jpeg))/, "$1large$3"), so now calculate C is OK, see variable c=jdn1vojzudgdm5wyosjf4yblbpryr4ovpc (E, " 9e03yzcyohuebmj5es4c2tblvwiqsgn1 "), this is a function, passed in two parameters, an E (Picture hash value), and a string, the return value assigned to C. So let's continue looking for the JDN1VOJZUDGDM5WYOSJF4YBLBPRYR4OVPC function, and jandan_load_img () in the same JS file.
The functions are as follows:
varJDN1VOJZUDGDM5WYOSJF4YBLBPRYR4OVPC =function(n, T, e) {varf = "DECODE"; vart = t? T: ""; varE = e? e:0; varR = 4; T=MD5 (t); //assigns the hash value of n, the picture, to the variable d. varD =N; varp = MD5 (T.SUBSTR (0, 16)); varo = MD5 (T.SUBSTR (16, 16)); if(r) {if(f = = "DECODE")) { varm = n.substr (0, R)} } Else { varm = "" } varc = p + MD5 (p +m); varl; if(f = = "DECODE")) {n=N.substr (R); L=Base64_decode (N)}varK =NewArray (256); for(varh = 0; h < 256; h++) {K[h]=h}varb =NewArray (); for(varh = 0; h < 256; h++) {B[h]= C.charcodeat (h%c.length)} for(varG = h = 0; h < 256; h++) {g= (g + k[h] + b[h])% 256; TMP=K[h]; K[H]=K[g]; K[G]=tmp}varU = ""; L= L.split (""); for(varQ = g = h = 0; H < l.length; h++) {Q= (q + 1)% 256; G= (g + k[q])% 256; TMP=K[q]; K[Q]=K[g]; K[G]=tmp; U+ = Chr (ord (l[h) ^ (k[(k[q) + k[g])% 256])) } if(f = = "DECODE")) { if((u.substr (0, ten) = = 0 | | u.substr (0)-time () > 0) && u.substr (+) = = MD5 (U.SUBSTR (+) + O). substr (0, 16) ) {u= U.substr (26) } Else{u= "" } //perform base64 decodingU =Base64_decode (d)}returnu};
This code is written really ... Long, but we only analyze the parts we need, the above jandan_load_img function C is the return value of this function, that is to say we only need to focus on the return value is good. Notice that this return value is y a U, to find the assignment of the nearest U to the return statement, you can see that the sentence U=base64_decode (d), this is a base64 decoding, the parameter is D. We'll find this. D is the god horse thing. Mainly to this function only one sentence is related to D, that is, var d = n, this n is the function of the first function, that is, the hash value of the picture. In other words, this image hash value after a Base64 decoding the return value is C, that is, the image address.
Nani!!!!!!!!! Is this the whole thing you write this long code?
Let's take that image hash value for example, write it in Python code, and see if we can get the URL of the image.
# ! usr/bin/env python # Coding:utf-8 Import ' ly93edquc2luywltzy5jbi9tdzywmc8wmdc2qlntnwx5mwzzbwrxd2f1dzbqmzbnazbrcgfkos5qcgc= '= base64.b64decode (img_hash)print(URL)
Open this link in your browser and look at it.
I feel like I've received an insult!!!!!! I'm going to eat a bowl of noodles and press yajing. I believe you can see here, you should climb to get the pictures of fried egg sister. Code another day to fill, I really eat noodles to.
For learning only, do not use for commercial purposes.
Python3 crawler Crawl Fried egg net sister paper pictures