Before in the Fish C Forum, see a lot of people are using Python to write crawler climbing egg Nets sister map, I also wrote, climbed a lot of sister pictures. Later, the egg net to the sister map of the Web page to improve the image of the address of the encryption, so the people in the Forum often asked how to request the page no link. This article is about the fried egg net Ooxx sister map of the link to obtain the way.
First of all, the reason why the egg net added to the anti-reptile mechanism, it should be because there are too many people to climb their site. The crawler frequently visit the website to bring the pressure to the website, therefore, suggested that everybody writes the crawler simple operation success to be enough, do not go overboard to climb others ' things. The analysis of Reptile thinking Picture Download Flowchart
First, use a simple flowchart (non-canonical flowchart format) to show the entire process of crawling a simple network of sister graphs:
Flowchart Interpretation
1, climb the egg net sister figure, we first want to open any sister map of the page, such as http://jandan.net/ooxx/page-44#comments then, we need to request this page, get 2 key information (follow-up will explain the specific role of information), The first message is the hash value of each sister's picture, which is the key message that is used to decrypt the resulting image address.
2, in addition to the page to extract the hash of the picture, there are extracts to the current page of a key JS file address, this JS file contains a same is used to generate picture address key parameters, to get this parameter, must go to request this JS address, At that time the sister map of each page of the JS address is different, so need to extract from the page.
3, get the picture of the hash and JS in the key parameters, you can provide in accordance with JS decryption method, get pictures of the link, the decryption method followed by Python code and JS code reference to explain.
4, with the picture link, download the picture is not much said, follow-up will have a second article, to use multi-threaded + multi-process way to download pictures. Page Analysis Web page source code interpretation
We can open a sister map of the page, or the beginning of the http://jandan.net/ooxx/page-44#comments as an example, and then look at the source code (note, not the review element), you can see this should be the location of the picture address does not have a picture address, Instead, it resembles the following code:
<p><span class=" Img-hash ">ece8ozwut/vggxw1hlbitpge0xmz9y/ywpci5rz5f/h2uswgxwv6iql6daeufit9mh2ep3cetllpwyd+ku0yhpshplny6lmhyiqo6stu9 /udy5k+vjt3eq</span></p>
From this code can be seen, the picture address by a JS function instead, that is, the image address is by this jandan_load_img (this) function to get and load, so, now the key is, need to find the meaning of this function in the JS file. js file Interpretation
By searching for jandan_load_img in each JS file, you can end up with an address similar to http://cdn.jandan.net/static/min/ 1d694f08895d377af4835a24f06090d0.29100001.js file to find the definition of this function, the compressed JS code format view, you can see the specific definition of the following fragment:
function Jandan_load_img (b) {
var d = $ (b);
var f = d.next ("Span.img-hash");
var e = F.text ();
F.remove ();
var c = F_QA8JE29JONVWCRMET1AJOCGATAINWKCN (E, "agc37is2vpayzkfi9wvobfdn5bcfn1px");
The meaning of this code is easy to understand, first of all, it extracts the current label CSS for Img-hash span label text, that is, we start to say the image of the hash value, and then put this value and a string parameter (each page of this parameter is variable, this page is AGC37IS2VPAYZKFI9WVOBFDN5BCFN1PX) is passed to another function f_qa8je29jonvwcrmet1ajocgatainwkcn, so we have to look at the meaning of the function. This function is the function used to generate a link to a picture. the interpretation of f_ function
You can find the definition of the F_ function in JS, you can see that there are two, but it doesn't matter, according to code from top to bottom of the law, we just need to see the next one on the line, the complete contents are as follows:
var f_qa8je29jonvwcrmet1ajocgatainwkcn = function (M, r, D) {var e = "DECODE"; var r = r?
R: ""; var d = d?
d:0;
var q = 4;
R = MD5 (r);
var o = MD5 (r.substr (0, 16));
var n = MD5 (R.SUBSTR (16, 16));
if (q) {if (E = = "DECODE") {var L = m.substr (0, Q)}} else {var L = ""} var c = O + MD5 (o + L);
var k;
if (E = = "DECODE") {m = m.substr (q);
k = Base64_decode (M)} var h = new Array (256);
for (var g = 0; g < 256; g++) {h[g] = g} var b = new Array ();
for (var g = 0; g < 256. g++) {B[g] = c.charcodeat (g% c.length)} for (var f = g = 0; g < 256; g++) {
F = (f + h[g] + b[g])% 256;
TMP = H[g];
H[G] = h[f];
H[F] = tmp} var t = "";
K = K.split ("");
for (var p = f = g = 0; g < k.length; g++) {p = (p + 1)% 256;
F = (f + h[p])% 256;
TMP = H[p];
H[P] = h[f];
H[F] = tmp; T + = Chr (ord (k[g)) ^ (h[(H[P] + h[f])% 256])} if (E = = "DECODE") {if (t.substr (0) = 0 | | t.substr (0)-time () > 0) &&am P
T.SUBSTR = = MD5 (T.SUBSTR () + N). substr (0)) {t = T.substr ()} else {t = ""}} return T};
This function needs to pass 3 parameters, the first parameter is the hash value of the picture, the second parameter is a string that is seen in the JANDAN_LOAD_IMG function, and the third argument is useless, because there is no pass in the JANDAN_LOAD_IMG function. We just need to follow the JS code meaning to rewrite this function into Python code on the line. python rewrite function
This should be the case when you rewrite the f_ function using Python:
Def get_imgurl (M, r= ', d=0):
' decrypt get Picture link '
e = "DECODE"
q = 4
r = _md5 (r)
o = _md5 (r[0:0 +))
n = _md5 (r[16:16 +))
L = m[0:q]
c = o + _md5 (o + L)
m = m[q:]
k = _base64_decode (m)
h = List (rang E (256))
B = [Ord (c[g% Len (c)]) for G in range (256)]
f = 0
for g in range (0, 256):
F = (f + h[g] + b[g ]% 256
tmp = H[g]
h[g] = h[f]
h[f] = tmp
t = ""
p, f = 0, 0 for
G in range (0, Len (k)):
p = (p + 1)% 256
F = (f + h[p])% 256
tmp = h[p]
h[p] = h[f]
h[f] = tmp
T + + chr (k[g) ^ (h[(h[ P] + h[f]) (% 256])
t = t[26:] Return
t
This function needs to use two other functions, the first is the MD5 encryption function, this function corresponds to the paragraph in JS:
var o = MD5 (r.substr (0, 16));
JS substr () function is actually the use of slices in python, a little look at the definition can understand, do not explain.
MD5 encryption is translated into the Python version as follows:
def _md5 (value):
' MD5 encryption '
m = hashlib.md5 ()
m.update (Value.encode (' Utf-8 ')) return
m.hexdigest ( )
Then there is a bash64 decoding function, which is used in this section of JS:
k = Base64_decode (M)
When using Python, you need to be aware that if you use Python's base64.b64decode directly, you will be able to make an error:
Binascii. Error:incorrect padding
So before you decode the data, you have to deal with it, and the specific functions are:
def _base64_decode (data):
' bash64 decoding, pay attention to the original string length error problem '
missing_padding = 4-len (data)% 4
if missing_ padding:
Data + = ' = ' * missing_padding return
base64.b64decode (data)
Here, the function to get the picture link is complete, mainly using 3 functions.
We can test this function by passing in two parameters copied from the Web page:
m = ' ece8ozwut/vggxw1hlbitpge0xmz9y/ywpci5rz5f/h2uswgxwv6iql6daeufit9mh2ep3cetllpwyd+ku0yhpshplny6lmhyiqo6stu9/ Udy5k+vjt3eq '
r = ' HPRB2OSFT5RHLSYZAXV8XYPVEAGDTHCA '
print (Get_imgurl (m,r))
You can see the following output:
Ww3.sinaimg.cn/mw600/0073ob6pgy1fpet9wku7dj30hs0qljuz.jpg
Note: The R parameter here is copied from the JS in each page, the JS address of each page is changed, this parameter is also changed. get hash and JS address
Previously said that the hash value is to get the picture address of the key parameters, while the other parameters in the JS file, and this JS file each page is different, so now to extract these two key parameters. Bulk Get Hash
To get the hash value of the picture is very convenient, we can use the BeautifulSoup method, the specific code fragment:
def get_urls (URL):
' get a link to all pictures of a page '
headers = {
' user-agent ': ' mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:47.0) gecko/20100101 firefox/47.0 ',
' Host ': ' Jandan.net '
}
html = requests.get (URL, headers=headers). Text
js_url = ' http: ' + re.findall (' <script src= ' (//cdn.jandan.net/static/min/[\w\d]+\.\d+\.js) ></ Script> ', HTML] [-1]
_r = Get_r (js_url)
soup = beautifulsoup (html, ' lxml ')
tags = soup.select ('. Img-hash ') for
tag in tags:
img_hash = tag.text
Img_url = Get_imgurl (img_hash,_r)
print (Img_url)
The code to extract the hash of the picture is this sentence:
Soup = beautifulsoup (html, ' lxml ')
tags = soup.select ('. Img-hash ') for
tag in tags:
img_hash = Tag.text
get key strings in JS
The way to get a JS address is to use a regular expression:
Js_url = ' http: ' + re.findall (' <script src= ' (//cdn.jandan.net/static/min/[\w\d]+\.\d+\.js) ' ></script> ', HTML) [-1]
Notice here, because the regular fetch is a list, so the final need to take a link in the list, after viewing, I found that some of the pages have two of these JS files, one is commented out, so you have to use the last one, this expression is in the list index using [-1] to take the last one.
Get the JS address to request, and then find the key string, you can write a function:
def get_r (js_url):
' Get key string '
js = Requests.get (js_url). Text
_r = Re.findall (' C=f_[\w\d]+\ (E, "(. *?)" \) ', JS ' [0] return
_r
Complete Code
Here's the complete code to get the entire picture link for a page:
#-*-Coding:utf-8-*-import requests from BS4 import beautifulsoup import hashlib import re import base64 def _MD5 (VA Lue): ' MD5 encryption ' m = Hashlib.md5 () m.update (Value.encode (' Utf-8 ')) return M.hexdigest () def _base64_dec Ode (data): ' Bash64 decoding, pay attention to the original string length error problem ' missing_padding = 4-len (data)% 4 if Missing_padding:data + = ' = ' * missing_padding return Base64.b64decode (data) def get_imgurl (M, r= ', d=0): ' Decrypt get picture link ' e = ' DECO
DE "q = 4 R = _md5 (r) o = _md5 (r[0:0 +)) n = _md5 (r[16:16 +)) L = m[0:q] c = o + _md5 (o + L) m = m[q:] k = _base64_decode (m) H = List (range (256)) b = [Ord (c[g% Len (c)]) for G in range (256)] F = 0 for G in range (0, 256): F = (f + h[g] + b[g])% 256 TMP = H[g] h[g] = h[f] h[f] =
TMP t = "" p, f = 0, 0 for G in range (0, Len (k)): P = (p + 1)% 256 F = (f + h[p])% 256 TMP = H[p] H[P] = h[f] h[f] = tmp T + + chr (K[g] ^ (h[(h[p) + h[f])% 256]) T = t[26:] return t def get_r (JS _url): ' Get key string ' JS = Requests.get (js_url). Text _r = Re.findall (' C=f_[\w\d]+\ (E, "(. *?)" \) ', JS ' [0] return _r def get_urls (URL): ' Get a link to all pictures of a page ' headers = {' user-agent ': ' mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:47.0) gecko/20100101 firefox/47.0 ', ' Host ': ' jandan.net '} html = requests.get (URL, headers=headers). Te XT Js_url = ' http: ' + re.findall (' <script src= ' (//cdn.jandan.net/static/min/[\w\d]+\.\d+\.js) ></script> ', HTML ' [-1] _r = Get_r (js_url) soup = beautifulsoup (html, ' lxml ') tags = soup.select ('. Img-hash ') for tag In Tags:img_hash = Tag.text Img_url = Get_imgurl (img_hash,_r) print (img_url) If __name__ = ' _
_main__ ': Get_urls (' http://jandan.net/ooxx/page-44 ')
Run the above code, you can print out all the pictures of this page link, some links are as follows:
Ww3.sinaimg.cn/mw600/0073ob6pgy1fpet9wku7dj30hs0qljuz.jpg
//ww3.sinaimg.cn/mw600/ 0073tlpggy1fpet9mszjwj30hs0g1jsv.jpg
//ww3.sinaimg.cn/mw600/0073ob6pgy1fpesskkgobj31jk1jkk5b.jpg
// Wx3.sinaimg.cn/mw600/006xfbarly1fpesq2jn1vj30j60svaz3.jpg
//wx3.sinaimg.cn/mw600/ 6967abd2gy1fpenoyobrcj20u03d0b2d.jpg
//wx3.sinaimg.cn/mw600/6967abd2gy1fpenp38v9uj20u03zkhdy.jpg
Summary: To this end, to extract the egg net sister pictures of the way the picture link has actually been given out, the next will be followed by multithreading + multi-process way to download the picture.
Original start: http://www.tendcode.com/article/jiandan-meizi-spider/