Website:
Autohome: http://club.autohome.com.cn/take the forum as an example
Anti-reptile measures:
A few words are randomly extracted in the body of the post of the forum, using the span tag instead, the tag content is empty, but the CSS style is displayed as the substituted text. This will not
affect the normal user's reading, only when using the mouse to select the time is not selected to replace the text, the crawler will cause the acquisition of content is not the effect of the whole.
Principle Analysis:
Look at the style of the span tag first
It is the Firebug HTML panel of Firefox browser. We can see that the style of each span tag in the body is a literal, and we just need to find each
The class attribute of the span tag can be used to restore the contents of the text, so I find out where the CSS style is defined. Found such a text.
, you can see all the CSS files in the Firebug CSS panel, and then I try to open the URL and find the result or the Post page rather than the CSS file.
By grasping the package and did not catch a similar CSS file, after a few attempts, I understand that this CSS file should be generated using JS, so I began to find
To find the JS code to generate such a file, basically is to take a variety of keywords to each JS file or source code search, for example: before, content, hs_kw
such as And then found that the JS code exists in the Web page source code, you can use the search Hs_zy to locate the JS code, the following is the analysis of JS code cracked,
Copy this JS code to http://jsbeautifier.org/a JS format tool website. After formatting, found that the JS code is confused, it is more than egg pain
, fortunately, have been exposed to this kind of confusion before, probably understand the principle of confusion, is nothing more than a random substitution of variable names, the complete code is split after the addition of variables and the like,
So in a very troublesome to replace the various variables manually back, probably understand the main logic of JS code.
(function(hz_) {functionew_ () {= Dv_ () [decodeURIComponent] ('%e3%80%81%e3%80%82%e4%b8%80%e4%b8%8a%e4%b8%8b%e4%b8%8d%e4%ba%86%e4%ba%94%e5%92%8c% E5%9c%b0%e5%a4%9a%e5%a4%a7%e5%a5%bd%e5%b0%8f%e5%be%88%e5%be%97%e6%98%af%e7%9a%84%e7%9d%80%e8%bf%9c%e9%95%bf%e9 %ab%98%ef%bc%81%ef%bc%8c%ef%bc%9f '? Yc_ ()); = La_ ((yc_ () 23; 3; 19; 17; 9; 1; 8; 12; 18; 13; 2; 4; 16; 5; 6; 21; 15; 11; 22; 14; 24; 0; 10; 7) 20), lf_ (;)); = La_ ((10 _7, 6 _0; 2 _33, 14 _18; 8 _45, 8 _36; 0 _71, 16 _54; 13 _76, 3 _72; 0 _107, 16 _90; 15 _110, 1 _108; 4 _139, 12 _126; 9 _152, 7 _144; 10 _169, 6 _162; 4 _193, 12 _180; 11 _204, 5 _198; 3 _230, 13 _216; 1 _250, 15 _234; 13 _256, 3 _252; 6 _281, 10 _270; 9 _296, 7 _288; 13 _310, 3 _306; 6 _335, 10 _324; 7 _352, 9 _342; 6 _371, 10 _360; 5 _390, 11 _378; 5 _408, 11 _396; 7 _424, 9 _414; 6 _443, 10_432lf_ (;)), yc_ (;)); Uj_ (); return;; } functionMs_ () { for(gx_ = 0; Gx_ < Nf_.length; gx_++) { varSu_ = Pn_ (Nf_[gx_], ', '); varKn_ = ' '; for(bk_ = 0; Bk_ < Su_.length; bk_++) {Kn_+ = Ui_ (Su_[bk_]) + "; } kx_ (Gx_, kn_); } } functionNh_ (gx_) {return'. hs_kw ' + gx_ + ' _MAINDC '; } functionLn_ () {return':: Before {content: '}) (document);
Very simple logic, pre-defined which words to be replaced, the above code of the many percent of the string is replaced by the text string, and then define each text
The ordinal of the word, and then reorder the text string by the ordinal of the text and create a CSS style, note that the class attribute of the first span tag has a sequence
The number is used to locate which text should correspond.
The next thing to do is to find the text string from the JS code, find the sequence of the text string, and then rearrange, and then according to the span tag ordinal to the original
Reverse substitution to get the full content.
Crack steps:
Simply tidy up:
1. Find the replaced text string and order from JS code
2. Rearrange text strings
3. Replace the span tag in the original text with the class number
In fact, 2, 3 are relatively simple, the focus is the first step, find the text string and the order of substitution, because the source code in the JS code is confused, can not directly see which
is the text string, so the first should be to the JS code anti-confusion, this anti-confusion is not to say that it is not necessarily complete to restore all the JS code, in fact, as long as can be anti-confusing to
Let's see what the word string and order are.
To talk about the idea of anti-confusion is actually very simple. It's a bit of a hassle to execute, and obfuscation is a way to define a simple variable into a complex JS code.
Implemented, but this obfuscation is actually limited (this limited refers to the confusion of the tool when generating the confusing code is definitely human pre-defined several patterns
, the man-made definition is certainly limited, as long as you find out all the patterns, you can restore it. As an example,
function iq_ () { ' return iq_ '; return ' 3 '; }
In fact, you can simply think of this code as the variable IQ () equals ' 3 ', using a regular matching code pattern, and then extracting the keyword: the function name and the last
return value, and then saves the extracted information for the full-text substitution of the JS code.
function cz_ () { function _c () { return ' cz_ '; }; if (_c () = = ' cz__ ') {return _c () ; Else { return ' n '; } }
This code is a bit more complicated, more judgmental, but also simple, using a regular matching pattern, then extracting the keyword: the function name, the value of the first return,
Determine the value after = =, the value of the last return, and then judge for yourself to determine how much the value of cz_ () should be, save it for full-text substitution.
And so on, each pattern can use the regular to extract the keyword and the full-text substitution to anti-confusion, and finally we will get a roughly restored JS code, which
The text string and order are clearly visible, and then use regular match to come out. One thing to note is that sometimes the substitution is not a single word, but some
Words, which are found in the order of "3,1;23,5" such, but these small tricks should not be anything, very good solution.
Conclusion:
This advice you do it yourself, or more interesting, complete crack code see my GitHub
https://github.com/duanyifei/antispider/blob/master/autohome.py
Original address: http://www.cnblogs.com/dyfblog/p/6753251.html
Anti-reptile hack series-Autohome using CSS style to replace text hack method