I'm going to refill a picture on a photo stand, but I find that it has DDos protection enabled. The site displays a piece of text first:
This process is automatic. Your Browser would redirect to Your requested content shortly.
Ask you to wait a few seconds to detect the browser, and then jump to the correct page with 302 redirects (of course, the correct page address remains the same).
The process of waiting is shown in the browser:
Im_under_attack_page
A detailed description of this protection is here: CloudFlare Advanced DDoS Protection.
Let's see how to solve the problem.
Source
Open the page to see the source code, you can see such a javascript:
(function () {
var a = function () {Try{return!! Window.addeventlistener} catch (e) {return! 1}},
b = function (b, c) {a () Document.addeventlistener ("domcontentloaded", B, c): Document.attachevent ("onreadystatechange ", b)};
B (function () {
var a = document.getElementById (' cf-content '); a.style.display = ' block ';
settimeout (function () {
var t,r,a,f, cgksouw={"BRM": + (+!! []+[])+(!+[]+!! []))};
t = document.createelement (' div ');
T.innerhtml= "<a href= '/' >x</a>";
t = T.firstchild.href;r = T.match (/https?:\ /\//) [0];
t = t.substr (r.length); t = t.substr (0,t.length-1);
A = document.getElementById (' Jschl-answer ');
f = document.getElementById (' Challenge-form ');
; cgksouw.brm-=+ ((!+[]+!! []+!! []+!! []+[])+(!+[]+!! []+!! []+!! []+!! []+!! []+!! []+!! []); cgksouw.brm*=+ (!+[]+!! []+[])+(!+[]+!! []+!! []+!! []+!! []+!! []); cgksouw.brm*=+ (!+[]+!! []+!! []+!! []+[])+(!+[]+!! []+!! []+!! []+!! []); cgksouw.brm+=+ (!+[]+!! []+[])+(!+[]+!! []); cgksouw.brm*=+ (!+[]+!! []+!! []+[]) + (+[]); a.value = parseint (CGKSOUW.BRM, ten) + t.length;
F.submit ();
}, 5000);
}, False);
})();
This is a few related Element in JS:
<p data-translate= "process_is_automatic" >this process is automatic. Your Browser would redirect to Your requested content shortly.</p>
<p data-translate= "Allow_5_secs" >please allow up to 5 seconds…</p>
</div>
<form id= "Challenge-form" action= "/cdn-cgi/l/chk_jschl" method= "Get" >
<input type= "hidden" name= "JSCHL_VC" value= "C841963d655c6bb040c4ef02fab8d5a3"/>
<input type= "hidden" name= "pass" value= "1441270501.274-wm96e3sqx+"/>
<input type= "hidden" id= "Jschl-answer" name= "Jschl_answer"/>
</form>
Analysis
The function of the above code is mainly to determine whether the client visiting the site is a real browser.
When you visit the homepage of the website, the first thing you get is a 503 error, which shows the content of the page above. Then wait 5 seconds, let the user see the message clearly, and then based on the random value of the calculated another value, written in Jschl-answer, and finally through the Challenge-form this form to submit.
After the commit is successful,/CDN-CGI/L/CHK_JSCHL returns the actual page content using 302 redirects.
Cgksouw is a random name, and the value contained in the input of pass is also a variable value. Each time you enter this page, both values will change.
So, how exactly does the value of pass be calculated when it is submitted?
From f = document.getElementById (' Challenge-form '); The following code shows that this calculation is nothing more than the multiplication of cgksouw.brm. And then turn it into an Int for submission. The point is, what is the meaning of the set of plus, parenthesis, and square brackets that seem inexplicable?
This is a simple camouflage.
Let's look at the initial value of the CGKSOUW.BRM, which is equal to the number 12:
+((+!! []+[])+(!+[]+!! []))
We can understand that by splitting the above code into several sub expressions.
First is the first nested bracket: (+!! []+[]), open the console of the browser, and enter the +!! [] Enter, you can get the number 1.
No matter which plus sign, look alone!! The value of [], which is the Boolean value true. This is because [] returns a valid Array that represents true, two times reversed, and still true.
And the plus sign on the left does not provide the left-hand operand, so its meaning is a positive number, where a numeric conversion is done to automatically convert the true to 1.
Then count back, followed by a plus, then a []. An empty array, which is handled as a string by default. The value of [].tostring () is an empty string. The preceding 1 plus the trailing empty string, gets a string "1".
Knowing the principle, you can use the same method to figure out the value in the second nested bracket is number 2.
Next is the string "1" plus the number 2, which gets the string "12".
The left-most plus sign converts the string "12" to the number 12.
So, this piece of code that seems garbled, just to get some random numbers, easy and server verification. I judged these values to be meaningful and correlated with server time to determine if the Web site was a standard browser.
Choose
After analysis, it is necessary to make the choice of technology.
Writing a reptile is naturally the preferred Python. At first, I'm going to use requests to simulate the request, use BEAUTIFULSOUP4 to parse the resulting HTML, and then download the resources.
But requests is not good at simulating browser behavior, and can not execute Javascript, but also need to use PyV8 and other libraries to perform the above data calculation, get the final pass.
PYEXECJS can be implemented to invoke Javascript engines in a Python environment, supported by PyV8, Node.js, and so on.
After successfully skipping DDos protection, you need to keep cookies and build heads.
Now that the compute section relies on other engines to parse JavaScript, it's better to use JavaScript directly to write it.
This is the time to headless Browser.
Headless Browser
Headless Browser, in layman's terms, is a browser with no interface and full functionality. The most popular headless Browser frame is not phantomjs. It is based on WebKit development, want the entire web screenshot of God Horse's (no longer afraid!)! It's the best use of it.
However, it's a bit of a hassle to save pictures in a Web page with PHANTOMJS. Because of the above DDOS protection, each individual image file is still protected, so can not get the address directly download, you must use the browser to browse this picture.
So I chose another frame slimerjs, which is basically consistent with the PHANTOMJS function, the API is basically consistent, the difference is that it uses Mozilla Firefox's Gecko engine.
The convenience of SLIMERJS is that its onresourcereceived callback supports the body attribute, which contains the content of the page resources (CSS, JS, Image, or page itself) that are being accessed. I just need to simply write it to the file to get the picture I need. Just like this:
function onresourcereceived (response)
{
if (State!= ' Image ' | |-response.stage!= "END" | | response.stage = = "Fail" | |!response.bodysize) return;
Console.log (' [onresourcereceived] ID: ' +response.id + ' starge: ' + response.stage + ' ContentType: ' + Response.ContentType + ' URL: ' +response.url ';
var curgalleryobj = Gallerylist[curgallery];
var curimageobj = Curgalleryimages[curimage];
if (!fs.exists (Curgalleryobj.dir))
{
Fs.maketree (Curgalleryobj.dir);
}
var fname = curgalleryobj.dir+ "/" +curimage+ ". +curimageobj.ext;
Curimageobj.file = fname;
Curimageobj.done = true;
Fs.write (fname, response.body, ' B ');
}
Download process
Below I pick some of the more important implementation of the download process posted out, in fact, very simple.
function Getgallerylist (Htmlfile, callback)
{
Htmlfile is a local HTML file, which is for the convenience of
var file = Htmldir + '/' + Htmlfile + '. html ';
var p = webpage.create ();
Do not parse JavaScript in this file because some JS is stored on Google's server, download is slow
p.settings.javascriptenabled = false;
No longer like the image in this file, for the reason above
P.settings.loadimages = false;
P.open (file, function (status)
{
Call a DOM function or a list of Dom in a page
var list = P.evaluate (function ()
{
Return Document.queryselectorall ('. Content. Gallery-list div a ');
var newlist = [];
Build the list of Gallery you want to download from the list
for (Var i=0;i<list.length;i++)
{
var href = list[i].href.slice (8);
var arr = href.split ('/');
var obj = {};
obj.url = href;
Obj.id = arr[1];
Obj.name = Arr[2].split ('. ') [0];
Obj.dir = imagedir+ '/[' +obj.id+ '] ' +obj.name;
Obj.index = site + '/images/' + obj.id + '/' + Obj.name + '. html ';
obj.progress = Obj.dir + '/progress.json ';
Newlist[i] = obj;
}
Callback (NewList);
});
}
function Onindexopen (status)
{
Console.log (' [Onindexopen] ');
State = ' index ';
var intervalid = null;
var t = 0;
var _checktimeout = function ()
{
if (T > 60)
{
Console.log ('!!!! Timeout!!!! ');
Clearinterval (Intervalid);
Slimer.exit ();
}
}
var _checktitle = function ()
{
& nbsp Console.log (' [_checktitle] ');
console.log (' [' +t+ '] Check Title ... ' +page.title ');
t++;
_checktimeout ();
//If the correct title appears, it represents a successful load
if (page.title.indexOf (' Welcome ') = = 0)
{
clearinterval (intervalid);
t = 0;
page.close ();
opengallery ();
}
}
Intervalid = SetInterval (_checktitle, 1000);
}
function Opengallery ()
{
Console.log (' [Opengallery ' +curgallery+ ']);
if (curgallery >= gallerylist.length)
{
Console.log (' = = Download all Gallery done! = = = = =];
Slimer.exit ();
}
curimage = 0;
var curgalleryobj = Gallerylist[curgallery];
Local already have list, no need to get
if (fs.exists (curgalleryobj.progress))
{
Curgalleryimages = Json.parse (Fs.read (curgalleryobj.progress, ' R ')). Images;
Local already have this file (last download successful), next
while (Curgalleryimages[curimage] && curgalleryimages[curimage].done)
{
Console.log (' The file ' +curgalleryimages[curimage].file + ' is downloaded. ');
curimage++;
}
Start downloading files
Downloadimage ();
}
Else
{
State = ' gallery ';
Page.open (Curgalleryobj.index, Ongalleryopen);
}
}
function Ongalleryopen (status)
{
Console.log (' [Ongalleryopen ' +curgallery+ ']);
The script tag holds the list of images in JSON format and gets them directly as strings
So you don't have to flip the DOM out of the way.
var script = page.evaluate (function ()
{
return Document.queryselector (' body > Div.outer-wrapper.image-page > Div.page-wrapper.page-wrapper-full > Script ');
});
if (script)
{
var re =/\ "Images\":(. +\]) \}\);
var images = Script.innerHTML.match (re) [1];
Curgalleryimages = json.parse (images);
Curgalleryimages.foreach (function (e,i,a)
{
var name = E.f.split ('. ');
A[i].ext = name[1];
A[i].name = name[0];
});
curimage = 0;
Downloadimage ();
}
Else
{
Console.log (' NO SCRIPT ');
Slimer.exit ();
}
}
function Downloadimage ()
{
if (curimage >= curgalleryimages.length)
{
var curgalleryobj = Gallerylist[curgallery];
Console.log (' = = Download ' +curgalleryobj.dir + ' has done. = = = ');
Download the next Gallery
Curgallery + +;
Opengallery ();
Return
}
State = ' image ';
var image = Getcurimage ();
Console.log (' [downloadimage] ' +image);
Closes the current page
Page.close ();
Reusing resources
Page.open (image);
}
Detects if the file is a standard JPEG
function Checkjpeg (path)
{
Console.log (' [checkjpeg ' +path+ ' +fs.exists (path) + '] ');
if (fs.exists (path))
{
var content = fs.read (path, ' RB ');
if (content &&
Content.length > 10240 &&
Content.charcodeat (0) = = 0xFF &&
Content.charcodeat (1) = = 0xd8 &&
Content.charcodeat (2) = = 0xFF &&
Content.charcodeat (3) = = 0XE0)
{
return true;
}
}
return false;
}
function onloadfinished (status, URL, isframe)
{
Loading = false;
Console.log (' = = [' +state+ ']onloadfinished Loading page (' +url+ ') ' + status + ' Loading: +loading ');
if (state = = ' image ')
{
var curgalleryobj = Gallerylist[curgallery];
var curimageobj = Curgalleryimages[curimage];
var progressobj = {' gallery ': curgalleryobj, ' images ': curgalleryimages};
Judge the JPEG file headers to see if you need to download them again
if (Checkjpeg (curimageobj.file))
{
Curimageobj.done = true;
curimage++;
}
Else
{
Curimageobj.done = false;
}
Writes the current download progress to the configuration file so that the download is not complete at one time.
or after an exception exits. The next time you download, you will know the progress
Fs.write (Curgalleryobj.progress, Json.stringify (progressobj), ' W ');
Download the next file or download it again
Downloadimage ();
}
}
function onloadstarted ()
{
Loading = true;
var currenturl = page.evaluate (function () {
return window.location.href;
});
Console.log (' = = [' +state+ ']onloadstarted current page (' + Currenturl + ') ' would gone. Loading: ' +loading ');
};
function onurlchanged (targeturl) {
Console.log (' = = = [' +state+ ']onurlchanged New URL: ' + targeturl+ ' loading: ' +loading ');
}
var webpage = require (' webpage ');
var page = Webpage.create ();
var loading = false;
var imagedir = ' images ';
var htmldir = ' html ';
var site = ' http://examples.com ';
var fs = require (' FS ');
var page = Webpage.create ();
var gallerylist = null;
var curgallery = null;
var curgalleryimages = null;
var curimage = 0;
var state = null;
page.onloadstarted = onloadstarted;
page.onloadfinished = onloadfinished;
page.onurlchanged = onurlchanged
page.onresourcereceived = onresourcereceived;
Getgallerylist (' Favorites_0 ', function (list)
{
Gallerylist = list;
Curgallery = 1;
Page.open (site, onindexopen);
});