The previous blog explains the use of Nodejs Crawl Blog Park, this time to everyone is downloading pictures on the network.
The third-party modules that need to be used are:
Superagent
Superagent-charset (manually change the specified encoding, solve GBK Chinese garbled )
Cheerio
Express
Async (concurrency control)
The complete code can be downloaded in my github. The main logical logic is in the netbian.js.
Landscape Wallpaper (http://www.netbian.com/) under the Shore table (http://www.netbian.com/fengjing/index.htm ) As an example to explain.
1. Analyzing URLs
Not hard to find:
Home: Column /index.htm
Pagination: Column /index_ specific page number . htm
Knowing this rule, you can download the wallpaper in bulk.
2. analyze wallpaper thumbnail to find a larger image of the corresponding wallpaper
Using Chrome 's developer tools, you can see that the thumbnail list is in the class= "list" div ,a Label's href The value of the property is the page where the single-sheet wallpaper resides.
Part of the code:
1 Request2 . Get (URL)3. End (function(Err, sres) {4 5 var$ =cheerio.load (sres.text);6 varPic_url = [];//medium Image Link Array7$ ('. List ul ', 0). Find (' Li '). each (function(index, ele) {8 varEle =$ (ele);9 varhref = Ele.find (' a '). EQ (0). attr (' href ');//Medium Image LinkTen if(href! =undefined) { One Pic_url.push (url_model.resolve (domain, href)); A } - }); -});
3.with "http://www.netbian.com/desk/17662.htmThe Continue analysis
Open this page and find the wallpaper displayed on this page, still not the highest resolution.
Click the link in the "Download Wallpaper" button to open a new page.
4.with "http://www.netbian.com/desk/17662-1920x1080.htmThe Continue analysis
Open this page and we will finally download the wallpaper, put in a table inside. For example,http://img.netbian.com/file/2017/0203/bb109369a1f2eb2e30e04a435f2be466.jpg
is the last image we want to download . URL (Behind the scenes BOSS finally showed up. (@ ̄ー ̄@) ).
Download the code for the image:
Request.get (Wallpaper_down_url). End (function(err, img_res) { if(img_ Res.status = = +) {// Save picture content function(err) { If(err) console.log (err);});} );
Open Browser, Access http://localhost:1314/fengjing
Select columns and pages and click on the "Start" button:
Concurrent request server, download picture.
Complete ~
The contents of the picture are stored in the form of columns + page numbers .
Attach the full picture to download the code:
1 /**2 * Download Image3 * @param {[type]} URL [picture URL]4 * @param {[type]} dir [store directory]5 * @param {[Type]} res [description]6 * @return {[type]} [description]7 */8 varDown_pic =function(URL, dir, res) {9 Ten varDomain = ' http://www.netbian.com ';//Domain name One A Request - . Get (URL) -. End (function(Err, sres) { the - var$ =cheerio.load (sres.text); - varPic_url = [];//medium Image Link Array -$ ('. List ul ', 0). Find (' Li '). each (function(index, ele) { + varEle =$ (ele); - varhref = Ele.find (' a '). EQ (0). attr (' href ');//Medium Image Link + if(href! =undefined) { A Pic_url.push (url_model.resolve (domain, href)); at } - }); - - varCount = 0;//Concurrency Counters - varwallpaper = [];//Wallpaper Array - varFetchpic =function(_pic_url, callback) { in -count++;//Concurrent plus 1 to + varDelay = parseint ((math.random () * 10000000)% 2000); -Console.log (' Now concurrency number is: ' + count + ', the URL of the image being crawled is: ' + _pic_url + ' time is: ' + delay + ' milliseconds '); theSetTimeout (function(){ * //get the big picture link $ RequestPanax Notoginseng . Get (_pic_url) -. End (function(Err, ares) { the var$$ =cheerio.load (ares.text); + varPic_down = url_model.resolve (domain, $$ ('. Pic-down '). Find (' a '). attr (' href '));//Large Map Link A thecount--;//concurrency minus 1 + - //Request a large map link $ Request $ . Get (Pic_down) -. CharSet (' GBK ')//set the encoding to get the Web page in a GBK way -. End (function(Err, pic_res) { the - var$$$ =cheerio.load (pic_res.text);Wuyi varWallpaper_down_url = $$$ (' #endimg '). FIND (' img '). attr (' src ');//URL the varWallpaper_down_title = $$$ (' #endimg '). FIND (' img '). attr (' Alt ');//title - Wu //Download Large Image - Request About . Get (Wallpaper_down_url) $. End (function(Err, img_res) { - if(Img_res.status = = 200){ - //Save Picture Contents -Fs.writefile (dir + '/' + Wallpaper_down_title + path.extname (path.basename (Wallpaper_down_url)), Img_res.body, ' binary ‘,function(err) { A if(Err) console.log (err); + }); the } - }); $ theWallpaper.push (wallpaper_down_title + ' download completed <br/> '); the }); theCallbackNULL, wallpaper);//Return Data the }); - }, delay); in }; the the //concurrency is 2, download wallpaper AboutAsync.maplimit (Pic_url, 2,function(_pic_url, callback) { the Fetchpic (_pic_url, callback); the},function(err, result) { theConsole.log (' Success '); +Res.send (Result[0]);//Remove the element labeled 0 - }); the });Bayi};
two points to note in particular:
1. The "Shore Desktop" page is encoded "GBK". the Nodejs itself only supports the "UTF-8" encoding. Here we introduce the "superagent-charset" module, which handlesthe encoding of "GBK".
Attach An example from GitHub
Https://github.com/magicdawn/superagent-charset
2. Nodejs is asynchronous, sending a large number of requests at the same time, which may be rejected by the server as a malicious request. Therefore, the introduction of "async" module for concurrent processing, using the method is:maplimit.
Maplimit (arr, limit, iterator, callback)
This method has 4 parameters:
The first 1 parameters are arrays.
The first 2 parameters are the number of concurrent requests.
The first 3 parameters are iterators, which are usually a function.
The 4 parameter is a callback after a concurrent execution.
The function of this method is to bring each element in arr concurrent with the limit to iterator to execute, and the result is passed to the final callback.
Something
This completes the download of the picture.
The complete code, already on GitHub, is welcome to star (☆▽☆).
The writing is limited, the study is shallow, if has the wrong place, welcome the broad Bo Friend to correct.
Download Landscape wallpaper using Nodejs