Recently in a bookstore project, Data Crawler crawling, Baidu a bit to find this site, in order to choose the day to remember this novel as an example.
The crawler used several modules, Cheerio,superagent,async.
Superagent is an HTTP request module, details can be found in the link.
Cheerio is a document parsing module with jquery-like syntax that you can simply interpret as jquery in Nodejs.
Async is an asynchronous Process Control module, where we mainly use Async's Maplimit(coll, limit, Iteratee, callback)
function (URL, callback) { fetchurl (URL, callback, id) function (err, results) { // TODO })
The first parameter, coll, is an array that holds the chapter URL of the novel, the second parameter limit is the control concurrency number, the third parameter iteratee accepts a callback function, the first parameter of the callback function is a separate chapter URL, the second parameter is a callback function, This callback function will save the result (here is the contents of each chapter) to the fourth parameter callback results, results is an array, saving the contents of all chapters.
We get chapter data in Fetchurl.
First, we'll get the URL of all chapters based on the page URL of the novel to be saved to the array URLs:
superagent.get (URL). CharSet (' GBK ')//the website is encoded as GBK and used Superagent-charset. End (function(Err, res) {var$ =cheerio.load (res.text);//res.text to get the Web content, through the Cheerio load method after processing, then is the syntax of jquery let URLs=[] Total= $ (' #list DD '). Length Console.log (' Total ${$ (' #list DD '). Length} chapter ') $ (' #list DD '). each (function(i, v) {if(I <chapters) {Urls.push (' http://www.zwdu.com ' + $ (v). Find (' a '). attr (' href '))) } })
Fetchurl function
function fetchurl (URL, callback, id) { superagent.get (URL) . CharSet (' GBK ') . End (function (Err, res) { = cheerio.load (res.text) //obj is a constructed object that contains chapter information callback ( null, obj) // to Pass obj to results} in the fourth argument )}
Full code:
/** * Created by tgxh on 2017/7/4.*/Const Cheerio= Require (' Cheerio ') Const Express= Require (' Express ') Const App=Express () const superagent= Require (' superagent ') require (' Superagent-charset ') (superagent) const Async= Require (' async ');= 0//total number of chaptersLet id = 0//counterConst CHAPTERS = 10//how many chapters to crawlConst URL = ' http://www.zwdu.com/book/8634/'//remove front and back spaces and escape charactersfunctionTrim (str) {returnStr.replace (/(^\s*) | ( \s*$)/g, '). Replace (/ /g, ")}//turn Unicode to kanjifunctionreconvert (str) {str= Str.replace (/(& #x) (\w{1,4});/gi,function($) { returnString.fromCharCode (parseint ($) Replace (/(%26%23x) (\w{1,4}) (%3b)/g, "$"), 16)); }); returnSTR}functionfetchurl (URL, callback, id) {superagent.get (URL). CharSet (' GBK '). End (function(Err, res) {let $=cheerio.load (res.text) const ARR=[] Const content= Reconvert ($ ("#content"). HTML ())//split HTML After parsing the structureConst CONTENTARR = content.split (' <br><br> ') Contentarr.foreach (Elem={Const data=Trim (elem.tostring ()) Arr.push (data)}) Const OBJ={id:id, err:0, BookName: $ ('. Footer_cont a '). Text (), Title: $ ('. BookName H1 '). Text (), Content:arr.join ('-')//because it needs to be saved to MySQL, it is not supported to save the array directly, so the array is stitched into a string, and then the string is separated.} callback (NULL, obj)})} App.get (‘/‘,function(req, response, Next) {Superagent.get (URL). CharSet (' GBK '). End (function(Err, res) {var$ =cheerio.load (Res.text); Let URLs=[] Total= $ (' #list DD '). Length Console.log (' Total ${$ (' #list DD '). Length} chapter ') $ (' #list DD '). each (function(i, v) {if(I <chapters) {Urls.push (' http://www.zwdu.com ' + $ (v). Find (' a '). attr (' href ')) }}) Async.maplimit (URLs,10,function(URL, callback) {ID++fetchurl (URL, callback, ID)//need to number chapters, so count by variable ID},function(err, results) {Response.send (Results)})}) App.listen (3378,function() {Console.log (' Server listening on 3378 ')})
The results are as follows:
"Nodejs crawler" uses async to control concurrent writing a novel crawler