"Nodejs crawler" uses async to control concurrent writing a novel crawler

Source: Internet
Author: User

Recently in a bookstore project, Data Crawler crawling, Baidu a bit to find this site, in order to choose the day to remember this novel as an example.

The crawler used several modules, Cheerio,superagent,async.

Superagent is an HTTP request module, details can be found in the link.

Cheerio is a document parsing module with jquery-like syntax that you can simply interpret as jquery in Nodejs.

Async is an asynchronous Process Control module, where we mainly use Async's Maplimit(coll, limit, Iteratee, callback)

function (URL, callback) {        fetchurl (URL, callback, id)      function  (err, results) {        //  TODO      })

The first parameter, coll, is an array that holds the chapter URL of the novel, the second parameter limit is the control concurrency number, the third parameter iteratee accepts a callback function, the first parameter of the callback function is a separate chapter URL, the second parameter is a callback function, This callback function will save the result (here is the contents of each chapter) to the fourth parameter callback results, results is an array, saving the contents of all chapters.

We get chapter data in Fetchurl.

First, we'll get the URL of all chapters based on the page URL of the novel to be saved to the array URLs:

superagent.get (URL). CharSet (' GBK ')//the website is encoded as GBK and used Superagent-charset. End (function(Err, res) {var$ =cheerio.load (res.text);//res.text to get the Web content, through the Cheerio load method after processing, then is the syntax of jquery let URLs=[] Total= $ (' #list DD '). Length Console.log (' Total ${$ (' #list DD '). Length} chapter ') $ (' #list DD '). each (function(i, v) {if(I <chapters) {Urls.push (' http://www.zwdu.com ' + $ (v). Find (' a '). attr (' href ')))        }      })

Fetchurl function

function fetchurl (URL, callback, id) {  superagent.get (URL)    . CharSet (' GBK ')    . End (function  (Err, res) {      = cheerio.load (res.text)      //obj is a constructed object that contains chapter information       callback (  null, obj)  // to Pass obj to results} in the fourth argument    )}

Full code:

/** * Created by tgxh on 2017/7/4.*/Const Cheerio= Require (' Cheerio ') Const Express= Require (' Express ') Const App=Express () const superagent= Require (' superagent ') require (' Superagent-charset ') (superagent) const Async= Require (' async ');= 0//total number of chaptersLet id = 0//counterConst CHAPTERS = 10//how many chapters to crawlConst URL = ' http://www.zwdu.com/book/8634/'//remove front and back spaces and &nbsp; escape charactersfunctionTrim (str) {returnStr.replace (/(^\s*) | ( \s*$)/g, '). Replace (/&nbsp;/g, ")}//turn Unicode to kanjifunctionreconvert (str) {str= Str.replace (/(& #x) (\w{1,4});/gi,function($) {    returnString.fromCharCode (parseint ($) Replace (/(%26%23x) (\w{1,4}) (%3b)/g, "$"), 16));  }); returnSTR}functionfetchurl (URL, callback, id) {superagent.get (URL). CharSet (' GBK '). End (function(Err, res) {let $=cheerio.load (res.text) const ARR=[] Const content= Reconvert ($ ("#content"). HTML ())//split HTML After parsing the structureConst CONTENTARR = content.split (' <br><br> ') Contentarr.foreach (Elem={Const data=Trim (elem.tostring ()) Arr.push (data)}) Const OBJ={id:id, err:0, BookName: $ ('. Footer_cont a '). Text (), Title: $ ('. BookName H1 '). Text (), Content:arr.join ('-')//because it needs to be saved to MySQL, it is not supported to save the array directly, so the array is stitched into a string, and then the string is separated.} callback (NULL, obj)})} App.get (‘/‘,function(req, response, Next) {Superagent.get (URL). CharSet (' GBK '). End (function(Err, res) {var$ =cheerio.load (Res.text); Let URLs=[] Total= $ (' #list DD '). Length Console.log (' Total ${$ (' #list DD '). Length} chapter ') $ (' #list DD '). each (function(i, v) {if(I <chapters) {Urls.push (' http://www.zwdu.com ' + $ (v). Find (' a '). attr (' href ')) }}) Async.maplimit (URLs,10,function(URL, callback) {ID++fetchurl (URL, callback, ID)//need to number chapters, so count by variable ID},function(err, results) {Response.send (Results)})}) App.listen (3378,function() {Console.log (' Server listening on 3378 ')})

The results are as follows:

"Nodejs crawler" uses async to control concurrent writing a novel crawler

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.