"Nodejs crawler" uses async to control concurrent writing a novel crawler

Last Update:2017-07-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently in a bookstore project, Data Crawler crawling, Baidu a bit to find this site, in order to choose the day to remember this novel as an example.

The crawler used several modules, Cheerio,superagent,async.

Superagent is an HTTP request module, details can be found in the link.

Cheerio is a document parsing module with jquery-like syntax that you can simply interpret as jquery in Nodejs.

Async is an asynchronous Process Control module, where we mainly use Async's Maplimit(coll, limit, Iteratee, callback)

function (URL, callback) {        fetchurl (URL, callback, id)      function  (err, results) {        //  TODO      })

The first parameter, coll, is an array that holds the chapter URL of the novel, the second parameter limit is the control concurrency number, the third parameter iteratee accepts a callback function, the first parameter of the callback function is a separate chapter URL, the second parameter is a callback function, This callback function will save the result (here is the contents of each chapter) to the fourth parameter callback results, results is an array, saving the contents of all chapters.

We get chapter data in Fetchurl.

First, we'll get the URL of all chapters based on the page URL of the novel to be saved to the array URLs:

superagent.get (URL). CharSet (' GBK ')//the website is encoded as GBK and used Superagent-charset. End (function(Err, res) {var$ =cheerio.load (res.text);//res.text to get the Web content, through the Cheerio load method after processing, then is the syntax of jquery let URLs=[] Total= $ (' #list DD '). Length Console.log (' Total ${$ (' #list DD '). Length} chapter ') $ (' #list DD '). each (function(i, v) {if(I <chapters) {Urls.push (' http://www.zwdu.com ' + $ (v). Find (' a '). attr (' href ')))        }      })

Fetchurl function

function fetchurl (URL, callback, id) {  superagent.get (URL)    . CharSet (' GBK ')    . End (function  (Err, res) {      = cheerio.load (res.text)      //obj is a constructed object that contains chapter information       callback (  null, obj)  // to Pass obj to results} in the fourth argument    )}

Full code:

/** * Created by tgxh on 2017/7/4.*/Const Cheerio= Require (' Cheerio ') Const Express= Require (' Express ') Const App=Express () const superagent= Require (' superagent ') require (' Superagent-charset ') (superagent) const Async= Require (' async ');= 0//total number of chaptersLet id = 0//counterConst CHAPTERS = 10//how many chapters to crawlConst URL = ' http://www.zwdu.com/book/8634/'//remove front and back spaces and &nbsp; escape charactersfunctionTrim (str) {returnStr.replace (/(^\s*) | ( \s*$)/g, '). Replace (/&nbsp;/g, ")}//turn Unicode to kanjifunctionreconvert (str) {str= Str.replace (/(& #x) (\w{1,4});/gi,function($) {    returnString.fromCharCode (parseint ($) Replace (/(%26%23x) (\w{1,4}) (%3b)/g, "$"), 16));  }); returnSTR}functionfetchurl (URL, callback, id) {superagent.get (URL). CharSet (' GBK '). End (function(Err, res) {let $=cheerio.load (res.text) const ARR=[] Const content= Reconvert ($ ("#content"). HTML ())//split HTML After parsing the structureConst CONTENTARR = content.split (' <br><br> ') Contentarr.foreach (Elem={Const data=Trim (elem.tostring ()) Arr.push (data)}) Const OBJ={id:id, err:0, BookName: $ ('. Footer_cont a '). Text (), Title: $ ('. BookName H1 '). Text (), Content:arr.join ('-')//because it needs to be saved to MySQL, it is not supported to save the array directly, so the array is stitched into a string, and then the string is separated.} callback (NULL, obj)})} App.get (‘/‘,function(req, response, Next) {Superagent.get (URL). CharSet (' GBK '). End (function(Err, res) {var$ =cheerio.load (Res.text); Let URLs=[] Total= $ (' #list DD '). Length Console.log (' Total ${$ (' #list DD '). Length} chapter ') $ (' #list DD '). each (function(i, v) {if(I <chapters) {Urls.push (' http://www.zwdu.com ' + $ (v). Find (' a '). attr (' href ')) }}) Async.maplimit (URLs,10,function(URL, callback) {ID++fetchurl (URL, callback, ID)//need to number chapters, so count by variable ID},function(err, results) {Response.send (Results)})}) App.listen (3378,function() {Console.log (' Server listening on 3378 ')})

The results are as follows:

"Nodejs crawler" uses async to control concurrent writing a novel crawler

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

"Nodejs crawler" uses async to control concurrent writing a novel crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

"Nodejs crawler" uses async to control concurrent writing a novel crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support