Nodejs crawler captures ancient classics, total 16,000 pages experience summary and project sharing

Source: Internet
Author: User
Tags mongodb collection new set

Technical details of the project
The project was heavily used in ES7 's async function, a more intuitive process of reacting to programs. In order to facilitate, in the process of data traversal directly using the famous async this library, so it is unavoidable to use the callback promise, because the data processing occurs in the callback function, inevitably encounter some data transmission problems, in fact, can also be directly with ES7 async await Write a method to implement the same functionality. One of the best places here is the use of the Class static method to encapsulate the operation of the database, static as the name implies as prototype, does not occupy additional space.
The project is mainly used in

    • 1 ES7 async await coprocessor makes asynchronous-related logical processing.
    • 2 Use the Async Library of NPM for loop traversal and concurrent request operations.
    • 3 using LOG4JS for log processing
    • 4 Use Cheerio to handle the operation of the DOM.
    • 5 use Mongoose to connect MongoDB to do the saving and operation of data.

Directory structure

<pre>
├──bin//Entrance
│├──booklist.js//Crawl book Logic
│├──chapterlist.js//Crawl Chapter Logic
│├──content.js//crawl content logic
│└──index.js//Program entry
├──config//config file
├──dbhelper//Database operation methods Directory
├──logs//Project Log directory
├──model//MongoDB collection Operation instance
├──node_modules
├──utils//Tool functions
├──package.json
</pre>

Project Implementation Plan analysis

Project is a typical multi-level crawl case, currently only three levels, that is, book list, book items corresponding to the chapter list, a chapter link corresponding content. Grab such a structure can be used in two ways, one is directly from the outer layer to the inner layers of the inner layer after the crawl and then perform the next layer of the crawl, there is one is the outer crawl complete save to the database, and then based on the outer crawl to all the inner chapters of the link, save again, and then from the database query to the corresponding To crawl the content. Both of these options have pros and cons, in fact, I have tried both ways, the latter has a benefit, because the three levels are crawled separately, so that it is more convenient to save as much as possible to the corresponding chapters of the relevant data. You can imagine if the former follows the normal logic
A first-level directory traversal crawl to the corresponding two-level chapter directory, and then the chapter list to traverse crawl content, to the third level content unit crawl complete need to save, if need a lot of one level directory information, need these hierarchical data between data transfer, think in fact should be a more complex thing. Therefore, saving data separately avoids unnecessary and complicated data transmission to some extent.

At present we consider the fact that we have to crawl to the number of ancient books, ancient Chinese books about only 180 of all the history of the classics. Its and chapter content itself is a very small data, that is, a collection of 180 document records. These 180 books all chapters crawl down a total of 16,000 chapters, corresponding to the need to access 16,000 pages crawled to the corresponding content. So it should be reasonable to choose the second kind.

Project implementation

The main course has three methods Booklistinit, Chapterlistinit,contentlistinit, respectively, is to crawl the book Directory, chapter list, book content method of public exposure to the initial method. Through async can be implemented to control the operation of the three methods, the book directory fetch is done to save the data to the database, and then the execution results are returned to the main program, if the successful main program is executed according to the book List of the list of chapters crawl, the same way the book content crawl.

Project main Entrance

/** * Crawler Crawl main Entrance * *Const START =Async () = {Let Booklistres =await Booklistinit ();if (!booklistres) {Logger.warn (' Book list crawl error, program termination ... ');Return } logger.info (' Book list fetch succeeds, now book chapter Crawl ... ');Let Chapterlistres =await chapterlistinit (); if (!chapterlistres) {Logger.warn ( "book chapters list crawl error, program termination ... '); Span class= "Hljs-keyword" >return; } logger.info ( Book chapter list crawl success, now for book Content crawl ... '); let contentlistres = await contentlistinit (); if (!contentlistres) {Logger.warn (" book chapter content crawl error, program termination ... '); return; } logger.info ( ' book content fetching Success ');} //start entrance if (typeof bookListInit = = = typeof chapterlistinit = = //start fetching starts ();}        

Introduction of Booklistinit, Chapterlistinit,contentlistinit, three methods

Booklist.js

/** * 初始化方法 返回抓取结果 true 抓取成果 false 抓取失败 */const bookListInit = async() => {    logger.info(‘抓取书籍列表开始...‘); const pageUrlList = www.dijiuyy.com getPageUrlList(totalListPage, baseUrl); let res = await getBookList(pageUrlList); return res;}

Chapterlist.js

/** * 初始化入口 */const chapterListInit = async() => {    const list = await bookHelper.getBookList(bookListModel); if (!list) { logger.error(‘初始化查询书籍目录失败‘); } logger.info(‘开始抓取书籍章节列表,书籍目录共:‘ + list.length + ‘条‘); let res = await asyncGetChapter(list); return res;};

Content.js

  /** * Initialize portal */const Contentlistinit = async () = {//get book list const list = await Bookhelper.getbookli ( Booklistmodel); if (! List) {Logger.error ( "initialization of the query book Directory failed '); return;} const res = await mapbooklist ( list); if (!res) {logger.error (return;} return Res;}            

Thinking of content grabbing

Book directory fetching is actually very simple logic, only need to use Async.maplimit to do a traversal to save the data, but when we save the content of the simple logic is to traverse the chapter list crawl link content. But the actual situation is that the number of links is as many as tens of thousands of. We cannot save all of our memory footprint to an array and then iterate over it, so we need to cell the content crawl.
The common way is to query a certain number of times, to do the crawl, so the disadvantage is just a certain number of classification, there is no association between the data, in bulk way to insert, if there are errors, fault tolerance will have some minor problems, and we would like a book as a set of separate save will encounter problems. 97 Cinema So the second thing we use is to crawl and save content in a book unit .
async.mapLimit(list, 1, (series, callback) => {})This method is used to traverse, the inevitable use of callbacks, it feels disgusting. The second parameter of Async.maplimit () can set the number of simultaneous requests.

 /* Content Crawl steps: * The first step to get a list of books, through the book List of a book to the corresponding list of all the chapters, * The second step to the chapter list traversal get content saved to the database * The third step to save the data after the first step to the next book content fetching and Saving * */** * Initialize the portal */Const CONTENTLISTINIT =Async () = {Get a list of booksConst LIST =Await Bookhelper.getbooklist (Booklistmodel);if (!list) {Logger.error (' Initialize query book directory failed ');Return }Const RES =Await mapbooklist (list);if (!res) {Logger.error (' Crawl chapter information, call Getcurbooksectionlist () for serial traversal operations, perform a completion callback error, error message has been printed, please check the log! ');Return }return res;}/** * Traverse the list of chapters in the books directory * @param {*} list */Const MAPBOOKLIST =(List) = {ReturnNewPromise ((Resolve, reject) = {Async.maplimit (list,1,(Series, callback) + = {let doc = Series._doc; Getcurbooksectionlist (Doc, callback); },(Err, result) = {if (err) {Logger.error (' Book Directory fetch asynchronous execution error! '); Logger.error (ERR); RejectFALSE);Return } Resolve (true); }) })}/** * Get a single book next Chapter list invoke Chapter list traversal crawl content * @param {*} series * @param {*} callback */Const GETCURBOOKSECTIONLIST =Async (series, callback) = {Let num =Math.random () *1000 +1000;await sleep (num);Let key = Series.key;Const RES =Await Bookhelper.querysectionlist (Chapterlistmodel, {key:key});if (!res) {Logger.error (' Get current book: ' + Series.bookname +' chapter content failed, into the next book content Crawl! '); Callback (null, null); return;} //Determine if the current data already exists const Bookitemmodel = Getmodel (key); Const CONTENTLENGTH = await bookhelper.getcollectionlength (Bookitemmodel, {}); if (contentlength = = = Res.length) {logger.info (' current book: ' + series.bookname + ' database has been crawled to completion, into the next Data task '); Callback (null, null); return;} await mapsectionlist (res); callback (null, null);}          

It's a question of how the data is captured and saved.

Here we classify the data by key, each time we follow the key to get the link, the benefit is that the saved data is a whole, now think about the problem of data preservation

    • 1 can be inserted in a holistic manner

      Advantage: Fast database operation is not a waste of time.

      Disadvantage: Some books may have hundreds of chapters also means to save hundreds of pages of content before inserting, this is also very memory-intensive, the Nineth theater may cause the program to run unstable.

    • 2 You can insert a database in the form of each article.

      Advantages: page fetching is saved so that the data can be saved in a timely manner, even if subsequent errors do not need to resave the previous chapters,

      Disadvantage: It is also obviously slow, think carefully if you want to crawl tens of thousands of pages to do tens of thousands of times *n database operation here can also do a cache one time to save a certain number of points when the number of bars to do to save this is also a good choice.

/** * Traversal of all chapters in a single book crawl method * @param {*} list */Const MAPSECTIONLIST =(List) = {ReturnNewPromise ((Resolve, reject) = {Async.maplimit (list,1,(Series, callback) + = {let doc = Series._doc; GetContent (Doc, callback)},  (err, result) = {if (err) {Logger.error (  ' book directory fetching asynchronous execution Error! '); Logger.error (ERR); Reject (false); return;} const bookname = List[0].bookname; const key = List[0].key; //is saved as a unit Saveallcontenttodb (Result,www.97yingyuan.org  bookname, key, resolve);  //is saved as a unit for each article //logger.info (bookname + ' data capture complete, Go to the next book grab function ... '); //resolve (True);}) })}

Both have pros and cons, and here we have all tried. Two sets of errors saved are prepared, Errcontentmodel, Errorcollectionmodel, save the information to the corresponding collection when an error is inserted, either. The reason for adding collections to save data is to make it easy to view and follow up, without looking at the logs.

(PS, in fact, completely with Errorcollectionmodel this collection can be, Errcontentmodel this collection can complete save chapter information)

//保存出错的数据名称const errorSpider = mongoose.Schema({    chapter: String,    section: String,    url: String, key: String, bookName: String, author: String,})// 保存出错的数据名称 只保留key 和 bookName信息const errorCollection = mongoose.Schema({ key: String, bookName: String,})

We put the contents of each book's information into a new set, and the collection is named by Key.

Summarize

The main difficulty in writing this project is the control of the stability of the program, the setting of the fault-tolerant mechanism, and the record of the error, the project is basically able to run a one-time run-through process. But the program design also certainly has many problems, welcome to correct and communicate.

Nodejs crawler captures ancient classics, a total of 16,000 pages experience summary and project sharing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.