Node.js environment to write crawler crawling content of Wikipedia to share the example _node.js

Source: Internet
Author: User

Basic ideas
idea One (origin:master): From a Wikipedia category (such as: Aircraft carrier (key)) page, to find the title of the link to include all the goals of the key (aircraft carrier), add to the queue to be crawled. In this way, grab a page of the code and its pictures, but also get this page all the key-related other pages address, take a class breadth first traversal algorithm to complete this task.
Idea two (ORIGIN:CAT): Crawl according to the classification. Note that Wikipedia, the classification of the category: Beginning, because Wikipedia has a good document structure, it is easy to start from any category, the beginning, has been all the categories under the crawl down. This algorithm to classify the page, extract the subcategory, and crawl all the pages in parallel, fast, you can save the classification structure, but in fact there are a lot of duplicate pages, but this can be written later script can be very easy to deal with.

Selection of libraries
start to want to use jsdom, although feel it powerful, but also more "heavy", the most fatal is to explain the document is not good enough, only said its advantages, not a comprehensive description. So, instead of cheerio, lightweight, full-featured, at least one look at the document can have a holistic concept. Actually do later, only then discovered does not need the storehouse, uses the regular expression to be able to take care of everything! Use the library just less write a little bit of it.

Key points
global variable setting:

var RegKey = [' Aircraft carrier ', ' Aircraft mothership ', ' aircraft carrier '];  If the link contains keywords in this, that is the target
var allkeys = [];              The title of the link, is also the page identification, avoid repeatedly grasping the
var keys = [' category:%e8%88%aa%e7%a9%ba%e6%af%8d%e8%88%b0 '];  Wait queue, start page

Picture Download
use the streaming operation of the request library to allow each download operation to form a closure. Note the possible side effects of asynchronous operations. In addition, the name of the picture to be reset, I began to take the original names, do not know why, some of the figure is clearly exist, is not shown, and to the Srcset property to clean out, otherwise this surface does not appear.

$ = cheer.load (downhtml);
 var rshtml = $.html ();
 var IMGs = $ (' #bodyContent. Image ');    Pictures are decorated with this style for
 (img in IMGs) {
  if (typeof imgs[img].attribs = = ' undefined ' | | typeof imgs[img].attribs.href = = ' undefined ')
   {continue;}  Structure for the picture under the link, the link does not exist, skipped
    var picurl = imgs[img].children[0].attribs.src;  Picture Address
    var dirs = Picurl.split ('. ');
    var filename = basedir+uuid.v1 () + '. ' +DIRS[DIRS.LENGTH-1];  Rename

    request ("https:" +picurl). Pipe (Fs.createwritestream (' pages/' +filename));  Download

    rshtml = Rshtml.replace (picurl,filename);  Replace with local path
    //Console.log (Picurl);

Breadth-First traversal
the beginning did not fully understand the concept of asynchronous, to do in a circular manner, that the use of promise, has been converted to synchronization, but in fact, only to ensure that the operation to promise will be ordered, and can not let these operations and other operations ordered! For example, the following code is not correct.

var keys = [' aircraft carrier '];
var key = Keys.shift ();
while (key) {
 Data.get ({
  Url:encodeuri (key),
 }). Then (function (downhtml) {
    Keys.push (key);        (1)
Key = Keys.shift ();          (2)

The above operation looks very normal, but in fact (2) will be run between (1)! What do you do?
I use the recursive return to solve this problem. The following sample code:

var key = Keys.shift ();
(function Donext (key) {
 Data.get ({
 }). Then (function (downhtml) {
  keys.push (href);
  Key = Keys.shift ();
  if (key) {
   donext (key);
  } else{
   Console.log (' Crawl task successfully completed. ')
}) (key);

Regular cleanup
use regular expressions to clean up useless page code, because there are a lot of patterns that need to be handled, and write a loop to unify the processing.

var regs = [/<link rel=\] stylesheet\ "href=\"? [ ^\ "]*\" >/g,
  /<script>?[ ^<]*<\/script>/g,
 /<style>?[ ^<]*<\/style>/g,
 /<a? [ ^>]*>/g,
 /srcset= (\ "? [ ^\ "]*\")/g
 Regs.foreach (function (RS) {
  var mactches = Rshtml.match (RS);
  for (Var i=0;i < mactches.length i++)
   rshtml = rshtml.replace (Mactches[i],mactches[i].indexof (' Stylesheet ') >-1? ' <link rel= "stylesheet" href= "wiki" + (i+1) + '. css ': ');

Run effect
Oberwicky is needed FQ, try to run a bit, grab the aircraft carrier classification, the operation process, found about 300 of related links (including category pages, these pages I was only valid links, not download), and finally correctly downloaded 209, hand-tested some error links, Found to be invalid links, showing that the term has not yet been established, the whole process probably took less than 15 minutes, compressed nearly 30 m, the feeling of good results.

by the end of last night's basic mission, thought one can crawl the content more accurate page, and the page does not repeat, but the crawl efficiency is not high, the classification information cannot obtain accurately; thought two can according to Wikipedia classification, automatically crawl and classify files to the local, efficient (measured, grab "warship" class, A total of 6,000 crawl pages, time consuming 50 minutes, can crawl more than 100 pages per minute, can accurately save classified information.
The biggest gain is a deep understanding of the overall process control of asynchronous programming.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.