Node.js environment to write crawler crawling content of Wikipedia to share the example

Node.js environment to write crawler crawling content of Wikipedia to share the example _node.js

Last Update:2017-01-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Basic ideas
idea One (origin:master): From a Wikipedia category (such as: Aircraft carrier (key)) page, to find the title of the link to include all the goals of the key (aircraft carrier), add to the queue to be crawled. In this way, grab a page of the code and its pictures, but also get this page all the key-related other pages address, take a class breadth first traversal algorithm to complete this task.
Idea two (ORIGIN:CAT): Crawl according to the classification. Note that Wikipedia, the classification of the category: Beginning, because Wikipedia has a good document structure, it is easy to start from any category, the beginning, has been all the categories under the crawl down. This algorithm to classify the page, extract the subcategory, and crawl all the pages in parallel, fast, you can save the classification structure, but in fact there are a lot of duplicate pages, but this can be written later script can be very easy to deal with.

Selection of libraries
start to want to use jsdom, although feel it powerful, but also more "heavy", the most fatal is to explain the document is not good enough, only said its advantages, not a comprehensive description. So, instead of cheerio, lightweight, full-featured, at least one look at the document can have a holistic concept. Actually do later, only then discovered does not need the storehouse, uses the regular expression to be able to take care of everything! Use the library just less write a little bit of it.

Key points
global variable setting:

var RegKey = [' Aircraft carrier ', ' Aircraft mothership ', ' aircraft carrier '];  If the link contains keywords in this, that is the target
var allkeys = [];              The title of the link, is also the page identification, avoid repeatedly grasping the
var keys = [' category:%e8%88%aa%e7%a9%ba%e6%af%8d%e8%88%b0 '];  Wait queue, start page

Picture Download
use the streaming operation of the request library to allow each download operation to form a closure. Note the possible side effects of asynchronous operations. In addition, the name of the picture to be reset, I began to take the original names, do not know why, some of the figure is clearly exist, is not shown, and to the Srcset property to clean out, otherwise this surface does not appear.

$ = cheer.load (downhtml);
 var rshtml = $.html ();
 var IMGs = $ (' #bodyContent. Image ');    Pictures are decorated with this style for
 (img in IMGs) {
  if (typeof imgs[img].attribs = = ' undefined ' | | typeof imgs[img].attribs.href = = ' undefined ')
   {continue;}  Structure for the picture under the link, the link does not exist, skipped
  else
   {
    var picurl = imgs[img].children[0].attribs.src;  Picture Address
    var dirs = Picurl.split ('. ');
    var filename = basedir+uuid.v1 () + '. ' +DIRS[DIRS.LENGTH-1];  Rename

    request ("https:" +picurl). Pipe (Fs.createwritestream (' pages/' +filename));  Download

    rshtml = Rshtml.replace (picurl,filename);  Replace with local path
    //Console.log (Picurl);
   }

Breadth-First traversal
the beginning did not fully understand the concept of asynchronous, to do in a circular manner, that the use of promise, has been converted to synchronization, but in fact, only to ensure that the operation to promise will be ordered, and can not let these operations and other operations ordered! For example, the following code is not correct.

var keys = [' aircraft carrier '];
var key = Keys.shift ();
while (key) {
 Data.get ({
  Url:encodeuri (key),
  Qs:null
 }). Then (function (downhtml) {
    ...
    Keys.push (key);        (1)
  }
 );
Key = Keys.shift ();          (2)
}

The above operation looks very normal, but in fact (2) will be run between (1)! What do you do?
I use the recursive return to solve this problem. The following sample code:

var key = Keys.shift ();
(function Donext (key) {
 Data.get ({
  url:key,
  qs:null
 }). Then (function (downhtml) {
  ...)
  keys.push (href);
  ...
  Key = Keys.shift ();
  if (key) {
   donext (key);
  } else{
   Console.log (' Crawl task successfully completed. ')
  }
 )
}) (key);

Regular cleanup
use regular expressions to clean up useless page code, because there are a lot of patterns that need to be handled, and write a loop to unify the processing.

var regs = [/<link rel=\] stylesheet\ "href=\"? [ ^\ "]*\" >/g,
  /<script>?[ ^<]*<\/script>/g,
 /<style>?[ ^<]*<\/style>/g,
 /<a? [ ^>]*>/g,
 /<\/a>/g,
 /srcset= (\ "? [ ^\ "]*\")/g
 ]
 Regs.foreach (function (RS) {
  var mactches = Rshtml.match (RS);
  for (Var i=0;i < mactches.length i++)
  {
   rshtml = rshtml.replace (Mactches[i],mactches[i].indexof (' Stylesheet ') >-1? ' <link rel= "stylesheet" href= "wiki" + (i+1) + '. css ': ');
  }

Run effect
Oberwicky is needed FQ, try to run a bit, grab the aircraft carrier classification, the operation process, found about 300 of related links (including category pages, these pages I was only valid links, not download), and finally correctly downloaded 209, hand-tested some error links, Found to be invalid links, showing that the term has not yet been established, the whole process probably took less than 15 minutes, compressed nearly 30 m, the feeling of good results.

Source
Https://github.com/zhoutk/wikiSpider
Summary
by the end of last night's basic mission, thought one can crawl the content more accurate page, and the page does not repeat, but the crawl efficiency is not high, the classification information cannot obtain accurately; thought two can according to Wikipedia classification, automatically crawl and classify files to the local, efficient (measured, grab "warship" class, A total of 6,000 crawl pages, time consuming 50 minutes, can crawl more than 100 pages per minute, can accurately save classified information.
The biggest gain is a deep understanding of the overall process control of asynchronous programming.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Node.js environment to write crawler crawling content of Wikipedia to share the example _node.js

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Node.js environment to write crawler crawling content of Wikipedia to share the example _node.js

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support