Basic ideas
idea One (origin:master): From a Wikipedia category (such as: Aircraft carrier (key)) page, to find the title of the link to include all the goals of the key (aircraft carrier), add to the queue to be crawled. In this way, grab a page of the code and its pictures, but also get this page all the key-related other pages address, take a class breadth first traversal algorithm to complete this task.
Idea two (ORIGIN:CAT): Crawl according to the classification. Note that Wikipedia, the classification of the category: Beginning, because Wikipedia has a good document structure, it is easy to start from any category, the beginning, has been all the categories under the crawl down. This algorithm to classify the page, extract the subcategory, and crawl all the pages in parallel, fast, you can save the classification structure, but in fact there are a lot of duplicate pages, but this can be written later script can be very easy to deal with.
Selection of libraries
start to want to use jsdom, although feel it powerful, but also more "heavy", the most fatal is to explain the document is not good enough, only said its advantages, not a comprehensive description. So, instead of cheerio, lightweight, full-featured, at least one look at the document can have a holistic concept. Actually do later, only then discovered does not need the storehouse, uses the regular expression to be able to take care of everything! Use the library just less write a little bit of it.
Key points
global variable setting:
var RegKey = [' Aircraft carrier ', ' Aircraft mothership ', ' aircraft carrier ']; If the link contains keywords in this, that is the target
var allkeys = []; The title of the link, is also the page identification, avoid repeatedly grasping the
var keys = [' category:%e8%88%aa%e7%a9%ba%e6%af%8d%e8%88%b0 ']; Wait queue, start page
Picture Download
use the streaming operation of the request library to allow each download operation to form a closure. Note the possible side effects of asynchronous operations. In addition, the name of the picture to be reset, I began to take the original names, do not know why, some of the figure is clearly exist, is not shown, and to the Srcset property to clean out, otherwise this surface does not appear.
$ = cheer.load (downhtml);
var rshtml = $.html ();
var IMGs = $ (' #bodyContent. Image '); Pictures are decorated with this style for
(img in IMGs) {
if (typeof imgs[img].attribs = = ' undefined ' | | typeof imgs[img].attribs.href = = ' undefined ')
{continue;} Structure for the picture under the link, the link does not exist, skipped
else
{
var picurl = imgs[img].children[0].attribs.src; Picture Address
var dirs = Picurl.split ('. ');
var filename = basedir+uuid.v1 () + '. ' +DIRS[DIRS.LENGTH-1]; Rename
request ("https:" +picurl). Pipe (Fs.createwritestream (' pages/' +filename)); Download
rshtml = Rshtml.replace (picurl,filename); Replace with local path
//Console.log (Picurl);
}
Breadth-First traversal
the beginning did not fully understand the concept of asynchronous, to do in a circular manner, that the use of promise, has been converted to synchronization, but in fact, only to ensure that the operation to promise will be ordered, and can not let these operations and other operations ordered! For example, the following code is not correct.
var keys = [' aircraft carrier '];
var key = Keys.shift ();
while (key) {
Data.get ({
Url:encodeuri (key),
Qs:null
}). Then (function (downhtml) {
...
Keys.push (key); (1)
}
);
Key = Keys.shift (); (2)
}
The above operation looks very normal, but in fact (2) will be run between (1)! What do you do?
I use the recursive return to solve this problem. The following sample code:
var key = Keys.shift ();
(function Donext (key) {
Data.get ({
url:key,
qs:null
}). Then (function (downhtml) {
...)
keys.push (href);
...
Key = Keys.shift ();
if (key) {
donext (key);
} else{
Console.log (' Crawl task successfully completed. ')
}
)
}) (key);
Regular cleanup
use regular expressions to clean up useless page code, because there are a lot of patterns that need to be handled, and write a loop to unify the processing.
var regs = [/<link rel=\] stylesheet\ "href=\"? [ ^\ "]*\" >/g,
/<script>?[ ^<]*<\/script>/g,
/<style>?[ ^<]*<\/style>/g,
/<a? [ ^>]*>/g,
/<\/a>/g,
/srcset= (\ "? [ ^\ "]*\")/g
]
Regs.foreach (function (RS) {
var mactches = Rshtml.match (RS);
for (Var i=0;i < mactches.length i++)
{
rshtml = rshtml.replace (Mactches[i],mactches[i].indexof (' Stylesheet ') >-1? ' <link rel= "stylesheet" href= "wiki" + (i+1) + '. css ': ');
}
Run effect
Oberwicky is needed FQ, try to run a bit, grab the aircraft carrier classification, the operation process, found about 300 of related links (including category pages, these pages I was only valid links, not download), and finally correctly downloaded 209, hand-tested some error links, Found to be invalid links, showing that the term has not yet been established, the whole process probably took less than 15 minutes, compressed nearly 30 m, the feeling of good results.
Source
Https://github.com/zhoutk/wikiSpider
Summary
by the end of last night's basic mission, thought one can crawl the content more accurate page, and the page does not repeat, but the crawl efficiency is not high, the classification information cannot obtain accurately; thought two can according to Wikipedia classification, automatically crawl and classify files to the local, efficient (measured, grab "warship" class, A total of 6,000 crawl pages, time consuming 50 minutes, can crawl more than 100 pages per minute, can accurately save classified information.
The biggest gain is a deep understanding of the overall process control of asynchronous programming.