Use Nodejs crawl Web page data, here used to Cheerio, parsing HTML is very useful, and jquery usage is exactly the same.
First install Cheerio, enter NPM install Cheerio on the command line, (enter this command in the Nodejs root directory)
After the installation is complete, we will analyze the http://www.imooc.com/learn/348 online and get the course information on it.
The code is as follows:
varHTTP = require (' http ');varCheerio = require (' Cheerio '));varurl = ' http://www.imooc.com/learn/348 ';functionFilter (HTML) {//Grab the required course Information var$ =cheerio.load (HTML); varChapters = $ ('. Chapter ')); //var result = [{chaptertitle: ', Videos:[title: ', ID: ']}]; The format of the fetch result varresult = []; Chapters.each (function(){ varItem = $ ( This); varChaptertitle = Item.find (' strong ')). text (); varVideos = Item.find ('. Video '). Find (' Li '); varChapterdata ={chaptertitle:chaptertitle, videos: []}; Videos.each (function(){ varVideo = $ ( This). Find ('. Studyvideo ')); vartitle = Video.text (). Split (') ') [0] + ') ';//remove the whitespace behind useful information //Console.log (title); varid = video.attr (' href '). Split (' video/') [1];//take the number of course videos onlyChapterData.videos.push ({title:title, id:id}); }); Result.push (Chapterdata); }); returnresult;}functionPrintresult (Result) {//Print Crawl Results varstr = "; Result.foreach (function(item) {str+ = Item.chaptertitle + ' \ n '; Item.videos.forEach (function(item) {str+ = ' + ' + item.id + ' + ' + item.title + ' \ n '; }); }); Console.log (str);} Http.get (URL,function(res) {varhtml = ' '; Res.on (' Data ',function(data) {//get all the information for the entire pageHTML + =data; }); Res.on (' End ',function(){ varresult = filter (HTML);//Filter the page to capture the required course informationPrintresult (result);//results of the print crawl }); }). On (' Error ',function(){//An error occurred while getting page informationConsole.log (' error! ');});
Results:
The 1th Chapter Preface
"6687" 1-1 preface (01:20)
"6688" 1-2 Why Study Nodejs (05:43)
2nd Chapter Installation Nodejs
"6689" 2-1 course Brief (01:19)
"6690" 2-2 Nodejs version Common sense (01:02)
"6691" 2-3 Windows installation Nodejs (04:43)
"6692" 2-4 Linux installation Nodejs (06:24)
"6693" 2-5 mac installation Nodejs (03:55)
The 3rd chapter can't wait to come to the early adopters
"6694" 3-1 from a Web server (05:14)
"6695" 3-2 command-line experience (02:47)
4th Chapter module and package management tools
"6697" 4-1 node. JS module and COMMONJS specification (03:44)
Classification of "6700" 4-2 modules (00:45)
"6701" 4-3 simple Nodejs module (09:23)
5th Chapter Sweep Nodejs API
"6705" 5-1 do not fall into the abyss of version selection (02:32)
"6710" 5-2 URL Parsing good helper (10:30)
"6711" 5-3 querystring Parameter Processing Small weapon (06:40)
"6712" 5-4 http Knowledge First pits (09:43)
"6713" 5-5 http Knowledge Pits "to MU class network for example Analysis" (10:13)
"7557" 5-6 HTTP event back to Redeployment order (17:51)
"7558" 5-7 HTTP Source Interpretation first understand scope, context (20:50)
"7963" 5-8 HTTP Source code interpretation (22:08)
"7964" 5-9 HTTP performance Test (09:15)
"7965" 5-10 HTTP crawler (17:33)
"8525" 5-11 Event module episode (15:15)
"8837" 5-12 Request method (17:56)
This article was published after learning the courses online, website: http://www.imooc.com/learn/348
HTTP crawler, Nodejs Learning (ii)