This article mainly introduces you to the basic node. js module http and the webpage analysis tool cherrio for Crawler implementation. For more information, see
I. Preface
It is a crawler, but it does not use third-party class libraries related to crawlers. It mainly uses the node. js Basic module http and the webpage analysis tool cherrio. Use http to directly obtain the webpage resources corresponding to the url path, and then use cherrio for analysis. I have tried the cases I have learned to better understand them. During the coding process, I used forEach to traverse the object obtained by jq for the first time and reported an error because jq does not have the corresponding method and can only be called by js arrays.
Ii. knowledge points
①: Superagent captures webpage tools. I have not used it for the moment.
②: Cherrio web analysis tool. You can understand it as jQuery on the server, because the syntax is the same.
1. capture the entire webpage
2. Analyzed data,The provided example is an example of case implementation.
Crawler source code analysis
Var http = require ('http'); var cheerio = require ('cheerio '); var url =' http://www.imooc.com/learn/348 ';/**************************** Print the data structure [{chapterTitle: '', videos: [{title:'', id: '}] ******************************/function printCourseInfo (courseData) {courseData. forEach (function (item) {var chapterTitle = item. chapterTitle; console. log (chapterTitle + '\ n'); item. videos. forEach (function (video) {console. log ('[' + video. id + ']' + video. title + '\ n ');})});} /*************************/function filterChapter (html) {var courseData = []; var $ = cheerio. load (html); var chapters = $ ('. chapter '); chapters. each (function (item) {var chapter = $ (this); var chapterTitle = chapter. find ('strong '). text (); // locate the unit title var videos = chapter. find ('. video '). children ('lil'); var chapterData = {chapterTitle: chapterTitle, videos: []}; videos. each (function (item) {var video = $ (this ). find ('. studyvideo'); var title = video. text (); var id = video. attr ('href '). split ('/video') [1]; chapterData. videos. push ({title: title, id: id}) courseData. push (chapterData) ;}); return courseData ;}http. get (url, function (res) {var html = ''; res. on ('data', function (data) {html + = data;}) res. on ('end', function () {var courseData = filterChapter (html); printCourseInfo (courseData );})}). on ('error', function () {console. log ('course data retrieval error ');})
References:
Https://github.com/alsotang/node-lessons/tree/master/lesson3
Http://www.imooc.com/video/7965