HTTP crawler
There are a lot of requests on the network, from the client to the server side, the server side to the server side.
Generally in the browser, we use Ajax to complete the form submission or data acquisition,
That's in the HTTP module. Get and request both interfaces to complete data acquisition or commit.
Take an example, crawl the 51cto blog data
Start with the simplest crawlers, crawl the source
The sample code is as follows:
var http = require (' http ') var url = ' http://mazongfei.blog.51cto.com/3174958/1909817 ' http.get (URL, function (res) {var html = ' Res.on ' (' Data ', function (data) {html + = data}) Res.on (' End ', function () {Console.log (HTML)}). On (' Error ', function () {Console.log (' Error getting blog page ')})})
Operation result attitude, intercept a paragraph as follows:
650) this.width=650; "src=" Https://s4.51cto.com/wyfs02/M02/8F/29/wKiom1jVEVPS4dLqAACouMTBZBY575.jpg "title=" 36020170315204858576.jpg "alt=" wkiom1jvevps4dlqaacoumtbzby575.jpg "/> Source, although crawled down, but for us not much significance, we want to get the blog title Information, ( because of insufficient knowledge reserves, failed to crawl to the code of the blog, the preliminary understanding is reboot file ) to change the network to crawl Mu
At this time, we need to analyze the source code, the value of things to filter out;
How do we analyze the source code, here is recommended a module: Cherio
It's like jquery, and it's simple and easy to manipulate the background HTML.
First install the module: NPM Install Cheerio
Var http = require (' http ') var cheerio = require (' Cheerio ') var url = ' http://www.imooc.com/learn/348 ' function filterchapters (HTML) {var $ = cheerio.load ( HTML) var chapters = $ ('. Chapter ')///We want to get an array with the following format:/*[{chaptertitle: ', Videos:[title: ', ID: ']} */var coursedata = []//the inside of the traversal to get the data Chapters.each (function (item) {var chapter = $ ( this);//Chapter title Var chaptertitle = chapter.find (' strong '). Text () Console.log (chaptertitle) var Videos = chapter.find ('. Video '). Children (' Li ') var chapterdata = {chaptertitle: chaptertitle,videos:[]}//traverses Videosvideos.each (function (item) {var video = $ (this). Find ('. J-media-item ') Var videotitle = video.text () var id = video.attr (' href '). Split (' video/') [1]chapterdata.videos.push ({title:videotitle,id:id})}) Coursedata.push (Chapterdata)}) return Coursedata}function printcourseinfo (Coursedata) {//The traversal Coursedata.foreach in the array (function (item) {var chaptertitle = Item.chapterTitleitem.videos.forEach (function (video) {Console.log (' "' +video.id+ '" ' +video.title ')})} Http.get (Url, function (res) {var html = ' res.on (' data ', function (data) {html += Data}) Res.on (' End ', function () {///process Var coursedata = filterchapters (HTML) printcourseinfo ( Coursedata)}). On (' Error ', function () {console.log (' get page error ')})})
The results of the operation are as follows:
650) this.width=650; "src=" Https://s2.51cto.com/wyfs02/M02/8F/2A/wKioL1jV4HzCrLcxAAAjTshuyZE359.jpg "title=" 36020170315204858576.jpg "alt=" Wkiol1jv4hzcrlcxaaajtshuyze359.jpg "/>
If there is an error, debug with Console.log
With the number, we can assemble the URL address "6712" and access the contents of each section individually.
Can do more things, the current crawler code to stop first.
But there is another problem: if you run out of every section,
The representation in the code is a callback, just take the content of the course homepage,
If the content of each section is requested, it is not necessarily how long it takes to get the content,
So a callback is a callback that does not know the length of time, so there is an asynchronous
Callback programming, so we can assemble the queue and get the results we want.
This article is from the "It Rookie" blog, make sure to keep this source http://mazongfei.blog.51cto.com/3174958/1910188
node. JS (ix)--http small reptile