node. JS (ix)--http small reptile

Source: Internet
Author: User

HTTP crawler

There are a lot of requests on the network, from the client to the server side, the server side to the server side.

Generally in the browser, we use Ajax to complete the form submission or data acquisition,

That's in the HTTP module. Get and request both interfaces to complete data acquisition or commit.

Take an example, crawl the 51cto blog data

Start with the simplest crawlers, crawl the source

The sample code is as follows:

var http = require (' http ') var url = ' http://mazongfei.blog.51cto.com/3174958/1909817 ' http.get (URL, function (res) {var html = ' Res.on ' (' Data ', function (data) {html + = data}) Res.on (' End ', function () {Console.log (HTML)}). On (' Error ', function () {Console.log (' Error getting blog page ')})})

Operation result attitude, intercept a paragraph as follows:

650) this.width=650; "src=" Https://s4.51cto.com/wyfs02/M02/8F/29/wKiom1jVEVPS4dLqAACouMTBZBY575.jpg "title=" 36020170315204858576.jpg "alt=" wkiom1jvevps4dlqaacoumtbzby575.jpg "/> Source, although crawled down, but for us not much significance, we want to get the blog title Information, ( because of insufficient knowledge reserves, failed to crawl to the code of the blog, the preliminary understanding is reboot file ) to change the network to crawl Mu

At this time, we need to analyze the source code, the value of things to filter out;

How do we analyze the source code, here is recommended a module: Cherio

It's like jquery, and it's simple and easy to manipulate the background HTML.

First install the module: NPM Install Cheerio

Var http = require (' http ') var cheerio = require (' Cheerio ') var url =   ' http://www.imooc.com/learn/348 ' function filterchapters (HTML) {var $ = cheerio.load ( HTML) var chapters = $ ('. Chapter ')///We want to get an array with the following format:/*[{chaptertitle: ', Videos:[title: ', ID: ']} */var coursedata = []//the inside of the traversal to get the data Chapters.each (function (item) {var chapter = $ ( this);//Chapter title Var chaptertitle = chapter.find (' strong '). Text () Console.log (chaptertitle) var  Videos = chapter.find ('. Video '). Children (' Li ') var chapterdata = {chaptertitle: chaptertitle,videos:[]}//traverses Videosvideos.each (function (item) {var video = $ (this). Find ('. J-media-item ') Var videotitle = video.text () var id = video.attr (' href '). Split (' video/') [1]chapterdata.videos.push ({title:videotitle,id:id})}) Coursedata.push (Chapterdata)}) return  Coursedata}function printcourseinfo (Coursedata) {//The traversal Coursedata.foreach in the array (function (item) {var chaptertitle =  Item.chapterTitleitem.videos.forEach (function (video) {Console.log (' "' +video.id+ '" ' +video.title ')})} Http.get (Url, function (res) {var html =  ' res.on (' data ', function (data) {html +=  Data}) Res.on (' End ', function () {///process Var coursedata = filterchapters (HTML) printcourseinfo ( Coursedata)}). On (' Error ', function () {console.log (' get page error ')})})

The results of the operation are as follows:

650) this.width=650; "src=" Https://s2.51cto.com/wyfs02/M02/8F/2A/wKioL1jV4HzCrLcxAAAjTshuyZE359.jpg "title=" 36020170315204858576.jpg "alt=" Wkiol1jv4hzcrlcxaaajtshuyze359.jpg "/>

If there is an error, debug with Console.log

With the number, we can assemble the URL address "6712" and access the contents of each section individually.

Can do more things, the current crawler code to stop first.

But there is another problem: if you run out of every section,

The representation in the code is a callback, just take the content of the course homepage,

If the content of each section is requested, it is not necessarily how long it takes to get the content,

So a callback is a callback that does not know the length of time, so there is an asynchronous

Callback programming, so we can assemble the queue and get the results we want.


This article is from the "It Rookie" blog, make sure to keep this source http://mazongfei.blog.51cto.com/3174958/1910188

node. JS (ix)--http small reptile

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.