node. JS (ix)--http small reptile

Last Update:2017-03-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

HTTP crawler

There are a lot of requests on the network, from the client to the server side, the server side to the server side.

Generally in the browser, we use Ajax to complete the form submission or data acquisition,

That's in the HTTP module. Get and request both interfaces to complete data acquisition or commit.

Take an example, crawl the 51cto blog data

Start with the simplest crawlers, crawl the source

The sample code is as follows:

var http = require (' http ') var url = ' http://mazongfei.blog.51cto.com/3174958/1909817 ' http.get (URL, function (res) {var html = ' Res.on ' (' Data ', function (data) {html + = data}) Res.on (' End ', function () {Console.log (HTML)}). On (' Error ', function () {Console.log (' Error getting blog page ')})})

Operation result attitude, intercept a paragraph as follows:

650) this.width=650; "src=" Https://s4.51cto.com/wyfs02/M02/8F/29/wKiom1jVEVPS4dLqAACouMTBZBY575.jpg "title=" 36020170315204858576.jpg "alt=" wkiom1jvevps4dlqaacoumtbzby575.jpg "/> Source, although crawled down, but for us not much significance, we want to get the blog title Information, ( because of insufficient knowledge reserves, failed to crawl to the code of the blog, the preliminary understanding is reboot file ) to change the network to crawl Mu

At this time, we need to analyze the source code, the value of things to filter out;

How do we analyze the source code, here is recommended a module: Cherio

It's like jquery, and it's simple and easy to manipulate the background HTML.

First install the module: NPM Install Cheerio

Var http = require (' http ') var cheerio = require (' Cheerio ') var url =   ' http://www.imooc.com/learn/348 ' function filterchapters (HTML) {var $ = cheerio.load ( HTML) var chapters = $ ('. Chapter ')///We want to get an array with the following format:/*[{chaptertitle: ', Videos:[title: ', ID: ']} */var coursedata = []//the inside of the traversal to get the data Chapters.each (function (item) {var chapter = $ ( this);//Chapter title Var chaptertitle = chapter.find (' strong '). Text () Console.log (chaptertitle) var  Videos = chapter.find ('. Video '). Children (' Li ') var chapterdata = {chaptertitle: chaptertitle,videos:[]}//traverses Videosvideos.each (function (item) {var video = $ (this). Find ('. J-media-item ') Var videotitle = video.text () var id = video.attr (' href '). Split (' video/') [1]chapterdata.videos.push ({title:videotitle,id:id})}) Coursedata.push (Chapterdata)}) return  Coursedata}function printcourseinfo (Coursedata) {//The traversal Coursedata.foreach in the array (function (item) {var chaptertitle =  Item.chapterTitleitem.videos.forEach (function (video) {Console.log (' "' +video.id+ '" ' +video.title ')})} Http.get (Url, function (res) {var html =  ' res.on (' data ', function (data) {html +=  Data}) Res.on (' End ', function () {///process Var coursedata = filterchapters (HTML) printcourseinfo ( Coursedata)}). On (' Error ', function () {console.log (' get page error ')})})

The results of the operation are as follows:

650) this.width=650; "src=" Https://s2.51cto.com/wyfs02/M02/8F/2A/wKioL1jV4HzCrLcxAAAjTshuyZE359.jpg "title=" 36020170315204858576.jpg "alt=" Wkiol1jv4hzcrlcxaaajtshuyze359.jpg "/>

If there is an error, debug with Console.log

With the number, we can assemble the URL address "6712" and access the contents of each section individually.

Can do more things, the current crawler code to stop first.

But there is another problem: if you run out of every section,

The representation in the code is a callback, just take the content of the course homepage,

If the content of each section is requested, it is not necessarily how long it takes to get the content,

So a callback is a callback that does not know the length of time, so there is an asynchronous

Callback programming, so we can assemble the queue and get the results we want.

This article is from the "It Rookie" blog, make sure to keep this source http://mazongfei.blog.51cto.com/3174958/1910188

node. JS (ix)--http small reptile

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

node. JS (ix)--http small reptile

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

node. JS (ix)--http small reptile

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support