Simple crawling method using Node. js

Last Update:2017-05-11 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Why use node to write crawlers? It is because the cheerio library is fully compatible with jQuery syntax. If you are familiar with it, it is really cool to use cheerio: Node. jQueryhttp for js: encapsulates an HTPP server and a simple HTTP client iconv-lite: solution... why use node to write crawlers? Because the cheerio library is fully compatible with jQuery syntax, it is really nice to use it if you are familiar with it.

Dependency Selection

Cheerio: jQuery for Node. js
Http: encapsulates an HTPP server and a simple HTTP client
Iconv-lite: Fixed garbled characters when crawling the gb2312 webpage.

Initial Implementation

Since we want to crawl the website content, we should first look at the basic composition of the website
The target website is movie heaven. You want to download all the latest movies.

Analysis page

The page structure is as follows:

We can see that the title of each movie isclassIsulinkOfaTag, and then locate it. We can see the most external boxclassIsco_content8

OK. You can start the project.

Get a movie title

First introduce the dependency and set the url to be crawled.

var cheerio = require('cheerio');var http = require('http');var iconv = require('iconv-lite');var url = 'http://www.ygdy8.net/html/gndy/dyzz/index.html';

Core codeindex.js

Http. get (url, function (sres) {var chunks = []; sres. on ('data', function (chunk) {chunks. push (chunk) ;}); // chunks stores the html content of the webpage and transmits it zhuan ma to cheerio. after load, // you can get a variable that implements the jQuery interface and name it '$' // The rest is jQuery's content, sres. on ('end', function () {var titles = []; // transcoding is required because the encoding format of this webpage is gb2312, otherwise, garbled code // The basis:"
 "Var html = iconv. decode (Buffer. concat (chunks), 'gb2312'); var $ = cheerio. load (html, {decodeEntities: false}); $ ('. co_content8. ulink '). each (function (idx, element) {var $ element = $ (element); titles. push ({title: $ element. text ()} console. log (titles );});});

Runnode index

The result is as follows:

The title of a movie is obtained successfully. If I want to obtain the title of multiple pages, it is impossible to change the title of a movie by url. Of course there is a way to do this. Please look down!

Obtain the title of a Multi-page movie

We only need to encapsulate the previous code into a function and execute it recursively.

Core codeindex.js

Var index = 1; // page number control var url = 'HTTP: // www.ygdy8.net/html/gndy/dyzz/list_23_'invalid var titles = []; // used to save titlefunction getTitle (url, I) {console. log ("retrieving content on the" + I + "page"); http. get (url + I + '.html ', function (sres) {var chunks = []; sres. on ('data', function (chunk) {chunks. push (chunk) ;}); sres. on ('end', function () {var html = iconv. decode (Buffer. concat (chunks), 'gb2312'); var $ = cheerio. load (html , {DecodeEntities: false}); $ ('. co_content8. ulink '). each (function (idx, element) {var $ element = $ (element); titles. push ({title: $ element. text ()}) if (I <2) {// only two pages of getTitle (url, ++ index) are crawled for convenience; // Recursive Execution, page number + 1} else {console. log (titles); console. log ("Title retrieved! ") ;}}) ;};}Function main () {console. log ("start crawling"); getTitle (url, index);} main (); // run the main function

The result is as follows:

If you want to locate the download link accurately, you must first findidIsZoomThe download link is here.pUndertrUnderaLabel.

Then we define another function to get the download link.

GetBtLink ()

Function getBtLink (urls, n) {// URL contains the addresses of all details pages in the console. log ("retrieving url content" + n + "); http. get ('HTTP: // www.ygdy8.net '+ urls [n]. title, function (sres) {var chunks = []; sres. on ('data', function (chunk) {chunks. push (chunk) ;}); sres. on ('end', function () {var html = iconv. decode (Buffer. concat (chunks), 'gb2312'); // transcode var $ = cheerio. load (html, {decodeEntities: false}); $ ('# Zoom td '). children ('A' ). Each (function (idx, element) {var $ element = $ (element); btLink. push ({bt: $ element. attr ('href ')}) if (n <urls. length-1) {getBtLink (urls, ++ count); // recursion} else {console. log ("btlink retrieved! "); Console. log (btLink );}});});}

Run againnode index

In this way, we can get the download links for all the movies on the three pages, isn't it easy?

Save data

We will talk about how to save the data crawling. Here I chose MongoDB to save it.

Data storage functionsave()

Function save () {var MongoClient = require ('mongodb '). consumer Client; // import dependency on Consumer Client. connect (performance_url, function (err, db) {if (err) {console. error (err); return;} else {console. log ("successfully connected to the database"); var collection = db. collection ('node-reptitle'); collection. insertlink (btLink, function (err, result) {// insert data if (err) {console. error (err);} else {console. log ("successfully saved data") ;}}) db. close ();}});}

The operation here is very simple, so there is no need to go to mongoose.
Run againnode index

This is the crawler implemented by Node. js. I wish you all the data you want ;)

The above describes how to use Node. js to implement simple crawling. For more information, see other related articles in the first PHP community!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Simple crawling method using Node. js

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Simple crawling method using Node. js

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support