Simple crawling method using Node. js

Source: Internet
Author: User
Why use node to write crawlers? It is because the cheerio library is fully compatible with jQuery syntax. If you are familiar with it, it is really cool to use cheerio: Node. jQueryhttp for js: encapsulates an HTPP server and a simple HTTP client iconv-lite: solution... why use node to write crawlers? Because the cheerio library is fully compatible with jQuery syntax, it is really nice to use it if you are familiar with it.

Dependency Selection
  • Cheerio: jQuery for Node. js

  • Http: encapsulates an HTPP server and a simple HTTP client

  • Iconv-lite: Fixed garbled characters when crawling the gb2312 webpage.

Initial Implementation

Since we want to crawl the website content, we should first look at the basic composition of the website
The target website is movie heaven. You want to download all the latest movies.

Analysis page

The page structure is as follows:

We can see that the title of each movie isclassIsulinkOfaTag, and then locate it. We can see the most external boxclassIsco_content8

OK. You can start the project.

Get a movie title

First introduce the dependency and set the url to be crawled.

var cheerio = require('cheerio');var http = require('http');var iconv = require('iconv-lite');var url = 'http://www.ygdy8.net/html/gndy/dyzz/index.html';

Core codeindex.js

Http. get (url, function (sres) {var chunks = []; sres. on ('data', function (chunk) {chunks. push (chunk) ;}); // chunks stores the html content of the webpage and transmits it zhuan ma to cheerio. after load, // you can get a variable that implements the jQuery interface and name it '$' // The rest is jQuery's content, sres. on ('end', function () {var titles = []; // transcoding is required because the encoding format of this webpage is gb2312, otherwise, garbled code // The basis:"
 "Var html = iconv. decode (Buffer. concat (chunks), 'gb2312'); var $ = cheerio. load (html, {decodeEntities: false}); $ ('. co_content8. ulink '). each (function (idx, element) {var $ element = $ (element); titles. push ({title: $ element. text ()} console. log (titles );});});

Runnode index

The result is as follows:

The title of a movie is obtained successfully. If I want to obtain the title of multiple pages, it is impossible to change the title of a movie by url. Of course there is a way to do this. Please look down!

Obtain the title of a Multi-page movie

We only need to encapsulate the previous code into a function and execute it recursively.

Core codeindex.js

Var index = 1; // page number control var url = 'HTTP: // www.ygdy8.net/html/gndy/dyzz/list_23_'invalid var titles = []; // used to save titlefunction getTitle (url, I) {console. log ("retrieving content on the" + I + "page"); http. get (url + I + '.html ', function (sres) {var chunks = []; sres. on ('data', function (chunk) {chunks. push (chunk) ;}); sres. on ('end', function () {var html = iconv. decode (Buffer. concat (chunks), 'gb2312'); var $ = cheerio. load (html , {DecodeEntities: false}); $ ('. co_content8. ulink '). each (function (idx, element) {var $ element = $ (element); titles. push ({title: $ element. text ()}) if (I <2) {// only two pages of getTitle (url, ++ index) are crawled for convenience; // Recursive Execution, page number + 1} else {console. log (titles); console. log ("Title retrieved! ") ;}}) ;};}Function main () {console. log ("start crawling"); getTitle (url, index);} main (); // run the main function

The result is as follows:

If you want to locate the download link accurately, you must first findidIsZoomThe download link is here.pUndertrUnderaLabel.

Then we define another function to get the download link.

GetBtLink ()

Function getBtLink (urls, n) {// URL contains the addresses of all details pages in the console. log ("retrieving url content" + n + "); http. get ('HTTP: // www.ygdy8.net '+ urls [n]. title, function (sres) {var chunks = []; sres. on ('data', function (chunk) {chunks. push (chunk) ;}); sres. on ('end', function () {var html = iconv. decode (Buffer. concat (chunks), 'gb2312'); // transcode var $ = cheerio. load (html, {decodeEntities: false}); $ ('# Zoom td '). children ('A' ). Each (function (idx, element) {var $ element = $ (element); btLink. push ({bt: $ element. attr ('href ')}) if (n <urls. length-1) {getBtLink (urls, ++ count); // recursion} else {console. log ("btlink retrieved! "); Console. log (btLink );}});});}

Run againnode index

In this way, we can get the download links for all the movies on the three pages, isn't it easy?

Save data

We will talk about how to save the data crawling. Here I chose MongoDB to save it.

Data storage functionsave()

Function save () {var MongoClient = require ('mongodb '). consumer Client; // import dependency on Consumer Client. connect (performance_url, function (err, db) {if (err) {console. error (err); return;} else {console. log ("successfully connected to the database"); var collection = db. collection ('node-reptitle'); collection. insertlink (btLink, function (err, result) {// insert data if (err) {console. error (err);} else {console. log ("successfully saved data") ;}}) db. close ();}});}

The operation here is very simple, so there is no need to go to mongoose.
Run againnode index

This is the crawler implemented by Node. js. I wish you all the data you want ;)

The above describes how to use Node. js to implement simple crawling. For more information, see other related articles in the first PHP community!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.