Why use node to write crawlers? It is because the cheerio library is fully compatible with jQuery syntax. If you are familiar with it, it is really cool to use cheerio: Node. jQueryhttp for js: encapsulates an HTPP server and a simple HTTP client iconv-lite: solution... why use node to write crawlers? Because the cheerio library is fully compatible with jQuery syntax, it is really nice to use it if you are familiar with it.
Dependency Selection
Cheerio: jQuery for Node. js
Http: encapsulates an HTPP server and a simple HTTP client
Iconv-lite: Fixed garbled characters when crawling the gb2312 webpage.
Initial Implementation
Since we want to crawl the website content, we should first look at the basic composition of the website
The target website is movie heaven. You want to download all the latest movies.
Analysis page
The page structure is as follows:
We can see that the title of each movie isclass
Isulink
Ofa
Tag, and then locate it. We can see the most external boxclass
Isco_content8
OK. You can start the project.
Get a movie title
First introduce the dependency and set the url to be crawled.
var cheerio = require('cheerio');var http = require('http');var iconv = require('iconv-lite');var url = 'http://www.ygdy8.net/html/gndy/dyzz/index.html';
Core codeindex.js
Http. get (url, function (sres) {var chunks = []; sres. on ('data', function (chunk) {chunks. push (chunk) ;}); // chunks stores the html content of the webpage and transmits it zhuan ma to cheerio. after load, // you can get a variable that implements the jQuery interface and name it '$' // The rest is jQuery's content, sres. on ('end', function () {var titles = []; // transcoding is required because the encoding format of this webpage is gb2312, otherwise, garbled code // The basis:"
"Var html = iconv. decode (Buffer. concat (chunks), 'gb2312'); var $ = cheerio. load (html, {decodeEntities: false}); $ ('. co_content8. ulink '). each (function (idx, element) {var $ element = $ (element); titles. push ({title: $ element. text ()} console. log (titles );});});
Runnode index
The result is as follows:
The title of a movie is obtained successfully. If I want to obtain the title of multiple pages, it is impossible to change the title of a movie by url. Of course there is a way to do this. Please look down!
Obtain the title of a Multi-page movie
We only need to encapsulate the previous code into a function and execute it recursively.
Core codeindex.js
Var index = 1; // page number control var url = 'HTTP: // www.ygdy8.net/html/gndy/dyzz/list_23_'invalid var titles = []; // used to save titlefunction getTitle (url, I) {console. log ("retrieving content on the" + I + "page"); http. get (url + I + '.html ', function (sres) {var chunks = []; sres. on ('data', function (chunk) {chunks. push (chunk) ;}); sres. on ('end', function () {var html = iconv. decode (Buffer. concat (chunks), 'gb2312'); var $ = cheerio. load (html , {DecodeEntities: false}); $ ('. co_content8. ulink '). each (function (idx, element) {var $ element = $ (element); titles. push ({title: $ element. text ()}) if (I <2) {// only two pages of getTitle (url, ++ index) are crawled for convenience; // Recursive Execution, page number + 1} else {console. log (titles); console. log ("Title retrieved! ") ;}}) ;};}Function main () {console. log ("start crawling"); getTitle (url, index);} main (); // run the main function
The result is as follows:
If you want to locate the download link accurately, you must first findid
IsZoom
The download link is here.p
Undertr
Undera
Label.
Then we define another function to get the download link.
GetBtLink ()
Function getBtLink (urls, n) {// URL contains the addresses of all details pages in the console. log ("retrieving url content" + n + "); http. get ('HTTP: // www.ygdy8.net '+ urls [n]. title, function (sres) {var chunks = []; sres. on ('data', function (chunk) {chunks. push (chunk) ;}); sres. on ('end', function () {var html = iconv. decode (Buffer. concat (chunks), 'gb2312'); // transcode var $ = cheerio. load (html, {decodeEntities: false}); $ ('# Zoom td '). children ('A' ). Each (function (idx, element) {var $ element = $ (element); btLink. push ({bt: $ element. attr ('href ')}) if (n <urls. length-1) {getBtLink (urls, ++ count); // recursion} else {console. log ("btlink retrieved! "); Console. log (btLink );}});});}
Run againnode index
In this way, we can get the download links for all the movies on the three pages, isn't it easy?
Save data
We will talk about how to save the data crawling. Here I chose MongoDB to save it.
Data storage functionsave()
Function save () {var MongoClient = require ('mongodb '). consumer Client; // import dependency on Consumer Client. connect (performance_url, function (err, db) {if (err) {console. error (err); return;} else {console. log ("successfully connected to the database"); var collection = db. collection ('node-reptitle'); collection. insertlink (btLink, function (err, result) {// insert data if (err) {console. error (err);} else {console. log ("successfully saved data") ;}}) db. close ();}});}
The operation here is very simple, so there is no need to go to mongoose.
Run againnode index
This is the crawler implemented by Node. js. I wish you all the data you want ;)
The above describes how to use Node. js to implement simple crawling. For more information, see other related articles in the first PHP community!