Crawl Web page data using node. js Cheerio

Source: Internet
Author: User
Tags chrome developer chrome developer tools

Want to automatically grab some data from a webpage or turn a piece of data from a blog to a structured data?

There is no ready-made API to fetch data?!!  [email protected]#[email protected]#$ ... It doesn't matter how the Web crawl can be solved. What is Web crawling? You may ask ... Web crawling is the process of retrieving the contents of a Web page and extracting it from it programmatically (usually without a browser involvement).   This article, the small series will show you a powerful crawler, you can quickly crawl the web surface, and easy to get started, it is implemented by JavaScript and node. js.   Recently I need to crawl some large numbers (humble) pages and analyze them to find some rules. You know, it's been too long to have done something ... Let's just say that there is basically no ready-made tool available.   I have to admit that I really like node. js. node. JS is a framework for writing JavaScript programs out of the browser. Under the guidance of Atwood's Law, node. JS has a powerful set of tools to help you develop network programs. You can not only use node. js to develop Webserver/websocket code, I find it can also meet some of my daily scripting needs. So I started looking for a ready-made library or tool for node. js in Web crawling, and, sure enough, I found  cheerio. Cheerio is a node. JS library that builds the DOM structure from a piece of HTML, and then provides a CSS selector query like jquery.   Very good! Because, in this world, CSS and CSS-driven style lists are almost the only way to organize web pages. (CSS is doing well!) Recently, people often use CSS class styles to create Web pages of various structures. Don't misunderstand me, this is not the key to solve the problem, I still have to deal with a large number of pages, mostly disorganized. But for me, the CSS selector provides a powerful, quick and simple tool for effective data recognition from HTML. My typical web crawl process is to first analyze the structure of the target page with Firebug or Chrome developer tools. The main concern is the CSS selector of the data that I am interested in, which is the target data. The next step is to start with node. js.   If you don't have node. js installed or you haven't upgraded it for a long time, download it from here. The installer installs not only node. js itself, but also a Node Toolkit manager called NPM, which NPM can use to quickly download and install the node. JS Library. This is the result we use NPM to install cheerio this library. Run the following lifeMake.  
1 npm install cheerio
Once the Cheerio installation is complete, we are ready to start working. First, let's look at a piece of JavaScript code that can download the contents of any Web page.
1234567891011121314151617 var http = require("http");// Utility function that downloads a URL and invokes// callback with the data.function download(url, callback) {  http.get(url, function(res) {    var data = "";    res.on(‘data‘, function (chunk) {      data += chunk;    });    res.on("end", function() {      callback(data);    });  }).on("error", function() {    callback(null);  });}
This code can download arbitrary URLs asynchronously (via the HTTP Get method), and when the download is complete, it invokes the callback function and passes the downloaded contents as parameters. The next piece of code can download any Web page and output its contents to the console. Note: Please refer to download.js from the source code as long as the simple journey as follows.
1 node download.js
Let's take a look at the code in detail.
12345678 var url = "http://www.dailymail.co.uk/news/article-2297585/Wild-squirrels-pose-charming-pictures-photographer-hides-nuts-miniature-props.html"download(url, function(data) {  if (data) {    console.log(data);  }  else console.log("error");  });
This code downloads the content from the specified URL and prints the content to the console. Now that we have a way to download content from a webpage, we'll see how Cheerio can extract the data we're interested in. Before the actual operation, we have to do a little research and experiments to help us understand the layout of the target page structure, so that people can extract interesting content. In this specific example, we try to take the main images from these URLs. We can first use the browser to open these pages, and then find a way to locate these images, you can use the chrome developer tools or directly read the source (more difficult) can be specific to separate the location of these images, understand? Let's look at the code. Note: Please refer to Squirrel.js  
1234567891011121314151617 var cheerio = require("cheerio");var url = "http://www.dailymail.co.uk/news/article-2297585/Wild-squirrels-pose-charming-pictures-photographer-hides-nuts-miniature-props.html"download(url, function(data) {  if (data) {    //console.log(data);    var $ = cheerio.load(data);    $("div.artSplitter > img.blkBorder").each(function(i, e) {        console.log($(e).attr("src"));      });    console.log("done");  }  else console.log("error");  });
With the introduction of the Cheerio module, we can download the content of the target page using the previously written download method. Once we have the data, the Cheerio.load method parses the HTML content into a DOM object and can filter the DOM like a jquery CSS selector query (note: I call this variable $ so it's more like jquery). On the landing page, I noticed that the div in which the images are located has a class called "Artsplitter", and the images themselves have a class called "Blkborderf". To be able to find them only, I wrote a CSS selector query statement
1 $("div.artSplitter > img.blkBorder")
This statement returns a list of picture objects. We then use each method to traverse the images and print the src of each image. The effect is good ... Let's look at another example, please refer to Echo.js source code.
123456789101112131415 var cheerio = require("cheerio");var url = "http://www.echojs.com/";download(url, function(data) {  if (data) {    // console.log(data);    var $ = cheerio.load(data);    $("article").each(function(i, e) {      var link = $(e).find("h2>a");      var poster = $(e).find("username").text();      console.log(poster+": ["+link.html()+"]("+link.attr("href")+")");    });  }});
In this chestnut, the target is echojs.com. I want to crawl all the articles on this page and print them in markdown format. First we use the following statement to find all the artical nodes
1 $("article")
Then traverse all the nodes and find the a tag under H2 with the following statement
1 varlink = $(e).find("h2>a");
Similarly, I can use the following statement to find the author's name
1 varposter = $(e).find("username").text();
Hopefully you'll get some fun from this article about node. js and Chreerio. Please take a cheerio documentation get more information about the column.    While it may not be suitable for all large web crawls, it is definitely a powerful tool, especially for the development of our front-end JavaScript jquery. Original address: http://www.storminthecastle.com/2013/08/25/use-node-js-to-extract-data-from-the-web-for-fun-and-profit/

Jaime Edmondson heats up football fashion with scorching pictorial
Transformice the Latest in Window treatment Styles

Crawl Web page data using node. js Cheerio

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.