Want to automatically grab some data from a webpage or turn a piece of data from a blog to a structured data?
There is no ready-made API to fetch data?!! [email protected]#[email protected]#$ ... It doesn't matter how the Web crawl can be solved. What is Web crawling? You may ask ... Web crawling is the process of retrieving the contents of a Web page and extracting it from it programmatically (usually without a browser involvement). This article, the small series will show you a powerful crawler, you can quickly crawl the web surface, and easy to get started, it is implemented by JavaScript and node. js. Recently I need to crawl some large numbers (humble) pages and analyze them to find some rules. You know, it's been too long to have done something ... Let's just say that there is basically no ready-made tool available. I have to admit that I really like node. js. node. JS is a framework for writing JavaScript programs out of the browser. Under the guidance of Atwood's Law, node. JS has a powerful set of tools to help you develop network programs. You can not only use node. js to develop Webserver/websocket code, I find it can also meet some of my daily scripting needs. So I started looking for a ready-made library or tool for node. js in Web crawling, and, sure enough, I found cheerio. Cheerio is a node. JS library that builds the DOM structure from a piece of HTML, and then provides a CSS selector query like jquery. Very good! Because, in this world, CSS and CSS-driven style lists are almost the only way to organize web pages. (CSS is doing well!) Recently, people often use CSS class styles to create Web pages of various structures. Don't misunderstand me, this is not the key to solve the problem, I still have to deal with a large number of pages, mostly disorganized. But for me, the CSS selector provides a powerful, quick and simple tool for effective data recognition from HTML. My typical web crawl process is to first analyze the structure of the target page with Firebug or Chrome developer tools. The main concern is the CSS selector of the data that I am interested in, which is the target data. The next step is to start with node. js. If you don't have node. js installed or you haven't upgraded it for a long time, download it from here. The installer installs not only node. js itself, but also a Node Toolkit manager called NPM, which NPM can use to quickly download and install the node. JS Library. This is the result we use NPM to install cheerio this library. Run the following lifeMake.
Once the Cheerio installation is complete, we are ready to start working. First, let's look at a piece of JavaScript code that can download the contents of any Web page.
1234567891011121314151617 |
var http = require(
"http"
);
// Utility function that downloads a URL and invokes
// callback with the data.
function download(url, callback) {
http.get(url,
function
(res) {
var data =
""
;
res.on(
‘data‘
,
function (chunk) {
data += chunk;
});
res.on(
"end"
,
function
() {
callback(data);
});
}).on(
"error"
,
function
() {
callback(
null
);
});
}
|
This code can download arbitrary URLs asynchronously (via the HTTP Get method), and when the download is complete, it invokes the callback function and passes the downloaded contents as parameters. The next piece of code can download any Web page and output its contents to the console. Note: Please refer to download.js from the source code as long as the simple journey as follows.
Let's take a look at the code in detail.
12345678 |
var url =
"http://www.dailymail.co.uk/news/article-2297585/Wild-squirrels-pose-charming-pictures-photographer-hides-nuts-miniature-props.html"
download(url,
function
(data) {
if (data) {
console.log(data);
}
else console.log(
"error"
);
});
|
This code downloads the content from the specified URL and prints the content to the console. Now that we have a way to download content from a webpage, we'll see how Cheerio can extract the data we're interested in. Before the actual operation, we have to do a little research and experiments to help us understand the layout of the target page structure, so that people can extract interesting content. In this specific example, we try to take the main images from these URLs. We can first use the browser to open these pages, and then find a way to locate these images, you can use the chrome developer tools or directly read the source (more difficult) can be specific to separate the location of these images, understand? Let's look at the code. Note: Please refer to
Squirrel.js
1234567891011121314151617 |
var cheerio = require(
"cheerio"
);
var url =
"http://www.dailymail.co.uk/news/article-2297585/Wild-squirrels-pose-charming-pictures-photographer-hides-nuts-miniature-props.html"
download(url,
function
(data) {
if (data) {
//console.log(data);
var $ = cheerio.load(data);
$(
"div.artSplitter > img.blkBorder"
).
each
(
function
(i, e) {
console.log($(e).attr(
"src"
));
});
console.log(
"done"
);
}
else console.log(
"error"
);
});
|
With the introduction of the Cheerio module, we can download the content of the target page using the previously written download method. Once we have the data, the Cheerio.load method parses the HTML content into a DOM object and can filter the DOM like a jquery CSS selector query (note: I call this variable $ so it's more like jquery). On the landing page, I noticed that the div in which the images are located has a class called "Artsplitter", and the images themselves have a class called "Blkborderf". To be able to find them only, I wrote a CSS selector query statement
1 |
$( "div.artSplitter > img.blkBorder" ) |
This statement returns a list of picture objects. We then use each method to traverse the images and print the src of each image. The effect is good ... Let's look at another example, please refer to Echo.js source code.
123456789101112131415 |
var cheerio = require(
"cheerio"
);
var url =
"http://www.echojs.com/"
;
download(url,
function
(data) {
if (data) {
// console.log(data);
var $ = cheerio.load(data);
$(
"article"
).
each
(
function
(i, e) {
var link = $(e).find(
"h2>a"
);
var poster = $(e).find(
"username"
).text();
console.log(poster+
": ["
+link.html()+
"]("
+link.attr(
"href"
)+
")"
);
});
}
});
|
In this chestnut, the target is echojs.com. I want to crawl all the articles on this page and print them in markdown format. First we use the following statement to find all the artical nodes
Then traverse all the nodes and find the a tag under H2 with the following statement
1 |
var link = $(e).find( "h2>a" ); |
Similarly, I can use the following statement to find the author's name
1 |
var poster = $(e).find( "username" ).text(); |
Hopefully you'll get some fun from this article about node. js and Chreerio. Please take a cheerio documentation get more information about the column. While it may not be suitable for all large web crawls, it is definitely a powerful tool, especially for the development of our front-end JavaScript jquery. Original address: http://www.storminthecastle.com/2013/08/25/use-node-js-to-extract-data-from-the-web-for-fun-and-profit/
Jaime Edmondson heats up football fashion with scorching pictorial
Transformice the Latest in Window treatment Styles
Crawl Web page data using node. js Cheerio