Nodejs crawler superagent, cheerio, and nodejssuperagent

Last Update:2018-03-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Preface

I have heard of Crawlers for a long time. I started to learn nodejs over the past few days and wrote an article title, user name, number of readings, number of recommendations, and user profile on the homepage of the crawler blog garden. Now I have a small summary.

These points are used:

1. node core module-File System

2. Third-party module used for http requests-superagent

3. The third-party module used to parse the DOM-cheerio

For detailed explanations and APIs of several modules, please go to various links. The demo only has simple usage.

Preparations

Use npm to manage dependencies. The dependency information is stored in package. json.

// Third-party module used for installation cnpm install -- save superagent cheerio

Introduce the required functional modules

// Introduce a third-party module. superagent is used for http requests. cheerio is used to parse DOMconst request = require ('superagent'); const cheerio = require ('cheerio '); const fs = require ('fs ');

Request + Resolution Page

To access the content of the blog homepage, first request the Homepage Address and obtain the returned html. Here, the superagent is used for http requests. The basic usage is as follows:

request.get(url)      .end(error,res){      //do something     }

Initiate a get request to the specified url. When a request error occurs, an error is returned (if no error exists, error is null or undefined), and res is the returned data.

After getting the html content, you need to use cheerio to parse the DOM to get the desired data. cheerio needs to load the target html first and then parse it, APIs are very similar to jquery APIs. It is very quick to get familiar with jquery. View code instances directly

// Target link: let targetUrl = 'https: // www.cnblogs.com/'#//use to temporarily Save the content and image address data let content = ''; let imgs = []; // initiate a request. get (targetUrl ). end (error, res) =>{ if (error) {// request error, print error, return to console. log (error) return;} // cheerio needs to load html let $ = cheerio first. load (res. text); // capture the required data. each is the method provided by cheerio to traverse $ ('# post_list. post_item '). each (index, element) =>{// analyze the DOM structure of the required data // locate the target element through the selector, then get the data let temp = {'title': $ (element ). find ('h3 '). text (), 'author ': $ (element ). find ('. post_item_foot> '). text (), 'read': + $ (element ). find ('. article_view '). text (). slice (3,-2), 'Recommended number': + $ (element ). find ('. diggnum '). text ()} // splice data content + = JSON. stringify (temp) + '\ n'; // obtain the image address in the same way if ($ (element ). find ('img. pfs '). length> 0) {imgs. push ($ (element ). find ('img. pfs '). attr ('src') ;}}); // stores the data mkdir ('. /content', saveContent); mkdir ('. /imgs ', downloadImg );})

Store Data

After parsing the DOM above, we have spliced the required information and obtained the image URL. Now we store the content in the txt file of the specified directory, and download the image to the specified directory.

Create a directory first and use the nodejs Core File System

// Create the directory function mkdir (_ path, callback) {if (fs. existsSync (_ path) {console. log ('$ {_ path} directory already exists')} else {fs. mkdir (_ path, (error) =>{ if (error) {return console. log ('failed to create $ {_ path} directory ');} console. log ('created $ {_ path} directory succeeded ')} callback (); // The specified directory is not generated and will not be executed}

With the specified directory, you can write data, and the contents of the txt file already exist. You can use writeFile () to directly write data ()

// Save the text content to the txt file function saveContent () {fs. writeFile ('./content/content.txt', content. toString ());}

To download the image, you must use the superagent to download the image. The superagent can directly return a response stream, and then directly write the image content to the local device in combination with the nodejs pipeline.

// Download the crawled image function downloadImg () {imgs. forEach (imgUrl, index) =>{// get the image name let imgName = imgUrl. split ('/'). pop (); // Save the downloaded image to the specified directory let stream = fs. createWriteStream ('. /imgs/$ {imgName} '); let req = request. get ('https: '+ imgUrl); // response stream req. pipe (stream); console. log ('start downloading image https :$ {imgUrl} -->. /imgs/$ {imgName }');})}

Effect

Run the demo and check the effect. The data has been crawled normally.

A very simple demo may not be so rigorous, but it always takes the first small step of node.

Summary

The above section describes the nodejs crawler superagent and cheerio. I hope it will help you. If you have any questions, please leave a message and I will reply to you in a timely manner. Thank you very much for your support for the help House website!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Nodejs crawler superagent, cheerio, and nodejssuperagent

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Nodejs crawler superagent, cheerio, and nodejssuperagent

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support