Nodejs crawler superagent, cheerio, and nodejssuperagent

Source: Internet
Author: User

Nodejs crawler superagent, cheerio, and nodejssuperagent

Preface

I have heard of Crawlers for a long time. I started to learn nodejs over the past few days and wrote an article title, user name, number of readings, number of recommendations, and user profile on the homepage of the crawler blog garden. Now I have a small summary.

These points are used:

1. node core module-File System

2. Third-party module used for http requests-superagent

3. The third-party module used to parse the DOM-cheerio

For detailed explanations and APIs of several modules, please go to various links. The demo only has simple usage.

Preparations

Use npm to manage dependencies. The dependency information is stored in package. json.

// Third-party module used for installation cnpm install -- save superagent cheerio

Introduce the required functional modules

// Introduce a third-party module. superagent is used for http requests. cheerio is used to parse DOMconst request = require ('superagent'); const cheerio = require ('cheerio '); const fs = require ('fs ');

Request + Resolution Page

To access the content of the blog homepage, first request the Homepage Address and obtain the returned html. Here, the superagent is used for http requests. The basic usage is as follows:

request.get(url)      .end(error,res){      //do something     }    

Initiate a get request to the specified url. When a request error occurs, an error is returned (if no error exists, error is null or undefined), and res is the returned data.

After getting the html content, you need to use cheerio to parse the DOM to get the desired data. cheerio needs to load the target html first and then parse it, APIs are very similar to jquery APIs. It is very quick to get familiar with jquery. View code instances directly

// Target link: let targetUrl = 'https: // www.cnblogs.com/'#//use to temporarily Save the content and image address data let content = ''; let imgs = []; // initiate a request. get (targetUrl ). end (error, res) =>{ if (error) {// request error, print error, return to console. log (error) return;} // cheerio needs to load html let $ = cheerio first. load (res. text); // capture the required data. each is the method provided by cheerio to traverse $ ('# post_list. post_item '). each (index, element) =>{// analyze the DOM structure of the required data // locate the target element through the selector, then get the data let temp = {'title': $ (element ). find ('h3 '). text (), 'author ': $ (element ). find ('. post_item_foot> '). text (), 'read': + $ (element ). find ('. article_view '). text (). slice (3,-2), 'Recommended number': + $ (element ). find ('. diggnum '). text ()} // splice data content + = JSON. stringify (temp) + '\ n'; // obtain the image address in the same way if ($ (element ). find ('img. pfs '). length> 0) {imgs. push ($ (element ). find ('img. pfs '). attr ('src') ;}}); // stores the data mkdir ('. /content', saveContent); mkdir ('. /imgs ', downloadImg );})

Store Data

After parsing the DOM above, we have spliced the required information and obtained the image URL. Now we store the content in the txt file of the specified directory, and download the image to the specified directory.

Create a directory first and use the nodejs Core File System

// Create the directory function mkdir (_ path, callback) {if (fs. existsSync (_ path) {console. log ('$ {_ path} directory already exists')} else {fs. mkdir (_ path, (error) =>{ if (error) {return console. log ('failed to create $ {_ path} directory ');} console. log ('created $ {_ path} directory succeeded ')} callback (); // The specified directory is not generated and will not be executed}

With the specified directory, you can write data, and the contents of the txt file already exist. You can use writeFile () to directly write data ()

// Save the text content to the txt file function saveContent () {fs. writeFile ('./content/content.txt', content. toString ());}

To download the image, you must use the superagent to download the image. The superagent can directly return a response stream, and then directly write the image content to the local device in combination with the nodejs pipeline.

// Download the crawled image function downloadImg () {imgs. forEach (imgUrl, index) =>{// get the image name let imgName = imgUrl. split ('/'). pop (); // Save the downloaded image to the specified directory let stream = fs. createWriteStream ('. /imgs/$ {imgName} '); let req = request. get ('https: '+ imgUrl); // response stream req. pipe (stream); console. log ('start downloading image https :$ {imgUrl} -->. /imgs/$ {imgName }');})}

Effect

Run the demo and check the effect. The data has been crawled normally.

A very simple demo may not be so rigorous, but it always takes the first small step of node.

Summary

The above section describes the nodejs crawler superagent and cheerio. I hope it will help you. If you have any questions, please leave a message and I will reply to you in a timely manner. Thank you very much for your support for the help House website!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.