Nodejs crawler superagent, cheerio, and nodejssuperagent
Preface
I have heard of Crawlers for a long time. I started to learn nodejs over the past few days and wrote an article title, user name, number of readings, number of recommendations, and user profile on the homepage of the crawler blog garden. Now I have a small summary.
These points are used:
1. node core module-File System
2. Third-party module used for http requests-superagent
3. The third-party module used to parse the DOM-cheerio
For detailed explanations and APIs of several modules, please go to various links. The demo only has simple usage.
Preparations
Use npm to manage dependencies. The dependency information is stored in package. json.
// Third-party module used for installation cnpm install -- save superagent cheerio
Introduce the required functional modules
// Introduce a third-party module. superagent is used for http requests. cheerio is used to parse DOMconst request = require ('superagent'); const cheerio = require ('cheerio '); const fs = require ('fs ');
Request + Resolution Page
To access the content of the blog homepage, first request the Homepage Address and obtain the returned html. Here, the superagent is used for http requests. The basic usage is as follows:
request.get(url) .end(error,res){ //do something }
Initiate a get request to the specified url. When a request error occurs, an error is returned (if no error exists, error is null or undefined), and res is the returned data.
After getting the html content, you need to use cheerio to parse the DOM to get the desired data. cheerio needs to load the target html first and then parse it, APIs are very similar to jquery APIs. It is very quick to get familiar with jquery. View code instances directly
// Target link: let targetUrl = 'https: // www.cnblogs.com/'#//use to temporarily Save the content and image address data let content = ''; let imgs = []; // initiate a request. get (targetUrl ). end (error, res) =>{ if (error) {// request error, print error, return to console. log (error) return;} // cheerio needs to load html let $ = cheerio first. load (res. text); // capture the required data. each is the method provided by cheerio to traverse $ ('# post_list. post_item '). each (index, element) =>{// analyze the DOM structure of the required data // locate the target element through the selector, then get the data let temp = {'title': $ (element ). find ('h3 '). text (), 'author ': $ (element ). find ('. post_item_foot> '). text (), 'read': + $ (element ). find ('. article_view '). text (). slice (3,-2), 'Recommended number': + $ (element ). find ('. diggnum '). text ()} // splice data content + = JSON. stringify (temp) + '\ n'; // obtain the image address in the same way if ($ (element ). find ('img. pfs '). length> 0) {imgs. push ($ (element ). find ('img. pfs '). attr ('src') ;}}); // stores the data mkdir ('. /content', saveContent); mkdir ('. /imgs ', downloadImg );})
Store Data
After parsing the DOM above, we have spliced the required information and obtained the image URL. Now we store the content in the txt file of the specified directory, and download the image to the specified directory.
Create a directory first and use the nodejs Core File System
// Create the directory function mkdir (_ path, callback) {if (fs. existsSync (_ path) {console. log ('$ {_ path} directory already exists')} else {fs. mkdir (_ path, (error) =>{ if (error) {return console. log ('failed to create $ {_ path} directory ');} console. log ('created $ {_ path} directory succeeded ')} callback (); // The specified directory is not generated and will not be executed}
With the specified directory, you can write data, and the contents of the txt file already exist. You can use writeFile () to directly write data ()
// Save the text content to the txt file function saveContent () {fs. writeFile ('./content/content.txt', content. toString ());}
To download the image, you must use the superagent to download the image. The superagent can directly return a response stream, and then directly write the image content to the local device in combination with the nodejs pipeline.
// Download the crawled image function downloadImg () {imgs. forEach (imgUrl, index) =>{// get the image name let imgName = imgUrl. split ('/'). pop (); // Save the downloaded image to the specified directory let stream = fs. createWriteStream ('. /imgs/$ {imgName} '); let req = request. get ('https: '+ imgUrl); // response stream req. pipe (stream); console. log ('start downloading image https :$ {imgUrl} -->. /imgs/$ {imgName }');})}
Effect
Run the demo and check the effect. The data has been crawled normally.
A very simple demo may not be so rigorous, but it always takes the first small step of node.
Summary
The above section describes the nodejs crawler superagent and cheerio. I hope it will help you. If you have any questions, please leave a message and I will reply to you in a timely manner. Thank you very much for your support for the help House website!