Use node. js to write a reptile, a picture of shame

Source: Internet
Author: User
Tags webp

When it comes to reptiles, a lot of people think it's a big thing. Wow, is not able to climb sister paper, Ah, can not crawl small pieces ah. The answer is right. The crawler can do these things. But, as an upright programmer, we have to use crawlers to serve us within the bounds of the law, not to do whatever we want. (PS: There should be applause here, thank you.) )

Today, I brought a crawler written in node. js. When it comes to tutorials, most people may think it's boring. That's good, I teach you to climb sister paper map, dry:

Did you get a moment of motivation?

When it comes to reptiles, in fact,"all sites can be crawled" objectively. the content of the Internet is written by people, and are lazy to write out (not the first page is a, the next page is 8), so there must be regular, this gives the possibility of crawling, it can be said that the world does not climb the site. And even if the site is different, but the principle is similar, most of the crawler is sent from the request-to-page----download content--The process of saving content , but with different tools, you may use Python, I use node, he uses PHP, but the idea is the same as above.

Since we are using node to complete the crawler, then we need to use the node environment, if not match, please refer to my first blog.

OK, let's start with the crawler process and analyze some of the modules we need.

First, we need to send a request to get the page, here we use the Request-promise module.

Const RP = require ("Request-promise"),//Enter Request-promise module async getpage (URL) {    const data = {        URL,         res: Await RP ({            url:url        })     };     Return data//In this way, we return an object that is the URL of the page and the content of the page. }

Second, parsing the page, we use a module called Cheerio to parse the res in the object returned above into a call pattern similar to JQ. Cheerio uses a very simple, consistent DOM model. Therefore parsing, operation and rendering are very efficient. Preliminary end-to-end benchmarks indicate that Cheerio is about 8 times times faster than Jsdom.

Const Cheerio = require ("Cheerio");//introduce cheerio module Const $ = cheerio.load (data.res); Convert HTML to an actionable node

At this point, we will analyze the page we are about to crawl. "www.mzitu.com/125685", this is the URL we crawl, F12 look at the DOM structure:

According to this structure we can use $ (". Main-image"). FIND ("img") [0].attribs.src to crawl the address of this image (if you do not know why ATTRIBS.SRC can be a step-by-step Console.log () Look at it).

Finally, at the most critical time, we use the FS module to create folders and download files. Here are a few instructions for the FS module:

1.fs.mkdirsync (Downloadpath): Check to see if this folder exists.

2.fs.mkdirsync (Downloadpath): Create folder.

3.fs.createwritestream (' ${downloadpath}/${index}.jpg '): Write the file, it is important to note that Fs.createwritestream does not seem to create a nonexistent folder on his own, So before you use it, be aware that the folder where the files are saved must be created in advance.

OK, the general method is the above several modules and steps.

Here, I have a look at some of the site's Scenarios for analysis:

1. This site has only one picture on a page, but the URL of each page is based. "Http://www.mzitu.com/125685" (When you enter "HTTP://WWW.MZITU.COM/125685/1" will also jump to this page), "HTTP://WWW.MZITU.COM/125685/2" and so on. Then we can crawl according to this rule, and we need to get the page number of this group image in the page number column below the screen:

  

2. We generally do not crawl only a group of images, but the title of the site's image is the last six-digit number basically no rule can be said, then we can only start from the first page. The exact method does not describe much, the same way as the URL to get the picture.

3. Similarly, when we crawl through a page of the directory, we will crawl the second directory, "http://www.mzitu.com/page/2/", the same principle as the first one.

4. However, some sites have anti-theft chain situation, in the face of such measures, we need to forge a request to avoid this situation. This can be found from the network of F12, see the friends here I would like to understand.

Let headers = {          Accept: "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8 ",          " accept-encoding ":" gzip, deflate ",          " Accept-language ":" zh-cn,zh;q=0.9,en;q=0.8 ",          " Cache-control ": "No-cache",          Host: "I.meizitu.net",          Pragma: "No-cache",          "proxy-connection": "Keep-alive",          Referer: data.url,//"Upgrade-insecure-requests" based on crawled URLs          : 1,          "user-agent": "mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/65.0.3325.19 safari/537.36 "        };

The above is my whole idea.

Code: Business code:
Const RP = require ("Request-promise"),//Enter request-promise module FS = require ("FS"),//Enter FS module Cheerio = require ("Cheerio"), Enter the Cheerio module Depositpath = "d:/blog/reptile/meizi/"; The address where the photo is stored let Downloadpath;      Download Picture folder Address Module.exports = {Async GetPage (URL) {const data = {URL, res:await rp ({Url:url    })    };  return data;    }, GETURL (data) {Let list = []; Const $ = cheerio.load (data.res); Convert HTML to an operational node $ ("#pins li a"). Children (). Each (async (I, e) + = {Let obj = {name:e.a        Ttribs.alt,//Picture page name, followed as folder name Url:e.parent.attribs.href//Picture page URL}; List.push (obj);    Output Directory page query out all the link address});  return list;    }, GetTitle (obj) {Downloadpath = Depositpath + obj.name; if (!fs.existssync (Downloadpath)) {//See if this folder exists Fs.mkdirsync (Downloadpath);//does not exist on the build folder Console.log (' ${obj.name}      The folder was created successfully ');    return true;      } else {Console.log (' ${obj.name} folder already exists ');    return false; }  },  Getimagesnum (res, name) {if (res) {Let $ = cheerio.load (res);      Let Len = $ (". Pagenavi"). Find ("a"). Find ("span"). Length;      if (len = = 0) {fs.rmdirsync (' ${depositpath}${name} ');//Delete the folder that cannot be downloaded return 0;      Let PageIndex = $ (". Pagenavi"). Find ("a"). Find ("span") [Len-2].children[0].data; Return pageindex;//returns the total number of pictures}},//download album photos async downloadimage (data, index) {if (data.res) {var $ = CHEERIO.L      Oad (Data.res); if ($ (". Main-image"). FIND ("img") [0]) {Let IMGSRC = $ (". Main-image"). FIND ("img") [0].attribs.src;//picture address let H          Eaders = {Accept: "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "Accept-encoding": "gzip, deflate", "accept-language": "zh-cn,zh;q=0.9,en;q=0.8", "Cache-control": "          No-cache ", Host:" I.meizitu.net ", Pragma:" No-cache "," proxy-connection ":" Keep-alive ", Referer: Data.url, "upgrade-insecure-requests": 1, "user-agent": "mozilla/5.0 (Windows NT 10.0;          WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/65.0.3325.19 safari/537.36 "};//anti-theft chain await RP ({ URL:IMGSRC, Resolvewithfullresponse:true, headers}). Pipe (Fs.createwritestream (' ${downloadpat      H}/${index}.jpg ');//Download Console.log (' ${downloadpath}/${index}.jpg download successful ');      } else {Console.log (' ${downloadpath}/${index}.jpg failed to load '); }    }  }};
Principal Logic code:
 Const MODEL = require ("./model"), Basicpath = "http://www.mzitu.com/page/"; let start = 1, end = 10;const main = asy  NC URL = = {let list = [], index = 0;  Const DATA = await model.getpage (URL);  List = Model.geturl (data); Downloadimages (list, index);//download};const downloadimages = Async (list, index) = {if (index = = list.length) {Start    ++;    if (Start < end) {Main (Basicpath + start);//Take a crawl of the next page of the picture group.  } return false; } if (Model.gettitle (List[index])) {Let item = await Model.getpage (list[index].url),//Gets the URL of the page where the picture is located imagenum = mo Del.getimagesnum (item.res,list[index].name);//Gets the number of this set of images for (var i = 1; I <= imagenum; i++) {Let page = await    Model.getpage (List[index].url + '/${i} ');//traverse to get this set of images each page is located in the await Model.downloadimage (page, i);//download} index++;    Downloadimages (list, index);//cycle complete download next group} else {index++; Downloadimages (list, index);//Download the next set of}};main (Basicpath + start); 

This project has been uploaded to my github warehouse Https://github.com/lunlunshiwo/NodeJs-crawler, for star, thank you.

Summarize:

As for subsequent operations, such as saving to the local and MongoDB database to save the operation, I will write next time, please follow me.

Solemn ascension, although the crawler is good, must not violate the law.

If this article violates your interests, please leave a message.

If you think this article is good, don't skimp your point like and attention. Thank you.

  

Use node. js to write a reptile, a picture of shame

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.