Node.js to crawl and analyze the content of the Web page has no special content JS file _node.js

Source: Internet
Author: User

Nodejs get Web page content binding data events, get the number will be divided several times, if you want to match the global content, need to wait for the end of the request, in the end of the event to the cumulative global data to operate!

For example, for example, to find in the page there is no www.baidu.com, do not say, directly put the code:

Introducing module var HTTP = require ("http"), FS = require (' fs '), url = require (' URL '); Writes a file, writes the result to a different file var writeres = function (p, r) {Fs.appendfile (p, r, function (ERR) {if (err) Console.log (E
    RR);
  else Console.log (R);
}); },//Send the request, and verify the content, write the result to the file Posthttp = function (arr, num) {console.log (' +num+ ' section!)
   ") var a = Arr[num].split ("-");
   if (!a[0] | | |!a[1]) {return; var address = Url.parse (a[1]), options = {host:address.host, Path:address.path, hostname:addr Ess.hostname, method: ' Get ', headers: {' user-agent ': ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/38.0.2125.122 safari/537.36 '}} var req = Http.request (opt
        Ions, function (res) {if (Res.statuscode = =) {res.setencoding (' UTF-8 ');
        var data = ';
        Res.on (' Data ', function (RD) {data + + rd;
        }); Res.on (' end '), function (q) {if (!~data.indexof ("WWw.baidu.com ")) {return Writeres ('./no2.txt ', a[0] + '--' + a[1 ' + ' \ n ');
          else {return writeres ('./has2.txt ', a[0] + '--' + a[1 ' + ' \ n ');
     }} else {Writeres ('./error2.txt ', a[0] + '--' + a[1] + '--' + Res.statuscode + ' \ n ');
   }
   });
   Req.on (' Error ', function (e) {writeres ('./error2.txt ', a[0] + '--' + a[1] + '--' + e + ' \ n ');
}) Req.end (); //Read the file to get the page that needs to be crawled OpenFile = function (path, coding) {Fs.readfile (path, coding, function (err, data) {var res =  
     Data.split ("\ n");
        for (var i = 0, RL = Res.length i < RL; i++) {if (!res[i)) continue;  
     Posthttp (res, i);  
   };
})
}; OpenFile ('./sites.log ', ' utf-8 ');

The above code you can see understand, there are no clear friends welcome to my message, but also rely on everyone to play the application into practice.

Here is a brief introduction to Nodejs's ability to crawl

First PHP. First of all, the advantage: the Internet to crawl and parse the framework of HTML to catch a lot of tools directly to use on the line, more worry. Disadvantages: First of all, speed/efficiency is a problem, there is a time to download movie posters, because it is crontab regular execution, also did not do optimization, open the PHP process too much, directly to the memory burst. Then the grammatical aspect is also very procrastination, each kind of key word symbol too many, is not concise, gives the person to have not been careful design the feeling, writes is very troublesome.

Node.js. The advantage is efficiency, efficiency or efficiency, because the network is asynchronous, so basically like hundreds of processes concurrency is as powerful, memory and CPU footprint is very small, if there is no data on the processing of complex operations, then the bottleneck of the system is basically in the bandwidth and write MySQL database I/O speed. Of course, the opposite of the advantages is also a disadvantage, the asynchronous network means that you need to callback, this time if the business needs are linear, such as must wait for the last page crawl completed, to get the data, in order to crawl the next page, or even multi-layer dependencies, it will appear terrible multi-layer callback! Basically this time, the code structure and logic will mess. Of course, you can use step and other process control tools to solve these problems.

Finally, Python. If you do not have extreme requirements for efficiency, then recommend the use of python! First of all, Python's syntax is very concise, the same sentence, you can knock down many times the keyboard. Then, Python is very suitable for data processing, such as the package of function parameters unpack, list resolution, matrix processing, very convenient.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.