Nodejs get Web page content binding data events, get the number will be divided several times, if you want to match the global content, need to wait for the end of the request, in the end of the event to the cumulative global data to operate!
For example, for example, to find in the page there is no www.baidu.com, do not say, directly put the code:
Introducing module var HTTP = require ("http"), FS = require (' fs '), url = require (' URL '); Writes a file, writes the result to a different file var writeres = function (p, r) {Fs.appendfile (p, r, function (ERR) {if (err) Console.log (E
RR);
else Console.log (R);
}); },//Send the request, and verify the content, write the result to the file Posthttp = function (arr, num) {console.log (' +num+ ' section!)
") var a = Arr[num].split ("-");
if (!a[0] | | |!a[1]) {return; var address = Url.parse (a[1]), options = {host:address.host, Path:address.path, hostname:addr Ess.hostname, method: ' Get ', headers: {' user-agent ': ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/38.0.2125.122 safari/537.36 '}} var req = Http.request (opt
Ions, function (res) {if (Res.statuscode = =) {res.setencoding (' UTF-8 ');
var data = ';
Res.on (' Data ', function (RD) {data + + rd;
}); Res.on (' end '), function (q) {if (!~data.indexof ("WWw.baidu.com ")) {return Writeres ('./no2.txt ', a[0] + '--' + a[1 ' + ' \ n ');
else {return writeres ('./has2.txt ', a[0] + '--' + a[1 ' + ' \ n ');
}} else {Writeres ('./error2.txt ', a[0] + '--' + a[1] + '--' + Res.statuscode + ' \ n ');
}
});
Req.on (' Error ', function (e) {writeres ('./error2.txt ', a[0] + '--' + a[1] + '--' + e + ' \ n ');
}) Req.end (); //Read the file to get the page that needs to be crawled OpenFile = function (path, coding) {Fs.readfile (path, coding, function (err, data) {var res =
Data.split ("\ n");
for (var i = 0, RL = Res.length i < RL; i++) {if (!res[i)) continue;
Posthttp (res, i);
};
})
}; OpenFile ('./sites.log ', ' utf-8 ');
The above code you can see understand, there are no clear friends welcome to my message, but also rely on everyone to play the application into practice.
Here is a brief introduction to Nodejs's ability to crawl
First PHP. First of all, the advantage: the Internet to crawl and parse the framework of HTML to catch a lot of tools directly to use on the line, more worry. Disadvantages: First of all, speed/efficiency is a problem, there is a time to download movie posters, because it is crontab regular execution, also did not do optimization, open the PHP process too much, directly to the memory burst. Then the grammatical aspect is also very procrastination, each kind of key word symbol too many, is not concise, gives the person to have not been careful design the feeling, writes is very troublesome.
Node.js. The advantage is efficiency, efficiency or efficiency, because the network is asynchronous, so basically like hundreds of processes concurrency is as powerful, memory and CPU footprint is very small, if there is no data on the processing of complex operations, then the bottleneck of the system is basically in the bandwidth and write MySQL database I/O speed. Of course, the opposite of the advantages is also a disadvantage, the asynchronous network means that you need to callback, this time if the business needs are linear, such as must wait for the last page crawl completed, to get the data, in order to crawl the next page, or even multi-layer dependencies, it will appear terrible multi-layer callback! Basically this time, the code structure and logic will mess. Of course, you can use step and other process control tools to solve these problems.
Finally, Python. If you do not have extreme requirements for efficiency, then recommend the use of python! First of all, Python's syntax is very concise, the same sentence, you can knock down many times the keyboard. Then, Python is very suitable for data processing, such as the package of function parameters unpack, list resolution, matrix processing, very convenient.