NodeJS write a crawler and put the article in the kindle for reading

Last Update:2014-04-29 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The so-called crawler can be simply understood as using programs to operate files, but these files are not local and need to be pulled. I. crawler code parsing 1. get the source code of the target page number Node provides many interfaces to obtain the remote address code. Take the AlloyTeam page as an example and crawl the information of several articles on the home page. Because AlloyTeam uses http: //, this article does not introduce the use of https: // in Node. Var http = require ("http"); var url = "http://www.alloyteam.com/"; var data = ""; // create a request var req = http. request (url, function (res) {// sets the display encoding res. setEncoding ("utf8"); // data is sent by chunked, which means a piece of sent data. // we use data to concatenate res for them. on ('data', function (chunk) {data + = chunk ;}); // start at the response completion time and output data res. on ('end', function () {// dealData (data); console. log (data) ;}); // send the request req. end (); the above Code contains only seven or eight lines, and Alloy is obtained. The code of the Team homepage is really simple. If it is https: //, You have to reference the https module. 2. the regular expression extraction target content first looks at the content we want to capture: because no other library is used, we cannot obtain the target content like DOM, but it is quite simple to write regular expressions, for example, if we want to obtain the content of the title/Article link/abstract, the regular expression is: // function dealDatavar reg =/<ul \ s + class = "articlemenu"> \ s + <li> \ s + <a [^>] *>. *? <\/A> \ s + <a href = "(.*?) "[^>] *> (.*?) <\/A> [\ s \ S] *? <Div \ s + class = "text"> ([\ s \ S] *?) <\/Div>/g; var res = []; while (match = reg.exe c (data) {res. push ({"url": match [1], "title": match [2], "excerpt": match [3]});} Here the regular expression looks a bit obscure, however, regular expressions are very basic in programming. If you do not know much about them, we recommend that you first clarify them. I will not elaborate on them here. It should be emphasized that: reg.exe c (data); if you only write the above sentence, only the first matching result will be obtained, so you need to use the while loop for processing, if it is not processed once, the location of the regular expression match will be pushed back. In fact, after the preceding statement is executed, an object is returned, which contains an index attribute. For details, refer to the content of JavaScript regular expressions. Here, the returned data format (res) is: [{"url: url," title ": title," excerpt "excerpt}]; 3. although the content is obtained in data filtering, all we need is plain text, and other labels must be filtered out. excerpt contains some labels: var excerpt = excerpt. replace (/(<. *?>) ((.*?) (<. *?> ))? /G, "$3"); although there is a lot of code in the article, some labels should not be deleted, but here is the summary content, the labels of these content are deleted, it is convenient for us to store. Then process the length: excerpt = excerpt. slice (0,120); 4. stored in a database (or file) I am storing the file in a file in the format of [title] (url)> excerpt Haha. I am familiar with it. markdown syntax, it looks clear. Var str = ""; for (var I = 0, len = data. length; I <len; I ++) {str + = "[" + data [I]. title + "] (" + data [I]. url + ") \ n" + data [I]. excerpt. replace ("\ n \ s * \ n? ","> \ N ") +" \ n ";} Concatenates the data and then writes it to the file: fs. writeFile ('index. md ', str, function (err) {if (err) throw err; console. log ('data saved ~ ') ;}); The success is that the process is actually very simple. Content (in Linux, the font is really ugly !): 2. source code and summary if you are not familiar with regular expressions, the above work is not very well completed. Many developers provide a tool library for Node and can install it using npm. If you are not familiar with regular expressions, some Toolkit can be used to parse the obtained data as DOM. I have learned that a library named node-jquery looks pretty good. For details, please search for it on the Internet. There should be a lot of it. The above code is hand-written. There is no fault tolerance mechanism, and only the content on the homepage is crawled. However, the idea is the same. After getting the URL, write a loop, the contents of other pages are ready. Few lines of source code: var http = require ("http"); var fs = require ("fs"); var url = "http://www.alloyteam.com/"; var data = ""; var req = http. request (url, function (res) {res. setEncoding ("utf8"); res. on ('data', function (chunk) {data + = chunk;}); res. on ('end', function () {dealData (data) ;}); req. on ('error', function (e) {throw e;}); req. end (); console. log ("downloading data... "); function dealData (data) {var reg =/<ul \ s + linoleic Ss = "articlemenu"> \ s + <li> \ s + <a [^>] *> .*? <\/A> \ s + <a href = "(.*?) "[^>] *> (.*?) <\/A> [\ s \ S] *? <Div \ s + class = "text"> ([\ s \ S] *?) <\/Div>/g; var res = []; while (match = reg.exe c (data) {res. push ({"url": match [1], "title": match [2], "excerpt": match [3]. replace (/(<. *?>) ((.*?) (<. *?> ))? /G, "$3 "). slice (0,120)});} writeFile (res)} function writeFile (data) {var str = ""; for (var I = 0, len = data. length; I <len; I ++) {str + = "[" + data [I]. title + "] (" + data [I]. url + ") \ n>" + data [I]. excerpt. replace (/\ n \ s * \ n? /G, "\ n>") + "\ n";} fs. writeFile ('index. md ', str, function (err) {if (err) throw err; console. log ('data saved ~ ') ;}) ;}In the node environment: node spider. js

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

NodeJS write a crawler and put the article in the kindle for reading

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

NodeJS write a crawler and put the article in the kindle for reading

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support