NodeJS write a crawler and put the article in the kindle for reading

Source: Internet
Author: User

The so-called crawler can be simply understood as using programs to operate files, but these files are not local and need to be pulled. I. crawler code parsing 1. get the source code of the target page number Node provides many interfaces to obtain the remote address code. Take the AlloyTeam page as an example and crawl the information of several articles on the home page. Because AlloyTeam uses http: //, this article does not introduce the use of https: // in Node. Var http = require ("http"); var url = "http://www.alloyteam.com/"; var data = ""; // create a request var req = http. request (url, function (res) {// sets the display encoding res. setEncoding ("utf8"); // data is sent by chunked, which means a piece of sent data. // we use data to concatenate res for them. on ('data', function (chunk) {data + = chunk ;}); // start at the response completion time and output data res. on ('end', function () {// dealData (data); console. log (data) ;}); // send the request req. end (); the above Code contains only seven or eight lines, and Alloy is obtained. The code of the Team homepage is really simple. If it is https: //, You have to reference the https module. 2. the regular expression extraction target content first looks at the content we want to capture: because no other library is used, we cannot obtain the target content like DOM, but it is quite simple to write regular expressions, for example, if we want to obtain the content of the title/Article link/abstract, the regular expression is: // function dealDatavar reg =/<ul \ s + class = "articlemenu"> \ s + <li> \ s + <a [^>] *>. *? <\/A> \ s + <a href = "(.*?) "[^>] *> (.*?) <\/A> [\ s \ S] *? <Div \ s + class = "text"> ([\ s \ S] *?) <\/Div>/g; var res = []; while (match = reg.exe c (data) {res. push ({"url": match [1], "title": match [2], "excerpt": match [3]});} Here the regular expression looks a bit obscure, however, regular expressions are very basic in programming. If you do not know much about them, we recommend that you first clarify them. I will not elaborate on them here. It should be emphasized that: reg.exe c (data); if you only write the above sentence, only the first matching result will be obtained, so you need to use the while loop for processing, if it is not processed once, the location of the regular expression match will be pushed back. In fact, after the preceding statement is executed, an object is returned, which contains an index attribute. For details, refer to the content of JavaScript regular expressions. Here, the returned data format (res) is: [{"url: url," title ": title," excerpt "excerpt}]; 3. although the content is obtained in data filtering, all we need is plain text, and other labels must be filtered out. excerpt contains some labels: var excerpt = excerpt. replace (/(<. *?>) ((.*?) (<. *?> ))? /G, "$3"); although there is a lot of code in the article, some labels should not be deleted, but here is the summary content, the labels of these content are deleted, it is convenient for us to store. Then process the length: excerpt = excerpt. slice (0,120); 4. stored in a database (or file) I am storing the file in a file in the format of [title] (url)> excerpt Haha. I am familiar with it. markdown syntax, it looks clear. Var str = ""; for (var I = 0, len = data. length; I <len; I ++) {str + = "[" + data [I]. title + "] (" + data [I]. url + ") \ n" + data [I]. excerpt. replace ("\ n \ s * \ n? ","> \ N ") +" \ n ";} Concatenates the data and then writes it to the file: fs. writeFile ('index. md ', str, function (err) {if (err) throw err; console. log ('data saved ~ ') ;}); The success is that the process is actually very simple. Content (in Linux, the font is really ugly !): 2. source code and summary if you are not familiar with regular expressions, the above work is not very well completed. Many developers provide a tool library for Node and can install it using npm. If you are not familiar with regular expressions, some Toolkit can be used to parse the obtained data as DOM. I have learned that a library named node-jquery looks pretty good. For details, please search for it on the Internet. There should be a lot of it. The above code is hand-written. There is no fault tolerance mechanism, and only the content on the homepage is crawled. However, the idea is the same. After getting the URL, write a loop, the contents of other pages are ready. Few lines of source code: var http = require ("http"); var fs = require ("fs"); var url = "http://www.alloyteam.com/"; var data = ""; var req = http. request (url, function (res) {res. setEncoding ("utf8"); res. on ('data', function (chunk) {data + = chunk;}); res. on ('end', function () {dealData (data) ;}); req. on ('error', function (e) {throw e;}); req. end (); console. log ("downloading data... "); function dealData (data) {var reg =/<ul \ s + linoleic Ss = "articlemenu"> \ s + <li> \ s + <a [^>] *> .*? <\/A> \ s + <a href = "(.*?) "[^>] *> (.*?) <\/A> [\ s \ S] *? <Div \ s + class = "text"> ([\ s \ S] *?) <\/Div>/g; var res = []; while (match = reg.exe c (data) {res. push ({"url": match [1], "title": match [2], "excerpt": match [3]. replace (/(<. *?>) ((.*?) (<. *?> ))? /G, "$3 "). slice (0,120)});} writeFile (res)} function writeFile (data) {var str = ""; for (var I = 0, len = data. length; I <len; I ++) {str + = "[" + data [I]. title + "] (" + data [I]. url + ") \ n>" + data [I]. excerpt. replace (/\ n \ s * \ n? /G, "\ n>") + "\ n";} fs. writeFile ('index. md ', str, function (err) {if (err) throw err; console. log ('data saved ~ ') ;}) ;}In the node environment: node spider. js

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.