Study Notes
Objective
Recently do a data crawler, the first use of the C # console application, while the regular expression to filter the data, look at the line, each run is dependent on the. NET Framework is very uncomfortable, so think the whole point of other methods. I still prefer JavaScript, reasoning decided to use the server-side JavaScript to try!
Environment, tool preparation
1, [ must install] installation nodejs, download the latest Nodejs, click here
2, [ optional ] Install Iisnode, as well as rewrite, because I am using IIS as server, so the use of the 2 IIS extension plug-in, if just a CMD console run node this second ignore
3. Additional information: After the installation is successful, the URL rewrite appears next to IIS.
Example implementations
I use a small case to achieve such a function, we go to the following site to test the data: http://www.jj59.com/
1, catch list, http://www.jj59.com/jingpinwenzhang/list_68_3.html
2, grasping the details, http://www.jj59.com/jingpinwenzhang/082919.html
The original website is as follows:
The next thing I'm going to do is
1. Filter the title of the article in the list page of the specified page and the corresponding link, and return the JSON array
2, the details page article title, author, creation time, content filtered out, and then return to the JSON object
Specific Nodejs code to run the results
1 varHTTP = require (' http ');2 var_url = require (' URL ');//referencing URL modules, handling URL-address related operations3 varCheerio = require ("Cheerio");//referencing the Cheerio module so that the DOM is manipulated on the server side like the client, without regular expressions4 varIconv = require (' Iconv-lite ');//Resolving the Encoding conversion module5 varBufferhelper = require (' Bufferhelper ');//about buffer I'll elaborate on the back .6 /*7 * Finally I need to reach the effect is to give an access address, shaped like: HTTP://WWW.MYNODE.COM?LINK=WWW.ABC.COM&CALLBACK=CB8 * I wish I could return JSON, or return to Jsonp9 */Ten OneHttp.createserver (function(req, res) { A vararg = _url.parse (Req.url,true). Query;//gets the query string parameter collection by calling the URL module - varlink = arg.link;//get the crawled link - varcallback = Arg.callback;//name of the callback function the //If link is not added to HTTP, the complete - varprotocol = "HTTP"; - if(Link.indexof ("http") < 0) { -link = protocol + "://" +link; + } - //Crawl Page +Download (link,function(data) { ARes.writehead (200, { at"Content-type": "Text/html;charset=utf-8", -"Transfer-encoding": "chunked" - }); - varDoc =data.tostring (); - var$ =Cheerio.load (DOC); - varList = []; in$ (". E2 li. title"). each (function(i, e) { - varItem = $ (E). Children ("a"). Last (); to vartitle =Item.text (); + varlink = item.attr ("href"); -List.push ({"title": Title, "link"): Link}); the }); * varJsontext =json.stringify (list); $ if(callback) {Panax NotoginsengRes.write (Callback + "(" + Jsontext + ")"); - } the Else { + Res.write (jsontext); A } the res.end (); + }); - }). Listen (Process.env.PORT); $ $ //Loading third-party pages - functionDownload (URL, callback) { -Http.get (URL,function(res) { the varBufferhelper =NewBufferhelper ();//solve the Chinese coding problem -Res.on (' Data ',function(chunk) {Wuyi Bufferhelper.concat (chunk); the }); -Res.on ("End",function () { Wu //Note that this encoding must be consistent with the encoding of the crawl page, or it will be garbled, or you can dynamically identify - varval = Iconv.decode (Bufferhelper.tobuffer (), ' gb2312 '); About Callback (val); $ }); -}). On ("Error",function () { -CallbackNULL); - }); A}View Code
1. List page JSON list
URL:HTTP://MYURL?LINK=HTTP://WWW.JJ59.COM/JINGPINWENZHANG/LIST_68_3.HTML&CALLBACK=CB of the request
CB ([{"title": "Stream gold years scholarly companion line", "link": "http://www.jj59.com/jingpinwenzhang/096929.html"},{"title": "Struggle to write No regrets youth", "link": " Http://www.jj59.com/jingdianmeiwen/088963.html "},{" title ":" If he likes you, he will not be ambiguous, if he no longer contact you, do not find reason for him "," link ":"/http Www.jj59.com/jingpinwenzhang/082919.html "},{" title ":" The month under the wandering, you have no longer the appearance of the year, "link": "Http://www.jj59.com/jingdianmeiwen /082295.html "},{" title ":" The words roughness of the farm words (4), "link": "http://www.jj59.com/jingpinwenzhang/080410.html"},{"title": " Hrs, hide sleeve A Smile "," link ":" http://www.jj59.com/jingdianmeiwen/078491.html "},{" title ":" Soft, like right and wrong "," link ":" HTTP// Www.jj59.com/jingpinwenzhang/078002.html "},{" title ": Spring", "link": "http://www.jj59.com/jingdianmeiwen/077439. HTML "},{" title ":" When the "beauty" of life becomes a kind of regret, "link": "http://www.jj59.com/jingdianmeiwen/074365.html"},{"title": "The origin of the Mid-Autumn Festival, food customs, Ancient poetry couplets "," link ":" http://www.jj59.com/jingpinwenzhang/043440.html "}])
2. Detail Page JSON object
Requested URL:HTTP://MYURL?LINK=HTTP://WWW.JJ59.COM/JINGPINWENZHANG/082919.HTML&CALLBACK=CB
CB ({"title": "If he likes you, he will not be ambiguous; if he no longer contacts you, do not find reason for him", "date": "2011-05-12", "auth": "The Midnight is not a shame", "content": "1, if he passively reserved." \ r \ r \ n \ \ "Maybe he didn't want to ruin our friendship. \ r \ n" Maybe he's shy. "\ r \ n" Maybe he just doesn't know how to contact me. \ r \ n \ r \ Gerg said that throughout the history of mankind, \ r \ n Any man will be close to you and do not care to ruin the "friendship", \ r \ r \ n He is not afraid to chase you because of shyness and inferiority, \ r \ r \ n He only "afraid" is that he is so "indifferent" to you, \ r \ n He will not know how to contact you, cell phone, EM ail,im,sns,twitter......\r\n \ r \ n He can use his eyes, mouth, brain, network, Google to find you-unless he doesn't want to find you. \ r \ r \ n Maybe someone advocates this is not the Stone Age, \ \ \ \ \ \ r \ n girls to take the initiative to pursue the people it, \ r \ n But believe that the person who really likes you will not let you struggle to find him-because he will come to his own initiative. \ r \ r \ n 2, if he promised you did not do, even if it is just a phone call. \ r \ r \ n "He was really busy so he forgot," at least he really apologized to me. He is very busy, \ r \ n is about to take office as President of the United States, one hours good hundreds of millions of business to talk about, busy fast crazy, \ r \ r \ n A day can not spare time to call you, busy really crazy. \ r \ n \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ n \ r \ r \ n If you really like you will not forget, if you forget that he doesn't care about your disappointment. \ r \ n "Busy" is a love of weapons of mass destruction, is a synonym for "bastard", \ r \ n \ r \ nthe bastard is the person who used to perfunctory you. \ r \ n \r\n--(mentally sound) men know what "priorities" are, as for apologies? \ r \ n Oh, there is no time to listen to his nonsense. \ r \ n \ 3, if he is ambiguous. \ r \ r \ n "He's been hurt before" "He's in a mess now. \ r \ n" He just dividedHand/Divorce He wants to take it slow. \ r \ n \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ r \ n \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ He doesn't want to take you into his circle because it's just two people. I just want to use you to kill the time "I don't like you very much". \ r \ n \ 4, if he doesn't want to be too close to you. \ r \ n \ \ \ \ Gerg said very directly, \ r \ r \ n "I am a man, if I like you, I will kiss you, would like to see you wear underwear and do not wear underwear appearance." "\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \" Although I feel very awkward and cold, I think it's the truth, \ \ \ \ \ \ \ \ \ If you like your inner and outer, \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ You want someone who likes you to say, "I love I reckon Plato would not say that to his ladyship. \ r \ n \ 5, if he betrays you. \ r \ r \ n "He drank too much" "That's just an occasional accident" "He's not careful" ... \ r \ n \ Gerg said right, betrayal has no excuse. \ r \ n \ \ \ \ \ \ \ \ \ \ \ \ \ \ r \ n He can't say, "Oh, I accidentally fell and fell to someone else's bed." It's not 20 pounds, \ r \ n \ NAND 175 pounds--the weight of your worthless boyfriend. "\ r \ r \ n 6, if he's drunk enough to come to you." \ r \ r \ n He's drinking, or taking drugs (this should not be common in the country), \ r \ n and not willing to change for you, then leave, because the long-term life needs to be sober. \ r \ n \ 7, if the time is ripe, he still doesn't want to get married. \ r \ n \ \ "Maybe I'm too open-minded", "he was in the shadow of his childhood family" "He's not ready" ...\ r \ r \ n Many men, women, psychologists, sociologists, anthropologists, feminists ... \ r \ n You can talk about a critical marriage system, \ r \ n \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ I'm sorry, first of all, you have to figure out "don't want to get married" may just mean "don't want to Marry you", \ r \ r \ n Those who say "Don't want to marry" will eventually get married, but not with you. \ r \ n \ 8, if he keeps breaking up with you, and then comes to you and good. \ r \ n First please maintain poise, do not call the message to him, if break up, that is to break up. \ r \ r \ n You don't think that as long as he comes back to you, you can continue to chat with him, meet and watch movies, \ r \ n \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ r \ nyou Really like your people will not want to break up with you, will not toss you, \ r \ r \ n So trouble you sober up, unless you want to become a yo Miss champion. \ r \ n \ 9, if he suddenly disappears somehow. \ r \ n \ \ \ \ \ Don't spend a lot of effort to solve the mystery of the missing man, \ \ \ \ \ \ \ r \ n You find all kinds of evidence and excuses to comfort yourself, \ r \ n The only fact is that he no longer wants to be with you, \ r \ n and no guts to tell you clearly. \ r \ n Please believe that there is no secret-he is not worthy of you. \ r \ n \ 10, if he is married. \ r \ n \ nyou have nothing to say, at least before he divorces. \ r \ r \ n If you still can't figure it out, you should probably call the police-someone has lost their brains. \ r \ r \ n Sometimes we would rather believe a man too afraid, too nervous, too low-self-esteem, too holy, \ r \ n Too love ex-girlfriend, too sensitive, too mother, too busy, too much shadow of childhood, \ r \ n Family pressure is too big, too tired, too crazy, too dark, too suicidal tendencies ... \ n \ n But do not want to see very simple fact, \ r \ n \ r Yes, he is not too busy, not injured, \ r \ n \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ r \ n not be president of the United States, not concussion. Is the phone fell into the hot pot, he did not have amnesia, \ r \ n He is notIt's dead-he just doesn't like you that much. \ r \ r \ n "})View CodeSummarize
1, through the use of NODEJS implementation of data capture, found that is not the same, like the original JavaScript to operate the DOM as convenient, thanks to the module cheerio, it is NODEJS special for the server customization, fast and flexible implementation of the jquery core implementation. Cheerio works on the DOM model and is efficient in parsing, manipulating, and interpreting, according to Benchmark: Cheerio is about 8 times times faster than Jsdom.
Use of Cheerio:
1 // 2 var Cheerio = require (" Cheerio " 3 var $ = Cheerio.load (DOC); 4 $ ("P"). attr ("id", "test" 5 // way two 7 var $= require ("Cheerio" 8 $ ("P"). attr ("id", "test");
2, another is iconv-lite, its role is to solve the coding problem, it can be considered a standard character set conversion interface, used to convert between different character set encoding, note: Nodejs's ToString () method can not solve the Chinese coding problem.
Official information: Iconv-lite supported encodings include node. JS native code: UTF8, UCS2, ASCII, Binary, Base64, and support for widely used single-byte encoding: Windows 125x family, ISO-8859 Family, Ibm/dos codepages, Macintosh family, KOI8 family, Latin1, us-ascii; multi-byte encoding: GBK, gb2313, Big5, cp950.
Use of Iconv-lite:
1 var iconv = require (' iconv-lite '); 2 // convert the native encoding of Nodejs to other encodings 3 var val = Iconv.decode (Bufferhelper.tobuffer (), ' gb2312 ');
3, in doing character processing, especially GB2312,GBK format, such as GBK format occupies 1 bytes, Chinese characters accounted for 2 bytes, when executing OnData, the parameters of the anonymous function chunk is actually a buffer object, the following code:
1 res.on (' data ', function (Chunk) { 2 Bufferhelper.concat (Chu NK); 3 });
When you replace with result+=chunk, in fact implicitly will chunk do the Tostrinig () processing, so in the end, no matter how you use the following method will error
1.var Iconv = newiconv ( ' GBK ' , ' UTF-8 ' ); Iconv.
2.iconv.decode (result, ' gb2312 ');
Cause: You do an addition to the buffer object, and the character truncation causes the decoding error.
Use of Bufferhelper:
1 var Bufferhelper = require (' Bufferhelper '); 2 var New Bufferhelper (); 3 Bufferhelper.concat (chunk);
End, sleep
------ If you think this article is helpful to you, don't forget to click on the lower right corner of the recommendation, thank you! ------
Using NODEJS to achieve data capture