Directory
- Before you write it.
- Example
- Sample Requirements
- Collection device
- Join the agent
- Request HTTPS
- Write in the following ...
Before you write it.
Many people have to do data acquisition needs, in different languages, different ways can be achieved, I used to write in C #, the main or send all kinds of requests and regular analytic data more cumbersome, overall there is nothing bad, is the efficiency is poor,
Using Nodejs to write a collection program is still relatively efficient (and probably only relative to C #), today mainly with an example to use the NODEJS implementation of the data collector, mainly used to request and cheerio.
Request : for HTTP requests
Https://github.com/request/request
cheerio: used to extract the required information in the HTML returned by request (consistent with jquery usage)
Https://github.com/cheeriojs/cheerio
Example
There's no point in saying the API usage alone, and there's no need to remember all the APIs, here's the example
Or a little gossip:
Nodejs development tools are many, I also recommended sublime, since Microsoft introduced visual Studio code, and then switch to do nodejs development.
Use it to develop or relatively comfortable, free configuration, start fast, automatic completion, view definitions and references, search fast, and so on, there is a consistent style of VS, should be more and better, so recommend ^_^!
Sample Requirements
Grab "title", "Address", "Publish Time", "cover picture" of the article from http://36kr.com/
Collection device
1. Create a project folder Sampledau
2. Create a Package.json file
{ "name": "Wilson_sampledau", "version": "0.0.1", false, "Dependencies": { "request": "*", "cheerio": "*" }}
3. Install references in the terminal with NPM
Install
4. Build app.js to write collector code
The first thing to do is open the URL you want to capture with your browser, use the developer tools to view the HTML structure, and then parse the code according to structure
/** Function: Data acquisition * created by: wilson* Time: 2015-07-29*/varRequest = require (' request ')), Cheerio= Require (' Cheerio '), Url_36kr= ' http://36kr.com/';//36 Krypton/*Open Data Logger*/ functionDatacollectorstartup () {datarequest (URL_36KR);}/*Data Request*/functionDatarequest (Dataurl) {request ({Url:dataurl, method:' GET ' }, function(err, res, body) {if(Err) {Console.log (Dataurl) console.error (' [error]collection ' +err); return; } Switch(dataurl) { CaseUrl_36kr:dataparse36kr (body); Break; } }); }/*36KR Data Parsing*/functionDataparse36kr (body) {Console.log (' ============================================================================================ '); Console.log (' ======================================36kr================================================== '); Console.log (' ============================================================================================ '); var$ =Cheerio.load (body); varArticles = $ (' article ') for(vari = 0; i < articles.length; i++) { varArticle =Articles[i]; varDescdoms = $ (article). Find ('. Desc ')); if(Descdoms.length = = 0) { Continue; } varCoverdom =$ (article). Children (). First (); varTitledom = $ (descdoms). Find ('. Info_flow_news_title ')); varTimedom = $ (descdoms). Find ('. Timeago ')); varTitleval =Titledom.text (); varUrlval = titledom.attr (' href ')); varTimeval = timedom.attr (' title '); varCoverurl = coverdom.attr (' data-lazyload ')); //Processing Time varTimedatesecs =NewDate (Timeval). GetTime ()/1000; if(Urlval! =undefined) {Console.info (‘--------------------------------‘); Console.info (' Title: ' +titleval); Console.info (' Address: ' +urlval); Console.info (' Time: ' +timedatesecs); Console.info (' Cover: ' +Coverurl); Console.info (‘--------------------------------‘); } };} Datacollectorstartup ();
Test results
This collector is completed, in fact, requests a GET request, the request callback will return the body is the HTML code, through the Cheerio library with the jquery library syntax as the operation of parsing, remove the desired data!
Join the agent
Do a collector demo above is basically done. If you need long-term use in order to prevent site blocking, or need to join a proxy list
For example, I present some examples from the free agent on the Internet, making it into Proxylist.js, which provides a function of randomly taking an agent
varProxy_list = [{"IP": "111.1.55.136", "Port": "55336"},{"IP": "111.1.54.91", "Port": "55336"},{"IP": "111.1.56.19", "Port" ":" 55336 "} ,{"IP": "112.114.63.16", "Port": "55336"},{"IP": "106.58.63.83", "Port": "55336"},{"IP": "119.188.133.54", "Port": "55336 "} ,{"IP": "106.58.63.84", "Port": "55336"},{"IP": "183.95.132.171", "Port": "55336"},{"IP": "11.12.14.9", "Port": "55336"} ,{"IP": "60.164.223.16", "Port": "55336"},{"IP": "117.185.13.87", "Port": "8080"},{"IP": "112.114.63.20", "Port": "55336"} ,{"IP": "188.134.19.102", "Port": "3129"},{"IP": "106.58.63.80", "Port": "55336"},{"IP": "60.164.223.20", "Port": "55336"} ,{"IP": "106.58.63.78", "Port": "55336"},{"IP": "112.114.63.23", "Port": "55336"},{"IP": "112.114.63.30", "Port": "55336"} ,{"IP": "60.164.223.14", "Port": "55336"},{"IP": "190.202.82.234", "Port": "3128"},{"IP": "60.164.223.15", "Port": "55336 "} ,{"IP": "60.164.223.5", "Port": "55336"},{"IP": "221.204.9.28", "Port": "55336"},{"IP": "60.164.223.2", "Port": "55336"} ,{"IP": "139.214.113.84", "Port": "55336"}, {"IP": "112.25.49.14", "Port": "55336"},{"IP": "221.204.9.19", "Port": "55336"} ,{"IP": "221.204.9.39", "Port": "55336"},{"IP": "113.207.57.18", "Port": "55336"}, {"IP": "112.25.62.15", "Port": "55336"} ,{"IP": "60.5.255.143", "Port": "55336"},{"IP": "221.204.9.18", "Port": "55336"},{"IP": "60.5.255.145", "Port": "55336"} ,{"IP": "221.204.9.16", "Port": "55336"},{"IP": "183.232.82.132", "Port": "55336"},{"IP": "113.207.62.78", "Port": "55336 "} ,{"IP": "60.5.255.144", "Port": "55336"}, {"IP": "60.5.255.141", "Port": "55336"},{"IP": "221.204.9.23", "Port": "55336"} ,{"IP": "157.122.96.50", "Port": "55336"},{"IP": "218.61.39.41", "Port": "55336"}, {"IP": "221.204.9.26", "Port": "55336"} ,{"IP": "112.112.43.213", "Port": "55336"},{"IP": "60.5.255.138", "Port": "55336"},{"IP": "60.5.255.133", "Port": "55336"} ,{"IP": "221.204.9.25", "Port": "55336"},{"IP": "111.161.35.56", "Port": "55336"},{"IP": "111.161.35.49", "Port": "55336"} ,{"IP": "183.129.134.226", "Port": "8080"}, {"IP": "58.220.10.86", "Port": "+"},{"IP": "183.87.117.44", "Port": "80"} ,{"IP": "211.23.19.130", "Port": "},{", "IP": "61.234.249.107", "Port": "8118"},{"IP": "200.20.168.140", "Port": "80"} ,{"IP": "111.1.46.176", "Port": "55336"},{"IP": "120.203.158.149", "Port": "8118"},{"IP": "70.39.189.6", "Port": "9090"} ,{"IP": "210.6.237.191", "Port": "3128"},{"IP": "122.155.195.26", "Port": "8080"}]; Module.exports.GetProxy=function () { varRandomnum = parseint (Math.floor (Math.random () *proxy_list.length)); varProxy =Proxy_list[randomnum]; return'/http ' + Proxy.ip + ': ' +Proxy.port;}
Proxylist.js
Make the following modifications to the App.js code
/** Function: Data acquisition * created by: wilson* Time: 2015-07-29*/varRequest = require (' request ')), Cheerio= Require (' Cheerio '), Url_36kr= ' http://36kr.com/',//36 Krypton Proxy = require ('./proxylist.js ' )); .../*Data Request*/functionDatarequest (Dataurl) {request ({Url:dataurl, proxy:Proxy.GetProxy (), method: /c7>' GET ' }, function(err, res, body) {...}} ... Datacollectorstartup () setinterval (Datacollectorstartup, 10000);
This completes the transformation, joins the code, and adds the setinterval to perform the interval execution!
Request HTTPS
The above example collects HTTP requests, and if you switch to HTTPS?
Create a new app2.js with the following code
/** Function: Request https* Creator: wilson* time: 2015-07-29*/varRequest = require (' request ')), Url_interfacelife= ' https://interfacelift.com/wallpaper/downloads/date/wide_16:10/';/*Open Data Logger*/ functionDatacollectorstartup () {datarequest (url_interfacelife);}/*Data Request*/functionDatarequest (Dataurl) {request ({Url:dataurl, method:' GET ' }, function(err, res, body) {if(Err) {Console.log (Dataurl) console.error (' [error]collection ' +err); return; } console.info (body); }); }datacollectorstartup ();
Execution will find that there is nothing ^_^ in return body!
Add some code and see
/** Function: Request https* Creator: wilson* time: 2015-07-29*/varRequest = require (' request ')), Url_interfacelife= ' https://interfacelift.com/wallpaper/downloads/date/wide_16:10/';/*Open Data Logger*/.../*Data Request*/functionDatarequest (Dataurl) {request ({Url:dataurl, method:' GET ' , headers: { ' user-agent ': ' Wilson ' } }, function(err, res, body) {if(Err) {Console.log (Dataurl) console.error (' [error]collection ' +err); return; } console.info (body); }); }...
Re-execute, you will find the body in the return request html! (The results will not be put up, self-implementation!) )
For details, see: Https://github.com/request/request#custom-http-headers
After the writing
It's almost half a year away from last ^_^! Recently there are plans to write several types of operations, do not speak the principle of the API, only talk about the instance!
Request Library I still recommend the API to see more, such as the forms part I in the actual project test with more!
For example, do interface testing:
1. Submit two parameters (parameter 1: string parameter 2: number)
Request.post ({url: ' interface url ', form: {parameter: ' parameter one value ', parameter two name: Parameter two value},function(err,res,body) {if (Err) { return; } Console.log (body); });
Body is the interface return
2. Submit a string parameter, submit a file parameter (such as upload avatar, etc.)
var r = request.post (' interface url ',function(err,res,body) { if(err) { return ; } Console.log (body); }); var form = r.form (); Form.append (' parameter a name ', ' parameter One value '); Form.append (' parameter two name ', Fs.createreadstream (' 1.jpg '), {filename: ' 1.jpg '});
Cheerio Library really nothing to talk about, will jquery on the line, its library API basically do not look!
Nodejs Study notes (11)---Data logger sample (Request and Cheerio)