Nodejs Study notes (11)---Data logger sample (Request and Cheerio)

Source: Internet
Author: User
Tags setinterval jquery library

Directory
    • Before you write it.
    • Example
      • Sample Requirements
      • Collection device
    • Join the agent
    • Request HTTPS
    • Write in the following ...
Before you write it.

Many people have to do data acquisition needs, in different languages, different ways can be achieved, I used to write in C #, the main or send all kinds of requests and regular analytic data more cumbersome, overall there is nothing bad, is the efficiency is poor,

Using Nodejs to write a collection program is still relatively efficient (and probably only relative to C #), today mainly with an example to use the NODEJS implementation of the data collector, mainly used to request and cheerio.

  Request : for HTTP requests

Https://github.com/request/request

  cheerio: used to extract the required information in the HTML returned by request (consistent with jquery usage)

Https://github.com/cheeriojs/cheerio

Example

There's no point in saying the API usage alone, and there's no need to remember all the APIs, here's the example

  Or a little gossip:

Nodejs development tools are many, I also recommended sublime, since Microsoft introduced visual Studio code, and then switch to do nodejs development.

Use it to develop or relatively comfortable, free configuration, start fast, automatic completion, view definitions and references, search fast, and so on, there is a consistent style of VS, should be more and better, so recommend ^_^!

  Sample Requirements

Grab "title", "Address", "Publish Time", "cover picture" of the article from http://36kr.com/

  Collection device

1. Create a project folder Sampledau

2. Create a Package.json file

{  "name": "Wilson_sampledau",  "version": "0.0.1",  false,   "Dependencies": {    "request": "*",    "cheerio": "*"     }}

3. Install references in the terminal with NPM

Install

4. Build app.js to write collector code

The first thing to do is open the URL you want to capture with your browser, use the developer tools to view the HTML structure, and then parse the code according to structure

/** Function: Data acquisition * created by: wilson* Time: 2015-07-29*/varRequest = require (' request ')), Cheerio= Require (' Cheerio '), Url_36kr= ' http://36kr.com/';//36 Krypton/*Open Data Logger*/ functionDatacollectorstartup () {datarequest (URL_36KR);}/*Data Request*/functionDatarequest (Dataurl) {request ({Url:dataurl, method:' GET '    }, function(err, res, body) {if(Err) {Console.log (Dataurl) console.error (' [error]collection ' +err); return; }        Switch(dataurl) { CaseUrl_36kr:dataparse36kr (body);  Break;    }            }); }/*36KR Data Parsing*/functionDataparse36kr (body) {Console.log (' ============================================================================================ '); Console.log (' ======================================36kr================================================== '); Console.log (' ============================================================================================ '); var$ =Cheerio.load (body); varArticles = $ (' article ')     for(vari = 0; i < articles.length; i++) {        varArticle =Articles[i]; varDescdoms = $ (article). Find ('. Desc ')); if(Descdoms.length = = 0)        {            Continue; }                varCoverdom =$ (article). Children (). First (); varTitledom = $ (descdoms). Find ('. Info_flow_news_title ')); varTimedom = $ (descdoms). Find ('. Timeago ')); varTitleval =Titledom.text (); varUrlval = titledom.attr (' href ')); varTimeval = timedom.attr (' title '); varCoverurl = coverdom.attr (' data-lazyload ')); //Processing Time        varTimedatesecs =NewDate (Timeval). GetTime ()/1000; if(Urlval! =undefined) {Console.info (‘--------------------------------‘); Console.info (' Title: ' +titleval); Console.info (' Address: ' +urlval); Console.info (' Time: ' +timedatesecs); Console.info (' Cover: ' +Coverurl); Console.info (‘--------------------------------‘); }    };} Datacollectorstartup ();

Test results

  

This collector is completed, in fact, requests a GET request, the request callback will return the body is the HTML code, through the Cheerio library with the jquery library syntax as the operation of parsing, remove the desired data!

     

Join the agent

Do a collector demo above is basically done. If you need long-term use in order to prevent site blocking, or need to join a proxy list

For example, I present some examples from the free agent on the Internet, making it into Proxylist.js, which provides a function of randomly taking an agent

varProxy_list = [{"IP": "111.1.55.136", "Port": "55336"},{"IP": "111.1.54.91", "Port": "55336"},{"IP": "111.1.56.19", "Port" ":" 55336 "}                    ,{"IP": "112.114.63.16", "Port": "55336"},{"IP": "106.58.63.83", "Port": "55336"},{"IP": "119.188.133.54", "Port": "55336 "}                    ,{"IP": "106.58.63.84", "Port": "55336"},{"IP": "183.95.132.171", "Port": "55336"},{"IP": "11.12.14.9", "Port": "55336"}                    ,{"IP": "60.164.223.16", "Port": "55336"},{"IP": "117.185.13.87", "Port": "8080"},{"IP": "112.114.63.20", "Port": "55336"}                    ,{"IP": "188.134.19.102", "Port": "3129"},{"IP": "106.58.63.80", "Port": "55336"},{"IP": "60.164.223.20", "Port": "55336"}                    ,{"IP": "106.58.63.78", "Port": "55336"},{"IP": "112.114.63.23", "Port": "55336"},{"IP": "112.114.63.30", "Port": "55336"}                    ,{"IP": "60.164.223.14", "Port": "55336"},{"IP": "190.202.82.234", "Port": "3128"},{"IP": "60.164.223.15", "Port": "55336 "}                    ,{"IP": "60.164.223.5", "Port": "55336"},{"IP": "221.204.9.28", "Port": "55336"},{"IP": "60.164.223.2", "Port": "55336"}                    ,{"IP": "139.214.113.84", "Port": "55336"}, {"IP": "112.25.49.14", "Port": "55336"},{"IP": "221.204.9.19", "Port": "55336"}                    ,{"IP": "221.204.9.39", "Port": "55336"},{"IP": "113.207.57.18", "Port": "55336"}, {"IP": "112.25.62.15", "Port": "55336"}                    ,{"IP": "60.5.255.143", "Port": "55336"},{"IP": "221.204.9.18", "Port": "55336"},{"IP": "60.5.255.145", "Port": "55336"}                    ,{"IP": "221.204.9.16", "Port": "55336"},{"IP": "183.232.82.132", "Port": "55336"},{"IP": "113.207.62.78", "Port": "55336 "}                    ,{"IP": "60.5.255.144", "Port": "55336"}, {"IP": "60.5.255.141", "Port": "55336"},{"IP": "221.204.9.23", "Port": "55336"}                    ,{"IP": "157.122.96.50", "Port": "55336"},{"IP": "218.61.39.41", "Port": "55336"}, {"IP": "221.204.9.26", "Port": "55336"}                    ,{"IP": "112.112.43.213", "Port": "55336"},{"IP": "60.5.255.138", "Port": "55336"},{"IP": "60.5.255.133", "Port": "55336"}                     ,{"IP": "221.204.9.25", "Port": "55336"},{"IP": "111.161.35.56", "Port": "55336"},{"IP": "111.161.35.49", "Port": "55336"}                    ,{"IP": "183.129.134.226", "Port": "8080"}, {"IP": "58.220.10.86", "Port": "+"},{"IP": "183.87.117.44", "Port": "80"}                    ,{"IP": "211.23.19.130", "Port": "},{", "IP": "61.234.249.107", "Port": "8118"},{"IP": "200.20.168.140", "Port": "80"}                    ,{"IP": "111.1.46.176", "Port": "55336"},{"IP": "120.203.158.149", "Port": "8118"},{"IP": "70.39.189.6", "Port": "9090"}                     ,{"IP": "210.6.237.191", "Port": "3128"},{"IP": "122.155.195.26", "Port": "8080"}]; Module.exports.GetProxy=function () {            varRandomnum = parseint (Math.floor (Math.random () *proxy_list.length)); varProxy =Proxy_list[randomnum]; return'/http ' + Proxy.ip + ': ' +Proxy.port;}
Proxylist.js

Make the following modifications to the App.js code

/** Function: Data acquisition * created by: wilson* Time: 2015-07-29*/varRequest = require (' request ')), Cheerio= Require (' Cheerio '), Url_36kr= ' http://36kr.com/',//36 Krypton    Proxy = require ('./proxylist.js ' )); .../*Data Request*/functionDatarequest (Dataurl) {request ({Url:dataurl, proxy:Proxy.GetProxy (), method: /c7>' GET '    }, function(err, res, body) {...}} ... Datacollectorstartup () setinterval (Datacollectorstartup, 10000);        

This completes the transformation, joins the code, and adds the setinterval to perform the interval execution!

Request HTTPS

The above example collects HTTP requests, and if you switch to HTTPS?

Create a new app2.js with the following code

/** Function: Request https* Creator: wilson* time: 2015-07-29*/varRequest = require (' request ')), Url_interfacelife= ' https://interfacelift.com/wallpaper/downloads/date/wide_16:10/';/*Open Data Logger*/ functionDatacollectorstartup () {datarequest (url_interfacelife);}/*Data Request*/functionDatarequest (Dataurl) {request ({Url:dataurl, method:' GET '    }, function(err, res, body) {if(Err) {Console.log (Dataurl) console.error (' [error]collection ' +err); return;            } console.info (body);    }); }datacollectorstartup ();

Execution will find that there is nothing ^_^ in return body!

Add some code and see

/** Function: Request https* Creator: wilson* time: 2015-07-29*/varRequest = require (' request ')), Url_interfacelife= ' https://interfacelift.com/wallpaper/downloads/date/wide_16:10/';/*Open Data Logger*/.../*Data Request*/functionDatarequest (Dataurl) {request ({Url:dataurl, method:' GET ' ,  headers: {        ' user-agent ': ' Wilson '            }    }, function(err, res, body) {if(Err) {Console.log (Dataurl) console.error (' [error]collection ' +err); return;            } console.info (body);    }); }...

Re-execute, you will find the body in the return request html! (The results will not be put up, self-implementation!) )

For details, see: Https://github.com/request/request#custom-http-headers

After the writing

It's almost half a year away from last ^_^! Recently there are plans to write several types of operations, do not speak the principle of the API, only talk about the instance!

Request Library I still recommend the API to see more, such as the forms part I in the actual project test with more!

For example, do interface testing:

1. Submit two parameters (parameter 1: string parameter 2: number)

Request.post ({url: ' interface url ', form: {parameter: ' parameter one value ', parameter two name: Parameter two value},function(err,res,body)                         {if  (Err)            {                return;            }            Console.log (body);    });

Body is the interface return

2. Submit a string parameter, submit a file parameter (such as upload avatar, etc.)

    var r = request.post (' interface url ',function(err,res,body) {                         if(err)            {                 return ;            }            Console.log (body);    });         var form = r.form ();    Form.append (' parameter a name ', ' parameter One value ');    Form.append (' parameter two name ', Fs.createreadstream (' 1.jpg '), {filename: ' 1.jpg '});

Cheerio Library really nothing to talk about, will jquery on the line, its library API basically do not look!

Nodejs Study notes (11)---Data logger sample (Request and Cheerio)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.