Use Nodejs and PHANTOMJS to crawl Site page information and site screenshots _javascript tips

Source: Internet
Author: User
The use of Phantomjs to do a screenshot of the economy is applicable, but its API is less, do other functions is more laborious. For example, its own web Server Mongoose can support only 10 requests at a time, and it is impractical to expect him to become a service independently. So here needs another language to support the service, here choose Nodejs to complete.





Install PHANTOMJS

First, go to PHANTOMJS website to download the version of the corresponding platform, or download the source code to compile itself. Then configure the PHANTOMJS into the environment variable and enter

$ phantomjs

If there is a response, then you can proceed to the next step.

Using PHANTOMJS to make a simple screenshot

Copy Code code as follows:
var webpage = require (' webpage '), page = Webpage.create (); Page.viewportsize = {width:1024, height:800}; Page.cliprect = {top:0, left:0, width:1024, height:800}; Page.settings = {javascriptenabled:false, loadimages:true, useragent: ' mozilla/5.0 (Windows NT 6.1) applewebkit/537.31 (khtml, like Gecko) phantomjs/19.0 '}; Page.open (' http://www.baidu.com ', function (status) {var data; if (status = = ' fail ') {Console.log (' open page fail! ');} else {page.render ('./snapshot/test.png ');}//Release the Memory Page.close (); });

Here we set the window size to 1024 * 800:

Copy Code code as follows:
Page.viewportsize = {width:1024, height:800};

Intercepts images of 1024 * 800 sizes from (0, 0) as the starting point:

Copy Code code as follows:
Page.cliprect = {top:0, left:0, width:1024, height:800};

Disables JavaScript, allows pictures to be loaded, and changes useragent to "mozilla/5.0 (Windows NT 6.1) applewebkit/537.31 (khtml, like Gecko) phantomjs/19.0":

Copy Code code as follows:
Page.settings = {javascriptenabled:false, loadimages:true, useragent: ' mozilla/5.0 (Windows NT 6.1) applewebkit/537.31 (khtml, like Gecko) phantomjs/19.0 '};

Then use Page.open to open the page, the final screenshot output to the./snapshot/test.png:

Copy Code code as follows:
Page.render ('./snapshot/test.png ');

Nodejs and PHANTOMJS communication

Let's take a look at what PHANTOMJS can communicate.

Command-line arguments
Copy Code code as follows:

For example:

Phantomjs Snapshot.js http://www.baidu.com

Command-line arguments can only be passed when the PHANTOMJS is open, and there is nothing to do during the run.

Standard output
Copy Code code as follows:

Standard output can output data from Phantomjs to Nodejs, but it cannot transmit data from Nodejs to PHANTOMJS.

However, in the test, the standard output is the fastest transmission in these ways, in a large number of data transmission should be considered.

HTTP
Copy Code code as follows:

PHANTOMJS sends an HTTP request to the Nodejs service and then Nodejs returns the corresponding data.

This approach is simple, but the request can only be sent by PHANTOMJS.

Websocket
Copy Code code as follows:

It is noteworthy that Phantomjs 1.9.0 support Websocket, but unfortunately is hixie-76 Websocket, but after all, it provides a NODEJS initiative to PHANTOMJS Communication program.

In the test, we found that the PHANTOMJS even the local WebSocket service takes about 1 seconds, temporarily do not consider this method.

Phantomjs-node
Copy Code code as follows:

Phantomjs-node successfully used PHANTOMJS as a module for Nodejs, but let's look at the author's rationale:

I'll answer that question with a question. Communicate with a process that doesn ' t support shared memory, sockets, FIFOs, or standard input?

So, there's one thing Phantomjs does support, and that ' s opening webpages. In fact, it's really good at opening Web pages. So we communicate with PHANTOMJS by spinning up a instance of Expressjs, opening Phantom in a subprocess, and pointing it At a special webpage which turns socket.io messages into alert() calls. Those alert() calls are picked up by Phantom and there for you go!

The communication itself happens via James Halliday ' fantastic Dnode library, which fortunately works-enough when co Mbined with Browserify to run straight out of Phantomjs ' s pidgin Javascript environment.

In fact, Phantomjs-node is using HTTP or websocket to communicate, but it relies on a large, we just want to do a simple thing, temporarily or not consider this stuff.

Design Drawings

Let's get started.
We use HTTP for implementation in the first edition.

The first step is to use cluster for a simple process daemon (index.js):

Copy Code code as follows:

Module.exports = (function () {
"Use Strict"
var cluster = require (' cluster ')
, FS = require (' FS ');

if (!fs.existssync ('./snapshot ')) {
Fs.mkdirsync ('./snapshot ');
}

if (cluster.ismaster) {
Cluster.fork ();

Cluster.on (' Exit ', function (worker) {
Console.log (' Worker ' + worker.id + ' died:(');
Process.nexttick (function () {
Cluster.fork ();
});
})
} else {
Require ('./extract.js ');
}
})();

Then use connect to do our external API (Extract.js):

Copy Code code as follows:



Module.exports = (function () {


"Use Strict"


var connect = require (' Connect ')


, FS = require (' FS ')


, spawn = require (' child_process '). Spawn


, Jobman = require ('./lib/jobman.js ')


, bridge = require ('./lib/bridge.js ')


, pkg = Json.parse (Fs.readfilesync ('./package.json '));

var app = connect ()
. Use (Connect.logger (' dev '))
. Use ('/snapshot ', connect.static (__dirname + '/snapshot ', {maxAge:pkg.maxAge}))
. Use (Connect.bodyparser ())
. Use ('/bridge ', bridge)
. Use ('/api ', function (req, res, next) {
if (Req.method!== "POST" | | |!req.body.campaignid) return next ();
if (!req.body.urls | |!req.body.urls.length) return Jobman.watch (Req.body.campaignId, req, res, next);

var Campaignid = Req.body.campaignId
, Imagespath = './snapshot/' + Campaignid + '/'
, URLs = []
Url
, ImagePath;

      function _deal (ID, URL, imagepath) {
        //Just push into URL list
        urls.push ({
           Id:id,
          Url:url,
          Imagepath:imagepath
        });
     }

      for (var i = req.body.urls.length i--;) {
        url = req.body.urls[i];
        ImagePath = imagespath + i + '. png ';
        _deal (i, URL, imagepath);
     }

      jobman.register (Campaignid, URLs, req, res, next);
      var snapshot = spawn (' Phantomjs ', [' snapshot.js ', Campaignid]);
      snapshot.stdout.on (' Data ', function (data) {
         Console.log (' stdout: ' + data);
     });
      snapshot.stderr.on (' Data ', function (data) {
         Console.log (' stderr: ' + data);
     });
      snapshot.on (' Close ', function (code) {
         Console.log (' Snapshot exited with code ' + code);
     });

})
. Use (connect.static (__dirname + '/html ', {maxAge:pkg.maxAge}))
. Listen (Pkg.port, function () {Console.log (' listen: ' + ' http://localhost: ' + Pkg.port);});

})();

Here we cite two modules bridge and Jobman.

Bridge is the HTTP communication bridges, Jobman is the work manager. We use Campaignid to correspond to a job and then delegate the job and response to Jobman management. The PHANTOMJS is then started for processing.

The communication Bridge is responsible for accepting or returning information about the job and handing it to Jobman (bridge.js):

Copy Code code as follows:



Module.exports = (function () {


"Use Strict"


var Jobman = require ('./jobman.js ')


, FS = require (' FS ')


, pkg = Json.parse (Fs.readfilesync ('./package.json '));

return function (req, res, next) {


if (Req.headers.secret!== Pkg.secret) return next ();


Snapshot APP can post URL information


if (Req.method = = "POST") {


var BODY = Json.parse (json.stringify (req.body));


Jobman.fire (body);


Res.end (");


Snapshot APP can get the URL should extract


} else {


var urls = jobman.geturls (Req.url.match (/campaignid= ([^&]*) (\s|&|$)/) [1]);


Res.writehead ({' Content-type ': ' Application/json '});


Res.statucode = 200;


Res.end (Json.stringify ({urls:urls}));


}


};

})();

If Request method is post, we think PHANTOMJS is giving us information about the job being pushed. For a GET, it is assumed to obtain information about the job.

Jobman is responsible for managing the job and sending current job information back to client (Jobman.js) via response:

Copy Code code as follows:



Module.exports = (function () {


"Use Strict"


var fs = require (' FS ')


, fetch = require ('./fetch.js ')


, _jobs = {};

function _send (Campaignid) {


var job = _jobs[campaignid];


if (!job) return;


if (job.waiting) {


Job.waiting = false;


Cleartimeout (job.timeout);


var finished = (Job.urlsnum = = Job.finishnum)


, data = {


Campaignid:campaignid,


Urls:job.urls,


Finished:finished


};


Job.urls = [];


var res = job.res;


if (finished) {


_jobs[campaignid] = null;


Delete _jobs[campaignid]


}


Res.writehead ({' Content-type ': ' Application/json '});


Res.statucode = 200;


Res.end (json.stringify (data));


}


}





function register (Campaignid, URLs, req, res, next) {


_jobs[campaignid] = {


UrlsNum:urls.length,


finishnum:0,


URLs: [],


Cacheurls:urls,


Res:null,


Waiting:false,


Timeout:null


};


Watch (Campaignid, req, res, next);


}

function Watch (Campaignid, req, res, next) {
_jobs[campaignid].res = res;
20s timeout
_jobs[campaignid].timeout = settimeout (function () {
_send (Campaignid);
}, 20000);
}

function Fire (opts) {
var Campaignid = Opts.campaignid
, job = _jobs[campaignid]
, Fetchobj = Fetch (opts.html);

if (Job) {


if (+opts.status && fetchobj.title) {


Job.urls.push ({


Id:opts.id,


Url:opts.url,


Image:opts.image,


Title:fetchObj.title,


Description:fetchObj.description,


Status: +opts.status


});


} else {


Job.urls.push ({


Id:opts.id,


Url:opts.url,


Status: +opts.status


});


}

if (!job.waiting) {
Job.waiting = true;
settimeout (function () {
_send (Campaignid);
}, 500);
}
Job.finishnum + +;
} else {
Console.log (' Job can not found! ');
}
}

function Geturls (Campaignid) {
var job = _jobs[campaignid];
if (job) return job.cacheurls;
}

return {
Register:register,
Watch:watch,
Fire:fire,
Geturls:geturls
};

})();

Here we use the fetch to crawl the HTML and its title and Description,fetch implementations are relatively simple (fetch.js):

Copy Code code as follows:



Module.exports = (function () {


"Use Strict"

return function (HTML) {
if (!html) return {title:false, description:false};

var title = Html.match (/\<title\> (. *?) \<\/title\>/)
, meta = Html.match (/\<meta\s (. *?) \/?\>/g)
, description;

if (meta) {
for (var i = meta.length; i--;) {
if (Meta[i].indexof (' name= "description ') >-1 | | meta[i].indexof (' name=" description "') >-1) {
Description = Meta[i].match (/content\=\) (. *?) \ "/) [1];
}
}
}

(title && title[1]!== ")? (title = Title[1]): (title = ' No title ');
Description | | (Description = ' No description ');

return {
Title:title,
Description:description
};
};

})();

Finally, Phantomjs runs the source code that gets the job information through HTTP to bridge, and then returns to Bridge (Snapshot.js) via HTTP for each job completion URL:

Copy Code code as follows:



var webpage = require (' webpage ')


, args = require (' system '). Args


, FS = require (' FS ')


, Campaignid = args[1]


, pkg = Json.parse (Fs.read ('./package.json '));

function snapshot (id, url, imagepath) {


var page = Webpage.create ()


, send


, begin


, save


, end;


Page.viewportsize = {width:1024, height:800};


Page.cliprect = {top:0, left:0, width:1024, height:800};


Page.settings = {


Javascriptenabled:false,


Loadimages:true,


UserAgent: ' mozilla/5.0 (Windows NT 6.1) applewebkit/537.31 (khtml, like Gecko) phantomjs/1.9.0 '


};


Page.open (URL, function (status) {


var data;


if (status = = ' fail ') {


data = [


' Campaignid= ',


Campaignid,


' &amp;url= ',


encodeURIComponent (URL),


' &amp;id= ',


Id


' &amp;status= ',


].join (");


Postpage.open (' http://localhost: ' + pkg.port + '/bridge ', ' POST ', data, function () {});


} else {


Page.render (ImagePath);


var html = page.content;


Callback Nodejs


data = [


' Campaignid= ',


Campaignid,


' &amp;html= ',


encodeURIComponent (HTML),


' &amp;url= ',


encodeURIComponent (URL),


' &amp;image= ',


encodeURIComponent (ImagePath),


' &amp;id= ',


Id


' &amp;status= ',


].join (");


Postman.post (data);


}


Release the Memory


Page.close ();


});


}

var postman = {
Postpage:null,
Posting:false,
Datas: [],
len:0,
currentnum:0,
Init:function (snapshot) {
var postpage = Webpage.create ();
Postpage.customheaders = {
' Secret ': Pkg.secret
};
Postpage.open (' http://localhost: ' + pkg.port + '/bridge?campaignid= ' + Campaignid, function () {
var urls = json.parse (postpage.plaintext). URLs
, URL;

This.len = Urls.length;

if (This.len) {


for (var i = This.len; i--;) {


url = urls[i];


Snapshot (Url.id, Url.url, Url.imagepath);


}


}


});


This.postpage = Postpage;


},


Post:function (data) {


This.datas.push (data);


if (!this.posting) {


This.posting = true;


This.fire ();


}


},


Fire:function () {


if (this.datas.length) {


var data = This.datas.shift ()


, that = this;


This.postPage.open (' http://localhost: ' + pkg.port + '/bridge ', ' POST ', data, function () {


That.fire ();


Kill child Process


settimeout (function () {


if (++this.currentnum = = This.len) {


That.postPage.close ();


Phantom.exit ();


}


}, 500);


});


} else {


This.posting = false;


}


}


};


Postman.init (snapshot);


effect

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.