The use of Phantomjs to do a screenshot of the economy is applicable, but its API is less, do other functions is more laborious. For example, its own web Server Mongoose can support only 10 requests at a time, and it is impractical to expect him to become a service independently. So here needs another language to support the service, here choose Nodejs to complete.
Install PHANTOMJS
First, go to PHANTOMJS website to download the version of the corresponding platform, or download the source code to compile itself. Then configure the PHANTOMJS into the environment variable and enter
$ phantomjs
If there is a response, then you can proceed to the next step.
Using PHANTOMJS to make a simple screenshot
Copy Code code as follows:
var webpage = require (' webpage '), page = Webpage.create (); Page.viewportsize = {width:1024, height:800}; Page.cliprect = {top:0, left:0, width:1024, height:800}; Page.settings = {javascriptenabled:false, loadimages:true, useragent: ' mozilla/5.0 (Windows NT 6.1) applewebkit/537.31 (khtml, like Gecko) phantomjs/19.0 '}; Page.open (' http://www.baidu.com ', function (status) {var data; if (status = = ' fail ') {Console.log (' open page fail! ');} else {page.render ('./snapshot/test.png ');}//Release the Memory Page.close (); });
Here we set the window size to 1024 * 800:
Copy Code code as follows:
Page.viewportsize = {width:1024, height:800};
Intercepts images of 1024 * 800 sizes from (0, 0) as the starting point:
Copy Code code as follows:
Page.cliprect = {top:0, left:0, width:1024, height:800};
Disables JavaScript, allows pictures to be loaded, and changes useragent to "mozilla/5.0 (Windows NT 6.1) applewebkit/537.31 (khtml, like Gecko) phantomjs/19.0":
Copy Code code as follows:
Page.settings = {javascriptenabled:false, loadimages:true, useragent: ' mozilla/5.0 (Windows NT 6.1) applewebkit/537.31 (khtml, like Gecko) phantomjs/19.0 '};
Then use Page.open to open the page, the final screenshot output to the./snapshot/test.png:
Copy Code code as follows:
Page.render ('./snapshot/test.png ');
Nodejs and PHANTOMJS communication
Let's take a look at what PHANTOMJS can communicate.
Command-line arguments
Copy Code code as follows:
For example:
Phantomjs Snapshot.js http://www.baidu.com
Command-line arguments can only be passed when the PHANTOMJS is open, and there is nothing to do during the run.
Standard output
Copy Code code as follows:
Standard output can output data from Phantomjs to Nodejs, but it cannot transmit data from Nodejs to PHANTOMJS.
However, in the test, the standard output is the fastest transmission in these ways, in a large number of data transmission should be considered.
HTTP
Copy Code code as follows:
PHANTOMJS sends an HTTP request to the Nodejs service and then Nodejs returns the corresponding data.
This approach is simple, but the request can only be sent by PHANTOMJS.
Websocket
Copy Code code as follows:
It is noteworthy that Phantomjs 1.9.0 support Websocket, but unfortunately is hixie-76 Websocket, but after all, it provides a NODEJS initiative to PHANTOMJS Communication program.
In the test, we found that the PHANTOMJS even the local WebSocket service takes about 1 seconds, temporarily do not consider this method.
Phantomjs-node
Copy Code code as follows:
Phantomjs-node successfully used PHANTOMJS as a module for Nodejs, but let's look at the author's rationale:
I'll answer that question with a question. Communicate with a process that doesn ' t support shared memory, sockets, FIFOs, or standard input?
So, there's one thing Phantomjs does support, and that ' s opening webpages. In fact, it's really good at opening Web pages. So we communicate with PHANTOMJS by spinning up a instance of Expressjs, opening Phantom in a subprocess, and pointing it At a special webpage which turns socket.io messages into alert()
calls. Those alert()
calls are picked up by Phantom and there for you go!
The communication itself happens via James Halliday ' fantastic Dnode library, which fortunately works-enough when co Mbined with Browserify to run straight out of Phantomjs ' s pidgin Javascript environment.
In fact, Phantomjs-node is using HTTP or websocket to communicate, but it relies on a large, we just want to do a simple thing, temporarily or not consider this stuff.
Design Drawings
Let's get started.
We use HTTP for implementation in the first edition.
The first step is to use cluster for a simple process daemon (index.js):
Copy Code code as follows:
Module.exports = (function () {
"Use Strict"
var cluster = require (' cluster ')
, FS = require (' FS ');
if (!fs.existssync ('./snapshot ')) {
Fs.mkdirsync ('./snapshot ');
}
if (cluster.ismaster) {
Cluster.fork ();
Cluster.on (' Exit ', function (worker) {
Console.log (' Worker ' + worker.id + ' died:(');
Process.nexttick (function () {
Cluster.fork ();
});
})
} else {
Require ('./extract.js ');
}
})();
Then use connect to do our external API (Extract.js):
Copy Code code as follows:
Module.exports = (function () {
"Use Strict"
var connect = require (' Connect ')
, FS = require (' FS ')
, spawn = require (' child_process '). Spawn
, Jobman = require ('./lib/jobman.js ')
, bridge = require ('./lib/bridge.js ')
, pkg = Json.parse (Fs.readfilesync ('./package.json '));
var app = connect ()
. Use (Connect.logger (' dev '))
. Use ('/snapshot ', connect.static (__dirname + '/snapshot ', {maxAge:pkg.maxAge}))
. Use (Connect.bodyparser ())
. Use ('/bridge ', bridge)
. Use ('/api ', function (req, res, next) {
if (Req.method!== "POST" | | |!req.body.campaignid) return next ();
if (!req.body.urls | |!req.body.urls.length) return Jobman.watch (Req.body.campaignId, req, res, next);
var Campaignid = Req.body.campaignId
, Imagespath = './snapshot/' + Campaignid + '/'
, URLs = []
Url
, ImagePath;
function _deal (ID, URL, imagepath) {
//Just push into URL list
urls.push ({
Id:id,
Url:url,
Imagepath:imagepath
});
}
for (var i = req.body.urls.length i--;) {
url = req.body.urls[i];
ImagePath = imagespath + i + '. png ';
_deal (i, URL, imagepath);
}
jobman.register (Campaignid, URLs, req, res, next);
var snapshot = spawn (' Phantomjs ', [' snapshot.js ', Campaignid]);
snapshot.stdout.on (' Data ', function (data) {
Console.log (' stdout: ' + data);
});
snapshot.stderr.on (' Data ', function (data) {
Console.log (' stderr: ' + data);
});
snapshot.on (' Close ', function (code) {
Console.log (' Snapshot exited with code ' + code);
});
})
. Use (connect.static (__dirname + '/html ', {maxAge:pkg.maxAge}))
. Listen (Pkg.port, function () {Console.log (' listen: ' + ' http://localhost: ' + Pkg.port);});
})();
Here we cite two modules bridge and Jobman.
Bridge is the HTTP communication bridges, Jobman is the work manager. We use Campaignid to correspond to a job and then delegate the job and response to Jobman management. The PHANTOMJS is then started for processing.
The communication Bridge is responsible for accepting or returning information about the job and handing it to Jobman (bridge.js):
Copy Code code as follows:
Module.exports = (function () {
"Use Strict"
var Jobman = require ('./jobman.js ')
, FS = require (' FS ')
, pkg = Json.parse (Fs.readfilesync ('./package.json '));
return function (req, res, next) {
if (Req.headers.secret!== Pkg.secret) return next ();
Snapshot APP can post URL information
if (Req.method = = "POST") {
var BODY = Json.parse (json.stringify (req.body));
Jobman.fire (body);
Res.end (");
Snapshot APP can get the URL should extract
} else {
var urls = jobman.geturls (Req.url.match (/campaignid= ([^&]*) (\s|&|$)/) [1]);
Res.writehead ({' Content-type ': ' Application/json '});
Res.statucode = 200;
Res.end (Json.stringify ({urls:urls}));
}
};
})();
If Request method is post, we think PHANTOMJS is giving us information about the job being pushed. For a GET, it is assumed to obtain information about the job.
Jobman is responsible for managing the job and sending current job information back to client (Jobman.js) via response:
Copy Code code as follows:
Module.exports = (function () {
"Use Strict"
var fs = require (' FS ')
, fetch = require ('./fetch.js ')
, _jobs = {};
function _send (Campaignid) {
var job = _jobs[campaignid];
if (!job) return;
if (job.waiting) {
Job.waiting = false;
Cleartimeout (job.timeout);
var finished = (Job.urlsnum = = Job.finishnum)
, data = {
Campaignid:campaignid,
Urls:job.urls,
Finished:finished
};
Job.urls = [];
var res = job.res;
if (finished) {
_jobs[campaignid] = null;
Delete _jobs[campaignid]
}
Res.writehead ({' Content-type ': ' Application/json '});
Res.statucode = 200;
Res.end (json.stringify (data));
}
}
function register (Campaignid, URLs, req, res, next) {
_jobs[campaignid] = {
UrlsNum:urls.length,
finishnum:0,
URLs: [],
Cacheurls:urls,
Res:null,
Waiting:false,
Timeout:null
};
Watch (Campaignid, req, res, next);
}
function Watch (Campaignid, req, res, next) {
_jobs[campaignid].res = res;
20s timeout
_jobs[campaignid].timeout = settimeout (function () {
_send (Campaignid);
}, 20000);
}
function Fire (opts) {
var Campaignid = Opts.campaignid
, job = _jobs[campaignid]
, Fetchobj = Fetch (opts.html);
if (Job) {
if (+opts.status && fetchobj.title) {
Job.urls.push ({
Id:opts.id,
Url:opts.url,
Image:opts.image,
Title:fetchObj.title,
Description:fetchObj.description,
Status: +opts.status
});
} else {
Job.urls.push ({
Id:opts.id,
Url:opts.url,
Status: +opts.status
});
}
if (!job.waiting) {
Job.waiting = true;
settimeout (function () {
_send (Campaignid);
}, 500);
}
Job.finishnum + +;
} else {
Console.log (' Job can not found! ');
}
}
function Geturls (Campaignid) {
var job = _jobs[campaignid];
if (job) return job.cacheurls;
}
return {
Register:register,
Watch:watch,
Fire:fire,
Geturls:geturls
};
})();
Here we use the fetch to crawl the HTML and its title and Description,fetch implementations are relatively simple (fetch.js):
Copy Code code as follows:
Module.exports = (function () {
"Use Strict"
return function (HTML) {
if (!html) return {title:false, description:false};
var title = Html.match (/\<title\> (. *?) \<\/title\>/)
, meta = Html.match (/\<meta\s (. *?) \/?\>/g)
, description;
if (meta) {
for (var i = meta.length; i--;) {
if (Meta[i].indexof (' name= "description ') >-1 | | meta[i].indexof (' name=" description "') >-1) {
Description = Meta[i].match (/content\=\) (. *?) \ "/) [1];
}
}
}
(title && title[1]!== ")? (title = Title[1]): (title = ' No title ');
Description | | (Description = ' No description ');
return {
Title:title,
Description:description
};
};
})();
Finally, Phantomjs runs the source code that gets the job information through HTTP to bridge, and then returns to Bridge (Snapshot.js) via HTTP for each job completion URL:
Copy Code code as follows:
var webpage = require (' webpage ')
, args = require (' system '). Args
, FS = require (' FS ')
, Campaignid = args[1]
, pkg = Json.parse (Fs.read ('./package.json '));
function snapshot (id, url, imagepath) {
var page = Webpage.create ()
, send
, begin
, save
, end;
Page.viewportsize = {width:1024, height:800};
Page.cliprect = {top:0, left:0, width:1024, height:800};
Page.settings = {
Javascriptenabled:false,
Loadimages:true,
UserAgent: ' mozilla/5.0 (Windows NT 6.1) applewebkit/537.31 (khtml, like Gecko) phantomjs/1.9.0 '
};
Page.open (URL, function (status) {
var data;
if (status = = ' fail ') {
data = [
' Campaignid= ',
Campaignid,
' &url= ',
encodeURIComponent (URL),
' &id= ',
Id
' &status= ',
].join (");
Postpage.open (' http://localhost: ' + pkg.port + '/bridge ', ' POST ', data, function () {});
} else {
Page.render (ImagePath);
var html = page.content;
Callback Nodejs
data = [
' Campaignid= ',
Campaignid,
' &html= ',
encodeURIComponent (HTML),
' &url= ',
encodeURIComponent (URL),
' &image= ',
encodeURIComponent (ImagePath),
' &id= ',
Id
' &status= ',
].join (");
Postman.post (data);
}
Release the Memory
Page.close ();
});
}
var postman = {
Postpage:null,
Posting:false,
Datas: [],
len:0,
currentnum:0,
Init:function (snapshot) {
var postpage = Webpage.create ();
Postpage.customheaders = {
' Secret ': Pkg.secret
};
Postpage.open (' http://localhost: ' + pkg.port + '/bridge?campaignid= ' + Campaignid, function () {
var urls = json.parse (postpage.plaintext). URLs
, URL;
This.len = Urls.length;
if (This.len) {
for (var i = This.len; i--;) {
url = urls[i];
Snapshot (Url.id, Url.url, Url.imagepath);
}
}
});
This.postpage = Postpage;
},
Post:function (data) {
This.datas.push (data);
if (!this.posting) {
This.posting = true;
This.fire ();
}
},
Fire:function () {
if (this.datas.length) {
var data = This.datas.shift ()
, that = this;
This.postPage.open (' http://localhost: ' + pkg.port + '/bridge ', ' POST ', data, function () {
That.fire ();
Kill child Process
settimeout (function () {
if (++this.currentnum = = This.len) {
That.postPage.close ();
Phantom.exit ();
}
}, 500);
});
} else {
This.posting = false;
}
}
};
Postman.init (snapshot);
effect