This article mainly introduces how to use NodeJS and PhantomJS to capture website page information and website methods, and provides instance code for your reference. PhantomJS is economically applicable for Web pages, but it has fewer APIs, it is hard to do other functions. For example, its built-in Web Server Mongoose can only support up to 10 requests at the same time. It is not practical to expect it to become a service independently. Therefore, another language is required to support the service. NodeJS is used here.
Install PhantomJS
First, go to the PhantomJS official website to download the version of the corresponding platform, or download the source code for self-compilation. Then configure PhantomJS into the environment variable and enter
$ Phantomjs
If there is a response, you can proceed to the next step.
Use PhantomJS for simplicity
The Code is as follows:
Var webpage = require ('webpage'), page = webpage. create (); page. viewportSize = {width: 1024, height: 800}; page. clipRect = {top: 0, left: 0, width: 1024, height: 800}; page. settings = {javascriptEnabled: false, loadImages: true, userAgent: 'mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.31 (KHTML, like Gecko) PhantomJS/100'}; page. open ('HTTP: // www.baidu.com ', function (status) {var data; if (Status = 'fail ') {console. log ('Open page fail! ');} Else {page. render ('./snapshot/test.png ');} // release the memory page. close ();});
Here we set the window size to 1024*800:
The Code is as follows:
page.viewportSize = { width: 1024, height: 800 };
Take an image of 1024*800 size starting from (0, 0:
The Code is as follows:
Page. clipRect = {top: 0, left: 0, width: 1024, height: 800 };
Disable Javascript, allow image loading, and change userAgent to "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.31 (KHTML, like Gecko) PhantomJS/19.0 ":
The Code is as follows:
Page. settings = {javascriptEnabled: false, loadImages: true, userAgent: 'mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.31 (KHTML, like Gecko) PhantomJS/100 '};
Use page. open to open the page and output it to./snapshot/test.png:
The Code is as follows:
Page. render ('./snapshot/test.png ');
Communication between NodeJS and PhantomJS
Let's take a look at what communication PhantomJS can do.
Command line parameter passing
The Code is as follows:
For example:
Phantomjs snapshot. js http://www.baidu.com
The command line parameter passing can only be performed when PhantomJS is enabled.
Standard output
The Code is as follows:
Standard output can output data from PhantomJS to NodeJS, but it cannot transmit data from NodeJS to PhantomJS.
However, in the test, the standard output is the fastest transmitted among these methods, and should be considered in a large amount of data transmission.
HTTP
The Code is as follows:
PhantomJS sends an HTTP request to the NodeJS service, and then NodeJS returns the corresponding data.
This method is simple, but the request can only be sent by PhantomJS.
Websocket
The Code is as follows:
It is worth noting that PhantomJS 1.9.0 supports Websocket, but unfortunately it is the hixie-76 Websocket, but after all it still provides a NodeJS initiative to PhantomJS communication solution.
During the test, we found that it would take about 1 second for PhantomJS to connect to the local Websocket service. do not consider this method for the moment.
Phantomjs-node
The Code is as follows:
Phantomjs-node successfully uses PhantomJS as a node. js module, but let's look at the author's principles:
I will answer that question with a question. How do you communicate with a process that doesn't support shared memory, sockets, Guest OS, or standard input?
Well, there's one thing PhantomJS does support, and that's opening webpages. in fact, it's really good at opening web pages. so we communicate with PhantomJS by spinning up an instance of ExpressJS, opening Phantom in a subprocess, and pointing it at a special webpage that turns socket. io messagesalert()CILS. Thosealert()Callare picked up by Phantom and there you go!
The communication itself happens via James Halliday's fantastic dnode library, which fortunately works well enough when combined with browserify to run straight out of PhantomJS's pidgin Javascript environment.
In fact, phantomjs-node uses HTTP or Websocket for communication, but it relies heavily. We just want to do something simple. For the moment, let's leave this stuff alone.
Design Diagram
Let's get started.
We selected HTTP for implementation in the first version.
First, use cluster for simple process daemon (index. js ):
The Code is as follows:
Module. exports = (function (){
"Use strict"
Var cluster = require ('cluster ')
, Fs = require ('fs ');
If (! Fs. existsSync ('./snapshot ')){
Fs. mkdirSync ('./snapshot ');
}
If (cluster. isMaster ){
Cluster. fork ();
Cluster. on ('exit ', function (worker ){
Console. log ('worker' + Worker. id + 'died :(');
Process. nextTick (function (){
Cluster. fork ();
});
})
} Else {
Require ('./extract. js ');
}
})();
Then we use connect to make our external API (extract. js ):
The Code is as follows:
Module. exports = (function (){
"Use strict"
Var connect = require ('connect ')
, Fs = require ('fs ')
, Spawn = require ('child _ Process'). spawn
, JobMan = require ('./lib/jobMan. js ')
, Bridge = require ('./lib/bridge. js ')
, Pkg = JSON. parse (fs. readFileSync ('./package. json '));
Var app = connect ()
. Use (connect. logger ('dev '))
. Use ('/snapshot', connect. static (_ dirname + '/snapshot', {maxAge: pkg. maxAge }))
. Use (connect. bodyParser ())
. Use ('/bridge', bridge)
. Use ('/api', function (req, res, next ){
If (req. method! = "POST" |! Req. body. campaignId) return next ();
If (! Req. body. urls |! Req. body. urls. length) return jobMan. watch (req. body. campaignId, req, res, next );
Var campaignId = req. body. campaignId
, ImagesPath = './snapshot/' + campaignId + '/'
, Urls = []
, Url
, ImagePath;
Function _ deal (id, url, imagePath ){
// Just push into urls list
Urls. push ({
Id: id,
Url: url,
ImagePath: imagePath
});
}
For (var I = req. body. urls. length; I --;){
Url = req. body. urls [I];
ImagePath = imagesPath + I + '.png ';
_ Deal (I, url, imagePath );
}
JobMan. register (campaignId, urls, req, res, next );
Var snapshot = spawn ('phantomjs', ['snapshot. js', campaignId]);
Snapshot. stdout. on ('data', function (data ){
Console. log ('stdout: '+ data );
});
Snapshot. stderr. on ('data', function (data ){
Console. log ('stderr: '+ data );
});
Snapshot. on ('close', function (code ){
Console. log ('snapshot exited with Code' + code );
});
})
. Use (connect. static (_ dirname + '/html', {maxAge: pkg. maxAge }))
. Listen (pkg. port, function () {console. log ('Listen: '+ 'HTTP: // localhost:' + pkg. port );});
})();
Here we reference two modules: bridge and jobMan.
Specifically, bridge is an HTTP Communication bridge and jobMan is the work manager. We use campaignId to correspond to a job, and then delegate the job and response to jobMan for management. Then start PhantomJS for processing.
The communication bridge is responsible for receiving or returning information about the job and handing it to jobMan (bridge. js ):
The Code is as follows:
Module. exports = (function (){
"Use strict"
Var jobMan = require ('./jobMan. js ')
, Fs = require ('fs ')
, Pkg = JSON. parse (fs. readFileSync ('./package. json '));
Return function (req, res, next ){
If (req. headers. secret! = Pkg. secret) return next ();
// Snapshot APP can post url information
If (req. method = "POST "){
Var body = JSON. parse (JSON. stringify (req. body ));
JobMan. fire (body );
Res. end ('');
// Snapshot APP can get the urls shocould extract
} Else {
Var urls = jobMan. getUrls (req. url. match (/campaignId = ([^ &] *) (\ s | & | $)/) [1]);
Res. writehead( 200, {'content-type': 'application/json '});
Res. statuCode = 200;
Res. end (JSON. stringify ({urls: urls }));
}
};
})();
If the request method is POST, we think PhantomJS is pushing the job information to us. For GET, the job information is obtained.
JobMan manages jobs and sends the current job information to the client (jobMan. js) through response ):
The Code is as follows:
Module. exports = (function (){
"Use strict"
Var fs = require ('fs ')
, Fetch = require ('./fetch. js ')
, _ Jobs = {};
Function _ send (campaignId ){
Var job = _ jobs [campaignId];
If (! Job) return;
If (job. waiting ){
Job. waiting = false;
ClearTimeout (job. timeout );
Var finished = (job. urlsNum = job. finishNum)
, Data = {
CampaignId: campaignId,
Urls: job. urls,
Finished: finished
};
Job. urls = [];
Var res = job. res;
If (finished ){
_ Jobs [campaignId] = null;
Delete _ jobs [campaignId]
}
Res. writehead( 200, {'content-type': 'application/json '});
Res. statuCode = 200;
Res. end (JSON. stringify (data ));
}
}
Function register (campaignId, urls, req, res, next ){
_ Jobs [campaignId] = {
UrlsNum: urls. length,
FinishNum: 0,
Urls: [],
CacheUrls: urls,
Res: null,
Waiting: false,
Timeout: null
};
Watch (campaignId, req, res, next );
}
Function watch (campaignId, req, res, next ){
_ Jobs [campaignId]. res = res;
// 20 s timeout
_ Jobs [campaignId]. timeout = setTimeout (function (){
_ Send (campaignId );
},20000 );
}
Function fire (opts ){
Var campaignId = opts. campaignId
, Job = _ jobs [campaignId]
, FetchObj = fetch(opts.html );
If (job ){
If (+ opts. status & fetchObj. title ){
Job. urls. push ({
Id: opts. id,
Url: opts. url,
Image: opts. image,
Title: fetchObj. title,
Description: fetchObj. description,
Status: + opts. status
});
} Else {
Job. urls. push ({
Id: opts. id,
Url: opts. url,
Status: + opts. status
});
}
If (! Job. waiting ){
Job. waiting = true;
SetTimeout (function (){
_ Send (campaignId );
},500 );
}
Job. finishNum ++;
} Else {
Console. log ('job can not found! ');
}
}
Function getUrls (campaignId ){
Var job = _ jobs [campaignId];
If (job) return job. cacheUrls;
}
Return {
Register: register,
Watch: watch,
Fire: fire,
GetUrls: getUrls
};
})();
Here we use fetch to capture the title and description of html. fetch implementation is relatively simple (fetch. js ):
The Code is as follows:
Module. exports = (function (){
"Use strict"
Return function (html ){
If (! Html) return {title: false, description: false };
Var title = html. match (/\ (.*?) \ <\/Title \> /)
, Meta = html. match (/\ /G)
, Description;
If (meta ){
For (var I = meta. length; I --;){
If (meta [I]. indexOf ('name = "description" ')>-1 | meta [I]. indexOf ('name = "Description" ')>-1 ){
Description = meta [I]. match (/content \ = \"(.*?) \ "/) [1];
}
}
}
(Title & title [1]! = '')? (Title = title [1]): (title = 'no title ');
Description | (description = 'no description ');
Return {
Title: title,
Description: description
};
};
})();
Finally, the source code of PhantomJS is run. After it is started, it obtains the job information from bridge through HTTP. Then, each url of the job is returned to bridge (snapshot. js) through HTTP ):
The Code is as follows:
Var webpage = require ('webpage ')
, Args = require ('system'). args
, Fs = require ('fs ')
, CampaignId = args [1]
, Pkg = JSON. parse (fs. read ('./package. json '));
Function snapshot (id, url, imagePath ){
Var page = webpage. create ()
, Send
, Begin
, Save
, End;
Page. viewportSize = {width: 1024, height: 800 };
Page. clipRect = {top: 0, left: 0, width: 1024, height: 800 };
Page. settings = {
JavascriptEnabled: false,
LoadImages: true,
UserAgent: 'mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.31 (KHTML, like Gecko) PhantomJS/1.9.0'
};
Page. open (url, function (status ){
Var data;
If (status = 'fail '){
Data = [
'Campaignid = ',
CampaignId,
'& Url = ',
EncodeURIComponent (url ),
'& Id = ',
Id,
'& Status = ',
]. Join ('');
PostPage. open ('HTTP: // localhost: '+ pkg. port +'/bridge ', 'post', data, function (){});
} Else {
Page. render (imagePath );
Var html = page. content;
// Callback NodeJS
Data = [
'Campaignid = ',
CampaignId,
'& Html = ',
EncodeURIComponent (html ),
'& Url = ',
EncodeURIComponent (url ),
'& Image = ',
EncodeURIComponent (imagePath ),
'& Id = ',
Id,
'& Status = ',
]. Join ('');
PostMan. post (data );
}
// Release the memory
Page. close ();
});
}
Var postMan = {
PostPage: null,
Posting: false,
Datas: [],
Len: 0,
CurrentNum: 0,
Init: function (snapshot ){
Var postPage = webpage. create ();
PostPage. customHeaders = {
'Secret': pkg. secret
};
PostPage. open ('HTTP: // localhost: '+ pkg. port +'/bridge? CampaignId = '+ campaignId, function (){
Var urls = JSON. parse (postPage. plainText). urls
, Url;
This. len = urls. length;
If (this. len ){
For (var I = this. len; I --;){
Url = urls [I];
Snapshot (url. id, url. url, url. imagePath );
}
}
});
This. postPage = postPage;
},
Post: function (data ){
This. datas. push (data );
If (! This. posting ){
This. posting = true;
This. fire ();
}
},
Fire: function (){
If (this. datas. length ){
Var data = this. datas. shift ()
, That = this;
This. postPage. open ('HTTP: // localhost: '+ pkg. port +'/bridge ', 'post', data, function (){
That. fire ();
// Kill child process
SetTimeout (function (){
If (++ this. currentNum = this. len ){
That. postPage. close ();
Phantom. exit ();
}
},500 );
});
} Else {
This. posting = false;
}
}
};
PostMan. init (snapshot );
Effect