Last introduced how to use Nodejs + PHANTOMJS for screenshots, but because of each screenshot operation, has enabled a PHANTOMJS process, so the concurrent volume up, the efficiency is worrying, so we rewrite all the code, and its independence into a module, convenient to call.
How to improve? control number of threads and the number of single-threaded processing URLs. Use standard Output & WebSocket for communication. Adds a caching mechanism that is currently being used with JavaScript object. Provide a simple interface externally.
Design Drawings
Dependency & Installation
Since Phantomjs 1.9.0+ is starting to support WebSocket, let's first make sure that the PHANTOMJS in path is more than 1.9.0. At the command line, type:
$ phantomjs-v
If you can return to version number 1.9.x, you can continue with the operation. If the version is too low, or if there is an error, please download the latest version at PHANTOMJS website.
If you already have Git installed, or if you have a git Shell, at the command line, type:
$ NPM Install Url-extract
For installation.
a simple example
For example, we want to intercept Baidu home page, so you can:
Copy Code code as follows:
Module.exports = (function () {"Use strict" var urlextract = require (' url-extract '); Urlextract.snapshot (' Http://www.bai Du.com ', function (Job) {Console.log (' This is a snapshot example. '); Console.log (Job); Process.exit ();}); })();
Here is the print:
The Image property is the address of the screenshot relative to the work path. We can use the GetData interface of the job to get clearer data, for example:
Copy Code code as follows:
Module.exports = (function () {"Use strict" var urlextract = require (' url-extract '); Urlextract.snapshot (' Http://www.bai Du.com ', function (Job) {Console.log (' This is a snapshot example. '); Console.log (Job.getdata ()); Process.exit ();}); })();
The print becomes like this:
Image represents the screenshot relative to the work path address, status indicates whether the state is normal, true for normal, false represents a screenshot failed.
For more examples, see: Https://github.com/miniflycn/url-extract/tree/master/examples
Main API
. snapshot
URL Snapshots
. Snapshot (URL, [callback]). Snapshot (URLs, [callback]). Snapshot (URL, [option]). Snapshot (URLs, [option])
Copy Code code as follows:
URL {String} to intercept address URLs {array} to intercept an array of addresses callback {function} callback function option {Object} optional parameter ┝id {String} custom URL ID, if the first parameter is ur LS, this parameter is invalid ┝image {string} Custom screenshot save address, if the first parameter is a URL, this parameter is invalid ┝groupid {string} defines a set of URLs groupId, used to return the time to identify which group Url┝ignorecache {Boolean} ignores cache ┗callback {function} callback functions
. extract
URL information crawl, and get snapshots
. extract (URL, [callback]). Extract (URLs, [callback]). Extract (URL, [option]). Extract (URLs, [option])
Address to intercept for URL {String}
Array of addresses to intercept for URLs {Array}
callback {Function} callback function
option {Object} optional parameter
┝id {String} to customize the ID of the URL, which is not valid if the first argument is a URLs
┝image {String} The saved address of the custom screenshot, which is not valid if the first parameter is a URL
┝groupid {String} defines the groupId of a set of URLs that are used to identify which set of URLs to return
┝ignorecache {Boolean} ignores caching
┗callback {Function} callback function
Job (Class)
Each URL corresponds to a Job object, and the information about the URL is stored by the Job object.
Field
URL {string} link address content {Boolean} Crawl page title and description information ID {string} job Idgroupid {string} A bunch of job Idcache {Boo Lean} whether to turn on caching callback {function} callback function image {String} picture address status {Boolean} job is currently normal
Prototype
GetData () get the job's related data
Global Configuration
The config files in the url-extract root directory can be configured globally, by default, as follows:
Module.exports = {wsport:3001, maxjob:100, maxqueuejob:400, Cache: ' object ', maxcache:10000, workernum:0};
wsport {Number} WebSocket occupied the port address maxjob {number} each PHANTOMJS thread can concurrent worker count Maxqueuejob {numbers} maximum waiting for work, 0 means no limit, More than this number, any work directly returns the failure (that is, status = False) cache {String} caching implementation, currently only object implementation Maxcache {number} Maximum cache links workernum {numbers} Number of PHANTOMJS threads, 0 indicates the same number of CPUs
A simple example of service
Https://github.com/miniflycn/url-extract-server-example
Note that you need to install connect and url-extract:
$ NPM Install
If you have downloaded the network disk file, please install connect:
$ NPM Install Connect
And then type:
$ node Bin/server
Open it:
http://localhost:3000
View the effect.
;