This article describes how to use the url-extract module of NodeJS url information capture module, and provides the instance code for your reference, however, since a PhantomJS process is enabled for each operation, the efficiency is worrying when the concurrency goes up. Therefore, we have rewritten all the code and made it an independent module for convenient calling.
How can we improve it?Controls the number of threads and the number of URLs processed by a single thread. Use Standard Output & WebSocket for communication. Add cache mechanism. Currently, Javascript Object is used. Provides simple external interfaces.
Design Diagram
Dependency & installation
Since PhantomJS 1.9.0 + starts to support Websocket, we must first determine that PhantomJS in PATH is of version 1.9.0 or later. On the command line, type:
$ Phantomjs-v
If you can return version 1.9.x, you can continue the operation. If the version is too low or an error occurs, go to the official PhantomJS website to download the latest version.
If you have installed Git or have Git Shell, type:
$ Npm install url-extract
.
A simple example
For example, if we want to intercept the Baidu homepage, we can:
The Code is as follows:
Module. exports = (function () {"use strict" var urlExtract = require ('url-extract '); urlExtract. snapshot ('HTTP: // www.baidu.com ', function (job) {console. log ('this is a snapshot example. '); console. log (job); process. exit ();});})();
Below is the print:
The image attribute is the address relative to the working path. We can use the getData interface of the Job to obtain clear data, for example:
The Code is as follows:
Module. exports = (function () {"use strict" var urlExtract = require ('url-extract '); urlExtract. snapshot ('HTTP: // www.baidu.com ', function (job) {console. log ('this is a snapshot example. '); console. log (job. getData (); process. exit ();});})();
The print becomes like this:
Image indicates the address relative to the working path, status indicates whether the status is normal, true indicates normal, and false indicates failure.
For more examples, see: https://github.com/miniflycn/url-extract/tree/master/examples
Main APIs
. Snapshot
Url Snapshot
. Snapshot (url, [callback]). snapshot (urls, [callback]). snapshot (url, [option]). snapshot (urls, [option])
The Code is as follows:
Url {String} the url urls {Array} to be intercepted the address Array callback {Function} callback Function option {Object} optional parameter consumer id {String} custom url id, if the first parameter is urls, this parameter is invalid because of the custom save address of Mirror image {String}. If the first parameter is urls, invalid Response groupId {String} defines the groupId of a set of URLs. this parameter is used to identify the group of URLs when returned, and determines whether the cached response callback {Function} callback Function is ignored.
. Extract
Capture url Information and obtain snapshots
. Extract (url, [callback]). extract (urls, [callback]). extract (url, [option]). extract (urls, [option])
Url {String} to be intercepted
URL {Array} address Array to be intercepted
Callback {Function} callback Function
Option {Object} (optional)
The id of the custom url. If the first parameter is urls, this parameter is invalid.
Custom image {String} custom save address. If the first parameter is urls, this parameter is invalid
Define groupId {String} defines the groupId of a set of URLs, used to identify which group of URLs are returned
Whether the ┝ ignoreCache {Boolean} ignores the cache
Callback {Function} callback Function
Job (class)
Each url corresponds to a job object, and url-related information is stored by the job object.
Field
Url {String} url content {Boolean}: whether to capture the title and description information of the page. id {String} idgroupId of the job {String}: whether to enable cache for a group of jobs {Boolean} callback {Function} callback Function image {String} image address status {Boolean} job is normal currently
Prototype
GetData () obtains job-related data.
Global Configuration
The config file in the url-extract root directory can be globally configured. The default value is as follows:
module.exports = { wsPort: 3001, maxJob: 100, maxQueueJob: 400, cache: 'object', maxCache: 10000, workerNum: 0};
The port address used by wsPort {Number} websocket maxJob {Number} Each PhantomJS thread can have concurrent worker count maxQueueJob {Number} max Number of waiting jobs. 0 indicates no limit. The Number exceeds this limit, failed (status = false) cache {String} cache implementation is directly returned for any work. Currently, only maxCache {Number} max Number of cache links is implemented for the object, workerNum {Number} PhantomJS thread count, 0 indicates the number of CPUs is the same.
A simple service example
Https://github.com/miniflycn/url-extract-server-example
Note: connect and url-extract must be installed:
$ Npm install
If you have downloaded an online storage file, install connect:
$ Npm install connect
Then type:
$ Node bin/server
Open:
Http: // localhost: 3000
View the effect.
;