NodeJS url capture module url-extract usage instance

NodeJS url capture module url-extract usage instance _ basic knowledge

Last Update:2017-05-11 Source: Internet

Author: User

Tags git shell

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article describes how to use the url-extract module of NodeJS url information capture module, and provides the instance code for your reference, however, since a PhantomJS process is enabled for each operation, the efficiency is worrying when the concurrency goes up. Therefore, we have rewritten all the code and made it an independent module for convenient calling.
How can we improve it?Controls the number of threads and the number of URLs processed by a single thread. Use Standard Output & WebSocket for communication. Add cache mechanism. Currently, Javascript Object is used. Provides simple external interfaces.

Design Diagram

Dependency & installation

Since PhantomJS 1.9.0 + starts to support Websocket, we must first determine that PhantomJS in PATH is of version 1.9.0 or later. On the command line, type:

$ Phantomjs-v

If you can return version 1.9.x, you can continue the operation. If the version is too low or an error occurs, go to the official PhantomJS website to download the latest version.

If you have installed Git or have Git Shell, type:
$ Npm install url-extract

.

A simple example

For example, if we want to intercept the Baidu homepage, we can:

The Code is as follows:

Module. exports = (function () {"use strict" var urlExtract = require ('url-extract '); urlExtract. snapshot ('HTTP: // www.baidu.com ', function (job) {console. log ('this is a snapshot example. '); console. log (job); process. exit ();});})();

Below is the print:

The image attribute is the address relative to the working path. We can use the getData interface of the Job to obtain clear data, for example:

The Code is as follows:

The print becomes like this:

Image indicates the address relative to the working path, status indicates whether the status is normal, true indicates normal, and false indicates failure.

For more examples, see: https://github.com/miniflycn/url-extract/tree/master/examples

Main APIs

. Snapshot

Url Snapshot

. Snapshot (url, [callback]). snapshot (urls, [callback]). snapshot (url, [option]). snapshot (urls, [option])

The Code is as follows:

Url {String} the url urls {Array} to be intercepted the address Array callback {Function} callback Function option {Object} optional parameter consumer id {String} custom url id, if the first parameter is urls, this parameter is invalid because of the custom save address of Mirror image {String}. If the first parameter is urls, invalid Response groupId {String} defines the groupId of a set of URLs. this parameter is used to identify the group of URLs when returned, and determines whether the cached response callback {Function} callback Function is ignored.

. Extract

Capture url Information and obtain snapshots
. Extract (url, [callback]). extract (urls, [callback]). extract (url, [option]). extract (urls, [option])

Url {String} to be intercepted

URL {Array} address Array to be intercepted

Callback {Function} callback Function

Option {Object} (optional)

The id of the custom url. If the first parameter is urls, this parameter is invalid.

Custom image {String} custom save address. If the first parameter is urls, this parameter is invalid

Define groupId {String} defines the groupId of a set of URLs, used to identify which group of URLs are returned

Whether the ┝ ignoreCache {Boolean} ignores the cache

Callback {Function} callback Function

Job (class)

Each url corresponds to a job object, and url-related information is stored by the job object.

Field

Url {String} url content {Boolean}: whether to capture the title and description information of the page. id {String} idgroupId of the job {String}: whether to enable cache for a group of jobs {Boolean} callback {Function} callback Function image {String} image address status {Boolean} job is normal currently

Prototype

GetData () obtains job-related data.

Global Configuration
The config file in the url-extract root directory can be globally configured. The default value is as follows:
module.exports = { wsPort: 3001, maxJob: 100, maxQueueJob: 400, cache: 'object', maxCache: 10000, workerNum: 0};
The port address used by wsPort {Number} websocket maxJob {Number} Each PhantomJS thread can have concurrent worker count maxQueueJob {Number} max Number of waiting jobs. 0 indicates no limit. The Number exceeds this limit, failed (status = false) cache {String} cache implementation is directly returned for any work. Currently, only maxCache {Number} max Number of cache links is implemented for the object, workerNum {Number} PhantomJS thread count, 0 indicates the number of CPUs is the same.

A simple service example
Https://github.com/miniflycn/url-extract-server-example

Note: connect and url-extract must be installed:

$ Npm install

If you have downloaded an online storage file, install connect:

$ Npm install connect

Then type:

$ Node bin/server

Open:

Http: // localhost: 3000

View the effect.

;

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More