NodeJS url capture module url-extract usage instance _ basic knowledge

Source: Internet
Author: User
Tags git shell
This article describes how to use the url-extract module of NodeJS url information capture module, and provides the instance code for your reference, however, since a PhantomJS process is enabled for each operation, the efficiency is worrying when the concurrency goes up. Therefore, we have rewritten all the code and made it an independent module for convenient calling.
How can we improve it?Controls the number of threads and the number of URLs processed by a single thread. Use Standard Output & WebSocket for communication. Add cache mechanism. Currently, Javascript Object is used. Provides simple external interfaces.

Design Diagram

Dependency & installation

Since PhantomJS 1.9.0 + starts to support Websocket, we must first determine that PhantomJS in PATH is of version 1.9.0 or later. On the command line, type:

$ Phantomjs-v

If you can return version 1.9.x, you can continue the operation. If the version is too low or an error occurs, go to the official PhantomJS website to download the latest version.

If you have installed Git or have Git Shell, type:
$ Npm install url-extract

.

A simple example

For example, if we want to intercept the Baidu homepage, we can:

The Code is as follows:

Module. exports = (function () {"use strict" var urlExtract = require ('url-extract '); urlExtract. snapshot ('HTTP: // www.baidu.com ', function (job) {console. log ('this is a snapshot example. '); console. log (job); process. exit ();});})();

Below is the print:

The image attribute is the address relative to the working path. We can use the getData interface of the Job to obtain clear data, for example:

The Code is as follows:

Module. exports = (function () {"use strict" var urlExtract = require ('url-extract '); urlExtract. snapshot ('HTTP: // www.baidu.com ', function (job) {console. log ('this is a snapshot example. '); console. log (job. getData (); process. exit ();});})();

The print becomes like this:

Image indicates the address relative to the working path, status indicates whether the status is normal, true indicates normal, and false indicates failure.

For more examples, see: https://github.com/miniflycn/url-extract/tree/master/examples

Main APIs

. Snapshot

Url Snapshot

. Snapshot (url, [callback]). snapshot (urls, [callback]). snapshot (url, [option]). snapshot (urls, [option])

The Code is as follows:

Url {String} the url urls {Array} to be intercepted the address Array callback {Function} callback Function option {Object} optional parameter consumer id {String} custom url id, if the first parameter is urls, this parameter is invalid because of the custom save address of Mirror image {String}. If the first parameter is urls, invalid Response groupId {String} defines the groupId of a set of URLs. this parameter is used to identify the group of URLs when returned, and determines whether the cached response callback {Function} callback Function is ignored.

. Extract

Capture url Information and obtain snapshots

. Extract (url, [callback]). extract (urls, [callback]). extract (url, [option]). extract (urls, [option])

Url {String} to be intercepted

URL {Array} address Array to be intercepted

Callback {Function} callback Function

Option {Object} (optional)

The id of the custom url. If the first parameter is urls, this parameter is invalid.

Custom image {String} custom save address. If the first parameter is urls, this parameter is invalid

Define groupId {String} defines the groupId of a set of URLs, used to identify which group of URLs are returned

Whether the ┝ ignoreCache {Boolean} ignores the cache

Callback {Function} callback Function

Job (class)

Each url corresponds to a job object, and url-related information is stored by the job object.

Field

Url {String} url content {Boolean}: whether to capture the title and description information of the page. id {String} idgroupId of the job {String}: whether to enable cache for a group of jobs {Boolean} callback {Function} callback Function image {String} image address status {Boolean} job is normal currently

Prototype

GetData () obtains job-related data.

Global Configuration

The config file in the url-extract root directory can be globally configured. The default value is as follows:

module.exports = { wsPort: 3001, maxJob: 100, maxQueueJob: 400, cache: 'object', maxCache: 10000, workerNum: 0};
The port address used by wsPort {Number} websocket maxJob {Number} Each PhantomJS thread can have concurrent worker count maxQueueJob {Number} max Number of waiting jobs. 0 indicates no limit. The Number exceeds this limit, failed (status = false) cache {String} cache implementation is directly returned for any work. Currently, only maxCache {Number} max Number of cache links is implemented for the object, workerNum {Number} PhantomJS thread count, 0 indicates the number of CPUs is the same.

A simple service example

Https://github.com/miniflycn/url-extract-server-example

Note: connect and url-extract must be installed:

$ Npm install

If you have downloaded an online storage file, install connect:

$ Npm install connect

Then type:

$ Node bin/server

Open:

Http: // localhost: 3000

View the effect.

;

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.