Nodejs of the URL intercepting module Url-extract of the use of an example _ basic knowledge

Source: Internet
Author: User
Tags git shell

Last introduced how to use Nodejs + PHANTOMJS for screenshots, but because of each screenshot operation, has enabled a PHANTOMJS process, so the concurrent volume up, the efficiency is worrying, so we rewrite all the code, and its independence into a module, convenient to call.
How to improve? control number of threads and the number of single-threaded processing URLs. Use standard Output & WebSocket for communication. Adds a caching mechanism that is currently being used with JavaScript object. Provide a simple interface externally.

Design Drawings

Dependency & Installation

Since Phantomjs 1.9.0+ is starting to support WebSocket, let's first make sure that the PHANTOMJS in path is more than 1.9.0. At the command line, type:

$ phantomjs-v

If you can return to version number 1.9.x, you can continue with the operation. If the version is too low, or if there is an error, please download the latest version at PHANTOMJS website.

If you already have Git installed, or if you have a git Shell, at the command line, type:
$ NPM Install Url-extract

For installation.

a simple example

For example, we want to intercept Baidu home page, so you can:

Copy Code code as follows:
Module.exports = (function () {"Use strict" var urlextract = require (' url-extract '); Urlextract.snapshot (' Http://www.bai Du.com ', function (Job) {Console.log (' This is a snapshot example. '); Console.log (Job); Process.exit ();}); })();

Here is the print:

The Image property is the address of the screenshot relative to the work path. We can use the GetData interface of the job to get clearer data, for example:

Copy Code code as follows:
Module.exports = (function () {"Use strict" var urlextract = require (' url-extract '); Urlextract.snapshot (' Http://www.bai Du.com ', function (Job) {Console.log (' This is a snapshot example. '); Console.log (Job.getdata ()); Process.exit ();}); })();

The print becomes like this:

Image represents the screenshot relative to the work path address, status indicates whether the state is normal, true for normal, false represents a screenshot failed.

For more examples, see: Https://github.com/miniflycn/url-extract/tree/master/examples

Main API

. snapshot

URL Snapshots

. Snapshot (URL, [callback]). Snapshot (URLs, [callback]). Snapshot (URL, [option]). Snapshot (URLs, [option])
Copy Code code as follows:
URL {String} to intercept address URLs {array} to intercept an array of addresses callback {function} callback function option {Object} optional parameter ┝id {String} custom URL ID, if the first parameter is ur LS, this parameter is invalid ┝image {string} Custom screenshot save address, if the first parameter is a URL, this parameter is invalid ┝groupid {string} defines a set of URLs groupId, used to return the time to identify which group Url┝ignorecache {Boolean} ignores cache ┗callback {function} callback functions

. extract

URL information crawl, and get snapshots

. extract (URL, [callback]). Extract (URLs, [callback]). Extract (URL, [option]). Extract (URLs, [option])

Address to intercept for URL {String}

Array of addresses to intercept for URLs {Array}

callback {Function} callback function

option {Object} optional parameter

┝id {String} to customize the ID of the URL, which is not valid if the first argument is a URLs

┝image {String} The saved address of the custom screenshot, which is not valid if the first parameter is a URL

┝groupid {String} defines the groupId of a set of URLs that are used to identify which set of URLs to return

┝ignorecache {Boolean} ignores caching

┗callback {Function} callback function

Job (Class)

Each URL corresponds to a Job object, and the information about the URL is stored by the Job object.

Field

URL {string} link address content {Boolean} Crawl page title and description information ID {string} job Idgroupid {string} A bunch of job Idcache {Boo Lean} whether to turn on caching callback {function} callback function image {String} picture address status {Boolean} job is currently normal

Prototype

GetData () get the job's related data

Global Configuration

The config files in the url-extract root directory can be configured globally, by default, as follows:

Module.exports = {wsport:3001, maxjob:100, maxqueuejob:400, Cache: ' object ', maxcache:10000, workernum:0};
wsport {Number} WebSocket occupied the port address maxjob {number} each PHANTOMJS thread can concurrent worker count Maxqueuejob {numbers} maximum waiting for work, 0 means no limit, More than this number, any work directly returns the failure (that is, status = False) cache {String} caching implementation, currently only object implementation Maxcache {number} Maximum cache links workernum {numbers} Number of PHANTOMJS threads, 0 indicates the same number of CPUs

A simple example of service

Https://github.com/miniflycn/url-extract-server-example

Note that you need to install connect and url-extract:

$ NPM Install

If you have downloaded the network disk file, please install connect:

$ NPM Install Connect

And then type:

$ node Bin/server

Open it:

http://localhost:3000

View the effect.

;

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.