Technical Solution of site collectors

Last Update:2018-12-04 Source: Internet

Author: User

Tags ftp protocol

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Document directory

1 Overview
1.1 Purpose
1.2 requirement Overview
1.3 System Requirements
1.3.1 diverse collection targets
1.3.2 diversified data formats
1.3.3 distributed massive data
1.3.4 horizontal and vertical data collection
1.3.5 simple and quick user operations
1.4 interaction target
1.4.1 collection target
2.4.2 release target
2 System Design
2.1 Operating System
2.2 System Structure
2.2.1 filter container
2.2.2 Cache
2.2.3 plug-in Manager
2.2.4 Input and Output
2.2.5 Filter
2.3 policies and ideas
2.3.1 System Strategy
2.3.2 collection policy
2.4 module structure
2.5 filter MPs queues
2.6 rule File
2.7 unit Association
2.7.2 association between siblings
2.8 Interface Design
2.8.1 Main Interface Structure
2.8.2 collection INTERFACE STRUCTURE
2.8.3 rule definition INTERFACE STRUCTURE
2.8.4 information publishing setting page
3. Function Description
3.1 structured collection
3.2 Visual metadata Definition
3.3 plugin support
3.4 client environment simulation
3.5 multi-thread collection
3.6 global release
3.7 page collection
3.8 download associated files
3.9 save rules
3.10 template Modification
3.11 filter and replace results
3.12 duplicate Filtering
4. supported information

Source: Visual mining website collector

Download the technical solution of visual mining website collectors in PDF Format 1 Overview 1.1 Purpose

This article analyzes system requirements and describes the system structure and solutions.

This article is suitable for technical personnel to read and reference.

1.2 requirement Overview

Websites, enterprises, and marketing personnel all have information requirements. Different information fields, users, and methods for obtaining information vary greatly. The collection system must meet diversified collection applications and meet future demand growth.

1.3 system requirements 1.3.1 diversified collection targets

Information is distributed in various information storage systems. Various storage systems have their own interaction mechanisms. The collection system must provide multiple and scalable connection modules.

1.3.2 diversified data formats

Information exists in multiple forms, such as web pages, Word documents, and PDF files. Different data formats require different collection mechanisms.

1.3.3 distributed massive data

Due to the delay of network communication and the limitation of network bandwidth, concurrent multi-thread communication can effectively reduce the delay and snatch resources.

1.3.4 horizontal and vertical data collection

Next page of the data to be automatically collected by the system; associated attachments of the data to be automatically collected; the next data is automatically collected based on the current collection results.

1.3.5 simple and quick user operations

Diverse and complex data formats increase the difficulty of your jobs. You may wish to learn what you see and what you get, and provide prompt information in a timely manner.

1.4 interaction target 1.4.1 collection target

The collection targets are as follows:

Web System
File System
Database System
Other text data sources

2.4.2 release target

The release targets are as follows:

File System
Database System
Other text data storage systems or receiving devices

2 system design 2.1 Operating System

The basic components of the collection system include the input subsystem, memory mixer, and output subsystem. Data is extracted by multiple filters in multiple depths and stored in the cache. As follows:

2.2 System Structure

The output subsystem, input subsystem, and filter are integrated into the system as plug-ins. The filter container uses the plug-in Manager to reference the plug-in module to drive system execution.

2.2.1 filter container

The container creates a filter instance of the current type and transmits the current input/output handle and global cache handle. The number of concurrent containers that control the filter. When the lifecycle of all filters ends, the container triggers the execution of the output subsystem.

The container generates a plug-in keyword through the rule file and the target address, and searches for the plug-in Manager based on the keyword to obtain the Current Filter plug-in and the factory handle of the current input/output plug-in.

2.2.2 Cache

The filter sends the collected data to the cache area. The cache area maintains the data collection sequence and context. The output subsystem indexes the unit and context unit through the unit identification.

2.2.3 plug-in Manager

The collection system supports a wide range of plug-ins. The plug-in manager is responsible for loading plug-ins and index plug-ins. There are several types of plug-ins: Input plug-ins, output plug-ins, and filter plug-ins. The functions are as follows:

The input plug-in supports reading different external objects. Such as HTTP server, FTP server, and file system.
The collection plug-in supports collection of different data formats and special information. Such as webpage collection, word collection, and email address collection.
The output plug-in supports the release of various systems, such as BBS systems and information systems.

The plug-in Manager uses keywords to index various plug-in factories.

2.2.4 Input and Output

The collection system uses a unified input/output interface to exchange data with various external targets. The data exchange process is implemented by specific modules. A specific module is a bridge between a collection system and an external target. Similar to the device driver module of a window, different input/output mechanisms correspond to different input/output modules. The I/O system manages and schedules these input and output modules. The input and output modules include standard input and output modules and extended input and output modules. The Extended Input/output module inherits standard input/output modules for specific session Processing Based on external target connection requirements.

Standard input modules include the following types:

FTP protocol input module
Support for FTP server access
HTTP input module
Support Web server access
File protocol input module
Supports file reading.
JDBC input module
Supports database access through JDBC Interfaces
ODBC input module
Allows you to access a database using an ODBC interface.

Standard input modules include the following types:

File protocol input module
Supports file writing.
JDBC input module
Supports database access through JDBC Interfaces
ODBC input module
Allows you to access a database using an ODBC interface.

2.2.5 Filter

The filter handle is created by the container and executed concurrently. The output result of the filter is input to the next filter, and the result is stored in the cache for global reference by the output subsystem.

2.3 policies and assumptions 2.3.1 system policies

To adapt to different collection targets and collection mechanisms, the collection system uses plug-in systems and container management systems. you can install plug-in packages to support special applications. The collection system contains three types of extensible plug-ins: Input plug-ins, collection plug-ins, and output plug-ins.

Three types of plug-ins work together under the container drive. The container creates an entry Filter Based on the collection rule file and starts the filter in multiple threads. The filter requests the corresponding input module to read data based on the collection address, and stores the filter results in the cache, then, request the container for its next filter. If the returned value is not empty, start them in multiple threads. When the container receives a filter request, if the next filter is empty, it calls the output module. The output module reads data from the cache area globally and publishes the collection result.

2.3.2 collection policy

The collection system uses different collection mechanisms for different collection targets to perform Semantic Analysis on semi-structured data and intelligently capture data. For webpage collection, the filter analyzes its HTML tag, and then captures the specified data based on the tag category and attribute. For Word documents, the filter analyzes the document format and word object of word, smart Data Capturing. For collecting special information such as email and mobile phone number, the filter captures information in template mode.

2.4 module structure

Modules are as follows:

Collection rule file collection rule definition module collection rule parsing module collection rule management module

Collection container management module collection container

Cache manager module

HTTP input plugin file system input/output plugin JDBC input/output plugin filter plugin

Input plug-in management module output plug-in management module filter plug-in management module plug-in Manager

Filter Status Report Module

2.5 filter MPs queues

A filter pipe is a tree-like data channel formed by the filter engine based on the input relationship of the filter. As follows:

The figure shows five pipelines: 1 = 1-1 = 1-1-1; 1 = 1-1 = 1-1-2; 1 = 1-2 = 1-2-1; 1 = 1-2 = 1-2-2; 1 = 1-2 = 1-2-3

2.6 rule File

Collection rules are defined according to the agreed syntax. The description language can use XML, and the Rule logic can use regular expressions or custom scripting languages, or the combination of the two.

The collection system checks the validity of the collection logic, checks whether Input and Output loops exist, and checks whether output results are not published.

Case: 1) interconnection between input and output

2) the output result is not released.

2.7 unit Association

The system has the following forms of value association during the release process: 2.7.1 The association filter a between parent and child has multiple matching results A1, A2, and A3, values A1, A2, and A3 match B for the second time as the input source, and multiple matching results (a1b1, a1b2), (a2b1, a21b1), (A3B1, A3B2) are generated ). Ensure the relevance of [a1, (a1b1, a1b2)], [A2, (a2b1, A2B2)], and [A3, (A3B1, A3B2)] during system release. Figure:

2.7.2 association between siblings

Data A, B, and C are filtered from the data source. The subscript of a letter indicates multiple values matched by a filter. B and C can only be associated with the form of [b1-1, c1-1], [b1-2, C1-2], [b2-1, c2-1], [b2-2, c2-2]. We call it sequential Association. Figure:

The collection memory must be able to represent the above data relationship in the storage data structure, and can index and search up through the subscript of the filter.

2.8 UI design 2.8.1 Main Interface Structure

Module list
Public information
Function list	Work Zone
	Help information prompt
Status Bar

The system uses the structure of the second-level navigation bar to guide user operations, help information prompts in real time, solve user problems, and allow users to complete jobs easily and smoothly.

2.8.2 collection INTERFACE STRUCTURE

Rule file window

Collection Unit status window area

Filter status window area

You can select a previous rule file in the Rule file window for collection. The collection unit status window area reports the collected unit data to the user in a timely manner. The filter status window area reports the status information of the filters that have been completed or are being executed to the user, so that the user can understand the execution status of the system at all times.

The collection interface is displayed in the workspace on the main interface.

2.8.3 rule definition INTERFACE STRUCTURE

Rule Properties window

Collection object display area

Unit definition Area

Source code area of the collection object

Function button Area

Rule Properties window: set basic attributes of a rule
Collection object display area: displays collection objects visually. You can select collection objects directly on the visualization objects.
Unit definition area: Specify the data to be collected on the collection object
Collection cash source code area: displays the source code of the collection object

2.8.4 information publishing setting page

Release target attribute setting area
Target large unit list Area	Target small unit list Area	Collection Unit list Area
		Variable setting area
	Associate button Area
	Unit Association display area
OPERATION button Area

The user connects to the publishing target, displays the unit list hierarchically in the target unit list area, and specifies the correspondence between the target unit and the collection unit. The cell association display area displays the list of currently associated cells.

3 function description 3.1 structured collection

The system performs Semantic Analysis on semi-structured data and intelligently extracts Data Based on Semantic Rules.

3.2 Visual metadata Definition

You can specify the content to be collected on the Visual Target Interface.

3.3 plugin support

The system has a wide range of plug-ins, supporting collection of various targets and release of various systems. For example, FTP collection, HTTP collection, database publishing, and file Publishing.

3.4 client environment simulation

Simulate the client environment and support basic session functions on the client and server. For example, the session mechanism and COOKIE Mechanism of the browser. User logon is supported.

3.5 multi-thread collection

The system supports multi-task concurrency and multi-thread collection. Supports concurrent thread control and status monitoring.

3.6 global release

The system provides a global cache area associated with the context. The publishing module can combine unit data of different levels. You can check and edit the unit data in the cache.

3.7 page collection

Automatically collects the next page of the Content Based on page number rules.

3.8 download associated files

The system can automatically download other files contained in the page according to settings. Such as flash and images.

3.9 save rules

Information such as collection objects, filtering rules, and publishing targets is stored in Rule files. You can import and export rule files to share or exchange rule files with others. The system provides a friendly Wizard Page for you to configure the rule file.

3.10 template Modification

You can publish data according to the predefined template structure.

3.11 filter and replace results

Formats and syntaxes of automatically filtered data, such as HTML language and Word format. Constant replacement and environment variable replacement are supported.

3.12 duplicate Filtering

Automatically delete duplicate data from the collection results.

4. supported information

Resources	Description
Http://www.caijiqi.net/	The project official website publishes project documents and provides system downloads.
QQ: 107175884
Mail: hotheartboy@gmail.com

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More