Document directory
- 1 Overview
- 1.1 Purpose
- 1.2 requirement Overview
- 1.3 System Requirements
- 1.3.1 diverse collection targets
- 1.3.2 diversified data formats
- 1.3.3 distributed massive data
- 1.3.4 horizontal and vertical data collection
- 1.3.5 simple and quick user operations
- 1.4 interaction target
- 1.4.1 collection target
- 2.4.2 release target
- 2 System Design
- 2.1 Operating System
- 2.2 System Structure
- 2.2.1 filter container
- 2.2.2 Cache
- 2.2.3 plug-in Manager
- 2.2.4 Input and Output
- 2.2.5 Filter
- 2.3 policies and ideas
- 2.3.1 System Strategy
- 2.3.2 collection policy
- 2.4 module structure
- 2.5 filter MPs queues
- 2.6 rule File
- 2.7 unit Association
- 2.7.2 association between siblings
- 2.8 Interface Design
- 2.8.1 Main Interface Structure
- 2.8.2 collection INTERFACE STRUCTURE
- 2.8.3 rule definition INTERFACE STRUCTURE
- 2.8.4 information publishing setting page
- 3. Function Description
- 3.1 structured collection
- 3.2 Visual metadata Definition
- 3.3 plugin support
- 3.4 client environment simulation
- 3.5 multi-thread collection
- 3.6 global release
- 3.7 page collection
- 3.8 download associated files
- 3.9 save rules
- 3.10 template Modification
- 3.11 filter and replace results
- 3.12 duplicate Filtering
- 4. supported information
Source: Visual mining website collector
Download the technical solution of visual mining website collectors in PDF Format 1 Overview 1.1 Purpose
This article analyzes system requirements and describes the system structure and solutions.
This article is suitable for technical personnel to read and reference.
1.2 requirement Overview
Websites, enterprises, and marketing personnel all have information requirements. Different information fields, users, and methods for obtaining information vary greatly. The collection system must meet diversified collection applications and meet future demand growth.
1.3 system requirements 1.3.1 diversified collection targets
Information is distributed in various information storage systems. Various storage systems have their own interaction mechanisms. The collection system must provide multiple and scalable connection modules.
1.3.2 diversified data formats
Information exists in multiple forms, such as web pages, Word documents, and PDF files. Different data formats require different collection mechanisms.
1.3.3 distributed massive data
Due to the delay of network communication and the limitation of network bandwidth, concurrent multi-thread communication can effectively reduce the delay and snatch resources.
1.3.4 horizontal and vertical data collection
Next page of the data to be automatically collected by the system; associated attachments of the data to be automatically collected; the next data is automatically collected based on the current collection results.
1.3.5 simple and quick user operations
Diverse and complex data formats increase the difficulty of your jobs. You may wish to learn what you see and what you get, and provide prompt information in a timely manner.
1.4 interaction target 1.4.1 collection target
The collection targets are as follows:
- Web System
- File System
- Database System
- Other text data sources
2.4.2 release target
The release targets are as follows:
- File System
- Database System
- Other text data storage systems or receiving devices
2 system design 2.1 Operating System
The basic components of the collection system include the input subsystem, memory mixer, and output subsystem. Data is extracted by multiple filters in multiple depths and stored in the cache. As follows:
2.2 System Structure
The output subsystem, input subsystem, and filter are integrated into the system as plug-ins. The filter container uses the plug-in Manager to reference the plug-in module to drive system execution.
2.2.1 filter container
The container creates a filter instance of the current type and transmits the current input/output handle and global cache handle. The number of concurrent containers that control the filter. When the lifecycle of all filters ends, the container triggers the execution of the output subsystem.
The container generates a plug-in keyword through the rule file and the target address, and searches for the plug-in Manager based on the keyword to obtain the Current Filter plug-in and the factory handle of the current input/output plug-in.
2.2.2 Cache
The filter sends the collected data to the cache area. The cache area maintains the data collection sequence and context. The output subsystem indexes the unit and context unit through the unit identification.
2.2.3 plug-in Manager
The collection system supports a wide range of plug-ins. The plug-in manager is responsible for loading plug-ins and index plug-ins. There are several types of plug-ins: Input plug-ins, output plug-ins, and filter plug-ins. The functions are as follows:
- The input plug-in supports reading different external objects. Such as HTTP server, FTP server, and file system.
- The collection plug-in supports collection of different data formats and special information. Such as webpage collection, word collection, and email address collection.
- The output plug-in supports the release of various systems, such as BBS systems and information systems.
The plug-in Manager uses keywords to index various plug-in factories.
2.2.4 Input and Output
The collection system uses a unified input/output interface to exchange data with various external targets. The data exchange process is implemented by specific modules. A specific module is a bridge between a collection system and an external target. Similar to the device driver module of a window, different input/output mechanisms correspond to different input/output modules. The I/O system manages and schedules these input and output modules. The input and output modules include standard input and output modules and extended input and output modules. The Extended Input/output module inherits standard input/output modules for specific session Processing Based on external target connection requirements.
Standard input modules include the following types:
- FTP protocol input module
Support for FTP server access
- HTTP input module
Support Web server access
- File protocol input module
Supports file reading.
- JDBC input module
Supports database access through JDBC Interfaces
- ODBC input module
Allows you to access a database using an ODBC interface.
Standard input modules include the following types:
- File protocol input module
Supports file writing.
- JDBC input module
Supports database access through JDBC Interfaces
- ODBC input module
Allows you to access a database using an ODBC interface.
2.2.5 Filter
The filter handle is created by the container and executed concurrently. The output result of the filter is input to the next filter, and the result is stored in the cache for global reference by the output subsystem.
2.3 policies and assumptions 2.3.1 system policies
To adapt to different collection targets and collection mechanisms, the collection system uses plug-in systems and container management systems. you can install plug-in packages to support special applications. The collection system contains three types of extensible plug-ins: Input plug-ins, collection plug-ins, and output plug-ins.
Three types of plug-ins work together under the container drive. The container creates an entry Filter Based on the collection rule file and starts the filter in multiple threads. The filter requests the corresponding input module to read data based on the collection address, and stores the filter results in the cache, then, request the container for its next filter. If the returned value is not empty, start them in multiple threads. When the container receives a filter request, if the next filter is empty, it calls the output module. The output module reads data from the cache area globally and publishes the collection result.
2.3.2 collection policy
The collection system uses different collection mechanisms for different collection targets to perform Semantic Analysis on semi-structured data and intelligently capture data. For webpage collection, the filter analyzes its HTML tag, and then captures the specified data based on the tag category and attribute. For Word documents, the filter analyzes the document format and word object of word, smart Data Capturing. For collecting special information such as email and mobile phone number, the filter captures information in template mode.
2.4 module structure
Modules are as follows:
Collection rule file collection rule definition module collection rule parsing module collection rule management module
Collection container management module collection container
Cache manager module
HTTP input plugin file system input/output plugin JDBC input/output plugin filter plugin
Input plug-in management module output plug-in management module filter plug-in management module plug-in Manager
Filter Status Report Module
2.5 filter MPs queues
A filter pipe is a tree-like data channel formed by the filter engine based on the input relationship of the filter. As follows:
The figure shows five pipelines: 1 = 1-1 = 1-1-1; 1 = 1-1 = 1-1-2; 1 = 1-2 = 1-2-1; 1 = 1-2 = 1-2-2; 1 = 1-2 = 1-2-3
2.6 rule File
Collection rules are defined according to the agreed syntax. The description language can use XML, and the Rule logic can use regular expressions or custom scripting languages, or the combination of the two.
The collection system checks the validity of the collection logic, checks whether Input and Output loops exist, and checks whether output results are not published.
Case: 1) interconnection between input and output
2) the output result is not released.
2.7 unit Association
The system has the following forms of value association during the release process: 2.7.1 The association filter a between parent and child has multiple matching results A1, A2, and A3, values A1, A2, and A3 match B for the second time as the input source, and multiple matching results (a1b1, a1b2), (a2b1, a21b1), (A3B1, A3B2) are generated ). Ensure the relevance of [a1, (a1b1, a1b2)], [A2, (a2b1, A2B2)], and [A3, (A3B1, A3B2)] during system release. Figure:
2.7.2 association between siblings
Data A, B, and C are filtered from the data source. The subscript of a letter indicates multiple values matched by a filter. B and C can only be associated with the form of [b1-1, c1-1], [b1-2, C1-2], [b2-1, c2-1], [b2-2, c2-2]. We call it sequential Association. Figure:
The collection memory must be able to represent the above data relationship in the storage data structure, and can index and search up through the subscript of the filter.
2.8 UI design 2.8.1 Main Interface Structure
Module list |
Public information |
Function list |
Work Zone |
Help information prompt |
Status Bar |
The system uses the structure of the second-level navigation bar to guide user operations, help information prompts in real time, solve user problems, and allow users to complete jobs easily and smoothly.
2.8.2 collection INTERFACE STRUCTURE
Rule file window |
Collection Unit status window area |
Filter status window area |
You can select a previous rule file in the Rule file window for collection. The collection unit status window area reports the collected unit data to the user in a timely manner. The filter status window area reports the status information of the filters that have been completed or are being executed to the user, so that the user can understand the execution status of the system at all times.
The collection interface is displayed in the workspace on the main interface.
2.8.3 rule definition INTERFACE STRUCTURE
Rule Properties window |
Collection object display area |
Unit definition Area |
Source code area of the collection object |
Function button Area |
- Rule Properties window: set basic attributes of a rule
- Collection object display area: displays collection objects visually. You can select collection objects directly on the visualization objects.
- Unit definition area: Specify the data to be collected on the collection object
- Collection cash source code area: displays the source code of the collection object
2.8.4 information publishing setting page
Release target attribute setting area |
Target large unit list Area |
Target small unit list Area |
Collection Unit list Area |
Variable setting area |
Associate button Area |
Unit Association display area |
OPERATION button Area |
The user connects to the publishing target, displays the unit list hierarchically in the target unit list area, and specifies the correspondence between the target unit and the collection unit. The cell association display area displays the list of currently associated cells.
3 function description 3.1 structured collection
The system performs Semantic Analysis on semi-structured data and intelligently extracts Data Based on Semantic Rules.
3.2 Visual metadata Definition
You can specify the content to be collected on the Visual Target Interface.
3.3 plugin support
The system has a wide range of plug-ins, supporting collection of various targets and release of various systems. For example, FTP collection, HTTP collection, database publishing, and file Publishing.
3.4 client environment simulation
Simulate the client environment and support basic session functions on the client and server. For example, the session mechanism and COOKIE Mechanism of the browser. User logon is supported.
3.5 multi-thread collection
The system supports multi-task concurrency and multi-thread collection. Supports concurrent thread control and status monitoring.
3.6 global release
The system provides a global cache area associated with the context. The publishing module can combine unit data of different levels. You can check and edit the unit data in the cache.
3.7 page collection
Automatically collects the next page of the Content Based on page number rules.
3.8 download associated files
The system can automatically download other files contained in the page according to settings. Such as flash and images.
3.9 save rules
Information such as collection objects, filtering rules, and publishing targets is stored in Rule files. You can import and export rule files to share or exchange rule files with others. The system provides a friendly Wizard Page for you to configure the rule file.
3.10 template Modification
You can publish data according to the predefined template structure.
3.11 filter and replace results
Formats and syntaxes of automatically filtered data, such as HTML language and Word format. Constant replacement and environment variable replacement are supported.
3.12 duplicate Filtering
Automatically delete duplicate data from the collection results.
4. supported information
Resources |
Description |
Http://www.caijiqi.net/ |
The project official website publishes project documents and provides system downloads. |
QQ: 107175884 |
|
Mail: hotheartboy@gmail.com |
|