Web site collector product White Paper

Source: Internet
Author: User
Tags mssql
Document directory
  • 1 Overview
  • 1.1 Purpose
  • 1.2 Product Introduction
  • 1.3 Market Analysis
  • 1.3.1 Internet applications
  • 1.3.2 Information Search
  • 1.3.3 Data Entry
  • 1.4 requirement Overview
  • 1.4.1 website collection
  • 1.4.2 Information Collection
  • 1.4.3 Data structuring
  • 2. User features
  • 2.1 website administrator
  • 2.2 Information Collection users
  • 2.3 structured data users
  • 3. Runtime Environment
  • 4 Operating System
  • 5. system features
  • 5.1 I/O system
  • 5.2 container system
  • 5.3 Cache System
  • 5.4 plug-in system
  • 6. Function Description
  • 6.1 structured collection
  • 6.2 visual metadata Definition
  • 6.3 plugin support
  • 6.4 client environment simulation
  • 6.5 multi-thread collection
  • 6.6 global release
  • 6.7 page collection
  • 6.8 download associated files
  • 6.9 save rules
  • 6.10 template Modification
  • 6.11 filter and replace results
  • 6.12 duplicate Filtering
  • 7. supported information

Source: Visual mining website collector

Download of the product White Paper in PDF Format of visual mining website collector 1 Overview 1.1 Objective

This article introduces the system structure, features, and features of the collection system from the technical point of view. Analyze the market conditions and current user requirements of the collection system.

This article is suitable for users and technicians to read and reference.

1.2 Product Introduction

The website collector is an open-source information collection software. It can be used for website information collection, full-text retrieval on the site, software system data exchange, Data structuring, and other applications.

1.3 market analysis 1.3.1 Internet applications

With the development and popularization of the Internet, Internet users are growing rapidly. surfing the Internet has become a daily content in people's life. People read, post, search, exchange, shopping, and all these online behaviors through the website, from point to line, it will bring together huge commercial value. Therefore, the Internet has become a dream empire for many people. Whether you are full of money or penniless, Here we only talk about information as king, service first. Therefore, information creation, collection, organization, and re-processing are the foundation for the survival of websites. The information collection system can automatically obtain webpage content through the website address specified by the website administrator and predefined crawling rules, and automatically extract data according to the data structure of the website system, and released to the website system, so that you do not spend any effort and money, you can make your website to the world overnight.

1.3.2 Information Search

The Network Connection of various user groups makes the Internet an all-encompassing information library. Commercial, academic, and individual information can be posted and obtained on the Internet. Therefore, enterprises can obtain customer resources, market quotations, and business information through the Internet. However, in this vast information sea, we often lack a tool to discover what we care about, and effectively organize and reserve them to become the internal resources of enterprises. The information collection system can automatically search for data based on the data mode and display the matching information on your desktop.

1.3.3 Data Entry

Enterprise Management Systems, enterprise information management systems, customer service systems, and other information processing systems can only process structured data, such as student information, including user names, gender, and age attributes, they must be saved in a predefined structure. However, the system contains a large amount of unstructured data, such as materials submitted by the customer and internal documents of the company. This data is usually used in various information processing systems for manual statistics and manual input. The information collection system can automatically extract a document into multiple fields based on the data structure of the information system and import these fields to various information processing systems of the enterprise.

1.4 requirement Overview

The biggest wish of a website administrator is to provide the most abundant website content to attract more visits. marketers are excited when they get hidden customer resources through clues; the company's logistics staff dreamed of moving away from such boring text input. The collection system is better than a pair of eye-catching eyes, allowing you to see more and get more.

1.4.1 website collection

The website administrator wants to save some content of other websites to their own servers. Extract related fields from the content and publish them to your website system. Sometimes you need to save the webpage-related files to a local device, as well as files and attachments.

The website administrator regularly crawls content from the same website and does not want to publish the captured content to the website system. For some websites, You need to log in to obtain the page. The website administrator wants to obtain all relevant content, including other pages in the content list, on a content list page. When you crawl the same website for the second time, do not repeat the settings for the first time.

1.4.2 Information Collection

Website administrators collect pictures, jokes, news, technologies, and other information from the Internet, classify, edit, and publish the information to their website systems. Website administrators generally search for various keywords through the search engine to obtain the target URL and then extract the content from the webpage. The organization of keywords determines the accuracy and quantity of the obtained content. Because the content comes from different websites, the methods for extracting content are also different. For a certain type of information, the data structure published to the website system is the same.

The website administrator searches the website and arranges and indexes the content on the homepage.

Enterprises search for e-mail and phone numbers on the Internet, and can view the relevant information of this information to understand the basic information of this object. Enterprises want to search for a certain type of customer information, such as the customer's female, aged between 20 and 30. The collected object information can be stored in the Customer Management System of the enterprise.

Enterprises need to know the information of a product and want to obtain the quotation, manufacturer, and comparison of such information. In addition, you can obtain the quotation and Manufacturer's latest information. This information needs to be stored in the enterprise's internal ERP system or other systems.

1.4.3 Data structuring

E-documents generated by the enterprise office, customer information submitted by the customer, and other data generally require a large amount of manpower to be manually input to the enterprise's ERP system or information system, enterprises want to use software to extract relevant data from these documents and automatically import the data to the system. These data generally have a fixed template format, and the template format of the same type of documents is the same. For example, the template format of Customer 1 and Customer 2 is the same for the customer's home information, but the content is different.

2 user features 2.1 website administrator

The final user group of the system includes website administrators. Some websites that have hidden the target addresses may fail to perform operations. In particular, some websites disrupt the webpage content, it makes it more difficult for them to define rules accurately. For these problems, the system provides examples and matching wildcards to tell them how to deal with them. The official website also provides forums to share and learn from each other's experiences.

2.2 Information Collection users

For information collection users, the system provides a variety of template modes for users, such as email matching mode and phone number matching mode. You only need to select one template, you can get the information they want. Of course, the official website provides abundant template resources for download.

2.3 structured data users

Third-party technical support is available for structured data applications.

The system reserves input and output programming interfaces. Some users can extend these interfaces so that the system can be used in more scenarios. For these users, the system provides detailed interface descriptions and extends the sample code. We provide the development kit and describe the usage and functions of each class and each method.

Another type of users belong to the merchant's category. They only engage in rule making and exchange or sell their own rules online. They are more concerned about the search and content quality of websites. They are divided into two types: hunters who can discover various types of information and meet the content requirements of various network management systems, they get a lot of money in quantity. Of course, for some rare content, the price is just like gold. The other is hackers who are proficient in Web technology, witty and persistent. in their hands, they are some of the best products that are hard to find. Of course, the price is high.

As the collection system is an open-source software, many people analyze and use the components in the collection system to expand and improve the collection system. They use not only the system interface, but also the internal code. They need to refer to various technical documents of the collection system. Therefore, apart from the user manual, other development documents are also essential. Collaboration relationship:

3. Runtime Environment

To adapt the collection system to multiple operating environments, the system adopts a variety of architecture and multi-language versions. The collection system is divided into standalone and web versions. The Web version is implemented in multiple languages, such as Java, PHP, And. net.

Software Structure Programming Language Operating System Database Running Environment
Standalone Edition VC Window Access Window
Java Window/Unix MySQL JDK
Web Version Java Window/Unix MySQL/MSSQL/Oracle Servlet container + JDK
PHP Window/Unix MySQL PHP container
. Net Window MSSQL IIS server
4 Operating System

The basic components of the collection system include the input subsystem, the cache and the output subsystem. Data is extracted by multiple filters in multiple depths and stored in the cache. As follows:

5 system features 5.1 I/O system

The system uses a unified input/output interface to read and publish data for various external targets. Transparent support for interaction requirements between current and future types of external systems.

5.2 container system

The container management system enables the system to run more efficiently and provides higher user interaction capabilities. Features:

  1. Control the number of concurrent filters to adapt to different target limits.
  2. Filter status report, always understand the content collection process.
  3. Reuse and scheduling policies are adopted to improve concurrency efficiency.
5.3 Cache System

The cache area provides global data indexes for the output subsystem so that the output subsystem has the following capabilities:

  1. Data can be checked and processed globally.
  2. You can associate unit data across layers to publish collected intermediate data.
5.4 plug-in system

The collection system supports a wide range of plug-ins. The plug-in manager is responsible for loading plug-ins and index plug-ins. There are several types of plug-ins: Input plug-ins, output plug-ins, and filter plug-ins. The functions are as follows:

  1. The input plug-in supports reading different external objects. Such as HTTP server, FTP server, and file system.
  2. The collection plug-in supports collection of different data formats and special information. Such as webpage collection, word collection, and email address collection.
  3. The output plug-in supports the release of various systems, such as BBS systems and information systems.
6 function description 6.1 structured collection

The system performs Semantic Analysis on semi-structured data and intelligently extracts Data Based on Semantic Rules.

6.2 visual metadata Definition

You can specify the content to be collected on the Visual Target Interface.

6.3 plugin support

The system has a wide range of plug-ins, supporting collection of various targets and release of various systems. For example, FTP collection, HTTP collection, database publishing, and file Publishing.

6.4 client environment simulation

Simulate the client environment and support basic session functions on the client and server. For example, the session mechanism and COOKIE Mechanism of the browser. User logon is supported.

6.5 multi-thread collection

The system supports multi-task concurrency and multi-thread collection. Supports concurrent thread control and status monitoring.

6.6 global release

The system provides a global cache area associated with the context. The publishing module can combine unit data of different levels. You can check and edit the unit data in the cache.

6.7 page collection

Automatically collects the next page of the Content Based on page number rules.

6.8 download associated files

The system can automatically download other files contained in the page according to settings. Such as flash and images.

6.9 save rules

Information such as collection objects, filtering rules, and publishing targets is stored in Rule files. You can import and export rule files to share or exchange rule files with others. The system provides a friendly Wizard Page for you to configure the rule file.

6.10 template Modification

You can publish data according to the predefined template structure.

6.11 filter and replace results

Formats and syntaxes of automatically filtered data, such as HTML language and Word format. Constant replacement and environment variable replacement are supported.

6.12 duplicate Filtering

Automatically delete duplicate data from the collection results.

7. supported information
Resources Description
Http://www.caijiqi.net/ The project official website publishes project documents and provides system downloads.
QQ: 107175884  
Mail: hotheartboy@gmail.com  

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.