1: configuration module:
Collection goals: News, user comments, blogs, forums, etc.
The integrated development environment for collection source configuration is visualized.
2: crawler module:
Automatic Identification of website content organization structure (website map.
Supports secondary cookie verification (such as Xinghua Network), and supports editing pop-up verification for login of verification Codes
3: first-time filter module (useless information such as advertisement filtering and navigation)
Rule identification, address filtering, and conversion of links in the selected area on the overview page.
Precise data identification, format conversion (internal code conversion, address conversion, time conversion, etc)
Based on the visual web page automatic partitioning Technology (VIPS), the area type and feature automatic labeling after the web page partition.
DOM tree structure analysis; partition-based Web Page Structure Analysis, visual area selection Configuration
4: task scheduling module: update policy, scheduling policy, and log management
Preset various thresholds of the target website to provide an alarm mechanism for exceptions. Considering URL second-level ing, the crawler server can exchange as little data as possible after dynamic increase or decrease.
5: Data Mining module:
Text Classification, text clustering, similarity search, automatic summarization, automatic word segmentation, information extraction, sensitive information filtering, sentiment analysis, pinyin search, and related phrase search
5.1 text classification:
Statistical-based text classification (training corpus, supporting modification of supplementary corpus and rule repository); support for multi-level and complex scoring; Support for vector space model based on semantic analysis, you can create a knowledge dictionary, the module automatically calls Knowledge Base resources to further improve the classification accuracy.
Rule-based text classification (writing classification rules ):
Rules support logical operations such as "and" or "and the number of Word Frequency conditions.
For example, expression: Author = (Liu Xiang + Gu Baogang)-Body = (competition); Title = (comeback) + body = (US + treatment)
K-Nearest Neighbor Method and Support Vector MachineAlgorithm: Http://www.360doc.com/content/070716/23/11966_615236.html
A svm classifier: http://www.csie.ntu.edu.tw /~ Cjlin/libsvm/
5.2 text clustering:
Aggregates texts with similar, similar, or identical features
5.3 similarity search
User-Defined similarity threshold
Extract features such as Web abstract, keywords, and keywords, automatically generate a unique sequence, and automatically determine whether the information fingerprint is equal
Improved efficiency with inverted index mechanism
5.4 automatic summary
You can create professional dictionaries and customize clue words.
Automatic Extraction of keywords contained in webpages
5.5 Automatic Word Segmentation
Stage 1: Forward & describe matching + reverse & describe matching
Stage 2:
A combination of rules and statistics, embedding a word segmentation ambiguity rule repository
Provides the part-of-speech tagging function to accurately identify personal names, place names, organization names, and other information.
Word Segmentation dictionary: The system supports the creation of Word Segmentation tables, synonyms/antsense dictionaries, disabled dictionaries, and on-demand dictionary maintenance.
Word Segmentation rule repository: a large number of ambiguity exclusion rules are established for statistics, which effectively improves word segmentation accuracy and accuracy.
Supports automatic extension search of topic dictionaries, automatic extension search of synonyms and Antonyms, automatic extension search of full-width half-width fields, and automatic extension search of simplified and traditional Chinese characters (based on the authoritative knowledge base system, this system can help you correct and complete metadata information)
5.6 Information Extraction
Extraction target: structured (time), semi-structured (HTML), unstructured (personal name, place name, organization name, time, currency, etc)
Extraction Method:
1: Template Technology (manual tagging of various template libraries, and then automatic extraction. If possible, use a neural network for automatic training)
2: heuristic acquisition (the news body is generally in the most recent area under the title)
3: Automatic Analysis of webpage semantic structure using Visual similarity (currently popular)
JS information (JS interpreter parses locally or simulates JS events, such as the Sohu Forum)
5.7 sentiment analysis
6. storage module:
Structured Data: Various relational databases
Non-confidential data: the file system Lucene performs indexing, bigtable (hbase, hypertable)
Distributed: hadoop cluster, mogilefs automatic backup, etc.