Research on the Application of suffix tree and Its Algorithms in Text Mining

Source: Internet
Author: User
Tags ranges
Research on the Application of suffix tree and Its Algorithms in Text Mining Li Haitao(China Institute of Science and Technology Information, Beijing 100038, China)  【Abstract】 this paper first introduces a novel Data Structure-related concepts of the suffix tree. On this basis, it discusses its characteristics and algorithm construction, the Application of suffix trees and their algorithms in Chinese Word Segmentation and association analysis is discussed. Then, the characteristics of Chinese word segmentation are combined with the clustering of Chinese documents as an example, A cluster system structure based on the suffix tree clustering algorithm is designed. 【Key words】 phrases, phrase strings, suffix trees, association analysis, and clustering The Study of suffix tree and Its Arithmetic's Application in Text Mining Li Haitao(Institute of Scientific & Technical Information of China, Beijing 100038)  Abstract] This paper introduces a novel Data Structure-suffix tree conception, and then discusses its special ity and Its Arithmetic's constructing process, discusses suffix tree and Its Arithmetic's Application in Chinese Word Segmentation and association analyses, at last takes the Chinese Document Clustering as an example, considering the need of Chinese word segmentation, designs the clustering system's structure based on suffix tree clustering arithmetic. 【 Keywords] Phrase, phrase cluster, suffix tree, association analyses, clustering 1. IntroductionText is the most widely used storage information. In fact, recent research shows that 80% of company information is contained in text documents, so text mining is considered to have higher commercial potential than data mining. Text Mining is a process of extracting valuable knowledge that is effective, novel, useful, understandable, and distributed in text files, and using this knowledge to better organize information.


Your ad here

The main research contents of text mining include association analysis, text classification, and text clustering. Association Analysis first collects frequently-occurring keywords, words, or phrases, and then finds their associations and relationships [1]. Here I divide it into three levels of association analysis: word, word and phrase. A text category is a predefined topic category that determines a category for each document in the document set. In this way, users can not only conveniently browse documents, but also restrict the search scope to make searching for documents easier and faster. The goal of text clustering is the same as that of text classification, but the implementation method is different. text clustering is machine learning without teachers. No defined classes are available before document classification, during text clustering, all documents of similar types are classified as one type, so that documents of the same type are classified as one type as much as possible, and documents of different types are isolated as much as possible. The clustering standard can be the attributes of text, it can also be text content. 2. Suffix tree related concepts 2.1 PhraseA phrase in this article is an ordered sequence with one or more words. A phrase may be of any length, but the sequence does not pass through the phrase boundary. The phrase boundary is inserted between phrases when the document parser identifies special syntax marks. These marks can be punctuation marks (periods '. 'Comma ', 'semicolon'; 'Question name '? ', Etc.) or HTML tags (for example, <p>, <br>, <li>, <TD> ), the beginning and end of a document are also considered as the phrase boundary [2]. 2.2 Phrase stringA phrase string is a phrase shared by at least two documents and all documents that contain the phrase. A maximum phrase string must meet the requirement that the phrase strings cannot be expanded by any word of the language without reducing the number of documents. 2.3 Suffix treeA suffix tree is a data structure that supports effective string matching and query. A suffix tree T of string s with m words is a directed Tree Containing a root node, which exactly contains M leaves, these leaves are labeled from 1 to M. Each internal node, except the root node, has at least two subnodes, and each edge is identified by a non-empty substring of S. Any two edges from the same node do not start with the same word. The key feature of the suffix tree is that for any leaf I, all the identifiers from the root node to the edge that the leaf experiences are connected in tandem, and then the suffix starting from the I position of S is spelled out, that is, s [I ,..., M]. The node ID in the tree is defined as the concatenation of the IDs of all edges from the root node to the node. Figure 1 represents a string "I know you know I know"Suffix tree. Inner nodes are represented by circles and leaves are represented by rectangles. In this example, there are six leaves marked as 1 to 6. The termination character is omitted in the figure. Figure 1 likewise, a suffix tree composed of several strings is called an extended suffix tree: N strings Sn ,The string length is MnIs composed of these strings to form an extended suffix tree. TIs a directed Tree Containing a root node. MnLeaves, each of which is identified by a two-digit coordinate tuple (K, L), where k ranges from 1 to n, and l ranges from 1 MkEach internal node, except the root node, has two subnodes and each edge is identified by a substring consisting of several words in a non-empty S. And the first word of any two sides of the same node cannot be the same. For any leaf (I, j), the serial numbers of all edges from the root node to the leaf exactly spell out the suffix. SiStarting from position J, that is, they are spelled out. Si[ J.. Mi]. 3. Suffix tree and Its Algorithm featuresThe suffix tree regards a document as a string composed of several phrases, rather than a group of [3] words. As a novel and incremental linear time calculation method, the suffix tree algorithm generates compact data structures and saves a lot of storage space. This algorithm is very suitable for solving basic string problems, such as finding the longest duplicate substring [4], similar string matching, string comparison, text compression, and English document clustering. 4. Algorithm ConstructionBuild a suffix tree of string s with a length of M, first put the suffix S[1 .. M] Is added to the tree as a single side. Then add the suffix S[ I.. M] To the growth tree, where I grew from 2 to M. The details of this algorithm are as follows: 1. Let NiIndicates moderate wood, which encodes all suffixes from 1 to I. 2. Tree N1It is composed of a single side from the root of the tree to a leaf marked with 1. This edge is identified by string S. 3. Tree Ni + 1Slave tree NiThe process is as follows: 3.1 from NiThe operation rule finds the longest path starting from the root, and the path ID must match the suffix. S[ I+ 1 .. M. This path is successfully compared and matched with the suffix S[ I+ 1 .. M] And words on a unique path starting from the root until they can no longer be matched. 3.2 when there is no deeper matching, the calculation rule is either to a node, called W, or to the middle of an edge. If the calculation rule is in the middle of the edge, it is called (u, v), then it inserts a new node W to divide (u, v) into two sides. 3.3 Add the new node W to the end of the last matching word of the edge (the ID of the edge must match the suffix) S[ I+ 1 .. M). 3.4 In both cases (originally there and no nodes), the operation rule creates a new edge ( W, I+ 1). the edge is extended from W to a new leaf marked as I + 1 with a suffix. S[ I+ 1 .. M] To identify the new edge. 5. Application Discussion   5.1 Chinese Word Segmentation  

Document

Generate suffix tree

Extract high frequency words

Filter out deprecated words and Vocabulary Existing words

New word collection and review

Word Table

Add to Word Table

Figure 2. Word Splitting Process

 


The main purpose of applying the suffix tree algorithm to Chinese Word Segmentation is to process non-Logon words. We know that word lists cannot contain all words. On the one hand, because the language is constantly evolving and changing, new words are constantly emerging. On the other hand, it is because word derivatives are very common and there is no need to include all derivatives in the dictionary. In particular, exclusive Nouns such as personal names and place names have a very high usage frequency and proportion in the text [5]. In addition, Word Segmentation errors introduced by Unlogged words are often more serious than word segmentation errors. Therefore, the word segmentation system must be able to recognize Unlogged words to improve the correctness of word segmentation, it provides a solid foundation for further processing of Chinese information. This word segmentation algorithm must be combined with Word Segmentation Algorithms Based on word lists. This algorithm has the advantage of automatically and Quickly Recognizing unregistered new words based on a certain word frequency threshold. The word segmentation criterion is word frequency and word length, the author uses the suffix tree algorithm program and the existing word segmentation program based on word lists to perform a word segmentation test on a large number of texts. In this test, the Word Frequency requirement is greater than 3 and the word length is greater than 1. Process 2: first discover all frequently used words, and then remove the words that already exist in the deprecated words and word lists. The remaining words are often new words that have not been logged on, then, the new words are reviewed and added to the existing vocabulary. The experiment proves that the solution is feasible.5.2Association AnalysisIn the process of association analysis, suffix tree algorithms can also be used to achieve the same purpose. In the past, association analysis was often based on words or words, while suffix tree algorithms can be used for association analysis based on phrases, I used the suffix tree algorithm program to test the suffix tree generated by multiple documents, several phrases that appear simultaneously in multiple documents are used to find topics that are frequently compared in a certain period. This is also the basis for merging phrase strings in the clustering module.5.3Cluster System Design5.3.1System FunctionsL classless Inter-Domain Routing (classless Inter-Domain Routing. The system dynamically divides categories based on the content of all the documents to be clustered. As the number and content of the documents to be clustered change, categories of different numbers and topics may be generated. L knowledge discovery can be realized through the interactive connection between the phrase string and Word Table, that is, to help the viewer discover the unknown contact and the aspects included in the topic, this provides a new idea for decision-making on this topic.5.3.2Overall Structure Design  

Figure 2.Overall Structure

Topic searcher

Pre-processor

Clustering tool

Knowledge manager

Browser

In the overall structure, the topic searcher --> Preprocessor --> knowledge manager --> clustering machine --> browser is listed in the following order: l Topic searcherThe searcher mainly uses a robot network robot to collect theme-related webpages. It can use the I-know topic search engine [6] of Wanfang data company. L Pre-processorWord Segmentation includes two parts: Word Segmentation Based on the subject table and non-word table. The former can use the word segmentation program of DM; the latter should be implemented using the suffix tree clustering algorithm, as described above. In addition to word segmentation, preprocessing also removes unnecessary punctuation marks, deprecated words, and various XML and HTML tags. L Clustering toolThere are three modules: suffix tree generation module> basic phrase string extraction module> phrase string merging module. The suffix tree generation module loads the structure of the pre-processed document and suffix tree to the memory, and then extracts all the phrase strings from the basic phrase string extraction module, filter the phrases in a phrase string, leave the phrase string containing the words in the Word Table, and increase its weight. As we have expanded the word table, therefore, the rest can be regarded as invalid phrase strings or handled by the Administrator (when the phrase strings are non-Chinese characters such as formulas, such information can still be added to the knowledge base ). Then, based on Formula 1, the phrase string merging module merges the phrase strings whose document accuracy rate reaches 50% to form the final clustering result.

Sim [cluter (I), cluster (j)]= Docnum [phrase (I)] Using docnum [phrase (j)]/(Docnum [phrase (I)] + docnum [phrase (j)])(Formula 1)

In Formula 1, Sim [cluter (I), cluster (j)] indicates the document availability rate, cluster (I) indicates the phrase string containing the phrase I, cluster (j) is a phrase string containing the phrase J, docnum [phrase (I)] is the number of documents containing the phrase string I, docnum [phrase (j)] is the number of documents containing the phrase string J, the meaning of the entire formula is the proportion of the number of documents that contain both phrase I and phrase J to the sum of the first two. L Knowledge managerThe vocabulary-based knowledge manager should be in direct contact with the pre-processor, clustering tool, and browser in the whole system. It is responsible for collecting and reviewing new words during word segmentation, the determination of the validity of the phrase string during clustering, as well as the setting of parameters in Word Segmentation and clustering, such as the shortest word length during word segmentation, the term frequency range, the number of documents in the phrase string during clustering, and the effective length of the phrase. L BrowserUsed for result browsing, including the category tree, summary, and details page. 5.3.3 Lab TestThe experimental corpus is from the Chinese degree Library of the National Science and Technology library literature center (http://www.nstl.gov.cn), because the library is constructed strictly according to the library classification, and the subject of the dissertation accurately summarizes the topic of the paper, therefore, in this experiment, the subject of the dissertation is used as the corpus. The clustering results show that the obtained results are obtained by clustering the dissertation documents using the STC clustering rules, the results are very consistent with the results of Manual classification of the dissertation according to the graph method. If the calculation is based on the document coverage rate, the consistency reaches 86%, this proves the validity and correctness of the clustering of the STC operation rules. 6. Conclusion As a novel algorithm, the suffix tree and its algorithm can efficiently handle the clustering of strings and English documents. In addition, due to its many advantages, I applied it to Chinese word segmentation, association analysis, and Chinese Document Clustering in Text Mining, and achieved initial results based on actual conditions, it provides a new idea for future research in the field of Chinese Text Mining.   References:1. jiawei Han, Michelin kamber, concept and technology of data mining, published by Mechanical Industry Press 20012. oren Eli Zamir, clustering web clients: a phrase-based method for grouping search engine results, 1999, University of washington3. D. gusfield, algorithms on strings, trees and sequences: Computer Science and computational biology, chapter 6, Cambridge University Press, 1997.4. mark Nelson, fast string searching with suffix trees, dr. dab's journal, August, 19965. shi zhongzhi, Knowledge Discovery, Tsinghua University Press 20026. wang shenghai, Design and Implementation of Network Intelligent knowledge service system, Document Information Center, Chinese Emy of Sciences 2001)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.