1. Why Chinese Word Segmentation?
Words are the smallest meaningful linguistic component capable of independent activities. English words use spaces as natural delimiters, while Chinese words use words as the basic unit of writing, there is no obvious distinction between words. Therefore, Chinese word analysis is the basis and key for Chinese Information Processing.
The processing of Chinese in Lucene is based on the automatic segmentation of single words or binary segmentation. In addition, there are also the maximum split (including forward, backward, and combination of front and back), minimum split, full split, and so on.
Ii. Classification of Chinese Word Segmentation technology
The word segmentation algorithms we discuss can be divided into three categories: dictionary-based and dictionary-based matching word segmentation methods, Word Segmentation Methods Based on Word Frequency Statistics, and word segmentation methods based on knowledge understanding.
The first method uses dictionary matching, Chinese Lexical or other Chinese language knowledge for word segmentation, such as the maximum matching method and the Minimum word segmentation method. This method is simple and efficient in Word Segmentation, but it is difficult to adapt to open large-scale text word segmentation due to complicated Chinese language phenomena, completeness of dictionaries, and rule consistency. The second type of statistical-Based Word Segmentation is based on the statistical information of words and words. For example, the information of adjacent words, word frequency, and the corresponding co-occurrence information are used for word segmentation, because the information is obtained by investigating the actual corpus, the statistical-based word segmentation method is more practical.
The following describes several common methods:
1). word-by-word traversal.
In the word-by-word Traversal method, all words in the dictionary are searched for word by word in the order of ascending to short until the end of the article. That is to say, no matter how short the document is or how large the dictionary is, you must traverse the dictionary. This method is less efficient and is generally unavailable for larger systems.
2). Word Segmentation Method Based on dictionary and dictionary matching (mechanical word segmentation method)
This method matches the Chinese character string to be analyzed with the word in a "sufficiently large" machine dictionary. If a character string is found in the dictionary, the match is successful. Identifies a word and classifies it into forward matching and reverse matching based on different scanning directions. Based on the priority matching of different lengths, they are divided into the maximum (longest) Matching and the minimum (shortest) matching. Based on whether it is combined with the part-of-speech tagging process, it can be divided into a simple word segmentation method and an integrated method combining word segmentation and tagging. The common methods are as follows:
(1) maximummatchingmethod is generally referred to as mm. The basic idea is: assuming that the longest word in the word segmentation dictionary contains I Chinese characters, the first I word in the current string of the document to be processed is used as the matching field to search for the dictionary. If such an I word exists in the dictionary, the match is successful and the matching field is split as a word. If such an I word cannot be found in the dictionary, the matching fails. the last word in the matching field is removed and the remaining strings are re-matched ...... Continue until the match is successful, that is, the length of a word or the remaining string is 0. In this way, a round of matching is completed, and the next I-string is taken for matching until the document is scanned.
The algorithm is described as follows:
(1) initialize the counter at the current position and set it to 0;
(2) starting from the current counter, take the first 2I characters as the matching field until the end of the document;
(3) If the length of a matching field is not 0, search for the same length in the dictionary for matching.
If the match is successful,
Then,
A) split the matching field into a word segmentation statistical table;
B) add the value of the current counter to the length of the matching field;
C) Jump to step 2 );
Otherwise
A) if the last character of the matching field is a Chinese character,
Then
① Remove the last word of the matching field;
② The length of the matching field is reduced by 2;
Otherwise
① Remove the last byte of the matching field;
② The length of the matching field is reduced by 1;
B) Jump to step 3 );
Otherwise
A) if the last character of the matching field is a Chinese character,
Then the value of the current position counter is increased by 2;
Otherwise, the value of the counter at the current position is increased by 1;
B) Jump to step 2 ).
(2) reversemaximummatcingmethod is generally referred to as RMM. The basic principle of the RMM method is the same as that of the MM method. The difference is that the direction of word segmentation is the opposite to that of the MM method, and the word segmentation dictionary is also different. The reverse maximum matching method starts scanning from the end of the processed document. Each time, the 2I character (I string) at the end is taken as the matching field. If the matching fails, then, the first word of the matching field is removed and the matching continues. Correspondingly, the word segmentation dictionary is a reverse dictionary, and each word entry is saved in the reverse order. In actual processing, the document is first inverted to generate a reverse document. Then, based on the reverse dictionary, you can use the forward maximum matching method to process the reverse document.
Because there are many positive structures in Chinese, if the forward and backward matching is performed, the accuracy can be improved as appropriate. Therefore, the inverse maximum matching method has less error than the forward maximum matching method. The statistical results show that the error rate of positive maximum matching is 1/16, and the error rate of reverse maximum matching is 1/245. For example, if the splitting field is "Master's degree of research and production", the result of the forward maximum matching method will be "Master's degree of study/production", while the reverse maximum matching method uses reverse scanning, you can get the correct word splitting result "Master/research/production ".
Of course, the maximum matching algorithm is a mechanical word segmentation method based on Word Segmentation dictionaries. Words cannot be segmented based on the semantic features of the document context, which is highly dependent on the dictionary. Therefore, in actual use, it is inevitable that some word segmentation errors will occur. To improve the accuracy of the system word segmentation, you can use a combination of forward and reverse maximum matching methods (that is, two-way matching method, see (4 ).)
(3) least segmentation: minimizes the number of words cut out in each sentence.
(4) bidirectional matching: The forward maximum matching method and the reverse maximum matching method are combined. First, the document is divided into several sentences based on punctuation. Then, the sentences are scanned and split using the forward and reverse maximum matching methods. If the matching results obtained by the two word segmentation methods are the same, the word segmentation is considered correct. Otherwise, the process is based on the minimum set.
3). Full segmentation and Word Segmentation Based on Word Frequency Statistics
Word Segmentation Based on Word Frequency Statistics is a full segmentation method. Before discussing this method, we must first understand the relevant content about full splitting.
Full splitting
Full splitting requires that all acceptable splitting forms of the input sequence be obtained, while partial splitting only obtains one or more acceptable splitting forms. Because partial splitting ignores other possible splitting forms, therefore, no matter what ambiguity correction strategy is adopted, the word segmentation method based on partial segmentation may omit the correct segmentation, resulting in incorrect or failed word segmentation. The establishment of a word segmentation method based on full segmentation has achieved all possible forms of segmentation, thus fundamentally avoiding the omission of possible forms of segmentation and overcoming the defects of some segmentation methods.
The full splitting algorithm can achieve all possible splitting forms. The sentence coverage rate and Word Segmentation coverage rate are both 100%, but the full splitting algorithm is not widely used in text processing, for the following reasons:
1) The full segmentation algorithm is only a prerequisite for obtaining correct word segmentation. Because full segmentation does not have the ambiguity detection function, the correctness and completeness of the final word segmentation results depend on independent ambiguity processing methods, if the evaluation is incorrect, it will also lead to incorrect results.
2) the number of fully-segmented splitting results increases exponentially with the length of sentences. On the one hand, it will lead to a large amount of useless data flooding the storage database. On the other hand, when a sentence reaches a certain length, due to the large number of splitting forms, the efficiency of Word Segmentation is greatly reduced.
Word Segmentation Method Based on Word Frequency Statistics:
This is a full splitting method. It does not rely on dictionaries, but collects statistics on the frequency of simultaneous occurrence of any two words in an article. A word with a higher frequency may be a word. It first splits all possible words that match the Word Table and uses the statistical language model and decision algorithm to determine the optimal splitting result. Its advantage is that it can detect all segmentation ambiguities and easily extract new words.
4). Knowledge-based word segmentation.
This method is mainly based on syntactic and syntax analysis, combined with semantic analysis. It defines words by analyzing the information provided by context content. It usually consists of three parts: word segmentation sub-system, syntactic sub-system, and general control part. Under the coordination of the general control, the word segmentation subsystem can obtain syntaxes and semantic information about words and sentences to determine word segmentation ambiguity. This method is intended to give machines the ability to understand humans and require a large amount of language knowledge and information. Due to the general and complex nature of Chinese language knowledge, it is difficult to organize language information into a form that can be directly read by machines. Therefore, the knowledge-based word segmentation system is still in the experimental stage.
5). A New Word Segmentation Method
Parallel Word Segmentation: This word segmentation method is performed by means of a pipeline containing word segmentation lexicon. The comparison and matching process is performed step by step, each step can compare the words in the pipeline with the corresponding words in the dictionary at the same time. Because multiple words are compared and matched at the same time, the word segmentation speed can be greatly improved. This method involves multi-level internal code theory and Dictionary data structure of pipelines. (For detailed algorithms, see Wu Shengyuan's parallel word segmentation method.)
Common Chinese Word Segmentation package
1. Ding jieniu word segmentation package, suitable for integration with Lucene. Http://www.oschina.net/p/paoding
The Ding Chinese dictionary is a Chinese search engine word segmentation component developed in Java and can be integrated into Lucene applications for the Internet and enterprise intranets.
Paoding fills the gaps in open-source components for Chinese Word Segmentation in China, and is committed to becoming the preferred open-source component for Chinese word segmentation for Internet websites. Paoding Chinese Word Segmentation pursues efficient word segmentation and a good user experience.
Paoding's knives Chinese Word Segmentation is highly efficient and scalable. Introduce metaphor, adopt completely object-oriented design, and have advanced ideas.
High Efficiency: on personal machines with piII 1g memory, 1 million Chinese characters can be accurately segmented within one second.
You can use dictionary files without limit to the number of words to effectively split an article so that you can define the word classification.
Ability to properly parse unknown words
2. lingpipe, a Java open source toolkit for open-source natural language processing. Http:/alias-i.com/lingpipe/
The feature is very powerful. The most important thing is that the document is extremely detailed. Each model is even listed in the reference paper. It is not only easy to use, but also suitable for model learning.
Topic classification, Named Entity recognition, part-of-speech tagging, and sentence Detection) query spell checking, interseting phrase detection, clustering, and character Language Modeling) medical Literature download/resolution/indexing (Medline download, parsing and indexing), Database Text Mining, Chinese word segmentation, and sentiment analysis) language Identification
3. Je word segmentation package
4. libmmseg http://www.oschina.net/p/libmmseg
Developed Using C ++ and supporting both Linux and Windows platforms, the splitting speed is approximately 300 K/s (PM-1.2G), up to the current version (0.7.1 ).
Libmmseg has not been carefully optimized for speed, so it is still possible to further improve the splitting speed.
5. ikanalyzer http://www.oschina.net/p/ikanalyzer
Ikanalyzer is developed based on the API of javase2.0 and implements the forward/reverse full Splitting Algorithm Based on dictionary word segmentation. It is the implementation of the javaseanalyzer interface.
This algorithm is suitable for searching internet users' search habits and enterprise knowledge base. Users can search for articles with Chinese words covered in sentences, such as "people" and "RMB, this is the search thinking of most users;
It is not suitable for knowledge mining and web crawler technology. The full cut method can easily lead to knowledge ambiguity, because the semantics of "people" and "RMB" are completely irrelevant.
6. phpcws http://www.oschina.net/p/phpcws
Phpcws is an open-source PHP Chinese Word Segmentation extension that currently only supports Linux/Unix systems.
Phpcws first uses the "ICTCLAS 3.0 Shared Chinese word segmentation algorithm" API for initial word segmentation, and then uses the self-compiled "inverse maximum matching algorithm" to merge word segmentation and words, the punctuation filtering function is added to obtain word segmentation results.
ICTCLAS (Institute of computing technology, Chinese Lexical Analysis System) is a Chinese lexical analysis system developed based on the multi-layer hidden horse Model Based on the Computing Technology Research Institute of the Chinese Emy of sciences, the main functions include Chinese word segmentation, part-of-speech tagging, Named Entity recognition, and new word recognition. It also supports user dictionaries. ICTCLAS has been carefully built over the past five years and has been upgraded six times. It has now been upgraded to ictclas3.0 with a word segmentation accuracy of 98.45%. The compression of various Dictionary data is less than 3 MB. ICTCLAS won the first place in the evaluation organized by the 973 Expert Group in China, and won the first place in the evaluation conducted by sighan, the first International Research Institute for Chinese processing, is the world's best Chinese Lexical analyzer.
The ICTCLAS 3.0 commercial version is charged, while the free ICTCLAS 3.0 Shared version is not open-source. The Lexicon is derived from the corpus of the People's Daily for one month. Many words do not exist. Therefore, I used the inverse maximum matching algorithm for the results obtained after ICTCLAS word segmentation. Based on the self-defined Dictionary of 90 thousand words I added (which is not repeated with the words in ICTCLAS Word Segmentation ), merge the ICTCLAS word splitting results and output the final word splitting results.
Because ICTCLAS 3.0 Shared version only supports GBK encoding, therefore, if it is a UTF-8 encoded string, you can first use PHP iconv function to convert to GBK encoding, and then use phpcws_split function for word segmentation, finally convert back to UTF-8 encoding.
7, ktdictseg a c #. Net do simple, fast and accurate open source Chinese Word Segmentation components (this word segmentation algorithm is also good) http://www.cnblogs.com/eaglet/archive/2007/05/24/758833.html