In the Lucene index time has led to the word breaker (analyser) This concept, participle is also an important step in information retrieval. We know that English is a word is a word, the two direct use of space between the natural separation, word segmentation becomes very easy, and the Chinese sentence has a lot of Chinese characters, the basic meaning is the word, a single character is often not enough information on the sentence, and the words can be said to be the smallest semantic units. We usually match the words in the search, and the words play an important role in the whole text.
1. Word segmentation method based on string matching (Mechanical Word segmentation method, dictionary-based Word segmentation method)
It is a strategy to match the string of Chinese characters to be analyzed with the entry in a "full-size" machine dictionary. If a string is found in the dictionary, the match succeeds (a word is identified). The method has three elements, namely, Word segmentation dictionary, text scanning order and matching principle. The scanning order of text is forward scan, reverse scan, and bidirectional scan. The matching principle mainly includes the maximal match, the minimum match, the word matching and the best match.
Maximum matching method (MM). The basic idea is: assume that the longest entry in the Automatic Word segmentation dictionary contains the number of Chinese characters is I, then take the processed material in the current string sequence of the first I character as a matching field, look up the word segmentation dictionary, if there is such a word in the dictionary, the match is successful, the match field as a word is sliced out If one of these I words is not found in the dictionary, the match fails, the match field takes the last kanji, the remaining characters are matched as a new matching field, and so on until the match is successful. The statistical results show that the error rate of this method is 1/169.
Inverse maximum matching method (RMM). The method of the word segmentation process and mm method is the same, the difference is from the end of the sentence (or article) processing, each time the match is unsuccessful, the first character is removed. The statistical results show that the error rate of this method is 1/245.
Word-wise traversal method. The words in the dictionary are searched for the entire material to be processed verbatim in the order of long to short descending, until all the words are sliced out. No matter how big the word dictionary is, how small the material is handled, all of it has to be matched to this thesaurus.
Set up the segmentation mark method. The dividing mark has natural and unnatural points. The natural segmentation sign refers to the non-text symbols appearing in the article, such as punctuation marks, etc., non-natural signs are the use of affixes and non-constituent words (including monosyllabic words, complex syllable words and onomatopoeia, etc.). The establishment of the method of dividing marks first collect a large number of segmentation marks, Word segmentation first to find the segmentation mark, the sentence cut into some shorter fields, and then with MM, RMM or other methods for fine processing. This method is not a true word segmentation method, but a pre-processing method of automatic word segmentation, it needs to spend more time scanning the segmentation flag, increase storage space to store those non-natural segmentation signs.
Best Fit Method (OM). This method is divided into the positive best matching method and the inverse best matching method, the starting point is: in the dictionary according to the size of the word frequency order the entry, in order to shorten the retrieval time of the thesaurus, to achieve the best results, thus reducing the time complexity of the word segmentation, speed up the word segmentation speed. In essence, this method is not a pure word segmentation method, it is just a way to organize the word segmentation dictionary. The word segmentation Dictionary of OM method must have a specified length of data item in front of each term, so its space complexity is increased, there is no effect on improving the accuracy of word segmentation, and the time complexity of word processing is reduced.
From the above algorithm, it is not difficult to see the advantages and disadvantages of the word segmentation method based on string matching:
Advantages: Simple, easy to implement.
Disadvantages: 1) The matching speed is slow; 2) There is the problem of intersection type and combinatorial ambiguity segmentation; 3) The word itself does not have a standard definition, there is no unified standard Word set; 4) different dictionaries produce different ambiguities; 5) lack of self-learning intelligence.
2, based on the understanding of the word segmentation method
This method, also known as the word segmentation method based on artificial intelligence, is the basic idea of syntactic and semantic analysis at the same time, using syntactic information and semantic information to deal with ambiguity. It usually consists of three parts: the word breaker subsystem, the syntactic semantics subsystem, and the General control section. Under the coordination of the general control part, the word segmentation subsystem can get the syntactic and semantic information about words and sentences to judge the ambiguity of the participle, that is, it simulates the process of human understanding of the sentence. This segmentation method requires a lot of language knowledge and information. At present, the word segmentation method based on comprehension mainly includes expert system segmentation method and neural network segmentation method. Because of the generality and complexity of Chinese language knowledge, it is difficult to organize various language information into the form that machine can read directly, so the word segmentation system based on understanding is still in the experimental stage.
Expert system Segmentation method. From the angle of expert system, the knowledge of Word segmentation (including knowledge of common-sense segmentation and heuristic knowledge of disambiguation, namely, ambiguity segmentation rules) is independent from the inference machine to realize the word segmentation process, so that the maintenance of knowledge base and the realization of inference machine do not interfere with each other, so that the knowledge base is easy to maintain and manage. It also has the ability and certain self-learning function to find the intersection ambiguity field and the ambiguous combination ambiguity field.
Neural network segmentation method. The method is to simulate the human brain parallel, distributed processing and the establishment of numerical calculation model work. It put the scattered implicit method into the neural network, through self-learning and training to modify the internal weights, in order to achieve the correct segmentation results, and finally give the neural network automatic segmentation results.
Neural network Expert system integrated Word segmentation method. This method first initiates the neural network to carry on the word segmentation, when the neural network does not give the accurate segmentation to the new word, activates the expert system to analyze and judge, the inference according to the Knowledge Base, obtains the preliminary analysis, and initiates the learning mechanism to train the neural network. This method can give full play to the advantages of neural network and expert system, and further improve the efficiency of word segmentation.
3, Statistics-based word segmentation method
The main idea of the method: The word is a stable combination, so in context, the more the number of simultaneous occurrences of adjacent words, the more likely to form a word. Therefore, the probability or frequency of the occurrence of the word and the word can reflect the credibility of the word better. You can count the frequency of the combinations of the individual words that appear next to each other in the training text, and calculate the reciprocal information between them. The mutual information embodies the close degree of the relationship between Chinese characters. When the degree of tightness is higher than a certain threshold, it can be assumed that this group of words may constitute a word. This method is also known as a dictionary-free participle.
The main statistical models used in this method are: N-ary grammar model, hidden Markov model and maximum entropy model. In the practical application, it is generally combined with the word segmentation method based on dictionary, which not only plays a fast and high efficiency in matching segmentation, but also uses the advantage of no dictionary word segmentation and context to identify new words and eliminate ambiguity automatically.
4. Word segmentation method based on semantics
The semantic Word segmentation method introduces semantic analysis to deal with the language information of natural language itself, such as expanding transfer network method, semantic analysis method of knowledge sub-word, adjacency constraint method, comprehensive matching method, suffix participle method, feature thesaurus method, Matrix constraint method and grammatical analysis method.
Extended Transfer Network method. The method is based on the concept of finite state machine. The finite state machine can only recognize the regular language, and the first expansion of finite state machine makes it have recursive ability to form a recursive transfer network (RTN). In Rtn, the sign on the arc can be not only the ultimate character (word in the language) or the non-ultimate character (part of speech), but also the other sub-network name can be called the non-ultimate character (such as Word or string of word-forming conditions). This way, when a computer is running a sub-network, it can call another subnet, and it can be called recursively. Lexical expansion of the use of the transfer network, the word processing and language understanding of the syntactic processing phase of interaction is possible, and effectively solve the Chinese word segmentation ambiguity.
Matrix constraint method. The basic idea is to establish a grammar constraint matrix and a semantic constraint matrix, in which the elements indicate whether the words with a certain part of speech and the words with another word of speech are in accordance with the grammatical rules, the words belonging to a certain semantic class and the words belonging to another meaning class are in accordance with the logic, and the results of the segmentation are constrained by the machine
There are several difficult questions in the process of participle:
1. Ambiguity
Ambiguity segmentation field processing a Chinese sentence is written in the form of a continuous string. Because of possible ambiguity, participle is not a simple process of discovering legal words from an input string. A sentence often corresponds to several legal word sequences, so an important problem in Chinese word segmentation is to choose a correct result in all of these possible sequences. Ambiguity segmentation is an unavoidable phenomenon in automatic word segmentation, and it is a tricky problem in automatic word segmentation. The processing ability of ambiguous segmentation field seriously affects the accuracy of Chinese automatic word segmentation system. The practice shows that only by mechanical matching, the accuracy of the participle can not be high, although some standards are not high, but can not meet the requirements of high standards of Chinese processing.
2. No sign-in Word recognition problem
Non-sign-in words distinguish non-login words including Chinese and foreign names, Chinese place names, organization names, event names, currency names, abbreviations, derivative words, various professional terms and some new words that have been developed and established. is a wide range of forms, combinations of different, large scale of a field. The automatic identification of these words is a very difficult thing.
3. Synonyms
Words, as an important unit of semantics, can not be identified by characters alone, but should be regarded as living organisms with positive meanings as their true linguistic meanings. Many words in life have the same or similar meanings, and good participle methods should also be regarded as equal. For example, searching for "computer", we can also find "computer" related things.
4. Contextual issues
Words have a place in the language environment, so-called dog mouth can not spit out ivory.
As a measure of automatic segmentation evaluation criteria, generally from the following aspects to compare the merits of the word:
1, the correct rate of participle
2. Splitting speed
3, the function of completeness
4. Easy Scalability and maintainability
5. Portability