http://langiner.blog.51cto.com/1989264/379166
Original works, allow reprint, please be sure to use hyperlinks in the form of the original source of the article, author information and this statement. Otherwise, the legal liability will be investigated. http://langiner.blog.51cto.com/1989264/379166
On one of the word segmentation sequences of conditional random field (CRF)
Langiner
Discriminant machine learning technology to solve word segmentation problem, in which discriminant machine learning technology mainly represents the conditional random field, maximum entropy/hidden Markov maximum entropy, perceptron, support vector machine, etc., about their similarities and differences after the opportunity to talk about, today mainly on the use of the airport to solve word segmentation problem
The conditional random fields are proposed by John Lafferty and applied to the field of natural language processing, which is mainly used for sequence labeling problems, such as word segmentation, entity recognition, part-of-speech tagging (in case the number of parts of speech is relatively small), and shallow syntactic analysis.
Discriminant Machine Learning technology solves word segmentation based on the concept of word formation, the word segmentation problem is transformed into a classification problem by defining the word bit information of each word (the position of each word in the word) to determine the sequence prediction of the word category, and the word bit information can be defined as any of the following
Two categories (the first and the words), three categories (word head, words and endings), four categories (word head, words, endings, single words) and so on, in general, the more categories, the stronger the difference between the characters, the more accurate classification, but the larger the classification space, the larger the model, decoding space, the greater the decoding speed, resulting in the slower, the actual system, The three categories (the first, the ending, and the words) are a good balance.
The internet on the open-source Airport project Many, the most typical and most used is crf++, which has a complete source code and application examples, through the software, we can easily learn and use. crf++ Open Source code The biggest problem is only the Linux version, considering the Linux environment, tracking debugging inconvenient, if under Windows through the establishment of Visual C + + or Visual Studio project, through the tracking debugging, more effective learning the algorithm, I will be in my own learning practice, the Linux version is ported to the Windows platform, and open it on the SourceForge platform (CRF Chinese participle open source version).
How to use these characteristics is the performance of the machine learning algorithm is key, Chinese word segmentation mainly uses the context of the word knowledge, the context can be 3 words, 5 words and 7 words, at the same time, considering the word word-building processing long words than weak, you can consider the introduction of imitation word patterns, idioms/idioms and other characteristics, there are studies that Adding the core dictionary will improve the classification effect of dictionary words, which need to weigh, if the training corpus covering the core dictionary is more comprehensive, the core dictionary of Word-formation knowledge is often included in the corpus, but if the training corpus for the core word coverage is not enough, you can consider to join the core word formation knowledge, but this to the core dictionary has higher requirements We think that the GKB dictionary published by the Institute of Computational Linguistics can be used as a core dictionary, and if there is no better core dictionary, the word-building knowledge of the core words should not be added as well.
This article is from the "focus on Natural Language Technology" blog, please be sure to keep this source http://langiner.blog.51cto.com/1989264/379166
One of the word sequence of conditional random field (CRF) (Turn)