One of the word sequence of conditional random field (CRF) (Turn)

Source: Internet
Author: User

http://langiner.blog.51cto.com/1989264/379166

Original works, allow reprint, please be sure to use hyperlinks in the form of the original source of the article, author information and this statement. Otherwise, the legal liability will be investigated. http://langiner.blog.51cto.com/1989264/379166

On one of the word segmentation sequences of conditional random field (CRF)
Langiner

Discriminant machine learning technology to solve word segmentation problem, in which discriminant machine learning technology mainly represents the conditional random field, maximum entropy/hidden Markov maximum entropy, perceptron, support vector machine, etc., about their similarities and differences after the opportunity to talk about, today mainly on the use of the airport to solve word segmentation problem

The conditional random fields are proposed by John Lafferty and applied to the field of natural language processing, which is mainly used for sequence labeling problems, such as word segmentation, entity recognition, part-of-speech tagging (in case the number of parts of speech is relatively small), and shallow syntactic analysis.

Discriminant Machine Learning technology solves word segmentation based on the concept of word formation, the word segmentation problem is transformed into a classification problem by defining the word bit information of each word (the position of each word in the word) to determine the sequence prediction of the word category, and the word bit information can be defined as any of the following
Two categories (the first and the words), three categories (word head, words and endings), four categories (word head, words, endings, single words) and so on, in general, the more categories, the stronger the difference between the characters, the more accurate classification, but the larger the classification space, the larger the model, decoding space, the greater the decoding speed, resulting in the slower, the actual system, The three categories (the first, the ending, and the words) are a good balance.

The internet on the open-source Airport project Many, the most typical and most used is crf++, which has a complete source code and application examples, through the software, we can easily learn and use. crf++ Open Source code The biggest problem is only the Linux version, considering the Linux environment, tracking debugging inconvenient, if under Windows through the establishment of Visual C + + or Visual Studio project, through the tracking debugging, more effective learning the algorithm, I will be in my own learning practice, the Linux version is ported to the Windows platform, and open it on the SourceForge platform (CRF Chinese participle open source version).

How to use these characteristics is the performance of the machine learning algorithm is key, Chinese word segmentation mainly uses the context of the word knowledge, the context can be 3 words, 5 words and 7 words, at the same time, considering the word word-building processing long words than weak, you can consider the introduction of imitation word patterns, idioms/idioms and other characteristics, there are studies that Adding the core dictionary will improve the classification effect of dictionary words, which need to weigh, if the training corpus covering the core dictionary is more comprehensive, the core dictionary of Word-formation knowledge is often included in the corpus, but if the training corpus for the core word coverage is not enough, you can consider to join the core word formation knowledge, but this to the core dictionary has higher requirements We think that the GKB dictionary published by the Institute of Computational Linguistics can be used as a core dictionary, and if there is no better core dictionary, the word-building knowledge of the core words should not be added as well.

This article is from the "focus on Natural Language Technology" blog, please be sure to keep this source http://langiner.blog.51cto.com/1989264/379166

One of the word sequence of conditional random field (CRF) (Turn)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.