A week of talk (eighth week)-Chinese participle

Source: Internet
Author: User

Chinese participle

NLP (Natural language Processing) Natural language processing has always been a relatively hot field, now whether it is search, recommend God Horse Basic need to deal with NLP, and the Chinese NLP processing of the first step is participle, So Chinese participle has always played a pivotal role. Of course, the algorithm is endless, from the original dictionary matching to the subsequent statistical model, from the HMM to the CRF, the accuracy of the word segmentation is constantly improving, below I will briefly introduce the basic word segmentation algorithm.

Dictionary Matching

The simplest word is based on dictionary matching, a sentence "disorderly talk Chinese word", if the dictionary I have these three words "disorderly talk" "Chinese" "participle" then I can naturally put the sentence to participle. Based on dictionary matching the problem is that there are multiple matching cases, such as "Beijing" "Peking University" two words, then came a sentence "where Peking University", should use "Beijing" to match or "Peking University" to match, so people put forward a lot of heuristic matching scheme, such as the most classic maximum match, is as far as possible to match the longest words, and similar to the number of words after the word segmentation as little as possible to inspire rules, so there is a mmseg algorithm, put forward some good heuristic rules, and the actual application of the effect is very good, so the application is very extensive, specific details of our own search, here is not repeat.

Statistical models

Soon based on the dictionary of Word or will expose a lot of problems, the main problem is ambiguous, such as "Wuhan Yangtze River Bridge", different participle may become "Wuhan/Mayor/Jiang Bridge" and "Wuhan/Changjiang/bridge", obviously dictionary matching is not solve such ambiguity problem, So there is a statistical segmentation algorithm. In my article is introduced is a meta-model of the word segmentation algorithm, for a sentence sequence a1a2a3...an into the final word sequence a1a2a3 ... AM, a unary model is a hope

ARGma x Πmi=1< Span id= "mathjax-span-57" class= "Texatom" > p ( ai ) /span>

The same n-ary model is

ARGMAXPiP(AI|AI−1,ai−2.< Span id= "mathjax-span-100" class= "Mo". ,ai−n< Span id= "mathjax-span-110" class= "Mo" >+1 ) /span>

This article of mine is the method of the one-dimensional model, so the birth of the statistical model solves the ambiguity problem in the word segmentation problem.

Hmm (Hidden Markov model) of hidden Markov models

The preceding n-ary model can solve the ambiguity problem, but, but can not solve the problem of non-login words, so-called non-login words, refers to the words have not been seen, or not in our dictionary of the word then people put forward based on word tagging word, such as a sentence "I like Tiananmen" can become such a mark " I S-Happy B Huan E Day b ann m door E "through S (single) b (Begin) m (middle) e (end) such a label to change the word segmentation problem to labeling, when the first proposed word labeling algorithm, in the Word segmentation Assembly also achieved an astonishing accuracy.

Hmm hidden Markov chain model is such a word segmentation algorithm, assuming that the original sentence sequence is a1a2a3...an, labeling sequence is c1c2...cn, then hmm is required to such a formula

ARGMAXPiP(CI|ci−1) ∗< Span id= "mathjax-span-172" class= "Mi" >p (ai | ci)

In my SNOWNLP this project has to implement HMM participle.

Maximum entropy model me (Maximum Entropy)

This article introduces the concept of information entropy, the greater the uncertainty of information entropy is greater, the information entropy is the largest of all kinds of equal distribution of probabilities, that is, an unbiased guess, the maximum entropy model is generally under the known conditions, the maximum entropy of the situation, the maximum entropy model we generally have feature function, The given condition is that the sample expectation equals the model expectation, i.e.

P−(F)=ΣP−(AI,CI)∗F(AI,CI)=P(F)=ΣP(CI|AI)∗p− Span id= "mathjax-span-245" class= "Msubsup" >ai) ∗f ( ai,ci)

< Span class= "Texatom" > < Span class= "Mi" > under known conditions is the case where the maximum entropy is obtained

Argmaxh (Ci|ai)

H is the function of information entropy, so we find out P (ci|ai), we know the label of each word a C, one of the advantages of maximum entropy model is that we can introduce a variety of feature, and not only from the word occurrence of the frequency of participle, such as we can join domain Knowledge, you can add a known dictionary information, and so on.

Maximum entropy Markov model MEMM (maximum-entropy Markov model)

One problem of the maximum entropy model is that the labeling problem of each word is divided, so the wise man combines the Markov chain and the maximum entropy to get the maximum entropy Markov model, which can not only utilize the characteristics of the maximum entropy of various feature, but also add the serialized information. So that it can be handled from the perspective of maximizing the entire sequence, rather than dealing with each word individually, so MEMM is seeking this form

ARGMAXPip (ci | ci−1,< Span id= "mathjax-span-321" class= "Msubsup" >ai)

Conditional with the airport CRF (Conditional Random Field)

Memm is the shortcomings of the Markov chain, Markov chain hypothesis is that each state is only related to the state in front of him, such assumptions are obviously biased, so there is a CRF model, so that each state not only with his previous state, but also with the state behind him, from the beginning of the picture can be seen , hmm is a forward graph based on Bayesian networks, and CRF is a graph without direction.

P(YV|Yw, W≠v ) = P (yv,y w, W∼v ) /span> Where w~v means that W and V is neighbors in G.

The above formula is the definition of the condition with the airport, a graph is called the condition with the airport, is saying that the node in the diagram is only related to his neighboring nodes. Finally, because it is not a graph of Bayesian networks, CRF uses the concept of the regiment to seek, the final formula is as follows

P(Y|X,Lambda)=1Z(X)∗e xp (σλj∗< Span id= "mathjax-span-390" class= "Msubsup" >fj (y,x) Span id= "mathjax-span-398" class= "Mo" >)

Because the conditions with the airport can be as large as the maximum entropy model plus a variety of feature, and no Markov chain-like paranoid hypothesis, so in recent years, CRF is known to be recognized as the best word segmentation algorithm STANFORDNLP has good Chinese participle of the CRF implementation, in their paper mentioned that They add the dictionary as feature to the CRF, which can improve the performance of the participle well.

Recently seen this paper, has been used deep learning method to try to solve the word segmentation algorithm, but also achieved good results.

In short now the Chinese word segmentation technology is relatively mature, so if there is no need to use these open-source participle implementation is enough, but in view of the purpose of learning, to implement a word segmentation algorithm is still very interesting.

A week of talk (eighth week)-Chinese participle

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.