Original address: http://blog.csdn.net/mspinyin/article/details/6137815#t12
Now the research of natural language processing is definitely a very hot direction, which is mainly driven by the current Internet development. In the internet is flooded with a lot of information, mainly the text of information, the processing of these information is inseparable from the natural language processing technology. So what exactly is natural language and natural language processing?
basic tasks for natural language processing
Natural language (Natural Language) is actually human language, natural language processing (NLP) is the processing of human language, of course, mainly using computers. Natural language processing is a cross-disciplinary study of computer science and linguistics, and common research tasks include:
· Participle (word segmentation or Word BREAKER,WB)
· Information extraction (Information Extraction,ie): Named entity recognition and Relationship extraction (Named entity recognition & Relation Extraction,ner)
· POS Callout (part of Speech Tagging,pos)
· Finger digestion (coreference Resolution)
· Syntactic analysis (parsing)
· Sense disambiguation (Word sense DISAMBIGUATION,WSD)
· Speech recognition (Speech recognition)
· Speech synthesis (Text to Speech,tts)
· MT (Machine TRANSLATION,MT)
· Auto Digest (Automatic summarization)
· Question and answer system (Question answering)
· Natural language Understanding (Natural Language Understanding)
· Ocr
· Information retrieval (information Retrieval,ir)
The early natural language processing system is mainly based on the rules of Manual writing, which is time-consuming and difficult to cover all kinds of language phenomena. In the late 80, machine learning algorithms were introduced into natural language processing, thanks to the increasing computational power. The research focuses on statistical models, this method uses the large-scale training corpus (corpus) to automatically learn the parameters of the model, compared with the previous rule-based method, this method is more robust.
Statistical language Model
Statistical language models (statistical Language model) are mentioned in this context and background. It is widely used in various natural language processing problems, such as speech recognition, machine translation, Word segmentation, pos tagging, and so on. To put it simply, a language model is a model for calculating the probability of a sentence, i.e.
Using a language model, you can determine which word sequence is more likely, or given several words, to predict the next most likely word. For example, the input pinyin string for Nixianzaiganshenme, the corresponding output can have a variety of forms, such as what you do now , you XI ' an , and so on, So which is the right conversion result, using the language model, we know that the former probability is greater than the latter, so it is more reasonable to convert the former in most cases. Another example of machine translation, given a Chinese sentence for Li Ming is watching TV at home , can be translated as Li Ming is watching TV at home, Li Ming at home iswatching tv, and so on, also according to the language model, we know the probability of the former is greater than the latter, so translation into the former more reasonable.
So how do you calculate the probability of a sentence? Given sentence (Word sequence)
,
Its probability can be expressed as:
(1)
Because there are too many parameters in the above, an approximate calculation method is required. Common methods include N-gram model method, decision tree method, Maximum Entropy model method, maximum Entropy Markov model method, conditional random domain method, neural network method, and so on.
N-gram
Language Model
N-gram
The concept of a model
The N-gram model, also known as the n-1-order Markov model, has a finite historical hypothesis: the probability of the occurrence of the current word is only related to the n-1 of the preceding word. Thus the (1) formula can be approximated as:
(2)
When n takes 1, 2, 3, the N-gram model is called Unigram, Bigram, and Trigram language models, respectively. The parameter of the N-gram model is the conditional probability
。
Assuming the size of the thesaurus is 100,000, the number of parameters for the N-gram model is
。
The larger the N, the more accurate and complex the model is, and the greater the amount of computation needed. The most commonly used is bigram, followed by Unigram and Trigram,n taking ≥4 less.
N-gram
parameter estimation of the model
The parameter estimation of the model is also called the training of the model, and the method of maximum likelihood estimation (Maximum likelihood estimation,mle) is used to estimate the parameters of the model:
(3)
C (x) indicates the number of times x appears in the training corpus and the greater the size of the training corpus, the more reliable the result of the parameter estimation. However, even if the training data is very large, such as a few gigabytes, there will be many language phenomena in the training corpus did not appear, which will lead to many parameters (the probability of an n-ary pair) is 0. For example, IBM Brown uses 366M English corpus to train Trigram, and in the test corpus, 14.7% of Trigram and 2.2% of Bigram did not appear in training, according to the laboratory statistics of the Ph. Using the 5 million word People's Daily training Bigram model, 1.5 million words of the people's Daily as a test corpus, the results of 23.12% Bigram did not appear.
This problem, also known as Data sparseness, solves the problem of data thinning that can be solved by data smoothing (smoothing) technology.
N-gram
Data Smoothing for models
Data smoothing is an n-ary pair with a frequency of 0, and a typical smoothing algorithm is additive smoothing, good-turing smoothing, Katz smoothing, interpolation smoothing, and so on.
· Addition Smoothing
The basic idea is to avoid the 0 probability problem by adding a constant δ (0<δ≤1) to each n-ary pair of occurrences:
(4)
· Good-turing Smoothing
Use the frequency category information to smooth the frequency:
(5)
where N (c) represents the number of n-gram with a frequency of C.
· Linear interpolation Smoothing
This data smoothing technique mainly uses the low-element N-gram model to interpolate the Takamoto n-gram model linearly. Low-N-gram models can often provide useful information when there is not enough data to estimate the probability of Takamoto n-gram models.
(6)
Can be estimated by EM algorithm.
· Katz Smoothing
Also known as fallback (Back-off) smoothing, its basic idea is that when an n-ary pair is large enough, the probability is estimated by the maximum likelihood estimation method, and when the number of n-ary pairs is not large enough, the good-turing estimation is smoothed, and the partial probability discount is given to the non-appearing n-elements. When the N-ary pair has a number of occurrences of 0 o'clock, the model is rolled back to the low-element model.
(6)
Parameter and the normalized constraint condition of the parameter probability of the model, namely.
N-gram
decoding algorithm of model
Why does the N-gram model need decoding algorithms? For example, for the word conversion problem, the input pinyin nixianzaiganshenme, may correspond to many conversion results, for this example, the possible conversion results as shown (only part of the word node), the nodes constitute a complex network structure, Any path from start to finish is a possible conversion result, and the process of selecting the most appropriate result from many conversion results requires a decoding algorithm.
The commonly used decoding algorithm is the Viterbi algorithm, which uses the principle of dynamic programming to quickly determine the most suitable path. The algorithm is not described in detail here.
N-gram
application of the model
The application of N-gram language model is very extensive, the earliest application is speech recognition, machine translation and so on. Professor Wang Xiaolong of Harbin Institute of Technology first applied it to the word conversion problem, put forward "sentence-level Pinyin input Method", later the technology transferred to Microsoft, that is, later Microsoft Pinyin Input method. Starting with Windows95, the input method is automatically installed and the latest Microsoft Pinyin Input method will be integrated in later versions of Windows and Office Office software. N years later, the various input methods of the rookie (such as Sogou and Google) have also adopted the N-gram technology.
Reference Documents:
[1] Microsoft Pinyin Input Method
[2] Natural language processing
[3] Language model
[4] Introduction to statistical language models
[5] Language model handout
The basic concept of language model