Basic concepts of language models
This article introduces the basic concepts of the language model, but before introducing the language model, let's briefly review the big problem of natural language processing. At present, the study of natural language processing is definitely a very hot direction, mainly driven by the development of the current Internet. There is a lot of information on the Internet, mainly text information. The processing of such information is inseparable from the natural language processing technology. So what is natural language and natural language processing?
1. Basic tasks of Natural Language Processing
Natural Language (Natural Language) is actually human language, Natural Language Processing (NLP) is the processing of human language, of course, mainly the use of computers. Natural Language Processing is a cross-discipline of computer science and linguistics. Common research tasks include:
· Word segmentation (Word Segmentation or word breaker, WB)
· Information Extraction (IE)
· Relational extraction (re)
· Nameentity recognition (NER)
· Part of speech tagging (POS)
· Coreference resolution)
· Parsing)
· Word Sense Disambiguation (WSD)
· Speech Recognition)
· Speech synthesis (Text to Speech, TTS)
· Machine Translation (MT)
· Automatic summarization)
· Q & A System (Question Answering)
· Natural language understanding (natural language understanding)
· Optical character recognition (Optical Character Recognition, OCR)
· Information Retrieval (IR)
In the early days, natural language processing systems were mainly based on manually written rules. This method was time-and labor-consuming and did not cover various language phenomena. In the late 1980s S, machine learning algorithms were introduced into natural language processing, thanks to the increasing computing power. This method uses a large-scale training corpus (corpus) to automatically learn model parameters. Compared with the previous rule-based method, this method is more robust.
2. Statistical Language Model
The statistical language model is proposed in such an environment and context. It is widely used in various natural language processing problems, such as speech recognition, machine translation, word segmentation, part-of-speech tagging, and so on. Simply put, the language model is used to calculate the probability of a sentence, that is, P (W1, W2,... WK ). Using the language model, you can determine which word sequence is more likely, or specify several words to predict the next most likely word to appear. For example, the input pinyin string isNixianzaiganshenmeThe corresponding output can be in multiple forms, suchWhat are you doing now,What else do you want in Xi'an?And so on. Which of the following is the correct Conversion Result? using the language model, we know that the probability of the former is greater than that of the latter. Therefore, it is reasonable to convert it to the former in most cases. Another example of machine translation is to give a Chinese SentenceLi Ming is watching TV at home, Which can be translatedLi Ming is watching TV at home,Li Ming at home is watching TVAccording to the language model, we know that the probability of the former is greater than that of the latter, so it is reasonable to translate it into the former.
How can we calculate the probability of a sentence? Given a sentence (word sequence) S = W1, W2,..., wk, its probability can be expressed:
(1)
Because there are too many parameters in the above formula, an approximate calculation method is required. Common methods include n-gram model method, decision tree method, maximum entropy model method, maximum entropy Markov model method, Conditional Random domain method, and neural network method.
3. n-gram Language Model
3.1 concept of n-gram model
The n-gram model, also known as the n-1 Markov model, has a finite historical hypothesis that the probability of occurrence of the current word is only related to the first n-1 words. Therefore, the formula (1) can be approximately:
(2)
When N is 1, 2, and 3, the n-gram model is called the unigram, bigram, and trigram language models respectively. The parameter of the n-gram model is the conditional probability P (Wi|Wi-n + 1,...,Wi-1). Assume that the word table size is 100,000, then the number of parameters in the n-gram model is 100,000 n. The larger the N value, the more accurate the model and the more complex it is. The larger the computing workload is required. Bigram is the most commonly used, followed by unigram and trigram, where n is greater than or equal to 4.
3.2 Parameter Estimation of n-gram model
Model parameter estimation is also called model training. Generally, maximum likelihood estimation (MLE) is used to estimate model parameters:
(3)
C (X)Indicates the number of times X appears in the training corpus. The larger the size of the training corpus, the more reliable the parameter estimation result. However, even if the size of the training data is large, such as several GB, there will still be many language phenomena that have not been seen in the training corpus, which will lead to many parameters (the probability of a certain N-yuan pair) 0. For example, IBM brown trained trigram using M corpus. 14.7% trigram and 2.2% bigram did not appear in the test corpus; according to the statistical results of the laboratory where the doctor was located, the bigram model was trained using the 5 million-word People's Daily and the 1.5 million-word people's daily as the test corpus. 23.12% of the bigram results did not appear.
This problem is also known as data sparseness. To solve the problem of data sparseness, we can solve it through data smoothing technology.
3.3 Data Smoothing of the n-gram model
Data Smoothing is an estimation of N-yuan pairs with a frequency of 0. Typical smoothing algorithms include addition smoothing, good-Turing smoothing, Katz smoothing, and interpolation smoothing.
·Addition Smoothing
To avoid the zero probability problem, add a constant delta (0 <Delta ≤ 1) to the number of occurrences of each n-element pair ):
(4)
·Good-TuringSmooth
Use the frequency category information to smooth the frequency:
(5)
N (c) indicates the number of n-grams whose frequency is C.
·Linear interpolation Smoothing
This data smoothing technique uses the low-cost n-gram model for linear interpolation of the high-cost n-gram model. When there is not enough data to estimate the probability of the n-gram model, the n-gram model can provide useful information.
(6)
λ n can be estimated using the EM algorithm.
·KatzSmooth
It is also called back-off smoothing. The basic idea is to use the maximum likelihood estimation method to estimate the probability of a n-yuan pair when the number of occurrences is large enough; when the number of occurrences of the N-yuan pair is not large enough, use good-Turing to estimate the smoothness of the N-yuan pair, and discount some of its probabilities to the n-yuan pair that did not appear; when the number of occurrences of the N-yuan pair is 0, the model is rolled back to the low-yuan model.
(7)
The α and β parameters guarantee the normalization constraint of the model parameter probability, that is.
Decoding Algorithm for the 3.4 n-gram model
Why does the n-gram model require decoding algorithms? For example, for audio conversion, enter pinyinNixianzaiganshenme, May correspond to a lot of conversion results. For this example, possible conversion results are shown in (just draw part of the word node), each node forms a complex network structure, any path from start to end is a possible conversion result. The decoding algorithm is required to select the most suitable result from many conversion results.
The common decoding algorithm is Viterbi. It uses the dynamic programming principle to quickly determine the most appropriate path. This algorithm is not detailed here.
3.5 application of n-gram model
N-gram language models are widely used. The earliest applications were speech recognition and machine translation. Professor Wang Xiaolong of Harbin Institute of Technology first applied it to the audio-character conversion problem and proposed the "Statement-level PinYin Input Method". Later, the technology was transferred to Microsoft, that is, the soft Pinyin input method. From Windows 95, the system automatically installs the input method, and later versions of Windows and Office software will integrate the latest micro soft Pinyin input method. N years later, rookie input methods (such as sogou and Google) also adopted the n-gram technology.
Basic concepts of language models