Language model
P(S) is the language model, the model used to calculate the probability of a sentence S .
So, how do you calculate it? The simplest and most straightforward method is to do division after counting, that is, maximum likelihood estimation (Maximum likelihood Estimate,mle), as follows:
whichCOUNT(W1,w2,... ,wi? 1,wi) Denotes a word sequence(W1,w2,... ,wi? 1,wi) The frequency that appears in the corpus. Here are two important issues: sparse data and too large parameter space, which can be impractical.
In practice, we typically use the N-ary syntax model (N-GRAM), which employs the Markov hypothesis (Markov assumption), that is, that each word in a language is only related to the context N-1 the preceding length.
- Assuming that the next word appears dependent on one of the words in front of it, namely Bigram, there are:
- Assuming that the next word appears dependent on the two words in front of it, namely Trigram, there are:
So, how do we choose the number of dependent words, i.e. N, when we are faced with practical problems?
- Greater N: The next word appears more constrained information, with greater discernment ;
- Smaller n: more times in the training corpus, more reliable statistical information, and higher reliability.
Theoretically, the greater the n the better, experience, trigram use the most, however, in principle, can be solved with bigram, never use Trigram.
In essence, this kind of statistical language model describes the finite state of regular grammar language, and natural language is a language of uncertainty, so the difference with real language, the ability to express is limited, especially the long-distance dependence on the language phenomenon is not good. However, it captures the nature of local constraints (local constrain) in natural language, so the model has achieved great success in practical applications.
Evaluation of speech model effect
After the construction of language model, how to determine its good or bad? There are two main methods of evaluation at present:
- Practical method: By looking at the performance of the model in the actual application (such as spell check, machine translation) evaluation, the advantages are intuitive, practical, the shortcomings are lack of pertinence, not objective;
- Theoretical methods: Obfuscation/perplexity/Chaos (preplexity), the basic idea is to give the test set a higher probability value of the language model is better, the formula is as follows:
It is known from the formula that the smaller the degree of confusion, the greater the probability of sentence and the better the language model.
Data sparse and smoothing technology
The data sparse problem inevitably arises between the large-scale data statistic method and the limited training corpus, resulting in 0 probability problem, which conforms to the classic zip ' F law. such as IBM brown:366m English Corpus training Trigram, in the test corpus, 14.7% of Trigram and 2.2% of Bigram in the training corpus did not appear.
In order to solve the problem of data sparse, people have made many attempts and efforts for the practical application of theoretical model, and a series of classical smoothing techniques have been born, and their basic idea is "to reduce the conditional probability distribution of n-gram, so that the non-appearing n-gram condition probability distribution is not 0", And after the data smoothing must be guaranteed probability and 1, data smoothing before and after as shown:
- Add-one (Laplace) Smoothing
Plus a smoothing method, also known as Laplace's law, which guarantees that each n-gram in the training corpus appears at least 1 times, taking Bigram as an example, the formula is as follows:
Where V is the number of all bigram.
Inv >>c (w i? 1,wi , that is, the majority of N-gram in the training corpus does not appear (generally so), Add-one smoothing after some "distracting" phenomenon, the effect is poor.
A simple method of improvement is add-δsmoothing (Lidstone, 1920; Jeffreys, 1948). Δ is a number less than 1.
The basic idea is to use frequency category information to smooth the frequency. As shown in the following:
whereNC indicates the number of N-gram with the number of occurrences C. Adjust the n-gram discount frequency to C for the occurrence frequency C ? :
However, for larger c, there may beNC+1=0 Orthe case of nc>nC+1 makes the model quality worse, as shown in:
The direct improvement strategy is "to gram the number of occurrences exceeding a certain threshold is not smooth, the threshold is generally taken 8~10."
Whether it is add-one, or good Turing smoothing technology, for the non-occurrence of n-gram are not equal, inevitably there is irrationality (event occurrence probability difference), so here again introduce a linear interpolation smoothing technology, the basic idea is to combine higher order model and low order model as linear combination, The Takamoto N-gram model is linearly interpolated using the low-N-gram model. Low-N-gram models can often provide useful information when there is not enough data to estimate the probability of Takamoto n-gram models.
The basic idea is that if the frequency of N-gram appears 0, it will fall back to (n-1)-gram approximate solution.
Statistical language model