Beauty of mathematics Series 3-Application of Hidden Markov Model in Language Processing
Poster: Wu Jun, Google researcher
Hidden Markov Model (HMM) is a mathematical model which has been regarded as the most successful method for achieving a fast and accurate speech recognition system. Complex speech recognition problems can be expressed and solved in a very simple way through the hidden Markov model, so that I cannot help but lament the beauty of the mathematical model.
Natural Language is a tool for humans to exchange information. Many natural language processing problems can be equivalent to decoding problems in the communication system-a person guesses the meaning of the speaker Based on the received information. This is actually like in communication, we analyze, understand, and restore the information transmitted by the sending end based on the signal received by the receiving end. The figure below shows a typical Communication System:
S1, s2, s3... indicates the signal sent by the information source. O1, o2, o3... is the signal received by the receiver. Decoding in communication is based on the received signals o1, o2, o3... to restore the signals s1, s2, s3 ....
In fact, when we are talking, our mind is a source of information. Our throat (vocal cords) and air are channels like wires and optical cables. The audience's ears are the receiver, and the sound is the signal sent. The speaker's meaning based on acoustic signals is speech recognition. In this case, if the acceptor is a computer rather than a human, the computer must perform automatic speech recognition. Similarly, in the computer, if we want to speculate the speaker's Chinese meaning based on the received English information, it is machine translation; if we want to deduce the correct meaning of a speaker based on a statement with a spelling mistake, it is to automatically correct the error.
So how can we guess what the speaker wants to say based on the received information? We can use the Hidden Markov Model to solve these problems. Taking speech recognition as an example, when we observe the speech signal o1, o2, o3, We need to predict the sentence s1, s2, s3 sent based on this signal. Obviously, we should find the most likely one among all possible sentences. The mathematical language is used to describe the conditional probability when o1, o2, o3,... is known.
P (s1, s2, s3,... | o1, o2, o3....) Sentence s1, s2, s3 ,...
Of course, the probability above is not easily obtained directly, so we can calculate it indirectly. By using Bayesian formula and saving a constant term, the above formula can be equivalent
P (o1, o2, o3,... | s1, s2, s3...) * P (s1, s2, s3 ,...)
Where
P (o1, o2, o3 ,... | s1, s2, s3 ....) it indicates a sentence s1, s2, s3... read as o1, o2, o3 ,... and
P (s1, s2, s3 ,...) string s1, s2, s3 ,... it can be a reasonable sentence, so this formula is used to send signals to s1, s2, s3... the possibility of this series is multiplied by s1, s2, s3... the probability of a sentence can be obtained.
(The reader may ask if you have made the problem more complicated because the formula is longer and longer. Don't worry. Let's simplify the problem now .) Here we make two assumptions:
First, s1, s2, s3,... is a Markov chain, that is, si is only determined by the Si-1 (see Series 1 );
Second, the oi of the receiving signal at the moment I is determined only by the si of the sending signal (also known as the independent output hypothesis, that is, P (o1, o2, o3 ,... | s1, s2, s3 ....) = P (o1 | s1) * P (o2 | s2) * P (o3 | s3 )....
Then we can easily use the Viterbi algorithm to find the maximum value of the above formula, and then find the sentence s1, s2, s3,... to be recognized ,....
The model that satisfies the preceding two assumptions is called the Hidden Markov Model. We use the word "implicit" because the statuses s1, s2, s3,... cannot be directly observed.
The Application of Hidden Markov Model is far more than in speech recognition. In the above formula, if we put s1, s2, s3 ,... for Chinese characters, o1, o2, o3 ,... as the corresponding English language, we can use this model to solve machine translation problems. If we put o1, o2, o3 ,... this model can be used to recognize printed and Handwritten Characters.
P (o1, o2, o3 ,... | s1, s2, s3 ....) according to different application names, it is called the "Acoustic Model" in speech recognition, and is a "Translation Model" in machine Translation) in spelling Correction, the "Correction Model" is used ). P (s1, s2, s3,...) is the language model we mentioned in Series 1.
Before using the hidden Markov model to solve language processing problems, we must first train the model. The common training method was proposed by Baum in 1960s and named after him. Speech recognition is a successful application of Hidden Markov Model in the early stage of processing language problems. In 1970s, Fred Jelinek of IBM and Jim and Janet Baker of Carnegie Mellon University (Baker and his wife, brother and sister of Lee Kai-Fu) the Hidden Markov Model is separately proposed to recognize speech. The error rate of speech recognition is three times lower than that of artificial intelligence and pattern matching (from 30% to 10% ). In 1980s, Dr. Lee insisted on adopting the Hidden Markov Model Framework and successfully developed the world's first continuous speech recognition system (Sphinx) with a large vocabulary.
I first came into contact with the hidden Markov model almost 20 years ago. At that time, I learned this model in "random process" (a famous course in Tsinghua), but I couldn't figure out its practical use. A few years later, when I studied Speech Recognition with Professor Wang zuoying at Tsinghua, he gave me dozens of documents. What impressed me most is the articles by gienik and Kai-fu Lee. Their core idea is the hidden Markov model. Complex speech recognition problems can be expressed and solved in such a simple way. I sincerely lament the wonderful mathematical model.