This paper mainly introduces the traditional speech recognition system based on Gmm/hmms.
Outline: Recognition Principle Statistics Model system framework
First, it is necessary to explain that the object discussed in this article is Continuous speech recognition (continuous Speech recognition, CSR), which means the recognition of isolated words based on DTW(Dynamic time warping) (Isolated Word recognition) is not within the scope of the discussion (out-of-date). At the same time, the whole paper focuses on the automatic speech recognition decoding process (recognition process). 1. Principle of recognition
First understand that our voice is a sound wave, is an analog signal, generally stored in the computer as a WAV file (no compression format) or can be obtained directly from the microphone acquisition (online).
preprocessing and digitizing operations are required: filtering noise reduction, pre-emphasis (high frequency), endpoint detection, window framing, and the decomposition of our speech signal into many small segments of speech (voice frames). Generally, the length of each frame is 25ms, the adjacent two frames have 10ms overlap, that is, often said frame length 25ms, frame shift 10ms.
Then, we do the signal analysis for each frame, to further compress the data, also known as feature extraction , the common characteristic parameters are: MFCC,PLP. After feature extraction, each frame is compressed into 39-dimensional MFCC characteristic parameters by the original hundreds of record points. (It's a lot easier in a moment)
Next, is how to transform a series of characteristic parameters into a paragraph of the question. That is, the acoustic model (GMM-HMMS), the language model comes in handy. First we need to know that a word consists of a sequence of words consisting of a string of phonemes (phoneme) (such as bal:/b//ɔː//l/). Usually in English we choose to establish the hidden Markov model (the Chinese modeling unit is usually the phonology), that is, a phoneme corresponds to a hmm, and usually a hmm consists of three states (state). Okay, we're in the opposite, we have a characteristic parameter sequence, the process of recognition, is to solve how each characteristic parameter is identified as a state, and then from the state to the phoneme, phoneme to word, word to Word sequence (a word). The characteristic parameters to the state are solved by GMMs (mixed Gaussian model); three states to a phoneme, solved by hmm; phoneme to word, solved by dictionary; Word to Word sequence, solved by language model. Of course, throughout the process, we are all in a state network (time-state), all based on HMMs. This is also why it is said that HMMs solved the problem of speech recognition.
Statistical Models
The task of automatic speech recognition (Automatic Speech recognition, ASR) is to map a segment acoustic signal to a string of text. (Modeling is the first step in our actual solution to the problem)
W∗=ARGMAXWP (w| X) (1) w^*=\mathop{argmax}_{w}p (w| X) \tag{1}
Wherein X=xti=x1x2,⋯,xt,⋯,xt x=x_{i}^{t}=x_1x_2,\cdots,x_t,\cdots,x_t represents a length of T-t acoustic signal (voice frame), W=wni=w1w2,⋯,xn W=w_{i}^{n}=w _1w_2,\cdots,x_n represents a sequence of words with a length of n n (Word sequence), w∗w^* is the most likely sequence of words in all, that is, our recognition results.
However, the formula (1) is difficult to calculate directly (for generative models). We perform a Bayes transformation:
P (w| X) =p (x| W) P (w) p (X) ∝p (x| W) P (W) (2) p (w| X) =\frac{p (x| W) P (w)}{p (X)} \ \ \quad\quad\quad\quad\propto P (x| W) P (w) \tag{2}