Speech Recognition overview
Speech recognition refers to the process of converting speech signals into words. The current speech recognition system framework is as follows:
The signal processing module will be based on the auditory perception of the human ear to extract the most important features of speech, the speech signal into a feature vector sequence. The common acoustic features in current speech recognition systems are linear predictive coding (Linear predictive CODING,LPC), Mel frequency cepstrum (mel-frequency cepstrum COEFFICIENTS,MFCC), Melpey Filter Group (Mel-scale filter Bank,fbank) and so on.
The decoder (Decoder) Converts the input speech feature vector sequence into a character sequence according to the acoustic model and the language model.
Acoustic model is a knowledge representation of the variables of acoustics, phonetics, environment, and the difference of Speaker's gender and accent. A language model is a knowledge representation of a set of word sequences. Training of models
In modern speech recognition system, acoustic model and language model are mainly used for statistical analysis of a large number of corpus and then modeled. Acoustic model
Acoustic models in speech recognition make full use of the information of acoustics, phonetics, environmental characteristics and speaker sex accent to model speech. In the present speech recognition system, the implicit Markov model (Hidden Markov model,hmm) is used to establish the posterior probability of a certain speech feature vector sequence for a certain state sequence. Implicit Markov model is a probabilistic graph model, which can be used to express the correlation between sequences, and is often used to model time-series data.
The Hidden Markov model is a weighted direction graph in which each node is called a state. Each moment, the implied Markov model has a certain probability to jump from one state to another state, and have a certain probability to launch an observation symbol, the probability of the jump with the weight of the edge expressed, as shown in the figure, S0 and S1 state, A and B are possible emission of observation symbols.
The implied Markov model assumes that the transfer of each State is related only to the previous state, and not to the other states before, that is, the Markov hypothesis; the symbols emitted in each state are only relevant to the current state and are not related to other States or other symbols, i.e., independent output assumptions.
The implied Markov model is generally expressed in ternary group λ= (a,b,π), where a is the state transition probability matrix, which indicates the probability of transferring to another State under a certain state; B is the symbolic probability matrix, which indicates the probability of launching a symbol in a certain state; Pi is the initial state probability vector, which indicates the probability of the initial state in a certain condition.
The hidden Markov model can produce two random sequences, one is the state sequence, the other is a sequence of observed symbols, so it is a double stochastic process, but the observed symbolic sequence can not be observed in the outside world. The Viterbi algorithm (Viterbi algorithm) can be used to find a state sequence with the greatest probability of occurrence under the condition of a given observed symbolic sequence. The probability of a sequence of observed symbols can be obtained efficiently by the forward backward algorithm (Forward-backward algorithm). The transfer probability of each state and the probability of the observed symbol emission can be computed by the Baum-Welch algorithm (Baum-welch algorithm).
In speech recognition, implicit Markov models are used to model the relationship between acoustic elements and speech feature sequences. Generally speaking, the acoustic unit level is small, its number is few, but the sensitivity to the context will be big. In the large vocabulary continuous speech recognition system, the Sub-word is generally used as the acoustic unit, such as the use of phonemes in English, and the use of sound vowels in Chinese.
The topological structure of the hidden Markov model in acoustic model generally adopts the three-state structure from left to right, each state has a arc pointing to itself, as shown in the figure, which indicates that the phoneme/t/model is modeled by the three-state models.
Because of the phenomenon of cooperative pronunciation in continuous speech, it is necessary to consider three phonemes together, called Sanjiko (Triphone) model. After the introduction of the three-tone, the number of hidden Markov models will increase dramatically, so the state is generally clustered, the state of the cluster is called Senone.
In order to eliminate the error caused by the quantization process, the acoustic feature vector value in speech recognition task is continuous, so we consider using the continuous probability density function to model the probability of the feature vector to the state. The mixed Gaussian model (Gaussian mixture MODELS,GMM) can approximate any probability density function, so it becomes the first choice of modeling.
Deng Li in the acoustic modeling of speech recognition by depth learning, the relationship between acoustic feature vectors and states is modeled by the depth neural network, which greatly improves the accuracy of speech recognition, and the application of depth learning in speech recognition acoustic modeling has been flourishing. such as cyclic neural networks (recurrent neural networks,rnn) using acoustic eigenvector context, and their special case-length memory networks (Long short-term memory,lstm). Language model
A language model can represent the probability of a word sequence occurring. The common language model in speech recognition is n-ary Grammar (N-gram), that is, the probability of the occurrence of N-word before and after statistics. The N-ary Grammar assumes that the probability of a word appearing is only related to the probability that the preceding N-1 word appears.
Now that there is a word sequence w= (W1,W2,W3,⋯,WU), the probability of occurrence can be decomposed into the following form:
P (W) =p (w1,w2,w3,..., wn) =p (W1) p (W2|W1) p (w3|w1,w2) ... P (wn|w1,w2,w3,..., wn−1)
However, such probabilities are not statistically available. According to Markov hypothesis, only the first N characters under the condition of the probability can be considered. Suppose the n=2 has
P (W) =p (W1) p (W2|W1) p (w3|w2) ... P (wn|wn−1)
According to the Bayesian formula, we can get the probability that one word will occur under the condition of another word.
P (wn|wn−1) =p (wn,wn−1) p (wn−1)
Thus, the probability of the occurrence of the adjacent words is counted in a large number of corpus, then the probability of single word appearing is counted.
Because there must be some obscure phrase in the corpus does not appear, but it also exists the probability of occurrence, so need algorithms to generate these rare phrases, that is, smooth. Commonly used smoothing methods include good-turing-Turing smoothing (soothing) and Katz smoothing (Katz smoothing).
Decoding based on weighted finite state converter
The "decoding" problem in speech recognition can be expressed as follows: for a given acoustic observation of a length of t
(acoustic observation) sequence x= (X1,X2,X3,⋯,XT) to find a corresponding word (word) sequence w= (W1,W2,W3,⋯,WU) with a length of u, so that the posteriori probability P (w| X) maximization, that is, to obtain a word sequence W, has
W^=ARGMAXP (w| X
The posterior probability P (w│x) is not easy to be obtained directly. According to the Bayesian formula can be obtained:
W^=ARGMAXP (w| X) =argmaxp (x| W) P (w) p (X)
Since the acoustic observation sequence has been given, the probability of acoustic observation is constant, so it can be classified into the following forms:
W^=ARGMAXP (w| X) =argmaxp (x| W) P (W)
At present, the large vocabulary speech recognition technology is used to introduce the knowledge of acoustics, phonetics and linguistics into the system. Using H to represent the state sequence of the implied Markov model, C denotes the context-related phoneme sequence, L represents the phoneme sequence, and assumes that the acoustic feature sequence, the state sequence of the implied Markov model, the phoneme sequence, and the word sequence are independent, and that the equation can be expanded:
W=ARGMAX∑H∑C∑LP (x| H) P (h| C) P (c| L) P (l| W) P (W)
In the equation, P (x| H) is called acoustic model, which indicates the posterior probability of the acoustic feature sequence for the hidden Markov state sequence; P (h│c), P (c│l) and P (l│w) respectively denote the sequence of sequences of context-related phoneme sequences, context-related phoneme sequences, and the posterior probability of the phoneme sequence pairs of word sequences; P (W) The probability of the occurrence of a sentence, called a language model. These probabilities are all derived from the previous "training" process.
At present, the decoding of speech recognition is generally based on the weighted finite state converter (weighted finite states transducer).
A weighted finite state converter is an empowering and direction graph, each of its nodes represents a "state", which, when accepted to an input symbol, jumps from the corresponding arc to another "state" and "emits" an output symbol, which can also be given a weight value on the arc. The formal description is as follows:
The weighted finite state converter on the half ring K is a eight-tuple t= (σ,δ,q,i,f,e,λ,ρ), Σ is the input symbol set, Δ for the output symbol set, Q for the state set, I, Q for the initial state set, F, Q for the termination state set, five-yuan relationship e⊆qx (Σ∪ϵ) x ( Δ∪ϵ) XKXQ represents a transfer function, mapping Λ:i→k represents the weights of the initial state, and maps ρ:f→k the weights that represent the termination state.
In order to enlarge the application scope of the weighted finite converter, the meaning of "weight" is extended to a more general algebraic structure-half loop. Given the set K and the two operation ⊙ and ⊗ on it, if (K,⊕,0¯) is a commutative unitary group with unit element 0¯, (K,⊗,1¯) is a unitary group with a unit element 1¯, and the ⊙ operation has the distributive nature for the ⊗ operation, the 0¯ operation is zero for the ⊗, that is, for the arbitrary A⊆k has A⊗0¯=0¯⊗a=0¯. Thus, the total weight of each arc from the first state to the final state "path" can be obtained by ⊗ operation, and the total weight of multiple paths can be summed up by ⊙. The following figure shows a simple weighted finite state converter. Its input symbol set is {a,b,c}, the symbol shown in the diagram as a colon before the arc, and the output symbol set is {x,y,z}, which is represented as the symbol after the arc colon, the half loop is the real field, the number that is represented in the graph as a slash, and the double circle represents the terminating state.
In speech recognition, the total weight of the path can be regarded as the joint probability of the output sequence under the condition of the input sequence, and because the Markov chain has no validity hypothesis, so the total weight can be regarded as the product of the weight on the path. Due to computer operations, it is necessary to prevent the underflow of floating-point numbers, these probabilities are often logarithmic, that is, the logarithmic half ring in the table, where the ⊕log operation is defined as X⊕logy=−log (e−x+e−y), and because the speech recognition often needs to find the optimal path of weights on the weighted finite state converter, the tropical half ring is defined.
Half ring set ⊕⊗0¯1¯ logarithmic half loop (log) R∪{−∞,+∞}⊕log + +∞0 Tropical Half Ring (tropical) r∪{−∞,+∞} Min + +∞0
A weighted finite state converter (composition) operation can be used to combine the weighted finite state converters at different levels. For example, in a real speech recognition system, typically, four weighted finite state converters are constructed: H representing implied Markov model state sequences to context-related phoneme sequence mappings, C representing context-related phoneme sequences to phoneme sequence mappings, l for phoneme sequences to word sequences, and language model G, Combining the four weighted finite state converters to form HCLG, which corresponds to the knowledge of phonetics and linguistics, the weights of the upper arcs can be regarded as the state of the input hidden Markov model and the probability of the occurrence of the corresponding words.
Acoustic Model P (x| (H) based on training. After the network Feedforward is trained by a speech input, a matrix is obtained, its columns represent the number of frames, and the row represents the probability distribution of the state of the implied Markov model, that is, a probability query table of a frame to an implied Markov state.
Therefore, the decoding problem of speech recognition can be summed up as the optimal path search problem in the weighted finite state converter HCLG, only the total weight of the path should consider the weight value of the HCLG arc, and the weight value of the acoustic model should be considered to maximize the total weight.
According to the Single-source shortest path algorithm with weighted direction-free graph, considering the fact "for a node U on the shortest path of a graph, if the precursor of this path is σ, then σ must be on the shortest path (one) from the source point to the U, and the shortest path tree can be constructed by layer from the source point. In the actual system, due to the large search graph, in order to reduce the consumption of computer memory, often using heuristic beam search (Beam search) technology, that is, set a threshold, in the search tree to retain the threshold range of the path, crop off the threshold range of paths. The decoding process on a weighted finite state converter can be described briefly in pseudo code as
foreach frame:
foreach token:
if token->cost > Cut_off:
foreach arc:
if arc.weight > cut_off:< C4/>add arc to token
else:
Delete token
Where token represents the data structure of a saved path, each node can hold an arc, and the total cost of the current path.
In the real speech recognition system, the optimal path is not necessarily matched with the actual word sequence, we generally hope to get the most points of the first candidate path, that is, n-best. In order to save a candidate path in a compact manner and prevent excessive memory footprint, we generally use the word lattice (lattice) to hold the identified candidate sequences. There is no general definition of the word lattice, the common method is to use the finite state automata data structure to construct the word lattice.
PostScript: This is my speech recognition of a summary, but also I set a chapter. Take out to show everyone, there are mistakes in the hope that we do not hesitate to enlighten. References Huang X, Acero A, Hon H, et al. spoken Language Processing[j]., 2000. Rabiner L R. A tutorial on hidden Markov models and selected applications in speech recognition[j]. Proceedings of the IEEE, 1989, 77 (2): 257-286. Mohri m, Pereira F C, Riley m, et al. weighted finite-state transducers in Speech recognition[j]. Computer Speech & Language, 2002, 16 (1): 69-88.