This article is a summary of the recent learning of speech recognition, the main reference is as follows:
Analytic deep Learning-the practice of speech recognition
http://licstar.net/archives/328 word vector and language model
Several papers, see the references in detail
Speech recognition task is to convert sound data into text, the main goal of the study is to achieve natural language human-computer interaction. In the past few years, the field of speech recognition has become a focus of attention, the emergence of a number of new
voice applications, such as voice search, virtual voice assistant (Apple's Siri ) and so on. pipline of speech recognition tasks
the input of the speech recognition task is the sound data, first of all the original sound data to a series of processing (short-time Fourier transform or take down the recipe ... ), which becomes a vector or matrix form, called a feature sequence. This process
is feature extraction, the most commonly used is the MFCC feature sequence. This is not a deep study, just know that this is the first step in speech recognition.
Then, we to model these feature sequence data. A hybrid Gaussian distribution model is used in the traditional speech recognition scheme. In fact, not only in the field of speech recognition, there are many fields of engineering and science, Gauss Division
cloth is very popular. Its popularity comes not only from its satisfying computational characteristics, but also from the ability of the large number theorem to approximate many of the actual problems that arise naturally.
Mixed Gaussian distribution:
a continuous random scalar subjected to mixed Gaussian distribution x, its probability density function is:
which
generalized to multivariate mixed Gaussian distributions, the joint probability density function is:
The most obvious characteristic of a mixed Gaussian distribution is its multimode state ( m>1 ) , single-modal properties different from Gaussian distributions M=1. This allows mixed Gaussian distributions to describe many of the data that shows the multi-modal properties, including the number of voices
The Gaussian distribution is not appropriate. The multi-modal nature of the data may come from a variety of potential factors, each determining a particular blend of elements in the distribution.
when the speech data is extracted by feature and transformed into a feature sequence, the mixed Gaussian distribution is very suitable for fitting such speech characteristics under the condition of ignoring the timing information. In other words, you can use a frame as a unit to mix
model of the speech feature is modeled by the Gaussian.
If you consider taking the phonetic order information into account, GMM is no longer a good model because it does not contain any sequential information.
at this time, using a class of Kefu model called Hidden Horse ( Hiddenmarkov model) to model timing information. However, when a state of a hmm is given, to the speech eigenvectors that belong to that State
, GMM is still a good model for modeling the probability distributions.
so in traditional speech recognition, gmm+hmm as an acoustic model. Language Model
language model is actually to see a word is not normal people say it. It can be used in many places, such as machine translation, speech recognition to obtain a number of candidates, you can use the language model to choose the best possible knot
Fruit. It can also be used in other tasks in NLP.
The formal description of a language model is given a string, which is the probability P (w1,w2,..., wt) of natural language. W1 to WT, in turn, to denote each word in this sentence. There is a very simple corollary to this:
P (w1,w2,..., wt) =P (W1) XP (W2|W1) XP (W3|W1,W2) x...xp (wt|w1,w2,..., wt−1)
The common language model is to approximate the P (wt|w1,w2,..., wt−1). For example , the N-gram model uses P (wt|wt−n+1,..., wt−1) to approximate the former. The Classic language model:
The most classic of the training language model, Bengio and others in 2001 published in the Nips of the article "Aneural Probabilistic Language model". Of course, if you look at it now, it must be seen in 2003, when he dropped his JMLR on the same paper.
Bengio uses a three-layer neural network to build a language model, as well as a n-gram model.
The bottom of the graph wt−n+1,..., wt−2,wt−1 is the first n−1 word. It is now necessary to predict the next word wt based on this known n−1 word. C (w) denotes the word vector corresponding to the word w, which is used throughout the model.