On speech recognition

Last Update:2018-07-26 Source: Internet

Author: User

Tags mixed

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article is a summary of the recent learning of speech recognition, the main reference is as follows:

Analytic deep Learning-the practice of speech recognition

http://licstar.net/archives/328 word vector and language model

Several papers, see the references in detail

Speech recognition task is to convert sound data into text, the main goal of the study is to achieve natural language human-computer interaction. In the past few years, the field of speech recognition has become a focus of attention, the emergence of a number of new

voice applications, such as voice search, virtual voice assistant (Apple's Siri ) and so on. pipline of speech recognition tasks

the input of the speech recognition task is the sound data, first of all the original sound data to a series of processing (short-time Fourier transform or take down the recipe ... ), which becomes a vector or matrix form, called a feature sequence. This process

is feature extraction, the most commonly used is the MFCC feature sequence. This is not a deep study, just know that this is the first step in speech recognition.

Then, we to model these feature sequence data. A hybrid Gaussian distribution model is used in the traditional speech recognition scheme. In fact, not only in the field of speech recognition, there are many fields of engineering and science, Gauss Division

cloth is very popular. Its popularity comes not only from its satisfying computational characteristics, but also from the ability of the large number theorem to approximate many of the actual problems that arise naturally.

Mixed Gaussian distribution:

a continuous random scalar subjected to mixed Gaussian distribution x, its probability density function is:

which

generalized to multivariate mixed Gaussian distributions, the joint probability density function is:

The most obvious characteristic of a mixed Gaussian distribution is its multimode state ( m>1 ) , single-modal properties different from Gaussian distributions M=1. This allows mixed Gaussian distributions to describe many of the data that shows the multi-modal properties, including the number of voices

The Gaussian distribution is not appropriate. The multi-modal nature of the data may come from a variety of potential factors, each determining a particular blend of elements in the distribution.

when the speech data is extracted by feature and transformed into a feature sequence, the mixed Gaussian distribution is very suitable for fitting such speech characteristics under the condition of ignoring the timing information. In other words, you can use a frame as a unit to mix

model of the speech feature is modeled by the Gaussian.

If you consider taking the phonetic order information into account, GMM is no longer a good model because it does not contain any sequential information.

at this time, using a class of Kefu model called Hidden Horse ( Hiddenmarkov model) to model timing information. However, when a state of a hmm is given, to the speech eigenvectors that belong to that State

, GMM is still a good model for modeling the probability distributions.

so in traditional speech recognition, gmm+hmm as an acoustic model. Language Model

language model is actually to see a word is not normal people say it. It can be used in many places, such as machine translation, speech recognition to obtain a number of candidates, you can use the language model to choose the best possible knot

Fruit. It can also be used in other tasks in NLP.
The formal description of a language model is given a string, which is the probability P (w1,w2,..., wt) of natural language. W1 to WT, in turn, to denote each word in this sentence. There is a very simple corollary to this:

P (w1,w2,..., wt) =P (W1) XP (W2|W1) XP (W3|W1,W2) x...xp (wt|w1,w2,..., wt−1)

The common language model is to approximate the P (wt|w1,w2,..., wt−1). For example , the N-gram model uses P (wt|wt−n+1,..., wt−1) to approximate the former. The Classic language model:

The most classic of the training language model, Bengio and others in 2001 published in the Nips of the article "Aneural Probabilistic Language model". Of course, if you look at it now, it must be seen in 2003, when he dropped his JMLR on the same paper.

Bengio uses a three-layer neural network to build a language model, as well as a n-gram model.

The bottom of the graph wt−n+1,..., wt−2,wt−1 is the first n−1 word. It is now necessary to predict the next word wt based on this known n−1 word. C (w) denotes the word vector corresponding to the word w, which is used throughout the model.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More