On speech recognition

Source: Internet
Author: User
Tags mixed

This article is a summary of the recent learning of speech recognition, the main reference is as follows:

Analytic deep Learning-the practice of speech recognition

http://licstar.net/archives/328 word vector and language model

Several papers, see the references in detail

Speech recognition task is to convert sound data into text, the main goal of the study is to achieve natural language human-computer interaction. In the past few years, the field of speech recognition has become a focus of attention, the emergence of a number of new

voice applications, such as voice search, virtual voice assistant (Apple's Siri ) and so on. pipline of speech recognition tasks

the input of the speech recognition task is the sound data, first of all the original sound data to a series of processing (short-time Fourier transform or take down the recipe ... ), which becomes a vector or matrix form, called a feature sequence. This process

is feature extraction, the most commonly used is the MFCC feature sequence. This is not a deep study, just know that this is the first step in speech recognition.

Then, we to model these feature sequence data. A hybrid Gaussian distribution model is used in the traditional speech recognition scheme. In fact, not only in the field of speech recognition, there are many fields of engineering and science, Gauss Division

cloth is very popular. Its popularity comes not only from its satisfying computational characteristics, but also from the ability of the large number theorem to approximate many of the actual problems that arise naturally.

Mixed Gaussian distribution:

a continuous random scalar subjected to mixed Gaussian distribution x, its probability density function is:


generalized to multivariate mixed Gaussian distributions, the joint probability density function is:

The most obvious characteristic of a mixed Gaussian distribution is its multimode state ( m>1 ) , single-modal properties different from Gaussian distributions M=1. This allows mixed Gaussian distributions to describe many of the data that shows the multi-modal properties, including the number of voices

The Gaussian distribution is not appropriate. The multi-modal nature of the data may come from a variety of potential factors, each determining a particular blend of elements in the distribution.

when the speech data is extracted by feature and transformed into a feature sequence, the mixed Gaussian distribution is very suitable for fitting such speech characteristics under the condition of ignoring the timing information. In other words, you can use a frame as a unit to mix

model of the speech feature is modeled by the Gaussian.

If you consider taking the phonetic order information into account, GMM is no longer a good model because it does not contain any sequential information.

at this time, using a class of Kefu model called Hidden Horse ( Hiddenmarkov model) to model timing information. However, when a state of a hmm is given, to the speech eigenvectors that belong to that State

, GMM is still a good model for modeling the probability distributions.

so in traditional speech recognition, gmm+hmm as an acoustic model. Language Model

language model is actually to see a word is not normal people say it. It can be used in many places, such as machine translation, speech recognition to obtain a number of candidates, you can use the language model to choose the best possible knot

Fruit. It can also be used in other tasks in NLP.
The formal description of a language model is given a string, which is the probability P (w1,w2,..., wt) of natural language
. W1 to WT, in turn, to denote each word in this sentence. There is a very simple corollary to this:

P (w1,w2,..., wt) =P (W1) XP (W2|W1) XP (W3|W1,W2) x...xp (wt|w1,w2,..., wt−1)

The common language model is to approximate the P (wt|w1,w2,..., wt−1)
. For example , the N-gram model uses P (wt|wt−n+1,..., wt−1) to approximate the former. The Classic language model:

The most classic of the training language model, Bengio and others in 2001 published in the Nips of the article "Aneural Probabilistic Language model". Of course, if you look at it now, it must be seen in 2003, when he dropped his JMLR on the same paper.

Bengio uses a three-layer neural network to build a language model, as well as a n-gram model.

The bottom of the graph wt−n+1,..., wt−2,wt−1 is the first n−1 word. It is now necessary to predict the next word wt based on this known n−1 word. C (w) denotes the word vector corresponding to the word w, which is used throughout the model.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.