Some knowledge about speech recognition in speech

Source: Internet
Author: User

The following are all copied, hahaha

1.mel Frequency:

is to mimic the perception of the human ear to different frequencies of speech.

Humans have different perceptual abilities for different frequencies of speech: 1kHz or less, linear relationship with frequency, and logarithmic relationship to the frequency of 1kHz. The higher the frequency, the worse the perceptual ability. Therefore, in the application, the low frequency MFCC is often used only, and the medium-high frequency mfcc is discarded.

In the Mel frequency domain, the human perception of the tone is linear, if the two-stage voice of the Mel frequency difference of twice times, then the perception of people is also twice times worse. Conversion formula: B (f) =1125ln (1+f/700) where f is the frequency and B is the mel-frequency.

2. Inverted spectrum:

The results of homomorphic processing, which are divided into complex and real cepstrum, and commonly used real cepstrum, are important coefficients in speech recognition.

3,mel frequency cepstrum coefficient parameter mentioned in: There is one step: frame, and then add the window, Reason:, the following is written:

4, that now summarizes the cepstrum analysis, which is actually such a process:

1) The original speech signal through the Fourier transform to obtain the spectrum: X[k]=h[k]e[k];

Only the range is: | X[K] |=| h[k]| | E[k] |;

2) We take the logarithm on both sides: Log| | X[k] | | = Log | | H[k] | | + Log | | E[k] | |.

3) Then take the inverse Fourier transform on both sides to get: x[k]=h[k]+e[k].

This actually has a professional name called homomorphic signal processing. Its purpose is to transform the nonlinear problem into a linear problem processing method. Corresponding to the above, the original voice signal is actually a volume of the signal (the channel equivalent to a linear time-invariant system, the production of sound can be understood as an excitation through the system), the first step through the convolution into a multiplicative signal (the time domain convolution equivalent to the product of the frequency domain). In the second step, the multiplicative signal is converted into an additive signal by taking the logarithm, and the third step is reversed to return to the coil signal. At this point, although both front and back are time domain sequences, but they are in the discrete time domain obviously different, so the latter is called the cepstrum frequency domain.

Last few graphs:

Reference:

Http://www.cnblogs.com/gogly/archive/2013/11/24/3440441.html

Very good literature, can refer to OH: http://blog.csdn.net/zouxy09/article/details/9156785/

Some knowledge about speech recognition in speech

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.