The following are all copied, hahaha
1.mel Frequency:
is to mimic the perception of the human ear to different frequencies of speech.
Humans have different perceptual abilities for different frequencies of speech: 1kHz or less, linear relationship with frequency, and logarithmic relationship to the frequency of 1kHz. The higher the frequency, the worse the perceptual ability. Therefore, in the application, the low frequency MFCC is often used only, and the medium-high frequency mfcc is discarded.
In the Mel frequency domain, the human perception of the tone is linear, if the two-stage voice of the Mel frequency difference of twice times, then the perception of people is also twice times worse. Conversion formula: B (f) =1125ln (1+f/700) where f is the frequency and B is the mel-frequency.
2. Inverted spectrum:
The results of homomorphic processing, which are divided into complex and real cepstrum, and commonly used real cepstrum, are important coefficients in speech recognition.
3,mel frequency cepstrum coefficient parameter mentioned in: There is one step: frame, and then add the window, Reason:, the following is written:
4, that now summarizes the cepstrum analysis, which is actually such a process:
1) The original speech signal through the Fourier transform to obtain the spectrum: X[k]=h[k]e[k];
Only the range is: | X[K] |=| h[k]| | E[k] |;
2) We take the logarithm on both sides: Log| | X[k] | | = Log | | H[k] | | + Log | | E[k] | |.
3) Then take the inverse Fourier transform on both sides to get: x[k]=h[k]+e[k].
This actually has a professional name called homomorphic signal processing. Its purpose is to transform the nonlinear problem into a linear problem processing method. Corresponding to the above, the original voice signal is actually a volume of the signal (the channel equivalent to a linear time-invariant system, the production of sound can be understood as an excitation through the system), the first step through the convolution into a multiplicative signal (the time domain convolution equivalent to the product of the frequency domain). In the second step, the multiplicative signal is converted into an additive signal by taking the logarithm, and the third step is reversed to return to the coil signal. At this point, although both front and back are time domain sequences, but they are in the discrete time domain obviously different, so the latter is called the cepstrum frequency domain.
Last few graphs:
Reference:
Http://www.cnblogs.com/gogly/archive/2013/11/24/3440441.html
Very good literature, can refer to OH: http://blog.csdn.net/zouxy09/article/details/9156785/
Some knowledge about speech recognition in speech