Recently learning music automatic labeling process, saw the use of MFCC extract audio features of the content, specifically found on the Internet Information, learning about the relevant content. Most of this note was taken from the blog post http://blog.csdn.net/zouxy09/article/details/9156785, which I added when I marked and corrected it for future reference.
speech Signal Processing (four) Mel frequency cepstrum coefficient (MFCC)
[Email protected]
Http://blog.csdn.net/zouxy09
In any automatic speech recognition system, the first step is to extract the features. In other words, we need to extract the identifiable components of the audio signal and throw away other messy information, such as background noise, emotions, and so on.
Figuring out how speech is produced can be a great help to our understanding of speech. A person produces sound through the channel, shape of the channel. ) determines what kind of sound is emitted. The shape of the channel includes the tongue, teeth, etc. If we can accurately know this shape, then we can accurately describe the resulting phoneme phoneme. The shape of the channel is displayed in the envelope of the speech short-term power spectrum. And Mfccs is a kind of characteristic that accurately describes this envelope.
Mfccs (Mel Frequency cepstral coefficents) is a widely used feature in automatic speech and speaker recognition. It was developed by Davis and Mermelstein in 1980. from that time onwards. In the field of speech recognition, Mfccs in the artificial characteristics of the stand alone, a solo show, has never been surpassed AH (as to the deep learning characteristics of learning that is something).
Well, here we mention a very important keyword: the shape of the channel, and then know that it is important and that it can be displayed in the envelope of the short-term power spectrum of the speech. Hey, what's the power spectrum? What is an envelope? What is Mfccs? Why does it work? How to get it? Let's take a slow path.
First, the Spectrum diagram (spectrogram)
We're dealing with voice signals, so it's important to describe them. It shows different information because of different descriptions. What kind of descriptive way is good for us to observe, and for us to understand? Here, let's get to know something about a call spectrum.
Here, the speech is divided into a number of frames, each frame of speech corresponds to a spectrum (by the short-term FFT calculation), the frequency spectrum represents the relationship between energy. In actual use, there are three spectral graphs , namely, linear amplitude spectrum, logarithmic amplitude spectrum, self-power spectrum (logarithmic amplitude spectrum of the amplitude of each spectral line is counted, so its ordinate unit is db (decibels). The purpose of this transformation is to make the relatively high amplitude components of the lower-amplitude components higher, so as to observe the periodic signals that cover up the low-amplitude noise.
We first put the spectrum of one frame of speech through coordinates, such as left. Now we'll rotate the spectrum on the left by 90 degrees. Get the middle picture. Then map these amplitudes to a grayscale representation (also understood to quantify the continuous amplitude to 256 quantization values?). ), 0 indicates black, 255 is white. The larger the amplitude value, the darker the corresponding area. This gives you the right-most picture. Then why do you do that? In order to increase the dimension of time, so that you can display a voice rather than a frame of the spectrum of the voice, and can visually see static and dynamic information. The merits will be presented later.
So we get a spectral graph that changes over time, and this is the Spectrogram spectrogram that describes the voice signal.
is a sound spectrogram, the very dark place is the peak in the Spectrum graph ( resonant peak formants).
So why do we say it in the spectrogram?
First, the properties of the phoneme (phones) can be better observed in this area. In addition, the sound can be better identified by observing the resonance peaks and their transitions. The Hidden Markov model (Hidden Markov Models) is implicitly modeled on the spectrogram to achieve good recognition performance. Another function is that it can intuitively evaluate the TTS system (text to speech) good or bad, directly compare the synthesized speech and natural voice spectrogram matching degree can be.
By the time-frequency transformation of the speech frame, the FFT spectrum of each frame is obtained and the spectrum of each frame is arranged in chronological order to obtain the time-frequency-energy distribution graph. It is very intuitive to show the change of the frequency center of the voice signal over time. |
Second, cepstrum analysis (cepstrum)
The following is a spectral map of the voice. The peak is the main frequency component of the speech, which we call the Resonance Peak (formants), and the resonant peak is the identification attribute that carries the sound (that is, the personal ID card). So it's especially important. Use it to identify different sounds.
Since it's so important, we just need to pull it out! We need to extract not only the position of the resonant peaks, but also the process of their transformation. So what we're extracting is the envelope of the spectrum (spectral Envelope). This envelope is a smooth curve that connects these resonant peaks.
We can understand that the original spectrum consists of two parts: The envelope and the details of the spectrum . The logarithm spectrum is used here, so the unit is DB. Now we need to separate the two parts so that we can get envelopes.
How do you get them out of there? That is, how to obtain log H[k] and log e[k] on the basis of a given log x[k] to satisfy the log x[k] = log H[k] + log e[k]?
To achieve this goal, we need play a mathematical Trick. What is this trick? is to do an FFT of the spectrum. The Fourier transform on the spectrum is equivalent to the inverse Fourier transform inverse FFT (IFFT). One thing to note is that we are dealing with the logarithmic domain of the spectrum, which is also part of the trick. At this time, doing ifft on the logarithmic spectrum is equivalent to describing the signal on a pseudo-frequency (pseudo-frequency) axis .
From the above figure we can see that the envelope is mainly low-frequency components (at this time need to change the thinking, the horizontal axis is not considered as the frequency, we can think of the times), we see it as a 4 cycles per second sine signal. So we give it a spike in the 4Hz place above the pseudo-axis. The details of the spectrum are mainly high frequencies. We think of it as a sine signal of 100 cycles per second. So we give it a spike in the 100Hz place above the pseudo-axis.
Adding them together is the original spectrum signal.
In practice we already know log x[k], so we can get x[k]. So by the graph can know, h[k] is the low-frequency part of x[k], then we will x[k] through a low-pass filter can get H[K]! That's right, we can get them out of here, we got what we want h[k], which is the envelope of the spectrum.
X[k] is actually the cepstrum cepstrum (this is a new coined word, the spectrum of the word spectrum the front four alphabetical order upside down is the word ofthe cepstrum). And the h[k we care about is the low-frequency part of the cepstrum. H[K] Describes the envelope of the spectrum, which is widely used to describe features in speech recognition.
That now summarizes the cepstrum analysis , which is actually such a process:
1) The original speech signal through the Fourier transform to obtain the spectrum: X[k]=h[k]e[k];
Only the range is: | X[K] |=| h[k]| | E[k] |;
2) We take the logarithm on both sides: Log| | X[k] | | = Log | | H[k] | | + Log | | E[k] | |.
3) Then take the inverse Fourier transform on both sides to get: x[k]=h[k]+e[k].
This actually has a professional name called homomorphic signal processing . Its purpose is to transform the nonlinear problem into a linear problem processing method. Corresponding to the above, the original voice signal is actually a volume of the signal (the channel equivalent to a linear time-invariant system, the production of sound can be understood as an excitation through the system), the first step through the convolution into a multiplicative signal (the time domain convolution equivalent to the product of the frequency domain). In the second step, the multiplicative signal is converted into an additive signal by taking the logarithm , and the third step is reversed to return to the coil signal. At this point, although both front and back are time domain sequences, but they are in the discrete time domain obviously different , so the latter is called the cepstrum frequency domain .
In conclusion, Cepstrum (Cepstrum) is the spectrum of a signal Fourier transform obtained by logarithmic operation and then Fourier inverse transformation. The calculation process is as follows:
The following sections are not yet organized
Three, Mel frequency analysis (mel-frequency)
Okay, here we go, let's see what we've just done. Give us a voice, we can get its spectral envelope (the smoothing curve that connects all the resonant peak points). However, experiments with human auditory perception show that the perception of human hearing is focused only on certain areas, not the entire spectrum envelope.
and Mel Frequency analysis is based on the human auditory perception experiment . Experimental observations have found that the human ear, like a filter group, focuses only on certain frequency components (human auditory frequency is selective). It also says that it only allows certain frequencies of the signal to pass through, and at the same simply ignores some of the frequency signals it does not want to perceive. However, these filters are not uniformly distributed on the frequency axis, there are many filters in the low frequency region, they are more dense, but in the high frequency region, the number of filters becomes less and the distribution is sparse.
The human auditory system is a special nonlinear system, which responds to different frequency signals with different sensitivity. In the extraction of speech features, the human auditory system is very good, it can not only extract the semantic information, but also can extract the speaker's personal characteristics, these are the existing speech recognition system beyond. If the characteristics of human auditory perception processing can be simulated in speech recognition system, it is possible to improve the recognition rate of speech.
Mel Frequency Cepstrum coefficients (Mel Frequency cepstrum coefficient, MFCC) take into account the human auditory characteristics, first mapping the linear spectrum to the acoustic-sensing-based Mel nonlinear Spectrum , Then convert to the cepstrum .
The formula for converting the normal frequency to the Mel frequency is:
As can be seen, it can convert the non-uniform frequency into a uniform frequency, that is, a unified filter group.
in the Mel frequency domain, the human perception of pitch is linearly related . For example, if the Mel frequency of a two-stage speech is twice times different, the tone of the ear may sound twice times different.
Four, Mel frequency cepstrum coefficient (mel-frequency cepstral coefficients)
We get the Mel spectrum through a set of Mel filters on the spectrum. The formula is stated as: Log x[k] = log (Mel-spectrum). At this point we perform a cepstrum analysis on log x[k]:
1) Take the logarithm: Log x[k] = log H[k] + log e[k].
2) Reverse transform: x[k] = H[k] + e[k].
The cepstrum coefficients obtained on the Mel Spectrum H[k] are called Mel frequency cepstrum coefficients, referred to as MFCC.
Now let's summarize the process of extracting the MFCC feature: (There are too many math processes on the Web, you don't want to post it here)
1) The speech pre-emphasis, sub-frame and window,(enhance the performance of speech signal (SNR, processing accuracy, etc.) of some pretreatment)
2) for each short-time Analysis window, the corresponding spectrum is obtained by FFT(The spectrum of different time Windows distributed in the time axis)
3) The above spectrum through the Mel Filter group to obtain the Mel Spectrum, (through the Mel Spectrum, the natural spectrum of the line into the human auditory characteristics of the Mel Spectrum)
4) in the Mel Spectrum above the Cepstrum analysis (take the logarithm, do inverse transformation, the actual inverse transformation is generally realized by DCT discrete cosine transform, take the 2nd to 13th coefficients of DCT as MFCC coefficient), to obtain the Mel frequency cepstrum coefficient MFCC, this mfcc is the feature of this frame speech; (cepstrum analysis, obtaining MFCC as a phonetic feature)
At this point, the voice can be described by a series of cepstrum vectors, each of which is the MFCC eigenvector of each frame.
This allows the speech classifier to be trained and identified through these cepstrum vectors.
V. References
[1] There is a better tutorial in this:
http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/
[2] This article main reference: CMU's Tutorial:
Http://www.speech.cs.cmu.edu/15-492/slides/03_mfcc.pdf
[3] C library for computing Mel Frequency cepstral coefficients (MFCC)
http://code.google.com/p/libmfcc/
Mel Frequency Cepstrum factor (MFCC) Learning notes