Speech Signal Processing (iv) Mel frequency cepstrum coefficient (MFCC)

Last Update:2018-12-04 Source: Internet

Author: User

Tags log e ranges

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Speech Signal Processing (iv) Mel frequency cepstrum coefficient (MFCC)

Zouxy09@qq.com

Http://blog.csdn.net/zouxy09

This semester I have a speech signal processing course, and I am about to take the test. So I need to know more about the relevant knowledge. Haha, I have no lectures at ordinary times, but now I have to stick to it. By the way, I would like to make my knowledge architecture clearer and share it with you. The fourth knowledge point is MFCC. Because it takes a little time, there may be a lot of things wrong with it. I hope you can correct it. Thank you.

In any automatic speech recognition system, the first step is to extract features. In other words, we need to extract the sensitive components from the audio signal, and then discard other messy information, such as background noise and emotion.

It is helpful for us to understand how speech is generated. A person generates a sound through a sound channel, and the shape of the sound channel (shape ?) Determines the voice. Sound channel shapes include tongue and teeth. If we can know the shape accurately, we can accurately describe the generated phoneme. The sound channel is displayed in the envelope of the short-term speech power spectrum. Mfccs is a feature that accurately describes this envelope.

Mfccs (Mel frequency cepstral coefficents) is a feature widely used in automatic speech and speaker recognition. It was developed by Davis and mermelstein in 1980. Since then. In the field of speech recognition, mfccs is an outstanding artificial feature that has never been surpassed ).

Well, here we have mentioned a very important keyword: The sound channel shape. Then we know that it is very important. We also know that it can be displayed in the envelope of the short-term speech power spectrum. What is power spectrum? What is envelope? What is mfccs? Why does it work? How to get it? Let's take a look.

1. spectrum diagram (spectrogram)

We are dealing with voice signals, so it is important to describe them. Because different descriptions show different information. So what kind of description method is conducive to our observation and understanding? Here we will first learn something called a music map.

Here, the speech is divided into multiple frames, each frame of speech corresponds to a spectrum (Calculated using short-term FFT), and the spectrum represents the relationship between frequency and energy. In actual use, there are three types of spectrum graphs: linear amplitude spectrum, logarithm amplitude spectrum, and self-Power Spectrum (the amplitude of each spectral line in the logarithm amplitude spectrum is calculated by logarithm, therefore, the ordinate unit is dB (db ). The purpose of this transformation is to increase the relatively high amplitude components of those components with lower amplitude so as to observe the periodic signals hidden in Low-amplitude noise ).

We first represent the spectrum of one frame of Speech by coordinates, such as the left. Now we rotate the spectrum on the Left 90 degrees. Obtain the middle graph. Then map these ranges to a grayscale representation (or can be understood as Quantizing continuous ranges to 256 quantifiers ?), 0 represents Black, and 255 represents white. The larger the amplitude value, the blackening the corresponding area. In this way, the rightmost graph is obtained. So why? In order to increase the time dimension, we can display a piece of speech instead of a frame of speech spectrum, and intuitively see static and dynamic information. The advantage will be presented later.

In this way, we will get a spectrum chart that changes over time. This is the spectrogram spectrum that describes the voice signal.

It is a voice spectrum. The dark part is the peak value in the spectrum diagram (co-vibration peak formants ).

Why do we need to represent speech in a music map?

First, the attributes of phoneme (Phones) can be better observed here. In addition, we can better recognize sounds by observing the resonance peaks and their changes. The Hidden Markov Model (Hidden Markov models) is an implicit modeling of the Acoustic Spectrum to achieve good recognition performance. Another function is that it can intuitively evaluate the quality of the TTS System (Text to Speech) and directly compare the matching degree between the synthesized speech and the natural voice spectrum.

Ii. Cepstrum Analysis)

The following is a speech spectrum diagram. The peak value represents the main frequency component of the speech. We call these peaks the formants, and the resonance peak carries the voice recognition attribute (the same as the personal ID card ). So it is particularly important. It can be used to recognize different sounds.

Since it is so important, we need to extract it! We need to extract not only the location of the resonance peak, but also the process of their transformation. Therefore, we extract the spectrum envelope (spectral envelope ). This envelope is a smoothing curve that connects these resonance peaks.

We can understand that the original spectrum consists of two parts: the envelope and the details of the spectrum. The logarithm spectrum is used here, so the unit is dB. Now we need to separate the two parts so that we can get the envelope.

How can we split them out? That is, how to specify the log x [k, obtain log H [k] and log E [k] to meet log x [k] = Log H [k] + log E [k?

To achieve this goal, we need to play a mathematical trick. What is trick? Is to perform FFT on the spectrum. The Fourier transformation in the spectrum is equivalent to the inverse Fourier transformation inverse FFT (IFFT ). It should be noted that we process the number field of the spectrum, which is also part of trick. At this time, IFFT on the logarithm spectrum is equivalent to describing the signal on a pseudo-frequency (pseudo-frequency) coordinate axis.

From the figure above, we can see that the envelope is mainly a low-frequency component (at this time, we need to change our thinking. At this time, we should not regard the horizontal axis as frequency, but we can regard it as time ), we regard it as a sine signal with four cycles per second. In this way, we will give it a peak value at 4Hz above the pseudo axis. The details of the spectrum are mainly high-frequency. We regard it as a sine signal with 100 cycles per second. In this way, we will give it a peak value at Hz on the pseudo axis.

Overlay them to form the original spectrum signal.

In practice, we already know log x [K], so we can get X [K]. As shown in the figure, H [k] is the low-frequency part of X [K]. Then, we can use X [k] to obtain H [k] through a low-pass filter! That's right. Here we can split them out and get the H [k] We want, that is, the envelope of the spectrum.

X [k] is actually a cepstrum (this is a newly created word that reverts the first four letters of the spectrum word to the word of the spectrum ). H [k] is the low frequency part of the cepstrum. H [k] describes the envelope of the spectrum, which is widely used in Speech Recognition to describe features.

Now let's summarize the Cepstrum analysis, which is actually a process like this:

1) convert the original voice signal through Fourier transformation to obtain the spectrum: X [k] = H [k] E [k];

Only the amplitude is: | x [k] | = | H [k] | E [k] |;

2) log | x [k] | = log | H [k] | + log | E [k] |.

3) then obtain the X [k] = H [k] + E [k] From the inverse Fourier transformation on both sides.

In fact, there is a professional name called homomorphic signal processing. It aims to convert non-linear problems into linear problems. Correspondingly, the original voice signal is actually a volume signal (sound channel is equivalent to a linear time-unchanged system, and sound generation can be understood as an incentive through this system ), the first step is to convert it into a multiplier signal through convolution (the convolution in the time domain is equivalent to the product of the frequency domain ). The second step is to convert the multiplication signal to the addition signal by taking the logarithm, and the third step is to convert it back to the convolution signal. At this time, although both the front and back are time series, they are obviously different in discrete time domain, so the latter is called the frequency domain.

In summary, cepstrum is a spectrum produced by Fourier transformation after logarithm operation of a signal. The calculation process is as follows:

Iii. Mel-Frequency Analysis)

Now, let's take a look at what we did just now? For a speech, we can get its spectral envelope (the smoothing curve connecting all the resonance peak points. However, experiments on human auditory perception show that human auditory perception only focuses on certain areas rather than the whole spectrum envelope.

The Mel frequency analysis is based on the human auditory perception experiment. Experimental observations show that human ears focus only on certain frequency components, just like a filter set (human hearing is selective on frequency ). That is to say, it only allows certain frequency signals to pass through, and directly ignores some frequency signals that it does not want to perceive. However, these filters are not uniformly distributed on the frequency coordinate axis. There are many filters in the low-frequency area, which are densely distributed. However, in the high-frequency area, the number of filters becomes relatively small, the distribution is sparse.

The human auditory system is a special nonlinear system. The sensitivity of the system to respond to signals of different frequencies is different. In terms of voice feature extraction, the human auditory system has done a very good job. It not only extracts semantic information, but also extracts the speaker's personal characteristics, which are beyond the reach of existing speech recognition systems. If the voice recognition system can simulate the processing characteristics of human auditory perception, it is possible to improve the speech recognition rate.

Mel frequency cepstrum coefficient (MFCC): considering human auditory characteristics, the linear spectrum is first mapped to the Mel Nonlinear Spectrum Based on auditory perception, and then converted to the cepstrum.

The formula for converting a common frequency to a Mel frequency is:

It can be seen that it can convert an inconsistent frequency into a unified frequency, that is, a unified filter group.

In the Mel frequency domain, the human perception of the tone is linear. For example, if the Mel frequency of two speech segments is twice the same, the voice of the human ears sounds twice different.

4. Mel-frequency cepstral coefficients)

We use a set of Mel filters to obtain the Mel spectrum. The formula is: log x [k] = Log (Mel-spectrum ). At this time, we analyze the Cepstrum in log x [k:

1) log x [k] = Log H [k] + log E [K].

2) perform inverse transformation: X [k] = H [k] + E [K].

The cepstrum coefficient h [k] obtained on the Mel spectrum is called the Mel frequency cepstrum coefficient (MFCC.

Now let's summarize the process of extracting MFCC features: (there are too many mathematical processes online, so I don't want to post them here)

1) pre-aggravated speech, split frames, and added windows;

2) use FFT to obtain the corresponding spectrum for each short-term analysis window;

3) apply the above spectrum to the Mel filter group to obtain the Mel spectrum;

4) perform Cepstrum Analysis on the Mel spectrum (take the logarithm and perform inverse transformation. The actual inverse transformation is generally implemented through the discrete cosine transformation of DCT, take the 2nd to 13th coefficients after DCT as the MFCC coefficient) and obtain the Mel frequency cepstrum coefficient MFCC. This MFCC is the characteristic of the speech;

At this time, the speech can be described through a series of cepstrum vectors. Each vector is the MFCC feature vector of each frame.

In this way, the speech classifier can be trained and recognized through these cepstrum vectors.

V. References

[1] There is also a good tutorial:

Http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/

[2] This article mainly references: CMU Tutorial:

Http://www.speech.cs.cmu.edu/15-492/slides/03_mfcc.pdf

[3] C library for computing Mel frequency cepstral coefficients (MFCC)

Http://code.google.com/p/libmfcc/

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More