[Portal]
[Automatic Speech Recognition Course] Lesson 1 Statistical Speech Recognition
Address: http://blog.csdn.net/joey_su/article/details/36414877
Please indicate the source for reprinting. Please contact us.
Overview
ASR Speech Signal Analysis
- Features
- Spectrum Analysis
- Cepstrum Analysis
- Standard features: MFCC and PLP Analysis
- Dynamic Features
At the end of the first lesson, we mentioned the block diagram of speech recognition, which shows the position of the Signal Analysis Technology in the speech recognition system:
Let's get to knowSpeech Generation Process:
Speech is produced by the combination of pronunciation organs and sound channels. When talking, vocal cords vibrate to produce sound with a certain cycle (pitch T0), through the throat, throat, nasal cavity, oral cavity and other pronunciation organs, in addition, the speech signal x (t) is formed under the friction of the lips, and the X (t) is transformed by Fourier to obtain the spectrum X (Ω) and X (Ω) from the co-vibration peak (F1, f2, F3.
The voice is a analog signal. In order to analyze and process the voice signal, a digital conversion is required.
SamplingThat is, to convert the analog signal to a number:
Voice-triggered air vibration is a sound pressure wave, which is recorded with a microphone.
It is the voice signal recorded by the microphone. It is obtained by multiplying the impact function with a period of 1 and converting the pulse into a discrete time series.
The sampling frequency is. According to the nequest sampling theorem, the sampling frequency must be greater than or equal to the highest signal frequency. in actual application, the sampling frequency is as follows:
It should be noted that the simulated low-pass filter is used to prevent aliasing.
After digitization, the next step is to extract the acoustic characteristics of voice signals:
The sampled signal is extracted from the acoustic feature vector after pre-processing to obtain the acoustic model.
Acoustic features used for speech recognition should include the following features:
- Features should contain valid information to distinguish between phoneme and phoneme
- Good time resolution (10 ms)
- Good frequency resolution (~ 20 channels)
- Separation of gene frequencies and its harmonic components
- Robust against different speakers
- Robust Against Noise or channel distortion
- Good Pattern Recognition features
- Low-dimensional features
- Feature independence
IsMFCC-based pre-processing
The original voice signal is converted to A/D to obtain a digital signal. After pre-weighting, the high-frequency components are increased. Then, the window is added to process the signals after the window is added, one aspect is to extract the Cepstrum feature, that is, after the Discrete Fourier transformation, the spectral amplitude is squashed, and then the logarithm transformation is performed through the Mel filter group, finally, the back operation of Discrete Fourier transformation is performed to obtain the Cepstrum feature. On the other hand, the energy of the signal after the window is obtained. These two aspects are combined to form dynamic features, and finally the feature transformation is performed to obtain the acoustic model.
Next, analyze each step.
The A/D conversion has been discussed earlier. I will not go into details here. FromPre-aggravatedStart.
We know that speech is produced by a system (sound channel, etc.) with a throttle excitation frequency. Therefore, the speech energy is mainly concentrated at a low frequency, which is less energy than a low frequency, increasing the high-frequency component helps improve the signal-to-noise ratio. Pre-weighting can be adopted, which is often used in communication systems.
The pre-aggravated (first level) filter improves the high frequency. The formula is as follows:
Perform the following operations to pre-aggravate the vowels:
We can see that the high-frequency part has been improved.
Add window
We know that the voice signal is constantly changing (not stable), but the non-stable signal is not easy to process. Therefore, the speech processing algorithm generally assumes that the signal is stable.
Segmentation (short-term) Stability: The speech signal is regarded as composed of one frame (assuming that the frame is stable)
Window addition: in the time domain, the waveform is multiplied by the window function to obtain the waveform after the window addition. The formula is
If we simply divide the voice signal into many small segments, these segments (frames) are rectangular windows, and the edge of the rectangular windows is steep, that is, they are not continuous. Therefore, we should select the continuous window function at the edge, so that two adjacent frames can be smoothly transitioned.
In speech processing, a conical window is usually used to replace a rectangular window, such as Hamming or Hanning. The window function is as follows:
Here, it is the coefficient of the window function. The hanming window is, And the Hanning window is.
The window addition effect in the time domain is as follows:
It can be seen that the transition of the edge part of the conical window is smoother.
The window addition in the frequency domain is as follows:
Discrete Fourier Transform (DFT)
Purpose of DFT: To extract spectrum information (for example, energy in each band) from a window-added Signal)
Input: the signal after the window is added (Time Domain)
Output: plural, indicating the amplitude and phase of the k-th frequency component in N frequency bands
DFT formula:
Fast Fourier Transform (FFT): it is an effective algorithm used to calculate DFT, where n is an index at the bottom of 2, n> L
Window addition and Spectrum Analysis
First, add a window to the voice signal X [N]. The signal after the window is added is T, which is the time point of the time domain signal. M indicates the MTH window, then perform Fourier transformation on each frame to obtain the short-term power spectrum.
In this process, you need to pay attention to two points, one of which is the frame length. For a shorter frame, it has a wide band, a high time resolution and a low frequency resolution, for a longer frame, it has a narrow band, a lower time resolution and a higher frequency resolution. Another point to note is that the transition between frames is more stable, the frame shifting method is used, that is, there is an overlapping area between two frames.
For speech recognition, we use 20 ms frame length and 10 ms frame shifting.
Comparison of the spectrum for broadband and narrowband networks:
The concept here is spectrogram. In fact, the spectrum graph of each frame is rotated 90 degrees left, and the amplitude is indicated by the color depth. The larger the amplitude, the deeper the color, then, the color information of each frame is listed in the order of time (frame). Therefore, the horizontal coordinate of the language spectrum is time (FRAME), the vertical coordinate is frequency, and the color is the frequency range.
Short-term Spectrum Analysis
If we still have a poor understanding of the spectrum, we can look at three-dimensional images similar to the Longji terraces (suspected of advertising) of the famous scenic spots in Guangxi. What about the X axis and Y axis? The preceding spectral chart shows that the Z axis represents the frequency. That is to say, the higher the mountains, the higher the frequency.
DFT Spectrum
It is a 25 ms Hamming window of vowels, and its spectrum is calculated by DFT.
DFT Spectrum features
From the previous introduction, we can see that the frequency band is at an equal interval, but we know that human ears are actually a super powerful speech recognition system. When we study speech recognition, the answer is largely from the human itself. From the perspective of the human auditory system, our ears are selective in obtaining sound. For sounds larger than Hz, human Auditory Sensitivity is reduced. The specific cause is 1000Hz, which is related to the physiological structure of the ears.
The power spectrum contains the fundamental frequency of f0 (as mentioned earlier). This makes it difficult to estimate the spectral envelope, but there is still a solution.
The frequency of the short-term Fourier transformation is highly correlated with the frequency. For example, the power spectrum indicates high redundancy.
Human Hearing
Let's take a look at the powerful human auditory system.
Physiology |
Perception |
Strength |
Response |
Basic Frequency |
Pitch |
Spectrum shape |
Tone |
Opening/closing time |
Time |
In-ear Phase Difference |
Location |
Technical terms:
- Contour
- Critical bandwidth
- Auditory filter (Critical Band Filter)
- Critical bandwidth
Contour
Nonlinear frequency scale
As mentioned above, the lower the sensitivity of the human auditory system to the higher frequency, it indicates that the human's frequency perception is non-linear. That is to say, the human ears divide the voice frequency, in addition, these frequencies are non-linear (not equal intervals ).
Below are three kinds of non-linear scales: Mel scale, bark scale, and LN scale. Mel scale is often used in actual speech processing:
The Mel scale is often used in actual speech processing.
Mel filter banks
I think a good figure can replace a lot of nonsense (including this sentence ).
First, you need to understand why you need to set up the Mel filter group. Here, we use a filter group composed of several triangular low-pass filters with different intervals. From the above introduction, we learned that we can use Mel scales to replace linear frequency scales, to satisfy human hearing characteristics. Therefore, we need to classify the frequency points (frequency bins) of the frequency scale in order. This classification requires Mel filter banks to implement a total of 12 triangles, so it can be understood that a large frequency segment is divided into 12 categories, that is, the Mel-scale power spectrum of 12. Note that the part smaller than Hz is a linear interval, while the part larger than Hz is a logarithm interval.
Logarithm energy
Why do we need to calculate the logarithm energy?
- You can use the logarithm to compress the dynamic range.
- Human sensitivity to signal energy is logarithm. For example, human sensitivity to small changes in high energy is lower than that of low energy, which means that changes in the low energy zone are more sensitive to humans.
- The logarithm makes the changes of acoustic Coupling immutable for features, that is, the logarithm makes the changes of acoustic Coupling become dispensable in feature extraction.
- Remove the phase information. The phase information is not very important for Speech Recognition (but not everyone agrees)
The logarithm energy can be obtained by calculating the square of the logarithm power spectrum output by each Mel filter set.
Cepstrum Analysis
What is cepstrum? The back-to-back spectrum converts the first four letters of the spectrum (spectrum) into the cepstrum ). Therefore, we can almost understand that the cepstrum is a special inverse transformation of the spectrum, which is specialized in homomorphic processing.
The speech generation model can be viewed as the source-filter model:
Sound Source (sourse): vocal cords vibration generates sound source waveforms
Filter: The Source Waveform passing through the channel: the location of the tongue, Chin, etc. Given a specific shape, there will be a specific filtering feature (when you punch your face, a part of the face will be swollen, please do not face it ).
Note that the characteristics of sound source (F0, dynamic glottal pulse) are not helpful for distinguishing phoneme;
The filter specifies the location of the pronunciation organ, which is fixed, so the phoneme can be distinguished.
What can we do if we say so much?
Cepstrum analysis can help us to separate the sound source and filter! In this way, we can separate the filter to distinguish phoneme. This is a very important cepstrum feature. The most basic unit of speech recognition is phoneme, that is, phone (not a mobile phone or phone) will be followed by bi-phone, tri-phone... if we can distinguish (identify) the most basic unit, we can do the following work.
A long-delayed diagram-separates the power spectrum into the spectral envelope and F0 harmonic.
The logarithm spectrum (frequency domain) is converted to the Cepstrum (inverse frequency) through the inverse transformation of Fourier transform, and the high and low parts are obtained through homomorphic filtering (in fact, two filters are added to the cepstrum domain ), then perform Fourier transformation to obtain the smooth spectrum (frequency domain). This is the lower part of the Cepstrum and the logarithm spectrum, that is, the higher part of the cepstrum.
We can see that the third figure is what we want. It is the envelope of the original power spectrum (envelope), and the residue in the fourth figure is the voice source separated, this part can be discarded. It is like squeezing orange juice. It is our delicious orange juice to filter out the dregs.
Cepstrum
We have already introduced what is cepstrum. You should have some knowledge about it. Let's look at the details of cepstrum.
As mentioned above, the cepstrum is obtained by inverse Discrete Fourier transformation of the logarithm amplitude spectrum. The cepstrum is a time-domain spectrum, which is usually called the inverse frequency (quefrency ), because it is the inverse DFT of the spectrum.
Formula for inverse Discrete Fourier Transformation:
It should be noted that because the logarithm power spectrum is a real number, the corresponding inverse DFT can be equivalent to the discrete cosine transformation, and because the discrete cosine transformation has a strong concentration of energy, the low (inverse) frequency components in the energy concentration can be extracted.
Mel-frequency cepstral coefficients (mfccs) -- Mel frequency cepstrum Coefficient
Let's just answer your question. What is mfccs?
Mel, that is, the Mel scale mentioned above; frequency, that is, it is a non-linear frequency, that is, mel; cepstrum, the feature obtained after cepstrum analysis; coefficient, er, you don't need to explain this.
The concept of a smoothing spectrum (smoothed spectrum) is given: it is transformed to the cepstrum domain, truncated, and then switched back to the frequency domain.
So what is the use of mfccs?
- As an acoustic feature, hmm-based speech recognition system is widely used.
- The first 12 mfccs are usually used as feature vectors (that is, to remove F0 Information)
- The relative spectrum features have a smaller correlation, that is, it is easier to establish a model than the spectrum features.
- Its representation is very compact (more crowded and healthy), because the 12 features describe a 20 ms frame of a piece of Speech data, and then go back and look at the above language spectrum to understand it.
- For standard hmm-based systems, the performance of mfccs in speech recognition is superior to that in filter banks or spectral features.
- It is a pity that mfccs is not robust against noise (Oh, Why can I see this)
Perceptual Linear Prediction (PLP)-perceived Linear Prediction
What is this guy? First.
PLP (hermansky, Jasa 1990)
This guy uses equal-ring pre-weighting and root-cube compression (perceived results), instead of the log compression used by mfccs, and uses linear prediction Autoregressive Model to obtain the Cepstrum coefficient.
It has been proved that compared with mfccs, PLP has better speech recognition accuracy and better Noise robustness.
It looks pretty cool. Let's see how it works:
The speech signal is obtained through Fourier transformation, and then the amplitude is calculated by square. Then, the critical band integral (critical-band intergration) is performed, and the equal ring degree is pre-increased, then obtain the cube root (compress the intensity with equi-width), perform inverse Fourier transformation, and then obtain PLP after linear prediction.
Dynamic Features
PLP looks very high. Let's go back to mfccs and study its dynamic characteristics.
We know that speech is not a constant frame-to-frame, so we can add some features to indicate that the Cepstrum coefficient changes with time, that is to say, we need to let the voice work (instead of those static waveform diagrams drawn above ).
We call it Delta feature, that is, the dynamic feature/derivative of time (time derivatives ).
Next, we will calculate the Delta feature of the Cepstrum feature at the T moment:
A more complex method is to use the regression estimation slope to estimate the derivative of time (usually four frames per frame ). (It is really complicated, and I don't know how to translate the problem)
The standard speech recognition feature is 39 dimensions:
I can't understand what I wrote. Just paste a picture.
Dynamic Feature Estimation
We all know that the slope can indicate the speed of change, that is, it can reflect the dynamic characteristics of the voice signal to a certain extent. Of course, the voice signal here refers to the feature parameter, not the actual signal.
Note that energy (Level 0, level 1, and level 2) is also a feature parameter. Why is there energy? Let's review the mfccs extraction process:
Haha, have you seen energy?
Feature conversion
Orthogonal Transformation
- DCT (discrete cosine transform) -- discrete cosine transformation
- PCA (Principal Component Analysis)-Principal Component Analysis
Maximum severability between classes
- LDA (Linear Discriminant Analysis)/Fisher's linear discrminant -- linear discriminant analysis/Fisher Linear Discriminant
- Hlda (heteroscedastic linear discriminant analysis) -- linear discriminant analysis of variance
Summary
ASR features
- Mfccs
- Short-term Discrete Fourier Analysis
- Mel filter group
- Logarithm Amplitude Square
- Inverse Discrete Fourier Transformation (discrete cosine transformation)
- Use the first 12 maintenance items
- Delta features
- 39 dimensions vector:
MFCC-12 + energy; + deltas; + Delta-Deltas