The first part of this article first explains the AR spectrum, but it does not give too much details. The second part introduces several common speech features, some of which have been used in previous blog posts, such as the zero-crossing rate. The third part provides the actual operation process and recognition results. The goal of this paper is to extract the characteristics of the voice signals collected by DSP to identify trucks and airplanes.
For more information, see xiahouzuoxin. github. io.
AR spectrum
The AR Model, short for Auto-Regression Model, is a method for calculating the signal power spectrum through parameters. It is easy to calculate the AR spectrum in Matlab: assume there is a vehicle signal x with 1024 points,
Y = pyulear (x, 256,128 );
ARspectrum
The calculation of the AR spectrum has two important parameters: the number of coefficients and the number of points of FFT inverse transformation. AR spectrum is a recursive model that uses the time domain value of the first p moment to estimate the value of the current n moment:
Where u (n) is the noise input, and the coefficient order is p in the above formula. FFT is involved because FFT can be used for fast computation in power spectrum calculation. Therefore, the number of points sampled by discrete FFT on the unit circle is as follows:
In the above formula, a0 = 1 after conversion, and a (p + 1) after the FFT computation is expanded to N points )... a (N-1) = 0, the number of FFT points is the length N used for FFT calculation.
What is the concept of AR spectrum? First of all, my understanding of the power spectrum is: the size of the energy value at different frequencies. This size is not necessarily a real energy value, however, the relative size of the energy spectrum values at different frequencies is close to the actual values. Therefore, the relative relationship of the power spectrum values at different frequencies is more important than the actual values. For example, when a vehicle is in a distance, its energy is relatively small and its closeness is large. However, for stable signals, although their energy values are different, they all have similar spectral envelopes, therefore, we are more concerned with the distribution of spectrum at different frequencies, just like probability, to see which frequency value (or segment) has a large power spectrum.
By observing the AR spectrum, we can clearly understand the frequency segments in which the main energy is concentrated, so that we can analyze the signals and focus on these frequency segments.
For details about AR spectrum, see Hu Guangshu's book Digital Signal Processing. For C implementation, refer to my Github project, for the basic theory of AR, see the previous blog post "modern digital signal processing-AR model"
Audio signal feature extraction 1 Short-term average Energy (Short Time Energy, STE)
N indicates the length of a frame. The short-term average energy can be used to determine the mute frame. The short-term energy of the mute frame is smaller, which is more stable than the direct determination by the maximum amplitude of x (n. For mute frames, they should be removed before subsequent processing. Generally, voice is more muted than music (voice is not as tuned as music). Therefore, the average energy of voice is much larger than that of music.
2 Short-term Zero-crossing Rate (Short Time Zero-Corssing Rate, ZCR)
The short-term zero-crossing rate is the number of changes from negative to positive or from positive to negative in an audio frame.
In a program, the zero-pass rate expresses the hop speed of the signal, which is a simple measurement of frequency. The combination of zero crossing rate and average energy is used for voice endpoint detection. The self-adaptive zero-crossing rate algorithm for noise signal in blog also tried to improve the zero-crossing rate for vibration signal recognition.
3 Sub-band energy ratio (SER)
The sub-band energy is used to describe the frequency-domain distribution characteristics of the main energy. The process is to divide the frequency-domain and other intervals into B sub-bands. On the AR spectrum, you can obtain the sub-band energy Ei by integrating each sub-band range, then the sub-band energy ratio is
Different audio signals have different energy distribution, and the main distribution frequency band of energy can be distinguished by sub-band energy. The sub-band energy ratio is a good parameter used to identify different targets with different frequency energy distributions. Of course, similar ideas can also be used in FFT spectrum diagrams.
Spectrum Centroid, SC)
Considering the amplitude of the AR spectrum as the weight w, the center of gravity of the spectrum frequency is calculated as follows:
The center of the spectral frequency Center is the center of the spectral peak statistics, and is not (or may be equal to) equal to the frequency of the main AR spectrum peak.
5 bandwidth (Band Width, BW)
Bandwidth refers to the high/low frequency difference in which the signal spectrum value drops to 0.707 of the center frequency spectrum value BW = fH-fL.
Support Vector Machine-based recognition
The clock speed of trucks and planes may be different, so the Fmax frequency corresponding to the highest peak of the AR spectrum of the sound signal is used as a feature dimension; in addition, the spectral frequency center of gravity and the sub-band energy ratio are used as the other two dimensions of features. Therefore, the final combination feature is {Fmax, SC, SER }.
It is worth noting that the center of gravity of the spectrum frequency used in this article is not a simple statistical calculation of all frequency domains, but rather:
First, the spectrum org_psd is sorted from high to low. The sorted psd and the corresponding frequency index are idx.
Select some high spectral values (these spectral values and 0.707 of the total spectral values in the frequency domain) for spectral center of gravity calculation, which can avoid the impact of some high-frequency noise.
Recognition uses the support vector machine (SVM) model. For more information about support vector machines, see the introduction to support vector machines in July's blog (understanding the three-layer realm of SVM ), A little basic can look at Lin Zhi Ren teacher handouts, the toolbox used here is Lin Zhi Ren LibSVM, can be from the software home page http://www.csie.ntu.edu.tw /~ Download to cjlin/libsvm/index.html.
The actual operation uses the self-designed DSP + FPGA to control the AD7606 to collect the sound signal and upload the sound signal to the matlab of the PC for training and feature extraction. It is very important to use classification algorithms (for example, the LibSVM support vector machine or other algorithms such as neural networks) to classify data on the premise that the data itself can be divided, the following is a visual result of the features of trucks and airplanes. We can see that the two types of samples can be classified using the features constructed above, so we can proceed with the recognition work.
Features
When LibSVM is used for training and the kernel function uses RBF, the effect is generally better than others. Most of the parameters here are default (mainly gamma and C parameters ). For more information about how to use LibSVM to achieve better results, see my another blog post "LibSVM notes series (2)-how to improve LibSVM classification performance ", this is mainly about how to search for the best parameters.
There are a total of 1400 groups of data in the experiment, and 200 groups of trucks and planes are selected for training (the code is part of the actual code, and the code cannot be published for other reasons ),
N_trian = 200; label_car = zeros (length (car_feat), 1); label_plane = ones (length (plane_feat), 1); instance = [car_feat (idx, 1: n_trian) plane_feat (idx, 1: n_trian)]; % idx indicates the idx dimension feature instance = instance '; label = [label_car (1: n_trian); label_plane (1: n_trian)]; model = svmtrain (label, instance, '-s 0-t 2'); % SVM training results are model models, which will be used for the following predictions
The remaining 1000 groups are used for testing,
Tests = [car_feat (idx, (n_trian + 1): end) plane_feat (idx, (n_trian + 1): end)]; test_label = [label_car (n_trian + 1 ): end); label_plane (n_trian + 1): end)]; tests = tests '; pd_label = svmpredict (test_label, tests, model); fprintf (' \ n recognition accuracy rate %. 4f \ n', length (pd_label (test_label = pd_label)/length (test_label ));
The final prediction result is as follows:
Recognition result
The prediction accuracy rate reaches 86.50%, which can be used in practice.
Speech target recognition based on AR spectrum features