Reproduced in the original: http://blog.csdn.net/shichaog/article/details/52399354 thank you very much.
Vad (voice Activity Detection) algorithm is to detect the voice , in the far-field speech interaction scenario, VAD faces two challenges: 1. How to successfully detect the lowest energy voice (sensitivity).
2. How to successfully detect in the multi-noise environment (detection rate and false detection rate).
The missed response is originally the voice but not detected, and the virtual detection rate is not the voice signal is detected as the probability of speech signal. Relatively missed is unacceptable, and virtual inspection can be further filtered by the ASR and NLP algorithms at the backend, but the virtual test will lead to higher system resource utilization, and the system's power and heat will increase further, and this will rise to a problem with mobile and portable devices.
Based on the AEC algorithm of WEBRTC, the WEBRTC model of VAD adopts Gaussian model, which is widely used in this model.
Gaussian distribution
Gaussian distribution is also called normal distribution (normal Distribution/gaussian distribution).
If the random variable x obeys a mathematical expectation of μ, the standard deviation is σ^2 Gaussian distribution, then:
X~n (μ,σ^2)
The probability density function is:
F (x) =1/(√2πσ) e^ (-〖 (x-u) 〗^2/(2σ^2))
Gauss's use in WEBRTC:
F (X_k | Z,r_k) =1/√2πe^ (-(X_k-u_z) ^2/(2σ^2))
X_k is the selected eigenvector, WebRTC middle finger x_k is the energy of six sub-band (the sub-band is 80~250hz,250~500hz,500hz~1k, 1~2k,2~3k,3~4khz, variable feature_vector storage is the sub-band Energy sequence), R The _k is a combination of the mean u_z and the variance σ, which determines the probability of the Gaussian distribution. The z=0 condition is the probability of calculating the noise, z=1 is the probability that the speech is calculated.
The reason for using the highest frequency here is 4KHz, because the program in the WEBRTC samples the input (48khz,32hkz,16khz) down to 8KHz, so according to the quest frequency theorem, the useful spectrum is below 4KHz.
Of course, you can also use 8KHz cutoff frequency, so that you need to train and modify the parameters of the Gaussian model, this algorithm I have tried, than based on the method of DNN, flexibility, reflected in the parameter adaptive update, for example, in the quiet home scene at night, noise mean is relatively low, Daytime ambient noise more, noise characteristics of the mean will also be adjusted, for DNN method, once the parameters of training, then the appropriate scenario is determined, if you want to increase the applicable scene, first to collect the target scene data, labeled good data retraining (usually to increase the number of parameters), This process can result in 1. High data collection costs, 2. Excessive parameter calculation is expensive (VAD is generally working).
WEBRTC is using the GMM model
Wait for the video link address.
WEBRTC Algorithm Flow 1. Set VAD Aggressive mode
There are four modes, which are differentiated by digital 0~3, and the degree of aggressiveness is positively correlated with the numerical size.
0:normal,1:low bitrate, 2:aggressive;3:very aggressive
These aggressive patterns are closely related to the following parameters.
[CPP] View plain copy <comman_audio/vad/vad_core.c> // mode 0, quality. static const int16_t koverhangmax1q[3] = { 8, 4, 3 }; static const int16_t koverhangmax2q[3] = { 14, 7, 5 } ; static const int16_t klocalthresholdq[3] = { 24, 21, 24 }; static const int16_t kglobalthresholdq[3] = { 57, 48, 57 }; // mode 1, low bitrate. static const int16_t koverhangmax1lbr[3] = { 8, 4, 3 }; Static const int16_t kOverHangMax2LBR[3] = { 14, 7, 5 }; static const int16_t klocalthresholdlbr[3] = { 37, 32, 37 }; Static const int16_t kglobalthresholdlbr[3] = { 100, 80, 100 }; // mode 2, aggressive. static const int16_t koverhangmax1agg[3] = { 6, 3, 2 }; static const int16_t koverhangmax2agg[3] = { 9, 5, 3 }; Static const int16_t klocalthresholdagg[3] = { 82, 78, 82 }; static const int16_t kglobalthresholdagg[3] = { 285, 260, 285 }; // Mode 3, Very aggressive. static const int16_t koverhangmax1vag[3] = { 6, 3, 2 }; static const int16_t koverhangmax2vag[3] = { 9, 5, 3 }; Static const int16_t klocalthresholdvag[3] = { 94, 94, 94 }; static const int16_t kglobalthresholdvag[3] = { 1100, 1050, 1100 }; They are used when calculating the probability of a Gaussian model.2. Frame length Setting
A) a total of three kinds of frame length can be used, respectively, is 80/10ms,160/20ms,240/30ms, in fact, currently only support 10ms frame length.
B) Other sample rate 48k,32k,24k,16k are resampled to 8k to calculate VAD.
The choice of the above three frame length, is because the voice signal is a short-time stationary signal, which can be regarded as a stationary signal between 10ms~30ms, Gausmarkov and other signal processing methods based on the premise is that the signal is stable, in the 10ms~30ms, smooth signal processing method can be used.
3. Selection of feature vectors in Gaussian model
In the WEBRTC VAD algorithm used in the idea of clustering, only two classes, a class is the voice, a class is noise, each frame of the signal is the probability of speech and noise, according to the probability of clustering, of course, in order to avoid a frame error also has a statistical decision in the algorithm, then the problem came, What features are selected as input to the Gaussian distribution? This is related to the accuracy of the clustering results, also known as VAD performance, since VAD is to distinguish between noise and speech, then the noise signal and voice signal the two signals of their maximum difference? The difference between the choice of characteristics of nature can be better than the degree of distinction.
As we all know, the signal processing classification is mainly sometimes domain, frequency domain and airspace, from the airspace, WebRTC Vad is based on the concept of Tanmak, noise and speech without spatial distinction, in the multi-microphone scene, is indeed based on the multi-microphone VAD algorithm, from the time domain, and the people are signal-varying, And the short-time signal rate of change is relatively small, so the calculation to calculate the frequency domain can only be a better degree of sensitivity.
Automotive Noise Spectrum
Pink Noise Spectrum
White Noise spectrum
Voice spectrum
From the above four graphs, we can see the noise and speech from the spectrum, their spectral difference is still relatively large, and in the form of peaks and troughs.
WEBRTC is formally based on this hypothesis, dividing the spectrum into 6 sub-bands. They are:
80hz~250hz,250hz~500hz,500hz~1k,1k~2k,2k~3k,3k~4k. Corresponds to feature[0],feature[1],feature[2],..., feature[5] respectively.
You can see the 1KHz as the demarcation, down 500hz,250hz and 170HZ three segments, upward also has three segments, each segment is 1KHz, this band covers the majority of the voice of the signal energy, and the greater the energy of the sub-band of the more detailed the sensitivity.
China's AC standard is 220v~50hz, power 50Hz interference will be mixed with the data collected by the microphone and the physical vibration will also have an impact, so take more than 80Hz signal.
Functions calculated in WEBRTC in the Filter_bank.c file, the previously said activation-based DNN can also be based on the Fbank feature.
High- pass filter Design
The
High-pass filter acts two points: 1. Filter out the DC components, 2 increase the frequency components (ear to 3.5KHz most sensitive)
[CPP] View plain copy// high pass filtering, with a cut-off frequency at 80 Hz, if the |data_in| is // sampled at 500 hz. // // - data_in [i] : input audio data sampled at 500 Hz. // - data_length [i] : length of input and output data. // - filter_state [ i/o] : state of the filter. // - data_out [o] : Output audio data in the frequency interval // &nbs