The whole process of speech enhancement, usually assumes that the noise is additive random stationary noise, and the speech is short and smooth, the following principles are described, are under these two assumptions to do. The entire speech enhancement process can be broadly divided into two parts
First, noise estimation
Calculation of attenuation factor (also called gain factor in some places)
Finally, applying the attenuation factor to noisy speech, we can get the "pure speech" we expect. The hardest part of speech enhancement should be noise estimation, not the determination of attenuation factors. Because some of the key points in the noise estimation section, to make noise estimates, first to understand the characteristics of noisy speech, according to these characteristics of the noise estimates, then, noisy voice has what characteristics?
(1) The effect of noise on the frequency of the speech spectrum is uneven, and some of the spectral region of the impact of large, and some of the spectral area of small, it is natural to think that can be used to bring estimated noise, when a particular frequency band signal-to- noise ratio or voice existence probability At a lower level, the noise spectrum can be updated independently, which is the starting point of the time-recursive average noise estimation algorithm.
(2) Even during the speech activity, the noisy speech power of a single band usually attenuates to the noise power level, which is the starting point of the minimum-value tracking noise estimation algorithm. A rough estimate of the noise level in the band can be obtained by tracking the minimum value of the noisy speech power in each band.
The noise estimation algorithm used by Speex combines these two features, and we separately say the two noise estimation algorithms that are based on the above two features: minimum tracking, time recursive averaging. First of all, the minimum value tracking noise estimation algorithm, this kind of estimation algorithm mainly has three kinds: minimum value statistic, minimum value search, continuous spectral minimum value tracking.
The minimum statistic algorithm is to estimate the noise by counting the minimum value of each frequency point of the past D frame and calculating the corresponding deviation factor, please refer to the paper: Noise Power spectral Density estimation Based on Optimal smoothing and Minimum Statistics. It is no longer elaborated here.
The minimum value search is to estimate the noise by traversing the minimum values of each frequency point in the past D frame, which is also known as the minimum value lookup
Continuous spectral minimum tracking specific content please refer to a frequency-domain speech denoising algorithm implementation and improve the content of the method, here also no longer detailed.
Next, the time-recursive average noise estimation algorithm is presented. This method is used to estimate the noise in the following general form
\[\hat \sigma _d^2 (\lambda, k) = \alpha (\lambda, k) \sigma _d^2 (\lambda-1,k) + [1-\alpha (\lambda, K)]| Y (\lambda, k) {|^2}\]
Here Lamda represents the number of frames, K is the frequency point index, Y represents the frequency domain of the noisy speech spectrum, Sigma for the noise spectrum, alpha for smoothing factor, such noise estimation algorithm is required time-frequency correlation smoothing factor, and then can be used to estimate the noise, smoothing factor can be based on signal-to-noise ratio to seek, It can also be a fixed value. However, it is more commonly used to calculate the probability of the presence or absence of speech at the K-point of frequency, and then we can see the relationship between the smoothing factor and the probability of speech existence.
First of all, the recursive average noise estimation algorithm based on signal-to-noise ratio, his main idea is that when the signal-to-noise ratio is large (indicating that there is a greater likelihood of speech), the smoothing factor tends to 1, which tends to use the noise of the previous frame as the noise estimate of the current frame, when the signal-to-noise ratio The smoothing factor tends to be 0, which means that the current frame's power is used as the noise estimate as much as possible. The main work of this method is how to establish a function relationship or a piecewise function relationship between the signal-to-noise ratio and the smoothing factor.
Then, the recursive average algorithm based on the probability of signal existence is emphasized. We first look at the introduction of conditional probability, how to calculate the mode of noise change, we firstly expressed the noise power spectral density as
\[\sigma _d^2 (\lambda, k) = e\{| D (\lambda, k) {|^2}\} \]
Then, the optimal noise power spectral density in the sense of minimum mean square error can be expressed as
\[\sigma _d^2 (\lambda, k) = E[\sigma _d^2 (\lambda, k) | Y (\lambda, k)] = E[\sigma _d^2 (\lambda, K) |{ H_0}]p ({h_0}| Y (\lambda, K)] + e[\sigma _d^2 (\lambda, K) |{ H_1}]p ({h_1}| Y (\lambda, k)) \]
In other words, when the probability is introduced, the noise power spectral density can be weighted by noise power spectral density without speech condition, noise power spectral density under speech condition, and then summed by the condition probability of noisy speech spectrum in frequency point K without the presence of speech and the conditional probability of existence speech.
We can easily calculate the two conditional probabilities in the above equation according to the Bayes theorem. As shown below
\[\BEGIN{ARRAY}{L}
R \buildrel \delta \over = P (h_1^k)/P (h_0^k) \ \
\lambda (\lambda, k) \buildrel \delta \over = \frac{{p (Y (\lambda, k) | h_1^k)}}{{p (Y (\lambda, k) | h_0^k)}} \ \
P (h_0^k| Y (\lambda, k)) = \frac{{p (Y (\lambda, k) | h_0^k) P (h_0^k)}}{{p (Y (\lambda, k) | H_0^K) P (h_0^k) + P (Y (\LAMBDA, k) | h_1^k) P (h_1^k)}} = \frac{1}{{1 + R\lambda (\lambda, k)}} \ \
P (h_1^k| Y (\lambda, k)) = \frac{{r\lambda (\lambda, k)}}{{1 + R\lambda (\lambda, k)}} \ \
\end{array}\]
In the above 4 formulas, the first equation R indicates the prior probability of the existence of speech and the ratio of the prior probability of non-existent speech, and the second equation is called the likelihood ratio. The third and fourth formulas represent frequency-point K, which has no conditional probability of speech, and the conditional probability of frequency-point K existence speech. A new noise power spectral density estimation is obtained by substituting the above two conditional probabilities into a formula for calculating noise power spectral density.
\[\sigma _d^2 (\lambda, k) = \frac{1}{{1 + R\lambda (\lambda, k)}}e[\sigma _d^2 (\lambda, K) |{ H_0}] + \frac{{r\lambda (\LAMBDA, k)}}{{1 + R\lambda (\lambda, k)}}e[\sigma _d^2 (\lambda, K) |{ H_1}]\]
When there is no voice in the frequency point K, we can approximate the noise power spectral mean of the non-existent speech condition with the short-time power spectrum of the current frequency point, and when the frequency point K has the voice, we can approximate the noise power spectral mean in the presence of speech to the noise estimate of the previous frame, as follows:
\[\BEGIN{ARRAY}{L}
E[\sigma _d^2 (\lambda, K) |{ H_0}] \rightarrow | Y (\lambda, k) {|^2} \ \
E[\sigma _d^2 (\lambda, K) |{ h_1}] \rightarrow \sigma _d^2 (\lambda-1,k) \ \
\end{array}\]
In this way, the estimated noise becomes the following form
\[\sigma _d^2 (\lambda, k) = \frac{{r\lambda (\lambda, k)}}{{1 + R\lambda (\lambda, k)}}\sigma _d^2 (\lambda-1,k) + \frac {1} {{1 + r\lambda (\lambda, K)}}| Y (\lambda, k) {|^2}\]
Comparing the recursive expressions above with the common form of time recursive average noise estimation, it is found that the time-frequency correlation smoothing factor alpha represents the frequency-point k in the sense of the conditional probability of the presence of speech. That
\[\BEGIN{ARRAY}{*{20}{C}}
{\alpha (\lambda, k) = \frac{{r\lambda (\lambda, k)}}{{1 + R\lambda (\lambda, K)}} & {1-\alpha (\LAMBDA, k) = \frac {1} {{1 + r\lambda (\lambda, K)}}} \\
\end{array}\]
In other words, the smoothing factor is a likelihood ratio function, and it is related to the probability of the existence of speech, when the frequency point K exists the higher the condition probability of speech, the more inclined to use the noise estimate of the previous frame, the more inclined to stop the noise estimation. Conversely, the more inclined to continue to use the current frequency point K power to estimate the noise.
Now we have introduced the minimum value tracking, time recursive average two kinds of noise estimation methods. So, can we use both methods of estimation to make the noise estimation more accurate? Really, this method is the recursive averaging (MCRA) algorithm for the minimum control. Here's a look at the idea and starting point of this approach. The idea of the noise update here is: When the voice is not present, the estimation of the noise is updated, and when the voice is present, the noise estimate of the previous frame is used as the noise estimate of the current frame. As shown below
\[\BEGIN{ARRAY}{L}
H_0^k:\hat \sigma _d^2 (\lambda, k) = \alpha \hat \sigma _d^2 (\lambda-1,k) + (1-\alpha) | Y (\lambda, k) {|^2} \ \
H_1^k:\hat \sigma _d^2 (\lambda, k) = \hat \sigma _d^2 (\lambda-1,k) \ \
\end{array}\]
Thus, the mean square estimate of the noise power spectral density can be expressed as follows:
\[\BEGIN{ARRAY}{L}
\hat \sigma _d^2 (\lambda, k) = E[\sigma _d^2 (\lambda, k) | Y (\lambda, k)] \ \
= E[\sigma _d^2 (\lambda, k) | H_0^k]p (h_0^k| Y (\lambda, k)) + E[\sigma _d^2 (\lambda, k) | H_0^k]p (h_1^k| Y (\lambda, k)) \ \
= [\alpha \hat \sigma _d^2 (\lambda-1,k) + (1-\alpha) | Y (\lambda, k) {|^2}]p (h_0^k| Y (\lambda, k)) + \hat \sigma _d^2 (\lambda-1,k) p (h_1^k| Y (\lambda, k)) \ \
= [\alpha \hat \sigma _d^2 (\lambda-1,k) + (1-\alpha) | Y (\lambda, k) {|^2}] (1-p (\LAMBDA, k)) + \hat \sigma _d^2 (\lambda-1,k) p (\lambda, k) \ \
= [\alpha \hat \sigma _d^2 (\lambda-1,k) (1-p (\lambda, k)) + \hat \sigma _d^2 (\lambda-1,k) p (\lambda, K)] + (1-\alp HA) | Y (\lambda, k) {|^2} (1-p (\lambda, k)) \ \
= [\alpha (1-p (\lambda, k)) + P (\lambda, k)]\hat \sigma _d^2 (\lambda-1,k) + [(1-\alpha) (1-p (\lambda, k))]| Y (\lambda, k) {|^2} \ \
= [\alpha + \alpha P (\lambda, K) + P (\lambda, k)]\hat \sigma _d^2 (\lambda-1,k) + [1-p (\lambda, K)-\alpha + \alpha P (\lambda, K)]| Y (\lambda, k) {|^2} \ \
= [\alpha + (1-\alpha) p (\lambda, k)]\hat \sigma _d^2 (\lambda-1,k) + [1-\alpha-(1-\alpha) p (\lambda, K)]| Y (\lambda, k) {|^2} \ \
= [\alpha + (1-\alpha) p (\lambda, k)]\hat \sigma _d^2 (\lambda-1,k) + [1-(\alpha + (1-\alpha) p (\lambda, K)]| Y (\lambda, k) {|^2} \ \
\end{array}\]
Among them, the above-described
\[p (\lambda, k) = P (h_1^k| Y (\lambda, k)) \]
Represents the probability of a voice being present. Finally, the mean square estimation of the noise power spectral density can be reduced to:
\[\BEGIN{ARRAY}{*{20}{C}}
{\hat \sigma _d^2 (\lambda, k) = {\alpha _d} (\lambda, k) \hat \sigma _d^2 (\lambda-1,k) + [1-{\alpha _d} (\lambda, k)]| Y (\lambda, k) {|^2}} & {{\alpha _d} (\lambda, k) \buildrel \delta \over = \alpha + (1-\alpha) p (\lambda, k)} \ \
\end{array}\]
From the derivation process above, we can see that the main flow of the MCRA algorithm is:
(1) The minimum value of noisy speech is obtained by using the maximum value tracking method, which represents the preliminary estimation of noise.
(2) Use this minimum value to calculate the probability of the existence of the speech p
(3) Smoothing factor for calculating noise estimates based on the above formula
(4) using recursive averaging to estimate noise
Speex noise estimation is the use of this idea, the specific details are not much to say, detailed questions can be discussed in the following groups!
Noise estimation of speech enhancement principle