Preface:
Voice communication is one of the most used, natural and basic means for human to disseminate information and communicate. The information carrier in this communication-voice signal is a time-varying, non-stationary signal, only in a very short period (usually 10~30ms) is considered to be stable. In the process of speech generation, processing and transmission, it is unavoidable to be disturbed by the environment noise, which makes the performance of speech signal processing system, such as speech coding and speech recognition system, greatly reduced. In order to improve the voice quality and improve the understanding of speech, people adopt various speech enhancement methods to suppress background noise according to the characteristics of speech and noise. But the speech signal denoising is a very complicated problem, must consider the characteristic of the voice itself, the characteristic of the changeable noise, the perceptual characteristic of the ear to the voice and how the brain handles the signal, so, the research of speech enhancement technology is the eternal topic in the speech signal processing.
Although the theory and method of speech signal denoising are still far from being solved, the researchers have put forward many methods for different noises and different applications in the past more than 40 years. The popular speech enhancement methods include wiener filtering, Kalman filtering, spectral subtraction and adaptive filtering. The Wiener filter is the optimal estimation based on the minimum mean square error in the stationary condition , but it is not suitable for the non-stationary signal of speech; Kalman Filter overcomes the stable condition of wiener filter, and it can guarantee the optimum of minimum mean square error under non-stationary condition, but it is only applicable to voiceless Spectral subtraction is a common method, but in the case of low SNR, it can damage the degree and nature of speech, and produce music noise in reconstructed speech. Adaptive filtering is one of the most effective speech enhancement methods, but because of the need of a reference noise source which is difficult to obtain in the real environment, the actual work is not good. , and is accompanied by spectral subtraction with music noise. At the same time, the above methods in speech enhancement, need to know some characteristics or statistical characteristics of noise, and in the absence of prior knowledge of noise, it is difficult to extract speech signals from noisy speech signals.
The
wavelet transform is a time-frequency local analysis method which has been developed rapidly in recent 10 years, and it overcomes the disadvantage of fixed resolution of short-time Fourier transform , which can decompose the signal in multiscale multiresolution. The wavelet coefficients decomposed at various scales represent the information of the signal at different resolutions. At the same time, the wavelet transform is very similar to the auditory characteristics of the human ear, so that the researchers can use the auditory characteristics of the human ear, which is a powerful tool to analyze the non-stationary signal of the voice, so many researchers have used it to deal with the speech signal in recent years. The principle of denoising by wavelet transform is that the energy of speech signal is concentrated in the low frequency band, and the noise energy is mainly concentrated in the HF segment, so that the noise wavelet component on the scale of the main component can be set 0 or given a very small weight, and then reconstructed with the wavelet coefficients of the processing to restore the signal. At the same time, with the development of the wavelet transform theory, the wavelet transform denoising is abundant, and has obtained the good effect, such as 1992 years Mallat proposed the wavelet transform modulus maxima de-noising, Donobo in 1995 proposed the nonlinear wavelet transform threshold denoising, this method makes the wavelet denoising to obtain the wide use, Attracts a lot of researchers. 1, wavelet decomposition
In speech enhancement, the purpose of the decomposition signal is to concentrate the energy of the signal on the minority coefficients of some frequency bands, so as to effectively suppress the noise . In the method of wavelet transform, the researchers generally use orthogonal wavelets, because the orthogonal wavelet transform can minimize the correlation of the original signal, and concentrate its energy on a few sparse and relatively large wavelet coefficients. Wavelet decomposition is only the low-frequency components of each stage decomposition, for high-frequency components are no longer decomposed. This decomposition method can not meet the need for good time resolution and want to have good frequency resolution occasions, so researchers began to use orthogonal wavelet packet decomposition speech signal, so as to facilitate the use of human ear auditory masking effect for speech enhancement. If there is a literature using the wavelet packet algorithm with flexible frequency-frequency analysis ability and can better use of the human ear basement membrane analysis characteristics, according to the bark scale and frequency scale between the conversion relationship, using fixed wavelet packet decomposition method to the 0~4000 Hz band divided into 52 bands, corresponding to 18 bark scale, In the mono-channel condition, the speech enhancement effect is more clear and intelligible than the traditional spectral subtraction. In the case of wavelet packet decomposition, the paper uses 5-level decomposition to get 17 bands corresponding to the bark scale, and 6-stage decomposition is used to obtain 24 critical bands in literature. They are designed to make full use of the auditory characteristics of the ear and do not need to suppress the noise completely when making speech enhancement, as long as the residual noise is not perceived, so as to reduce unnecessary speech distortion while denoising.
In general, the researchers usually adopt the fixed decomposition series in the wavelet packet decomposition, and generally above the level of 5. A large number of experiments show that the wavelet decomposition series has a great influence on the noise reduction effect of the algorithm, too many decomposition series, it will cause some important local characteristics of the signal loss, the SNR decreases, and the computation amount is large, the delay is large, the decomposition series is too small, the noise corresponding modulus maxima can not be sufficiently attenuated, So that the noise reduction effect is not ideal, the signal-to-noise ratio is limited, so a fixed wavelet decomposition series is used to limit the noise reduction performance of the algorithm to a large extent. For this reason, a novel decomposition Series Adaptive selection method is proposed, which effectively improves the performance of the wavelet min-value denoising algorithm, but further leads to delay and computational amount.
Because the first generation wavelet has delay, the algorithm is relatively complex, and the demand of memory is big, the literature adopts adaptive lifting wavelet to enhance speech. The experimental results show that the proposed method can reduce the complexity of the algorithm, and the noise can be eliminated and the voice is well understood.
It is known from the wavelet theory that the orthogonal wavelet decomposition can not guarantee the linear phase of the intermediate process, which is not conducive to the processing of the speech signal, and the biorthogonal wavelet decomposition can guarantee the phase of the intermediate process is not distorted. Therefore, this paper uses biorthogonal wavelet packet to make speech denoising, and achieves good results. At the same time, because the wavelet decomposition finally relies on the filter bank realization, inevitably brings the delay, limits the application scope of the wavelet theory. Therefore, it is necessary to design a low-delay filter to realize wavelet decomposition, which lays a solid foundation for the further application of wavelet theory.
In a word, in the process of wavelet decomposition, wavelet packet decomposition is developed from the initial wavelets decomposition, the wavelet packet decomposition is made by using the ear-auditory characteristics, the Decomposition series is adaptively selected, the wavelet decomposition is improved, and the low-delay filter is designed to guarantee the linear phase using biorthogonal wavelet packet decomposition. are fully prepared for subsequent processing to better improve noise performance. 2, Modulus maxima denoising method
The principle of modulus Maxima denoising is that the modulus maxima of speech signal increases or remains unchanged with the increase of the scale, while the modulus maxima of the noise decreases with the increase of the scale. According to this characteristic, the modulus maxima of the noise is removed, the modulus maxima of the speech is preserved, and the speech is reconstructed by the retained modulus Maxima to achieve the purpose of removing the noise. The specific steps for modulus maxima de-noising are: Discrete binary wavelet transform for noisy speech, the decomposition scale is generally 4 or 5, the modulus maxima corresponding to the wavelet coefficients on each scale is obtained, and at the maximum scale, the point of the modulus Maxima is less than that of the bored value, and the value of the model maximal is smaller than that of the tightness. Search propagation points, preserving the modulus maxima of speech generation , removing the modulus maxima from the noise, and reconstructing the denoising speech by using the modulus maximum points retained by each scale.
Although the Denoising method based on wavelet transform modulus Maxima has good theoretical foundation, there are many factors which affect the precision of calculation in practical application, and the effect of denoising is not satisfactory. The paper uses the interpolation method of wavelet transform frequency response to reconstruct the wavelet transform modulus maxima at low scale, and then constructs the iterative projection operator method based on the compression mapping principle to reconstruct the signal, but this method improves the performance is limited. At the same time, this method has many technical problems to be solved in the concrete operation, such as how to fit the decomposition scale, and the reconstruction only uses the finite modulus maximum point, so the reconstructed signal has the error with the original signal, so how to construct the wavelet coefficients similar to the original signal, which restricts the further application of the method. There are few literatures on this method. 3. Correlation Denoising Method
The principle of correlation denoising method is that the wavelet coefficients of speech signal have strong correlation among each scale, and the wavelet coefficients of noise have no obvious correlation among the scales. The main steps of the correlation denoising method are: Calculating the correlation of the wavelet coefficients of the same spatial position of the neighboring scales, the CWJ,K,CWJ,K=WJ,KWJ+1,K,J representative scale, the k representing the position, and the wj,k representing the K wavelet coefficients of the J-scale. The wj+1,k represents the K-wavelet coefficients of the j+1 scale. Comparing correlation with the size of the wavelet coefficients, if the correlation is large, then the signal is indicated, and the wavelet coefficients are reserved; Conversely, it is considered as noise, and the wavelet coefficients are zeroed. The denoising signal is reconstructed by using the wavelet coefficients after processing.
In this method, the correlation calculation enhances the edge characteristic of the signal and makes it easier to extract the characteristic of the signal. A good result has been obtained by using this method for denoising.
However, in this method, once the process of wavelet decomposition is biased, the calculated correlation can not accurately represent the true correlation of k points, and the performance of dependent correlation denoising is reduced. A kind of region-related Denoising method is presented in this paper, which solves the above problems better. This method mainly considers the wavelet coefficients at K-points, and also considers the wavelet coefficients near K-points, thus weakening the influence of the wavelet coefficients deviation. But the correlation coefficients are calculated at every point of the method, and the calculation is relatively complex, and it does not cause extensive research. 4. Threshold Denoising Method
In 1995, the cut off from-value denoising method was first introduced by Donob. Proposed that he proposed the nonlinear wavelet transform threshold denoising [Ca] to make wavelet denoising deeply researched and widely used.
The theoretical basis of the threshold denoising method is that the wavelet coefficients of noise and the wavelet coefficients of the useful signals have different manifestations in the amplitude, in the low frequency band, the wavelet coefficients of the speech signal are larger than the noise wavelet coefficients, in the high frequency segment, and vice versa. In this way, the wavelet coefficients of each layer are set to a proper min value to separate the signal from the noise. the specific steps of the algorithm are: wavelet transform the noise-containing signal, non-linear boring value of wavelet coefficients, and reconstruct the denoising signal by using the wavelet coefficients after processing.
In the wavelet threshold denoising method, there are 2 key problems:(1) The Threshold application method, (2) The specific estimate of the threshold value . These two problems directly affect the performance of noise removal. 5. Mixed Denoising Method
Although the pure wavelet denoising method can achieve good results, but in the case of low snr and colored noise, the good degree of speech is not very high. In order to make use of the advantages of wavelet transform and to remove noise better, the current research trend is to combine various wavelet methods with other methods. In order to eliminate music noise, a low-variance spectral estimation method based on wavelet threshold is proposed, and the results show that multi-band spectral estimation combined with wavelet threshold can suppress music noise and enhance the quality of speech better than spectral subtraction. However, this method is more effective than Gaussian white noise in the processing of colored noises. In the wavelet domain, the low-scale wavelet coefficients are adaptively filtered, and the high-scale coefficients are adopted by spectral subtraction or wiener filtering.
Experiments show that this method combines the advantages of wavelet denoising, adaptive filtering and spectral subtraction, the damage to speech is less than the threshold denoising, and the music noise is reduced, but the computational complexity and delay are induced. In order to prevent the high frequency of voiceless as noise is removed, there is literature first according to the energy of the wavelet coefficients voicing judgment, if it is voiceless, only the minimum scale of low-frequency components to de-noising, in order to retain the voiceless, otherwise to all the scale denoising, so as to suppress the noise as far as possible to retain the voiceless information, improve the Reduce the distortion of enhanced speech, but in the case of colored noise, the noise is not clean. A speech enhancement method is adopted in the literature, and the steps to be added are: noise-containing speech is firstly decomposed into some critical bands by wavelet transform, then extracting a series of components from the forward feedback subsystem, using the average normalized time-frequency energy to guide the forward feedback subsystem threshold, suppressing the stable noise, and The improved wavelet threshold is used to suppress the non-stable noise and colored noise, and finally the voice enhancement is made by using the fixed soft threshold value. The method combined with artificial neural network denoising method has been successfully used in speech recognition field, but the delay is large and can not be used in real-time processing. In this paper, the advantages of Kalman filter and wavelet decomposition can be combined to simulate the merits of the human ear, which can suppress the non-stable noise and the colored noise, and the voice has very little distortion in the wavelet domain. In the paper, the masking threshold and the optimal weighting coefficients are calculated by the spectral subtraction in the wavelet domain, and the noise estimation based on the parameter method is used to enhance the voiceless, and the performance of many noises is obtained, but the algorithm is more complicated. Considering that the above method can remove stable, unstable, white noise and colored noise to some extent, but when the SNR is very low, the denoising effect is not good and still contains a small amount of music noise, a method of decomposition of speech by using Bionic wavelet transform is presented in the literature. The advantage of the Bionic wavelet transform is that the scale of the time-frequency domain can be adjusted not only according to the frequency of the signal, but also with the instantaneous amplitude of the signal and adaptive adjustment of the first-order emblem. Experiments show that this method can better preserve the original pure speech. If we can apply the above methods on the basis of bionic wavelet, we can get better results.
Reference: A review of speech enhancement algorithms based on wavelet transform
Thank you.