theory
Mathematical model diagram of Echo Canceller
Echo cancellation essentially creates an echo mathematical model between the output signal and the echo signal it produces, using the data from the beginning to train the model's parameters and how to train it.
is to speak at the far end, but when the near side does not speak, the recording should be muted, that is, the echo is completely eliminated. So the algorithm works in this direction, and once the Echo is 0, the filter converges. Adaptive filter algorithms vary, but the most popular is the most classic LMS and Nlms,nlms are LMS optimizations.
Judging criteria: fast convergence speed, low operation complexity, good stability, small offset error
LMS algorithm
In practice, the statistical characteristics of signals and noises are not known in advance, and adaptive filters are used.
Commonly used adaptive filtering technology: LMS (minimum mean square) adaptive filter, recursive least squares (RLS) filter, lattice filter, infinite impulse response (IIR) filter. As the name implies, the LMS is to minimize the error between the output signal of the filter and the desired response, which is to ask for a gradient.
The AEC algorithm in WEBRTC belongs to the piecewise fast frequency domain Adaptive filtering algorithm, partioned block Frequeney Domain Adaptive filter (PBFDAF).
To determine whether the distal and proximal end of the conversation, also known as double-ended detection, the following four conditions need to be monitored:
1. Only the far end to speak, at this time there is an echo, to use this state to adapt the coefficient of adaptive filter update, convergence as soon as possible
2. Only the proximal end of the speech, this time is no echo, do not consider
3. Both sides are talking (double talk), at this time the coefficient is cured, the coefficient is not updated
4. Both sides have not spoken, this time can hang up the phone ... You need to enable near-end VAD
A VAD is required at the distal end, and there is an echo in the near side when there is a sound, so vad is useless and can only be used with a DTD (double talk detection).
The technology that binds to mute detection is the comfort noise generation, which is widely used in Voip,phone, but is not used in ASR. It is estimated that the use of voice activity detection and comfort noise generation can reduce the bandwidth requirements of a set of audio channels by 50%.
There are two commonly used DTD algorithms: energy-based, such as the Geigel algorithm, the basic principle is to detect the near-end signal strength if large enough to judge someone to speak. Based on signal correlation, some correlation algorithms, such as cosine similarity, are used.
Geigel Double talk Detector Talk detection can is done with a threshold for the microphone signal only. This approach was very sensitive to the threshold level. A more robust approach are to compare microphone level with loudspeaker level. The threshold in this solution would be a relative one. Because we deal with ECHO, it isn't sufficient to compare only the actual levels and we have to consider previous Le
Vels, too. The Geigel DTD brings these ideas in one simple formula:the last L levels (index 0 for now and index L-1 for L samples ago) from loudspeaker signal is compared to the actual microphone signal.
To avoid problems with phase, the absolute values is used. Double talk is declared if: |d|
>= c * MAX (|x[0]|, |x[1]|,.., |x[l-1]|) With |d|
Is the absolute level of actual microphone signal, and C is a threshold value (typical value 0.5 for-6db or 0.71 for-3db), |x[0]| Is the absolute level of actual loudspeaker signel, |x[l-1]| is the AbsolUte level of loudspeaker signal L samples ago.
See references 3, 7, 9.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
The diagram above is a functional block diagram, the BPF is the band Pass filter, used to filter out the remote signal in the high and low frequency components (similar to noise reduction. ), DCF is a correlation filter used to make the nlms fast convergent. VAD is the monitoring of the far end is not a sound signal, NLP is used to remove the residual echoes. Interface
/* * Inserts an OR "block of data into the farend buffer." * * Inputs Description *-------------------------------------------------------------------* void* aecminst Pointer to The AECM instance * int16_t* farend in buffer containing one frame of * farend signal * int16_t nrofsamples number of Samp Les in farend buffer * * Outputs Description *-------------------------------------------------------------------* int3 2_t return 0:ok * 1200-12004,12100:error/warning */int32_t Webrtcaecm_bufferfarend (void* aecminst, const int16_t* Faren
D, size_t nrofsamples);
/* * Runs the AECM on a or blocks of data. * * Inputs Description *-------------------------------------------------------------------* void* aecminst Pointer to The AECM instance * int16_t* nearendnoisy in buffer containing one frame of * reference nearend+echo signal.
If * Noise reduction is active, provide * the noisy signal here. * int16_t* Nearendclean in buffer containing one frame of * neareNd+echo signal. If Noise * Reduction is active, provide the "clean signal" here.
Otherwise Pass A * NULL pointer. * int16_t nrofsamples number of samples in nearend buffer * int16_t msinsndcardbuf Delay estimate for sound card and * sys TEM buffers * * Outputs Description *-------------------------------------------------------------------* int16_t* out Out buffer, one frame of processed nearend * int32_t return 0:ok * 1200-12004,12100:error/warning * * int32_t webrtcaecm_ Process (void* aecminst, const int16_t* nearendnoisy, const int16_t* Nearendclean, int16_t* out, size_t nrofsamples, int16_
T msinsndcardbuf);
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 4 3 44 45
Nearendnoisy is a near-end signal with noise, Nearendclean is to eliminate the noise of the near-end signal, out is the output of the AEC processing signal, nrofsamples can only be 80 or 160, is 10ms of audio data, Msinsndcardbuf is the delay of the input and output, which is the time difference between the remote signal being reference and the AEC processing.
For this time difference:
When the speaker and microphone are close together, it is possible to ignore the time the sound travels, so this delay is:
Sets the |delay| In MS between Analyzereversestream () receiving a far-end
Frame and Processstream () receiving a near-end frame containing the
corresponding echo. The client-side this can expressed as
Delay = (t_render-t_analyze) + (t_process-t_capture)
where
-T_analyze is the time a frame was passed to Analyzereversestream () and
T_render is the time the first sample of the same frame was rendered by
The audio hardware.
-T_capture is the time the first sample of a frame was captured by the
Audio hardware and T_pull is the time the same frame was passed to
Processstream ().
So the closer the AEC module is to the hardware the better (should be placed in the signal chain as close to the audio hardware abstraction layer (HAL) as possible. )。 This: Avoids a lot of software processing, the delay can be controlled at least, because all in the hardware running, the delay will not change basically, the volume and the sound from the speaker is consistent.
The length of one processing is 80 sample, becomes a frame,nb corresponds to 1 frame, WB corresponds to 2
Webrtcaecm_processframe is processed once per 80 samples
int Webrtcaecm_processblock (aecmcore* aecm,
Const int16_t* Farend,
Const int16_t* Nearendnoisy,
Const int16_t* Nearendclean,
int16_t* output) {
Handles 64 samples of a set of Bloack.
But the output is still output according to 80 samples of a frame
Webrtcaecm_processblock
Timetofrequencydomain time domain to the frequency domain conversion, comes out is 64 complex points, respectively with the real and the imaginary part expresses
Aecm->real_fft = Webrtcspl_createrealfft (part_len_shift); The order of this FFT is 7, which is the Length of (Part_len * 2) in base 2.
Webrtcspl_realforwardfft is also calculated by Webrtcspl_complexfft.
Far_q = Timetofrequencydomain (AECM,
AECM->XBUF, 64 * 2
DFW, 64 * 2
XFA, 64
&xfasum);
static int Timetofrequencydomain (aecmcore* aecm,
Const int16_t* time_signal, 64 * 2
complexint16* freq_signal, 64 * 2
Uint16_t* Freq_signal_abs, 64
uint32_t* freq_signal_sum_abs)
int16_t Fft_buf[part_len4 + 16];
static void Windowandfft (aecmcore* aecm,
int16_t* FFT, 64 * 4
Const int16_t* time_signal, 64 * 2
complexint16* freq_signal, 64 * 2
int time_signal_scaling)
Webrtcspl_realforwardfft (Aecm->real_fft,
FFT, 64 * 4
(int16_t*) freq_signal 64 * 2
);
Add the window before you do it. Henning window to prevent spectrum leakage.
Approximation for magnitude of complex FFT output
//Magn = sqrt (real^2 + imag^2)
//magn ~= Alpha * MAX (|imag| , |real|) + beta * min (|imag|,|real|)
//The parameters Alpha and beta is stored in Q15
A simple method of calculating the touch of a complex number, this is a technique of DSP
Http://dspguru.com/dsp/tricks/magnitude-estimator
Webrtcaecm_updatefarhistory, which stores the spectral signal amplitude spectrum of far end
What is the q-domain of current frequency values?
It seems to be to go first the maximum absolute value of a time domain signal, and then ... Don't know
if (Webrtc_addfarspectrumfix (Aecm->delay_estimator_farend,
Xfa
Part_len1,
FAR_Q) = =-1)
Calculates the fixed delay, which is based on a patent for the low COMPLEX and robust delay estimation, the lower complexity and stable delay estimation algorithm, how awesome http://patents.justia.com/ patent/20130163698, it's calculated by probability.
After estimating the delay, it is the alignment of the far and near waveforms
//Returns A pointer to the far end spectrum aligned to current near end
//spectrum. The function Webrtc_delayestimatorprocessfix (...) should has been
//called before Alignedfarend (...). Otherwise, get the pointer to the
//Previous frame. The memory is a valid until the next call of
//Webrtc_delayestimatorprocessfix (...).
//
//inputs:
//-Self:pointer to the AECM instance.
//-Delay:current Dela Y estimate.
//
//output:
//-Far_q:the Q-domain of the aligned far end SPECTRUM 
//
//Return value:
//-Far_spectrum:pointer to the aligned far end spectrum
//NULL- error
//
Const uint16_t* Webrtcaecm_alignedfarend
The
calculates the near end, the far side, the energy is actually for VAD to do
//Webrtcaecm_calcenergies (...)
//
//This function calculates the log of energies for Nearend, Farend and estimated
//Ech Oes. There is also a update of energy decision levels, i.e. internal vad.
//
//
//@param aecm [ I/O] Handle of the AECM instance.
//@param far_spectrum [in] Pointer to Farend spectrum.
//@param FA R_q [in] q-domain of Farend spectrum. ,
//@param nearener [in] near end energy for current block in
// Q (aecm->dfaqdomain).
//@param echoest [out] estimated echo in Q (XFA_Q+RESOLUTION_CHANNEL16).
//
void Webrtcaecm_calcenergies (aecmcore* aecm,
Const uint16_t* far_spectrum,
Const INT16_ t far_q,
Const uint32_t nearener,
int32_t* echoest) {
Estimated distal vad aecm->currentvadvalue = 1; Indicates that the distal wood has VAD
if (!aecm->currentvadvalue)
Far end energy level too low, no channel update
As for step Size, this is part of the LMS algorithm
Webrtcaecm_calcstepsize (...)
//
This function calculates the step size used in channel estimation
//
//
@param aecm [in] Handle of the AECM instance.
@param mu [out] (Return value) stepsize in log2 (), i.e. number of shifts.
//
//
int16_t webrtcaecm_calcstepsize (aecmcore* const AECM) {
Update channel, part of the NLMS algorithm
Webrtcaecm_updatechannel (...)
//
This function performs channel estimation. NLMS and decision on channel storage.
//
//
@param aecm [I/O] Handle of the AECM instance.
@param far_spectrum [in] Absolute value of the farend signal in Q (FAR_Q)
@param far_q [in] q-domain of the farend signal
@param DFA [in] Absolute value of the nearend signal (Q[aecm->dfaqdomain])
@param mu [in] NLMS step size.
@param echoest [I/O] estimated echo in Q (FAR_Q+RESOLUTION_CHANNEL16).
//
void Webrtcaecm_updatechannel (aecmcore* aecm,
Const uint16_t* Far_spectrum,
Const int16_t FAR_Q,
Const uint16_t* Const DFA,
Const int16_t MU,
int32_t* echoest) {
Webrtcaecm_storeadaptivechannelneon
This is C code of following optimized code.
During startup We store the channel every block.
memcpy (aecm->channelstored,
Aecm->channeladapt16,
sizeof (int16_t) * part_len1);
Recalculate Echo Estimate
for (i = 0; i < part_len; i + = 4) {
Echo_est[i] = webrtc_spl_mul_16_u16 (Aecm->channelstored[i],
Far_spectrum[i]);
Echo_est[i + 1] = webrtc_spl_mul_16_u16 (aecm->channelstored[i + 1],
Far_spectrum[i + 1]);
Echo_est[i + 2] = webrtc_spl_mul_16_u16 (Aecm->channelstored[i + 2],
Far_spectrum[i + 2]);
Echo_est[i + 3] = webrtc_spl_mul_16_u16 (Aecm->channelstored[i + 3],
Far_spectrum[i + 3]);
// }
Echo_est[i] = webrtc_spl_mul_16_u16 (Aecm->channelstored[i],
Far_spectrum[i]);
We have enough data.
Calculate MSE of "Adapt" and "Stored" versions.
It is actually not MSE, but average absolute error.
Based on who's MSE small decision store who, adaptive one or old a
Then calculate the Wiener filter gain
Determine suppression gain used in the Wiener filter. The gain is based in a mix of far
End Energy and Echo estimation error.
Calcsuppressiongain (...)
//
This function calculates the suppression gain, that's used in the Wiener filter.
//
//
@param aecm [i/n] Handle of the AECM instance.
@param supgain [out] (Return value) suppression gain with which to scale the noise
Level (Q14).
//
//
int16_t Webrtcaecm_calcsuppressiongain (aecmcore* const AECM) {
In this, you can make a DTD judgment. This is based on the estimated echo signal and the actual input of the echo signal to determine whether the DTD is not.
Then is the Wiener filter and the Henning window, as well as the comfort noise generation, does not understand. Disadvantages:
There is not a good DTD. This results in the absence of the DT when the echo is clean, with the DT, the near end of talk has been eliminated.
WebRTC not ready to fix it, see Google's mailing list:
Andrew MacDonald
9/29/11
-Show quoted Tex
Just to set the record straight here, no, we don't have any explicit
Double-talk detection. It ' s handled implicitly by limiting the
Magnitude of the error used in adaptation.
Additionally, we disregard the filter output if it is higher
than the input, since this indicates the filter has likely diverged.
braveyao@webrtc.org, Dec 3 2013
Status:wontfix
We once states AECM offers decent double-talk feature which are not equivalent to AEC but better than nothing, giving the L ight complexity of AECM. But people used to has higher expectation then. So it's more safer to say NO double-talk feature in AECM.
And from another thread, we is working on other methods to replace AECM, instead of improving it further. So I would mark the issue to Wontfix too.
BTW: @boykinjim, recently I found out that currently AECM are limited to 8k&16k codec only. So, try not to use Opus on android phone so far.