理論
回聲消除器的數學模型圖
回聲消除本質上就是把輸出訊號和它產生的回聲訊號之間建立一個回聲數學模型,利用開始的資料訓練這個模型的參數,怎麼訓練呢。
就是在遠端有說話,但近端沒有說話的時候,錄音應該是靜音,即回聲完全消除。所以演算法就朝著這個方向努力,一旦回聲為0,則濾波器收斂。自適應濾波器演算法多種多樣,但是目前流行的還是最經典的LMS和NLMS,NLMS是LMS的最佳化 。
判斷標準:收斂速度快,運算複雜度低,穩定性好,失調誤差小
LMS演算法
實際應用中,很多時候無法預先得知訊號和雜訊的統計特性,這時候就用得著自適應濾波器了
常用的自適應濾波技術:LMS(最小均方)自適應濾波器,遞推最小二乘(RLS)濾波器,格型濾波器,無限脈衝響應(IIR)濾波器。顧名思義,LMS是使得濾波器的輸出訊號和期望響應之間的誤差的均方值最小,也就是求一個梯度。
WebRtc中的AEC演算法屬於分段快頻域自適應濾波演算法,Partioned block frequeney domain adaptive filter(PBFDAF)。
判斷遠端和近端是否說話的情況,又稱為雙端檢測,需要監測以下四種情況:
1. 僅遠端說話, 此時有回聲,要利用這種狀態進行自適應濾波器的係數更新,儘快收斂
2. 僅近端說話, 這種時候是沒有回聲的,不用考慮
3. 雙端都在說話(Double Talk),此時係數固化,不進行係數更新
4. 雙端都沒有說話,這時候可以掛電話了。。。這時候需要啟用近端VAD
遠端需要一個VAD;在遠端有聲音的時候,近端即時不說話也有回聲,所以VAD沒什麼用,只能是使用一個DTD(double talk detection)。
跟靜音檢測綁定在一起的技術是舒適雜訊產生,這個在VOIP,phone中用的比較廣泛,但是在ASR中無需使用。 據估算,運用語音活動檢測及舒適噪音產生可將一組音頻通道對頻寬的需求降低50%。
目前常用的DTD演算法有兩種: 基於能量的,比如Geigel演算法,基本原理就是檢測近端的訊號強度如果足夠大的話,就判斷有人說話。 基於訊號相關性的,使用一些相關性演算法,比如餘弦相似性。
Geigel Double Talk Detector Talk detection can be done with a threshold for the microphone signalonly. This approach is very sensitive to the threshold level. A more robust approach is to compare microphone level with loudspeaker level. The threshold in this solution will be a relative one. Because we deal with echo, it is not sufficient to compare only the actual levels, but we have to consider previous levels, too. The Geigel DTD brings these ideas in one simple formula: The last Llevels (index 0 for now and index L-1 for L samples ago) from loudspeaker signal are compared to the actual microphone signal. To avoid problems withphase, the absolute values are used. Double talk is declared if: |d| >= c * max(|x[0]|, |x[1]|, .., |x[L-1]|) with |d| is the absolute level of actual microphone signal, c is a threshold value (typical value 0.5 for -6dB or 0.71 for -3dB), |x[0]| is the absolute level of actual loudspeaker signel, |x[L-1]| is the absolute level of loudspeaker signal L samples ago. See references 3, 7, 9.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
上圖是功能框圖,BPF是Band Pass Filter,用來濾掉遠端訊號中的過高和過低的頻率分量(類似降噪。),DCF是correlation filter,用來使得NLMS快速收斂的。VAD是監測遠端是不是有聲音訊號的,NLP是用來去掉殘餘回聲的。 介面
/** Inserts an 80 or 160 sample block of data into the farend buffer.** Inputs Description* -------------------------------------------------------------------* void* aecmInst Pointer to the AECM instance* int16_t* farend In buffer containing one frame of* farend signal* int16_t nrOfSamples Number of samples in farend buffer** Outputs Description* -------------------------------------------------------------------* int32_t return 0: OK* 1200-12004,12100: error/warning*/int32_t WebRtcAecm_BufferFarend(void* aecmInst, const int16_t* farend, size_t nrOfSamples);/** Runs the AECM on an 80 or 160 sample blocks of data.** Inputs Description* -------------------------------------------------------------------* void* aecmInst Pointer to the AECM instance* int16_t* nearendNoisy In buffer containing one frame of* reference nearend+echo signal. If* noise reduction is active, provide* the noisy signal here.* int16_t* nearendClean In buffer containing one frame of* nearend+echo signal. If noise* reduction is active, provide the* clean signal here. Otherwise pass a* NULL pointer.* int16_t nrOfSamples Number of samples in nearend buffer* int16_t msInSndCardBuf Delay estimate for sound card and* system buffers** Outputs Description* -------------------------------------------------------------------* int16_t* out Out buffer, one frame of processed nearend* int32_t return 0: OK* 1200-12004,12100: error/warning*/int32_t WebRtcAecm_Process(void* aecmInst, const int16_t* nearendNoisy,const int16_t* nearendClean, int16_t* out, size_t nrOfSamples, int16_t msInSndCardBuf);
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
nearendNoisy是帶有噪音的近端訊號,nearendClean是消掉噪音的近端訊號,out是輸出的AEC處理過的訊號,nrOfSamples只能是80或者160,就是10ms的音頻資料,msInSndCardBuf是輸入輸出的時延, 就是遠端訊號從被reference到被aec處理之間的時間差。
針對這個時間差:
在擴音器和麥克風離得很近的情況下,可以忽略聲音傳播的時間,因此這個delay就是:
// Sets the |delay| in ms between AnalyzeReverseStream() receiving a far-end
// frame and ProcessStream() receiving a near-end frame containing the
// corresponding echo. On the client-side this can be expressed as
// delay = (t_render - t_analyze) + (t_process - t_capture)
// where,
// - t_analyze is the time a frame is passed to AnalyzeReverseStream() and
// t_render is the time the first sample of the same frame is rendered by
// the audio hardware.
// - t_capture is the time the first sample of a frame is captured by the
// audio hardware and t_pull is the time the same frame is passed to
// ProcessStream().
因此AEC模組的位置越靠近硬體越好(should be placed in the signal chain as close to the audio hardware abstraction layer (HAL) as possible.)。這樣: 避免了大量的軟體處理,時延可以控制在最小; 因為都是在硬體裡啟動並執行,時延基本不會變化; 音量和從Speaker裡出來的聲音一致。
一次處理的長度是80 個sample,成為一個FRAME,nb對應1個FRAME, wb對應2個
WebRtcAecm_ProcessFrame是每80個採樣處理一次
int WebRtcAecm_ProcessBlock(AecmCore* aecm,
const int16_t* farend,
const int16_t* nearendNoisy,
const int16_t* nearendClean,
int16_t* output) {
處理64個採樣一組的bloack。
但是輸出還是按照80個採樣一個frame來輸出
WebRtcAecm_ProcessBlock
TimeToFrequencyDomain時域到頻域的轉換,出來就是64個複數點,分別用實部和虛部表示
aecm->real_fft = WebRtcSpl_CreateRealFFT(PART_LEN_SHIFT);這個FFT的order是7,是 Length of (PART_LEN * 2) in base 2.
WebRtcSpl_RealForwardFFT也是通過WebRtcSpl_ComplexFFT計算的
far_q = TimeToFrequencyDomain(aecm,
aecm->xBuf, 64 * 2
dfw, 64 * 2
xfa, 64
&xfaSum);
static int TimeToFrequencyDomain(AecmCore* aecm,
const int16_t* time_signal, 64 * 2
ComplexInt16* freq_signal, 64 * 2
uint16_t* freq_signal_abs, 64
uint32_t* freq_signal_sum_abs)
int16_t fft_buf[PART_LEN4 + 16];
static void WindowAndFFT(AecmCore* aecm,
int16_t* fft, 64 * 4
const int16_t* time_signal, 64 * 2
ComplexInt16* freq_signal, 64 * 2
int time_signal_scaling)
WebRtcSpl_RealForwardFFT(aecm->real_fft,
fft, 64 * 4
(int16_t*)freq_signal 64 * 2
);
做之前要加窗。漢寧窗,防止頻譜泄露。
// Approximation for magnitude of complex fft output // magn = sqrt(real^2 + imag^2) // magn ~= alpha * max(|imag|,|real|) + beta * min(|imag|,|real|) // // The parameters alpha and beta are stored in Q15
計算複數的摸的簡單方法,這個是DSP的一個技巧
http://dspguru.com/dsp/tricks/magnitude-estimator
WebRtcAecm_UpdateFarHistory, 儲存far end的頻譜訊號幅度譜到far history
The Q-domain of current frequency values 是啥。
似乎是先去一個時域訊號的最大絕對值,然後。。。。。不知
if (WebRtc_AddFarSpectrumFix(aecm->delay_estimator_farend,
xfa,
PART_LEN1,
far_q) == -1)
計算fixed delay, 這個計算是根據一篇專利來的 LOW COMPLEX AND ROBUST DELAY ESTIMATION,低複雜性和穩定的延時估計演算法,多麼牛逼 http://patents.justia.com/patent/20130163698, 是用機率算的
估計完延時後,就是對齊far和near的波形
// Returns a pointer to the far end spectrum aligned to current near end
// spectrum. The function WebRtc_DelayEstimatorProcessFix(…) should have been
// called before AlignedFarend(…). Otherwise, you get the pointer to the
// previous frame. The memory is only valid until the next call of
// WebRtc_DelayEstimatorProcessFix(…).
//
// Inputs:
// - self : Pointer to the AECM instance.
// - delay : Current delay estimate.
//
// Output:
// - far_q : The Q-domain of the aligned far end spectrum
//
// Return value:
// - far_spectrum : Pointer to the aligned far end spectrum
// NULL - Error
//
const uint16_t* WebRtcAecm_AlignedFarend
計算近端,遠端,的能量,其實是為了VAD做的
// WebRtcAecm_CalcEnergies(…)
//
// This function calculates the log of energies for nearend, farend and estimated
// echoes. There is also an update of energy decision levels, i.e. internal VAD.
//
//
// @param aecm [i/o] Handle of the AECM instance.
// @param far_spectrum [in] Pointer to farend spectrum.
// @param far_q [in] Q-domain of farend spectrum.
// @param nearEner [in] Near end energy for current block in
// Q(aecm->dfaQDomain).
// @param echoEst [out] Estimated echo in Q(xfa_q+RESOLUTION_CHANNEL16).
//
void WebRtcAecm_CalcEnergies(AecmCore* aecm,
const uint16_t* far_spectrum,
const int16_t far_q,
const uint32_t nearEner,
int32_t* echoEst) {
估計遠端VAD aecm->currentVADValue = 1; 表示遠端木有VAD
if (!aecm->currentVADValue)
// Far end energy level too low, no channel update
至於Step Size,這是LMS演算法·中的一部分
// WebRtcAecm_CalcStepSize(…)
//
// This function calculates the step size used in channel estimation
//
//
// @param aecm [in] Handle of the AECM instance.
// @param mu [out] (Return value) Stepsize in log2(), i.e. number of shifts.
//
//
int16_t WebRtcAecm_CalcStepSize(AecmCore* const aecm) {
更新channel, NLMS的演算法一部分
// WebRtcAecm_UpdateChannel(…)
//
// This function performs channel estimation. NLMS and decision on channel storage.
//
//
// @param aecm [i/o] Handle of the AECM instance.
// @param far_spectrum [in] Absolute value of the farend signal in Q(far_q)
// @param far_q [in] Q-domain of the farend signal
// @param dfa [in] Absolute value of the nearend signal (Q[aecm->dfaQDomain])
// @param mu [in] NLMS step size.
// @param echoEst [i/o] Estimated echo in Q(far_q+RESOLUTION_CHANNEL16).
//
void WebRtcAecm_UpdateChannel(AecmCore* aecm,
const uint16_t* far_spectrum,
const int16_t far_q,
const uint16_t* const dfa,
const int16_t mu,
int32_t* echoEst) {
WebRtcAecm_StoreAdaptiveChannelNeon
// This is C code of following optimized code.
// During startup we store the channel every block.
// memcpy(aecm->channelStored,
// aecm->channelAdapt16,
// sizeof(int16_t) * PART_LEN1);
// Recalculate echo estimate
// for (i = 0; i < PART_LEN; i += 4) {
// echo_est[i] = WEBRTC_SPL_MUL_16_U16(aecm->channelStored[i],
// far_spectrum[i]);
// echo_est[i + 1] = WEBRTC_SPL_MUL_16_U16(aecm->channelStored[i + 1],
// far_spectrum[i + 1]);
// echo_est[i + 2] = WEBRTC_SPL_MUL_16_U16(aecm->channelStored[i + 2],
// far_spectrum[i + 2]);
// echo_est[i + 3] = WEBRTC_SPL_MUL_16_U16(aecm->channelStored[i + 3],
// far_spectrum[i + 3]);
// }
// echo_est[i] = WEBRTC_SPL_MUL_16_U16(aecm->channelStored[i],
// far_spectrum[i]);
// We have enough data.
// Calculate MSE of “Adapt” and “Stored” versions.
// It is actually not MSE, but average absolute error.
根據誰的MSE小決定Store誰,adaptive one or old one
然後計算維納濾波器增益
// Determine suppression gain used in the Wiener filter. The gain is based on a mix of far
// end energy and echo estimation error.
// CalcSuppressionGain(…)
//
// This function calculates the suppression gain that is used in the Wiener filter.
//
//
// @param aecm [i/n] Handle of the AECM instance.
// @param supGain [out] (Return value) Suppression gain with which to scale the noise
// level (Q14).
//
//
int16_t WebRtcAecm_CalcSuppressionGain(AecmCore* const aecm) {
在這個裡面可以做DTD的判斷。這個是根據估計的回聲訊號和實際的輸入的回聲訊號來的判斷是不是DTD。
然後是維納濾波和漢寧窗,以及舒適雜訊的產生,不懂。 缺點:
沒有一個好的DTD。這就造成沒有DT的時候消除回聲很乾淨,有DT的時候,近端talk也被消掉了。
WebRtc不準備Fix it, 參見Google的郵件清單:
Andrew MacDonald
9/29/11
- show quoted tex
Just to set the record straight here, no, we don’t have any explicit
double-talk detection. It’s handled implicitly by limiting the
magnitude of the error used in adaptation.
Additionally, we disregard the filter output if its energy is higher
than the input, since this indicates the filter has likely diverged.
braveyao@webrtc.org, Dec 3 2013
Status: WontFix
We once states AECM offers decent double-talk feature which is not equivalent to AEC but better than nothing, giving the light complexity of AECM. But people used to have higher expectation then. So it’s more safer to say NO double-talk feature in AECM.
And from another thread, we are working on other methods to replace AECM, instead of improving it further. So I would mark this issue to WontFix too.
BTW: @boykinjim, recently I found out that currently AECM is limited to 8k&16k codec only. So try not to use Opus on Android phone so far.