Speech endpoint Detection (1): Double threshold method (simple teaching version) __matlab

Source: Internet
Author: User
Tags abs strcmp

Why should I have speech endpoint detection? Or in other words, silent detection, mute detection.

The following excerpt from Baidu.

Speech detection (Voice activity Detection,vad) is also called speech endpoint detection, voice boundary check, is to detect the existence of voice in noisy environment, commonly used in speech coding, speech enhancement and other speech processing systems, to reduce the rate of speech coding, save communication bandwidth, Reduce the energy consumption of mobile equipment, improve the recognition rate and other functions. The earlier representative Vad method has the ITU-T g.729 Annex B.

Seems very important.

How do you do that?

The most direct, from the energy point of view.

If it is greater than a threshold, it is considered a speech segment. Of course, but also to exclude individual bursts of noise.

In addition, the initials of the power is small, but also belong to the Voice section ah. What to do. Considering its ZCR is relatively large.

So there is a teaching version of the algorithm, simple, the effect is very good.

Assume that the first segment of a signal is pure noise (noise is not useless, or useful) ...

The VAD algorithm of double threshold is everywhere. Baidu will be able to know.

The following post the corresponding MATLAB code.

Well worth learning is that the 3 sigma principle. (Sigma is standard deviation std)

Measurement of ambient noise parameters that paragraph, using a 3 sigma of the idea, the door limit was selected. Personally, I think it's pretty ingenious.

% percent voice silence detection% Content: Time domain feature + double threshold for voice silence detection% Author: qcy%: v1.1% Start time: November 1, 2016 9:01:54% End time: November 1, 2016 10:13:49% package GET_ST_ZCR function
% package Get_st_energy function% version: v1.0% reference: Rabiner L R, Schafer R W. Introduction to digital speech].
% Foundations & Trends in Signal processing, 2007, 1 (1): 1-194.
% Start time: October 24, 2016 21:10:58 clear;
Close all;
CLC% import file [X,fs] = Audioread (' bluesky1.wav ');
% [X,fs] = Audioread (' silent Detection. m4a ');
x = X-mean (x);
x = X./max (ABS (x));
t_duration = Length (x)/fs;
t = 0:1/fs: (T_DURATION-1/FS);
% sound (X,FS);
Figure (1);
Subplot (411) plot (t,x);

Title (' Signal waveform '); % percent calculated at a short time over 0 short-term zero-crossing rate% 1. Set frame length, step wlen_time = 0.02; % [s] step_time = 0.01;
% [s] wlen = round (WLEN_TIME*FS);
Nstep = round (STEP_TIME*FS); nframes = Fix ((length (x)-wlen)/nstep) +1; % frame Number Frame_time = Frame2time (Nframes, Wlen, Nstep, FS);
% calculates the corresponding time Zr = GET_ST_ZCR (x,fs,wlen_time,step_time) for each frame;
Figure (1);
Subplot (412) plot (FRAME_TIME,ZR);

Title (' short time over 0 rate '); Percent calculation short-term energy short-term energies Er = Get_st_energy (x,fs,wlen_time,step_time, ' hamming ', ' DB ');
Figure (1);
Subplot (413) plot (frame_time,er);
% plot (Er);

Title (' Short-term energy '); %%%%%%%%%%%%%%%%%%%%% below for silent detection%%%%%%%%%%%%%%%%%%%%%% 1.  To obtain the noise characteristics, define some parameters required in the silent Detection (1) to measure ambient noise and obtain some noise parameters Noise_frame_idx=floor ((. 1-wlen_time)/step_time) +1; % assumes that the first 100ms is ambient noise eavg=mean (Er (1:NOISE_FRAME_IDX)); % calculates ambient noise average short time power esig=std (Er (1:NOISE_FRAME_IDX)); % The standard deviation Zcavg=mean (Zr (1:NOISE_FRAME_IDX)) of the short-time power of ambient noise is computed;
% The average short time of ambient noise is 0 zcsig=std (Zr (1:NOISE_FRAME_IDX));% (2) The standard deviation of the average over 0 rate of ambient noise is calculated according to the background noise setting threshold if=35; Izct=max ([IF zcavg+3*zcsig]); % mean value 3 times times the standard deviation, as a short time over 0 rate threshold itu=-15; % constant in the range [ -10-20] DB (intensity threshold Upper) Itr=max ([ITU-10 eavg+3*esig]); % mean 3 times times the standard deviation, as a short-term power threshold of 2. Start Silent detection B1 = 1; The% beginning coarse estimate dimension is the frame ordinal number B2 = 1; The% starting point fine estimate dimension is the frame ordinal E1 = Length (Er); % end point coarse estimate dimension ditto E2 = Length (Er); % end point precision estimate dimensional IBID. (1) Beginning and end of rough estimate--> energy threshold% (1a) rough estimate: Starting point C = 1; % cursor is_continue = 1; % in order to eliminate burst noise while is_continue while Er (c) < ITR% in the past, skip all low power frames if C>length (Er)% legality check, do not break across borders;
    End c = c+1; End is_continue = 0; % Scan End B1 = c;
    % at this time as a rough estimate of the beginning% in order to prevent is sudden noise, also to walk back m frame, see whether true is the voice m = 3;
        For k = C+1:c+m if K>length (Er)% legality check, do not break out of bounds;
            End If Er (k) < ITR% after a frame of energy goes down, the previous frame is mostly noise% so there must be an outer loop, let c skip the noise frame c = k+1;
        Is_continue = 1; End ended% (1b) rough estimate: termination point c = Length (Er);
% cursor is_continue = 1; While is_continue% in order to exclude burst noise while Er (c) < ITR% from the back sweep, skip all low-power frames if C < 1 legality check, do not cross Brea
        K
    End c = c-1; End is_continue = 0; % Scan End E1 = c;
    % at this time as the rough estimate of the beginning% in order to prevent is sudden noise, also to go forward M frame, see whether true is the voice m = 3;
        For k = C-1:c-m if k < 1 legality check, do not break across borders;
            End If Er (k) < ITR% before a frame of energy is less than the threshold, indicating that the last frame is mostly noise% so must have an outer loop, let c skip the noise frame c = k-1;
        Is_continue = 1; End End

% (2) The precise estimate of the starting point and the end point of the--> short time ZCR threshold% (2a) fine estimate: The beginning B2 = B1;
% cumulative from B1 to left m_left frame, ZCR greater than ZCR threshold frame number counter% if this counter is indeed greater than a certain number m_counter, it is considered that the preceding paragraph is also a language% otherwise B2 is considered B1 m_left = 20;
M_counter = 4;
Nframes_zcr_higher_than_threshold_counter = 0; Possible_b2_idx = 0;
    % beginning of the fine estimate possible subscript for k = (b1-1):-1: (b1-m_left) If k < 1 legality check, do not break across borders; End If Zr (k) > Izct nframes_zcr_higher_than_threshold_counter = ... nframes_zcr_higher_than_thr
        eshold_counter+1; 
    Possible_b2_idx = k;
End-If nframes_zcr_higher_than_threshold_counter > M_counter B2 = possible_b2_idx;
End% (2b) fine estimate: termination point E2 = E1;
% cumulative from B1 to left m_left frame, ZCR greater than ZCR threshold frame number counter% if this counter is indeed greater than a certain number m_counter, it is considered that the preceding paragraph is also a language% otherwise B2 is considered B1 m_right = 20;
M_counter = 4;
Nframes_zcr_higher_than_threshold_counter = 0; Possible_e2_idx = 0;
    % endpoint fine estimate of possible subscript for k = E1+1:e1+m_right if k > length (Zr)% legality check, do not cross break; End If Zr (k) > Izct nframes_zcr_higher_than_threshold_counter =..
        nframes_zcr_higher_than_threshold_counter+1; 
    Possible_e2_idx = k;
End-If nframes_zcr_higher_than_threshold_counter > M_counter B2 = possible_e2_idx;
End percent (3) a conservative starting point and a recalculation of the endpoint. % conservatives: Prefer to leave a little noise, and never miss a frame of voice% B2 go ahead m_left2 frame, E2 go back m_right2 frame% if the energy of these frames is also greater than the energy threshold% it is considered that these are the Voice segment (3a) starting point m_left2 = 20;

M_right2 = 20;
    For k = (b2-1):-1: (b2-m_left2) If k < 1 legality check, do not break across borders;
    End If Er (k) > ITR B2 = k;
    End ending% (3b) endpoint for k = e2+1:e2+m_right2 if k > length (Er)% legality check, do not break across borders;
    End If Er (k) > ITR E2 = k;
End of% to this point, B, E, respectively, is to save the starting and ending frame number% is required to convert to seconds B_time = (b2-1) * step_time + WLEN_TIME/2;

E_time = (e2-1) * step_time + WLEN_TIME/2;
%%%%%%%%%%%%% detection end%%%%%%%%%%%%%%% drawing figure (1);
Subplot (414) plot (t,x);
Title (' Silent test result ');
Hold on; 
Line ([B_time b_time], [1 1], ' Color ', [1 0 0], ' linewidth ', 2);

Line ([E_time e_time], [1 1], ' Color ', [1 0 0], ' linewidth ', 2); %% listening effect
X_speech = x (Round (B_TIME*FS): Round (E_TIME*FS));

 Sound (X_SPEECH,FS)


Among them, the function of calculating short time energy and short time over 0 rate is already mentioned in the previous article.

Short time energy, short time over 0 rate

This algorithm is very simple and effective. But you can't detect a pause in the voice.

The test results are shown below.


If you want to improve, it is a bit cumbersome.

In addition, this algorithm can only be used for post processing, off-line processing (offline processing).

To detect the end of a speech, you must finish reading all the data. Because the end of it is first scanned from right to left.

But it's simple. acceptable.


Add:frame2time

Before the CSDN have code, now the new system seems to find no code film ... Previous code snippet references seem to be missing.

-_-!! What the hell ...

function Frametime=frame2time (FRAMENUM,FRAMELEN,INC,FS)
% after frames calculate the corresponding time for each frame
frametime= (((1:framenum)-1) *inc+ FRAMELEN/2)/fs;
function ZCR = GET_ST_ZCR (x,fs,wlen_time,step_time,win_type)%function ZCR = GET_ST_ZCR (x,fs,wlen_time,step_time,win_
Type)% gets a short time over 0 rate.           % input parameter% x: Speech signal--> Mono% FS: sample rate% Wlen_time: Window time (s)% Step_time: Step time (s)% Win_type: ' Hamming ', ' hanning ',..., default ' hamming '% return parameter% ZCR: short time over 0 rate (horizontal axis is frame number)%% Author: qcy% version: v1.0% version Description:
Calculates a short time over 0 rate.  % if the frame is not divisible, then discard the last frame and not calculate% time: October 31, 2016 21:08:22 if (min (Size (x)) >1 if not mono% ... end wlen = Round (wlen_time
* FS);

Nstep = Round (Step_time * FS);
If Nargin < 5 win = Hamming (Wlen);
    ElseIf Narmin = = 5 if strcmp (Win_type, ' hamming ') win = Hamming (Wlen);
    ElseIf strcmp (win_type, ' hanning ') win = Hanning (Wlen);
    Else win = Hamming (Wlen);
End Else win = Hamming (Wlen); End nframes = Floor ((Length (x)-Wlen)/nstep) + 1;

% Total Frame number ZCR = [];
    For k = 1:nframes idx = (k-1) * nstep + (1:wlen);
    X_sub = x (idx). * WIN; X_sub1 = X_sub (1:end-1);
    X_SUB2 = X_sub (2:end); 
ZCR (k) = SUM (ABS (sign (X_SUB1)-sign (X_SUB2)))/2/length (X_SUB1);

 End End
function E = Get_st_energy (x,fs,wlen_time,step_time,win_type,energy_unit)%function ZCR = Get_st_energy (x,fs,wlen_
Time,step_time,win_type,energy_unit)% gets short time energy (not divided by frame length, so is not calculated power).           % input parameter% x: Speech signal--> Mono% FS: sample rate% Wlen_time: Window time (s)% Step_time: Step time (s)% Win_type: ' Hamming ', ' hanning ',..., default ' hamming '% energy_unit: ' db ', with normalized energy display (unit: DB).
Otherwise, it's a linear scale.
% return parameter% E: short-term energy (the horizontal axis is the frame number)% Author: qcy% version: v1.0% version Description: Calculate the Short-time energy.
% if the frame is not divisible, then discard the last frame, do not calculate% time: October 31, 2016 21:21:23 Wlen = round (Wlen_time * FS);

Nstep = Round (Step_time * FS);
If Nargin < 5 win = Hamming (Wlen);
    ElseIf Nargin = = 5 if strcmp (Win_type, ' hamming ') win = Hamming (Wlen);
    ElseIf strcmp (win_type, ' hanning ') win = Hanning (Wlen);
    Else win = Hamming (Wlen);
End Else win = Hamming (Wlen); End nframes = Floor ((Length (x)-Wlen)/nstep) + 1;

% total number of frames E = [];
    For k = 1:nframes idx = (k-1) * nstep + (1:wlen); x_sUB = x (IDX). * WIN; 
E (k) = SUM (x_sub.^2);
    Whether end% needs to be converted to DB if Nargin = 6 if strcmp (Energy_unit, ' DB ') E = 10*log10 (E/max (E) +eps);

 End End

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.