DTW algorithm (speech recognition)

Last Update:2017-07-23 Source: Internet

Author: User

Tags sprintf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

DTW is mainly applied in the isolated word recognition algorithm, used to identify some specific instructions compared to good, the algorithm is based on DP (Dynamic programming) algorithm on the basis of the development. The introduction of speech recognition here first introduces the framework of speech recognition, first we have a comparison of the template sound, and then need to intercept it contains the part that really belongs to the voice, this is called VAD (voice activedetection) voice activity detection algorithm, In the middle of vad we most often use the double-threshold endpoint detection method, we use VAD to determine the beginning and end of the speech, the method is to determine the volume by the size of a threshold value, in the time domain is very simple to determine.

Figure. Speech (voice signal), energy (short-term power), ZCR (short-time over 0 rate value)

Then we need to look for a feature vector, many of which use MFCC in speech recognition, that is, the Mel Cepstrum parameter as a feature vector. General spectral analysis We are all using the spectrum, or the difference between the wavelet and the spectrum is only a different measure, these are to solve the additive noise filtering problem, but also there are cepstrum, the order spectrum is for the specific needs of the construction of another spectral method, these are discussed in NI last said. Cepstrum is a spectral method in order to filter out multiplicative noise, simply say that the power spectrum of the log, and then inverse Fourier transform, the formula, such as the method used to do signal separation is very useful, under the comprehensive MATLAB analysis under the DTW speech recognition.

fname = sprintf ('%da.wav ', i);

X=fname;

[X,fs]=wavread (x);

[X1 X2] =vad (x);

m = MFCC (x);

m = m (x1-2:x2-2,:);

Ref (i). MFCC = m;

First here is to read a piece of speech, through the wavread, and then through the VAD function to get the beginning of the end of the speech, many of the functions are called the Voice application library Voicebox, get X1,X2 is the voice of the two ends, first on the voice signal overall calculation MFCC Mel Cepstrum, It then intercepts the voice part as its function value.

fname= sprintf ('%db.wav ', i);

X=fname;

[X,fs]=wavread (x);

[X1 X2] =vad (x);

m = MFCC (x);

M =m (x1-2:x2-2,:);

Test (i). MFCC =m;

Then the same method to calculate the voice file needs to recognize the voice of the Mel Cepstrum coefficients, and then to the template and recognition file "comparison", here is the DTW algorithm, we often call the entire speech recognition algorithm called DTW speech recognition, but in fact, DTW is mainly applied to the comparison of two Mel Cepstrum, and this is a distance-based comparison, it can also be considered as a kind of clustering method based on tutor learning. But before we talk about the match, we need to talk about matching, before the image match we are using spectral analysis method, and here is a one-dimensional signal matching, so explain the relevant template matching method. A key problem to be solved in speech recognition is that the speaker cannot pronounce the same word two times, these differences include not only the size of the Rimal, the shift of the spectrum, but more importantly, the length of the syllable cannot be exactly the same, and the syllable of two pronunciation often does not have a linear correspondence relationship. Set the reference template to have the M frame vector {R (1), R (2), ... R (M), ..., R (M)},r (m) is the speech feature vector for frame m, the test template has n frame vector {T (1), T (2), ... T (n), ..., t (n)},t (n) is the speech feature vector of the nth frame. D (t (i n), R (i m)) represents the Euclidean distance between the I n frame feature in T and the I m frame feature in R. The direct match is the assumption that the test template and reference template are of equal length, i.e. i n =i m; linear time warping technique assumes that the speaking speed is distributed according to the length of the speech unit, i.e. Both of these assumptions are not in line with the actual pronunciation of speech, we need a more realistic nonlinear time-warping technology, that is, the DTW algorithm.

Fig. Comparison of three matching modes

DTW the schematic diagram of the algorithm, the test template frame number n=1~n in a two-dimensional Cartesian coordinate system in the horizontal axis, the reference template of each frame m=1~m on the vertical axes, through these representations frame number of the integer coordinates to draw a number of vertical lines can be formed a grid, each intersection point in the grid (t I, R J) Represents the intersection of a frame in test mode with a frame in training mode. The DTW algorithm is carried out in two steps, one is to calculate the distance between frames of two modes, that is, the frame matching distance matrix is obtained, and the other is to find the best path in the frame matching distance matrix. The process of searching for this path can be described as follows: The search starts at (from) point, for Local path constraints, the point (i n, i m) can reach the previous lattice is only possible (I n-1,i m), (I n-1,i m-l) and (In-1,i m-2). Then (i n, i m) must choose the smallest of the three distances corresponding to the point as its predecessor, the cumulative distance of this path is:

D (i n, i m) =d (T (i n), R (i m)) +min{d (i n-1,i m), D (I n-1,i m-1), D (I n-1,i m-2)}, so from (l,1) point (D () =0) search, repeatedly recursive, until (n,m The optimal path is obtained, and D (n,m) is the matching distance corresponding to the best matching path. When speech recognition is performed, the test template is matched with all reference templates, and the resulting minimum match distance of D min (n,m) corresponds to the recognition result.

Figure. DTW Algorithm principle

Figure. Local constraint path

The following is the specific implementation of MATLAB for the DTW algorithm:

Function dist = DTW (t,r)

n = size (t,1);

m = size (r,1);

% Frame matching distance matrix

D = zeros (n,m);

For i = 1:n

for j = 1:m

D (i,j) = SUM ((t (i,:)-R (J,:)). ^2);

End

% Cumulative Distance matrix

D = Ones (n,m) * REALMAX;

D (a) = d (a);

% Dynamic Planning

For i = 2:n

for j = 1:m

D1 = D (i-1,j);

If j>1

D2 = D (i-1,j-1);

Else

D2 = Realmax;

End

If j>2

D3 = D (i-1,j-2);

Else

D3 = Realmax;

End

D (i,j) = d (i,j) + min ([d1,d2,d3]);

End

Dist = D (n,m);

Figure. DTW speech recognition algorithm test results

The final test results, we can complete the identification of specific isolated words, in fact, this local optimization of the idea there are many places to be used, such as the solution of the N queen, we can adopt a relatively fast method has a local search method. In speech recognition we can also use LPC (Linear prediction coefficient, linear predictive coefficients) to derive LPCC (linearprediction cepstrum coefficient, linear predictive cepstrum coefficients) in addition to Mel Cepstrum ), but it is said that the results of poor consonants, good vowel results, the above introduction of the DTW algorithm is actually in the English recognition rate is relatively high, in fact, English recognition should be more simple than the Chinese language, in the speech recognition technology, we also need to consider the speech synthesis technology, such as a series of ways to compose a voice, to be able to

DTW algorithm (speech recognition)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More