DTW is mainly applied in the isolated word recognition algorithm, used to identify some specific instructions compared to good, the algorithm is based on DP (Dynamic programming) algorithm on the basis of the development. The introduction of speech recognition here first introduces the framework of speech recognition, first we have a comparison of the template sound, and then need to intercept it contains the part that really belongs to the voice, this is called VAD (voice activedetection) voice activity detection algorithm, In the middle of vad we most often use the double-threshold endpoint detection method, we use VAD to determine the beginning and end of the speech, the method is to determine the volume by the size of a threshold value, in the time domain is very simple to determine.
Figure. Speech (voice signal), energy (short-term power), ZCR (short-time over 0 rate value)
Then we need to look for a feature vector, many of which use MFCC in speech recognition, that is, the Mel Cepstrum parameter as a feature vector. General spectral analysis We are all using the spectrum, or the difference between the wavelet and the spectrum is only a different measure, these are to solve the additive noise filtering problem, but also there are cepstrum, the order spectrum is for the specific needs of the construction of another spectral method, these are discussed in NI last said. Cepstrum is a spectral method in order to filter out multiplicative noise, simply say that the power spectrum of the log, and then inverse Fourier transform, the formula, such as the method used to do signal separation is very useful, under the comprehensive MATLAB analysis under the DTW speech recognition.
fname = sprintf ('%da.wav ', i);
X=fname;
[X,fs]=wavread (x);
[X1 X2] =vad (x);
m = MFCC (x);
m = m (x1-2:x2-2,:);
Ref (i). MFCC = m;
First here is to read a piece of speech, through the wavread, and then through the VAD function to get the beginning of the end of the speech, many of the functions are called the Voice application library Voicebox, get X1,X2 is the voice of the two ends, first on the voice signal overall calculation MFCC Mel Cepstrum, It then intercepts the voice part as its function value.
fname= sprintf ('%db.wav ', i);
X=fname;
[X,fs]=wavread (x);
[X1 X2] =vad (x);
m = MFCC (x);
M =m (x1-2:x2-2,:);
Test (i). MFCC =m;
Then the same method to calculate the voice file needs to recognize the voice of the Mel Cepstrum coefficients, and then to the template and recognition file "comparison", here is the DTW algorithm, we often call the entire speech recognition algorithm called DTW speech recognition, but in fact, DTW is mainly applied to the comparison of two Mel Cepstrum, and this is a distance-based comparison, it can also be considered as a kind of clustering method based on tutor learning. But before we talk about the match, we need to talk about matching, before the image match we are using spectral analysis method, and here is a one-dimensional signal matching, so explain the relevant template matching method. A key problem to be solved in speech recognition is that the speaker cannot pronounce the same word two times, these differences include not only the size of the Rimal, the shift of the spectrum, but more importantly, the length of the syllable cannot be exactly the same, and the syllable of two pronunciation often does not have a linear correspondence relationship. Set the reference template to have the M frame vector {R (1), R (2), ... R (M), ..., R (M)},r (m) is the speech feature vector for frame m, the test template has n frame vector {T (1), T (2), ... T (n), ..., t (n)},t (n) is the speech feature vector of the nth frame. D (t (i n), R (i m)) represents the Euclidean distance between the I n frame feature in T and the I m frame feature in R. The direct match is the assumption that the test template and reference template are of equal length, i.e. i n =i m; linear time warping technique assumes that the speaking speed is distributed according to the length of the speech unit, i.e. Both of these assumptions are not in line with the actual pronunciation of speech, we need a more realistic nonlinear time-warping technology, that is, the DTW algorithm.
Fig. Comparison of three matching modes
DTW the schematic diagram of the algorithm, the test template frame number n=1~n in a two-dimensional Cartesian coordinate system in the horizontal axis, the reference template of each frame m=1~m on the vertical axes, through these representations frame number of the integer coordinates to draw a number of vertical lines can be formed a grid, each intersection point in the grid (t I, R J) Represents the intersection of a frame in test mode with a frame in training mode. The DTW algorithm is carried out in two steps, one is to calculate the distance between frames of two modes, that is, the frame matching distance matrix is obtained, and the other is to find the best path in the frame matching distance matrix. The process of searching for this path can be described as follows: The search starts at (from) point, for Local path constraints, the point (i n, i m) can reach the previous lattice is only possible (I n-1,i m), (I n-1,i m-l) and (In-1,i m-2). Then (i n, i m) must choose the smallest of the three distances corresponding to the point as its predecessor, the cumulative distance of this path is:
D (i n, i m) =d (T (i n), R (i m)) +min{d (i n-1,i m), D (I n-1,i m-1), D (I n-1,i m-2)}, so from (l,1) point (D () =0) search, repeatedly recursive, until (n,m The optimal path is obtained, and D (n,m) is the matching distance corresponding to the best matching path. When speech recognition is performed, the test template is matched with all reference templates, and the resulting minimum match distance of D min (n,m) corresponds to the recognition result.
Figure. DTW Algorithm principle
Figure. Local constraint path
The following is the specific implementation of MATLAB for the DTW algorithm:
Function dist = DTW (t,r)
n = size (t,1);
m = size (r,1);
% Frame matching distance matrix
D = zeros (n,m);
For i = 1:n
for j = 1:m
D (i,j) = SUM ((t (i,:)-R (J,:)). ^2);
End
End
% Cumulative Distance matrix
D = Ones (n,m) * REALMAX;
D (a) = d (a);
% Dynamic Planning
For i = 2:n
for j = 1:m
D1 = D (i-1,j);
If j>1
D2 = D (i-1,j-1);
Else
D2 = Realmax;
End
If j>2
D3 = D (i-1,j-2);
Else
D3 = Realmax;
End
D (i,j) = d (i,j) + min ([d1,d2,d3]);
End
End
Dist = D (n,m);
Figure. DTW speech recognition algorithm test results
The final test results, we can complete the identification of specific isolated words, in fact, this local optimization of the idea there are many places to be used, such as the solution of the N queen, we can adopt a relatively fast method has a local search method. In speech recognition we can also use LPC (Linear prediction coefficient, linear predictive coefficients) to derive LPCC (linearprediction cepstrum coefficient, linear predictive cepstrum coefficients) in addition to Mel Cepstrum ), but it is said that the results of poor consonants, good vowel results, the above introduction of the DTW algorithm is actually in the English recognition rate is relatively high, in fact, English recognition should be more simple than the Chinese language, in the speech recognition technology, we also need to consider the speech synthesis technology, such as a series of ways to compose a voice, to be able to
DTW algorithm (speech recognition)