Speech signal Processing (a) dynamic time warping (DTW)

Source: Internet
Author: User

Http://blog.csdn.net/zouxy09

This semester has the "voice signal processing" This course, fast exam, so also need to understand the relevant knowledge points. Oh, usually not how to lecture, now only cramming. By the way also summed up, so that their knowledge structure to clear points, but also share with you. The following is a summary of the first knowledge point: DTW. Because the time spent is not much, so there may be a lot of inappropriate places, but also hope that you correct. Thank you.

Dynamic time Warping (DTW) was born with a certain history (the Japanese scholar Itakura), and its appearance is relatively simple, is a measure of the two length of different times of the similarity of a method. The application is also relatively broad, mainly in the template matching, for example, in the isolated Word speech recognition (to identify whether two-segment speech is the same word), gesture recognition, data mining and information retrieval.

I. Overview

In most disciplines, time series is a common representation of data. A common task for time-series processing is to compare the similarity of two sequences.

In the time series, the length of two time series that need to compare similarity may not be equal, and the speech recognition field is different for different people. Because the voice signal has a considerable randomness, even if the same person at different times the same tone, it is not possible to have a full length of time. And the pronunciation of different phonemes in the same word is different, for example, some people will drag the "A" sound long, or the "I" hair is very short. In these complex cases, the distance (or similarity) between two time series that is not effectively calculated using the traditional Euclidean distance.

As shown in figure A, the solid and dashed lines are the two voice waveforms of the same word "pen" (opened on the y-axis for observation). You can see that the waveform shapes on the whole are similar, but they are not aligned on the timeline. For example, at the 20th point in time, the a point of the real line waveform corresponds to the B ' point of the dashed waveform, so the traditional calculation of similarity by comparison distance is obviously not reliable. Because it is obvious that the point a of the solid line corresponds to the b point of the dashed lines. In Figure B, Dtw can find the points aligned by the two waveforms so that the distance is correct.

That is, in most cases, two sequences have very similar shapes on the whole, but these shapes are not aligned on the x-axis. So before we compare their similarity, we need to warping one (or two) sequence in the timeline to achieve better alignment. And DTW is an effective way to realize this kind of warping distortion. DTW calculates the similarity of two time series by extending and shortening the time series.

What if we know that the two waveforms are aligned? That is to say what kind of warping is right. Intuitively understood, of course warping a sequence can be coincident with another sequence recover. At this time, the sum of the distances of all the corresponding points in the two sequence is minimal. So intuitively, the correctness of warping generally refers to the alignment of "feature to feature".

second, dynamic time regularization DTW

Dynamic time warping (DTW) is a typical optimization problem, which describes the time correspondence between test template and reference template by the time-warping function w (n) satisfying certain conditions, and solves the normalized function corresponding to the minimum accumulative distance of two template matches.

Let's say we have two time series Q and C, their lengths are N and M: (in the actual speech matching application, a sequence is a reference template, a sequence is a test template, and the value of each point in the sequence is the eigenvalues of each frame in the speech sequence.) For example, the speech sequence Q has n frames, and the eigenvalues of frame I (a number or a vector) are QI. As to what characteristics to take, here does not affect the discussion of DTW. What we need is to match the similarity of the two speech sequences in order to identify what word our test speech is.

Q = q1, Q2,..., Qi,..., qn;

C = c1, c2,..., CJ,..., cm;

If the n=m, then do not need to toss, directly calculate the distance of two series is good. But if n is not equal to M we need to align. The simplest way to align is linear scaling. The short sequence is linearly amplified to the same length as the long sequence, or the long linear is shortened to the same length as the short sequence. However, this calculation does not take into account that the duration of each segment in the speech will be the result of a long or short change in different situations, so the recognition effect is not optimal. Therefore, it is more to adopt dynamic programming (programming) method.

To align these two sequences, we need to construct a matrix grid of N x m, and the matrix elements (I, j) represent the distance between Qi and CJ two points D (QI, CJ) (that is, the similarity between each point of the sequence Q and each point of C, the smaller the distance, the higher the similarity. Here, regardless of the order), generally use European distance, D (QI, CJ) = (QI-CJ) 2 (can also be interpreted as distortion). Each matrix element (I, j) represents the alignment of the point Qi and CJ. The DP algorithm can be attributed to finding a path through several lattice points in this grid, where the path passes through a grid of points that are calculated for two sequences.

So how do we find this path? That path is the best one. That is the problem, what kind of warping is the best.

We define this path as the warping path, which is represented by W, and the K element of W is defined as wk= (I,J) k, which defines the mappings for sequence Q and C. So we have:

First, this path is not optional and requires the following constraints:

1) Boundary conditions:w1= (1, 1) and wk= (M, N). The speed of pronunciation of any kind of speech is likely to change, but the order of its parts cannot be changed, so the selected path must be from the lower left corner and end in the upper right corner.

2) Continuity: if wk-1= (a ', B '), then for the next point of the path wk= (A, b) need to meet (A-A ') <=1 and (B-b ') <=1. That is, it is not possible to cross a point to match, only with their adjacent points aligned. This ensures that each coordinate in Q and C appears in W.

3) monotonicity: if wk-1= (a ', B '), then for the next point of the path wk= (A, b) need to meet 0<= (A-a ') and 0<= (B-b '). This limits the points above the W to be monotonous over time. To ensure that the dashed lines in Figure B do not intersect.

Combining continuity and monotonicity constraints, the path of each lattice point is only three directions. For example, if the path has passed the grid point (I, J), then the next pass of the lattice may only be one of the following three cases: (I+1, J), (I, j+1) or (i+1, j+1).

The paths that satisfy these constraints can have a number, and then we are interested in the path that makes the following the least structured cost:

The k in the denominator is mainly used to compensate for the different lengths of the regular paths. What we are aiming for. Or what Dtw's mind is. is to extend and shorten the two time series, to get two time series of shortest distance is the most similar to the warping, the shortest distance is the last distance measurement of these two time series. Here, all we have to do is choose a path that minimizes the total distance that is ultimately obtained.

Here we define an accumulative distance of cumulative distances. The distance from (0, 0) starts to match the two sequences Q and C, each to a point where all the points calculated are accumulated. After reaching the end point (N, m), this cumulative distance is the last total distance we said above, which is the similarity between the sequence Q and C.

The cumulative distance gamma (I,J) can be expressed in the following way, and the cumulative distance γ (I,J) is the current lattice distance d (i,j), which is the sum of the Euclidean distance (similarity) between the points Qi and CJ and the cumulative distance of the smallest neighboring elements that can reach that point:

The best path is the path where the accumulation distance along the path reaches the minimum value. This path can be obtained through the dynamic Programming (programming) algorithm.

A visual example of a specific search or solution process can be found in:

Http://www.cnblogs.com/tornadomeet/archive/2012/03/23/2413363.html

The application of DTW in pronunciation

An isolated word (word) speech recognition system is assumed and the template matching method is used to identify it. This is usually the whole word as a recognition unit. During the training phase, the user will say each word in the glossary, extract the features and serve as a template to the template Library. In the recognition phase, the feature is also extracted for a new word to be identified, and then the DTW algorithm is used to match each template in the template library to calculate the distance. Finding the shortest distance, or the most similar one, is the recognized word.

Iv. References

[1] http://baike.baidu.com/view/1647336.htm

[2] Http://www.cnblogs.com/tornadomeet/archive/2012/03/23/2413363.html

[3] http://www.cnblogs.com/luxiaoxun/archive/2013/05/09/3069036.html (with matlab/c++ code)

[4] Eamonn J. Keogh, derivative Dynamic time warping

[5] Zhaoli "Voice signal Processing"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.