Voice Signal Processing (I) Dynamic Time normalization (DTW)

Source: Internet
Author: User

Voice Signal Processing (I) Dynamic Time normalization (DTW)

Zouxy09@qq.com

Http://blog.csdn.net/zouxy09

 

This semester I have a speech signal processing course, and I am about to take the test. So I need to know more about the relevant knowledge. Haha, I have no lectures at ordinary times, but now I have to stick to it. By the way, I would like to make my knowledge architecture clearer and share it with you. The following is the first knowledge point: DTW. Because it takes a little time, there may be a lot of things wrong with it. I hope you can correct it. Thank you.

 

The birth of Dynamic Time Warping (DTW) has a certain history (proposed by Itakura, a Japanese scholar), and its purpose is also simple, is a method to measure the similarity between two time series with different lengths. It is also widely used, mainly in template matching. For example, it is used in Speech Recognition of isolated words (recognizing whether two speech segments represent the same word) and Gesture Recognition, data Mining and information retrieval.

 

I. Overview

In most disciplines, time series is a common form of data representation. For time series processing, a common task is to compare the similarity between the two sequences.

In time series, the length of the two time series that require Similarity comparison may not be equal. In the field of speech recognition, the speech speed varies with people. Because the voice signal is quite random, even if the same person sends the same voice at different times, it cannot have a full length of time. In addition, the pronunciation speed of different Phoneme in the same word is also different. For example, some people may drag the "A" sound for a long time or make the "I" sound very short. In these complex cases, the distance (or similarity) between two time series cannot be effectively obtained using the traditional Euclidean distance ).

As shown in Example A, the solid line and dotted line are the two speech waveforms of the same word "pen" (opened on the Y axis for observation ). We can see that their overall waveform shapes are very similar, but they are not aligned on the timeline. For example, when there are 20th time points, the point of the Real-line waveform will correspond to the B 'point of the dotted-line waveform. In this way, the traditional distance is used to calculate similarity, which is obviously unreliable. Obviously, point A of the solid line corresponds to point B of the dotted line. In Figure B, DTW can find the points alignment of the two waveforms, so that the distance between them is correct.

That is to say, in most cases, the two sequences have very similar shapes on the whole, but these shapes are not aligned on the X axis. Therefore, before comparing their similarity, We need to distort one or two of the sequences under the timeline to achieve better alignment. DTW is an effective method to implement this warping distortion. DTW extends and shortens the time series to calculate the similarity between the two time series.

What if we know that the two waveforms are aligned? That is to say, what kind of warping is correct? Intuitively, of course, a sequence can overlap with another sequence after warping. In this case, the sum of distance between all corresponding points in the two sequences is the smallest. Therefore, intuitively, the correctness of warping generally refers to the alignment of "feature to feature.

 

Ii. Dynamic Time normalization DTW

Dynamic Time normalization DTW is a typical optimization problem. It uses the time normalization function W (n) that meets certain conditions to describe the time correspondence between the test template and the reference template, solve the regularization function corresponding to the minimum cumulative distance when the two templates match.

Suppose we have two time series Q and C, whose lengths are N and M: (in actual speech matching, one sequence is a reference template and the other is a test template, the value of each vertex in the sequence is the feature value of each frame in the speech sequence. For example, there are N frames in the speech sequence Q, and the feature value of the I-frame (a number or a vector) is Qi. The discussion on DTW is not affected here. What we need is to match the similarity between the two speech sequences to identify the words in our test speech)

Q= Q1, q2 ,..., Qi ,..., Qn;

C= C1, C2 ,..., CJ ,..., Cm;

If n = m, you don't need to worry about it. Simply calculate the distance between the two sequences. But if n is not equal to m, we need to align. The simplest Alignment Method is linear scaling. Scale down a short sequence to the same length as a long sequence, or shorten the length to the same length as a short sequence before comparison. However, this calculation does not take into account that the duration of each segment in the speech may change in length or short, so the recognition effect cannot be the best. Therefore, dynamic programming is adopted.

To align these two sequences, we need to construct a matrix grid of N x m. The matrix element (I, j) represents the distance between the Qi and CJ points D (QI, CJ) (that is, the similarity between each vertex of the Q series and each vertex of C. The smaller the distance, the higher the similarity. Here, regardless of the sequence), we generally use Euclidean distance, D (QI, CJ) = (Qi-CJ) 2 (which can also be understood as distortion ). Each matrix element (I, j) indicates the alignment of vertices Qi and CJ. The DP algorithm is used to find a path that uses several grid points in the grid. The grid points used by the path are the Alignment Points calculated by the two sequences.

How can we find this path? What is the best path? That is, the problem just now. What kind of warping is the best.

We define this path as a warping path regular path and use W to represent it. The K element of W is defined as wk = (I, j) K, defines the ing between Q and C. In this way, we have:

First, this path is not randomly selected and must meet the following constraints:

1) boundary conditions:W1 = (1, 1) and wk = (m, n ). The pronunciation speed of any speech may change, but the order of each part cannot be changed. Therefore, the selected path must start from the lower left corner and end in the upper right corner.

2) Continuity:If the wk-1 = (A', B '), then for the next vertex of the path wk = (a, B) needs to satisfy (a-') <= 1 and (B-B ') <= 1. That is, it is impossible to match a point across it. It can only be aligned with its adjacent point. This ensures that every coordinate in Q and C appears in W.

3) Monotonic:If the wk-1 = (A', B '), then for the next vertex of the path wk = (a, B) needs to satisfy 0 <= (a-') and 0 <= (B-B '). The vertices above W must be monotonic over time. To ensure that the dotted lines in Figure B do not intersect.

Combined with continuity and Monotonicity constraints, the path of each grid point has only three directions. For example, if the path has passed the lattice (I, j), the next passing lattice can only be one of the following three situations: (I + 1, J), (I, j + 1) or (I + 1, J + 1 ).

The paths that meet the preceding constraints can have an exponential number. What we are interested in is to minimize the cost of the following regular expressions:

The K in the denominator is mainly used to compensate for the regular paths of different lengths. What is our purpose? Or what is DTW's idea? It is to extend and shorten the two time series to get the warping with the shortest distance between the two time series, that is, the most similar one, the shortest distance is the final Distance Measurement of the two time series. Here, we need to select a path to minimize the final distance.

Here we define a cumulative distance to cumulative distances. Match the Q and C sequences from (0, 0). The distance calculated by all previous vertices is accumulated at each vertex. After arriving at the end point (n, m), this cumulative distance is the final total distance we mentioned above, that is, the similarity between the series Q and C.

The cumulative distance gamma (I, j) can be expressed as follows. The cumulative distance gamma (I, j) is the current point distance d (I, j ), that is, the sum of the Euclidean distance (similarity) between the vertex Qi and CJ and the cumulative distance of the smallest adjacent element that can reach the vertex:

The optimal path is to make the accumulation distance along the path reach the minimum value. This path can be obtained through the dynamic programming algorithm.

For an intuitive example of the search or solution process, refer:

Http://www.cnblogs.com/tornadomeet/archive/2012/03/23/2413363.html

 

Iii. Application of DTW in speech

Assume that an isolated word speech recognition system uses the template matching method for recognition. Generally, the entire word is used as the recognition unit. During the training phase, you can extract each word in the vocabulary and use it as a template to store it in the template library. In the recognition phase, features are also extracted for a new word to be recognized, and then the DTW algorithm is used to match with every template in the template library to calculate the distance. Finding the shortest distance is the word that is most similar.

 

Iv. References

[1] http://baike.baidu.com/view/1647336.htm

[2] http://www.cnblogs.com/tornadomeet/archive/2012/03/23/2413363.html

[3] http://www.cnblogs.com/luxiaoxun/archive/2013/05/09/3069036.html (with Matlab/C ++ code)

[4] Eamonn J. Keogh, derivative Dynamic Time Warping

[5] Zhao Li Speech Signal Processing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.