Speech Recognition Process Analysis 1

Source: Internet
Author: User

1 Sampling: the voice signal spectrum is mainly concentrated in Hz ~ Within the range of 3400hz. The digital phone cz711 protocol proposed by ITU. The sampling frequency is 8 kHz;

2. quantization accuracy: 8bit
;

3-frame Splitting: the sound signal ranges from 10 to 10 ~ Stable for a short time within 30 ms, with a frame length of 20 ~ 30 ms. For example, if a frame contains 160 sampling points, the frame length is 160 × 1/8000 = 20 ms. Frame shifting is 10 ms;

The size of 4 frames is 160 x 8bit = 1280bit ≈ 1.2kb;

The size of the 5-segment cache is 8 KB.

Detailed analysis:

The analog voice signal is sampled by uda1341ts, and the sampling frequency of the uda1341ts Audio Codec Chip is set to 8 kHz. The sampling frequency is 8000 times per second, that is, 8000 sampling points per second, one sample point is quantified Based on 8-bit precision, so 8000x8 bit = 64 x 103bit ≈ 8 Kbyte (less than) in one second ). Sends the sampling signal to ram;

In the system Ram, set the Three-segment cache area, each of which is 8 KB, that is, the cache area is full for 1 second, and the three-segment buffer area is recycled to improve the real-time performance of the system;

When data in a cache area is filled up, the system extracts data from the buffer zone and analyzes the data. First, the system adds a window to the digital voice signal and splits the frame, the length of each frame is 20 ms, and the frame is moved to 10 ms. Each frame includes 160 sampling points for endpoint detection. If it is detected that it is not the starting point of the voice signal, it is directly discarded; if the detection time is the starting point of the voice signal, the feature parameters are extracted from each frame of Digital Signal starting from this frame and stored in the feature parameter template library, that is, a memory segment, when the end point of a digital voice signal is detected, feature parameter extraction and storage are stopped. The number of cycles is the number of training requests. For example, we need to perform 100 speech data training times, and this cycle is 100. The endpoint detection algorithm is used to calculate the average energy of each frame for the audio signal after the split frame,
When the threshold value is reached for several consecutive frames, it is considered that it is already a Voice Segment. Then, when the forward search reaches the zero-crossing rate, it is always greater than a preset threshold,
This is the starting point of the speech segment. Short-term energy is used for the end point determination,
That is, if the energy value of several consecutive frames is smaller than the preset threshold value, the speech segment ends. The accuracy of direct link recognition is very important because there are many methods to improve the endpoint detection algorithm, this module analyzes the Speech Endpoint Detection Algorithm and proposes an improved endpoint detection algorithm to improve the accuracy of the judgment.

During speech recognition, pattern matching is performed based on the same processed feature parameters and the feature parameters in the template library according to the DTW algorithm.

DTW algorithm:

The Dynamic Time normalization (DTW) algorithm performs nonlinear bending correction on the timeline of the feature parameters of the speech signal to be recognized and those of the reference speech signal, so that two pronunciations can be better matched. In order to measure the similarity between two pronunciations, each template in the Feature Template Library is called a reference template, represented by R, and the feature vector sequence obtained by processing the speech signal to be recognized is called a test template, it is represented by T. Comparing the similarity between the test template T and the reference template R, the distortion between them can be calculated. The smaller the distortion, the higher the similarity. The overall distortion between the test template and the reference template can be expressed as d [T, R]. To calculate this distortion, the distortion of each frame corresponding to the test template and the reference template should be counted. Assume that N and m are randomly selected frame numbers in T and R respectively, then the distortion of the two frames is represented by d [T (N), R (m. The distance function depends on the actual distance measurement. In DTW algorithm, the Euclidean distance is usually used. The formula is as follows:

Considering that when calculating the overall distortion of d [T, R], the number of voice frames between the test template and the reference template is often different, in this case, you need to map the test template to the reference template using a scaling method, and then calculate the time distortion between the new frames separately to obtain the total distortion d [T, R]. the specific method is as follows:

Use a two-dimensional Cartesian coordinate system to calibrate the N frames of the test template and M frames of the reference template to construct the grid shown in L, each intersection in the grid represents the intersection of a frame in the test template and a frame in the reference template, and the distortion is d [T (N), R (m)]. Calculate the distortion of each intersection to obtain the frame matching distance matrix.

Figure 1 minimum distortion of Dynamic Time Normalization

When the test template and reference template start and end point are aligned, DTW looks for a normalization function that maps the timeline N of the test vector to the timeline m of the reference template nonlinear, the optimal path from each intersection to the ending point minimizes the total frame distortion of all intersections on the path, that is

In speech recognition, a reference template is created for different voice commands. the DTW algorithm pattern matching is performed for each reference template and the test template to be recognized to obtain different accumulative distances d [n, m], the cumulative distance is the least, indicating the highest similarity. The corresponding speech is the recognition result.

The DTW algorithm is suitable for the recognition of isolated words, because when the length of the test and reference templates is large, the training and recognition operations are very large, and it is difficult to meet the high real-time requirements of the recognition algorithm. What we need to implement for the moment is the recognition of isolated words, so this algorithm can be used. Of course, DTW algorithms have also been improved to improve real-time performance, such as the improved DTW algorithm mentioned in the journal paper "Design and Test of on-board speech recognition system.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.