I. Introduction
The implementation of embedded speech recognition technology in the 251 kernel.
Abbreviations and terminologies |
Release |
Specific person isolated word Speech Recognition |
Voice recognition of exclusive words |
Endpoint Detection |
Endpoint Detection |
Feature parameter extraction |
Feature parameter extraction |
DTW |
(Dynamic Time Warping) Dynamic Time Normalization |
Lpcc |
Linear prediction Cepstrum Parameters |
II. Introduction to Speech Recognition Technology
1. Application category
(1) recognition of specific persons and non-specific persons. The recognition of specific persons is relatively simple, and the recognition rate of trainees is high, but the recognition rate of non-trained persons is very low. Non-specific persons are not affected, but the implementation is complicated and the recognition rate is relatively low.
(2) Speech Recognition and identity recognition: the former extracts the common features of speech sent by each imperative, while the latter extracts differentiated features. Speech-based identity recognition is mainly used in access control and other security fields. Speech recognition is widely used in Word Recognition, industrial control, and other fields.
(3) Continuous and non-continuous (isolated words) speech recognition, which is obvious and difficult for continuous speech recognition. Embedded products focus on Speech Recognition of isolated words.
(4) Speech Recognition with small vocabulary and large vocabulary. The two methods are different, which may compromise the recognition rate and recognition speed.
(5) keyword recognition, such as extracting a sentence with a keyword in a speech, or searching for the corresponding song based on the song melody.
The system is limited to 80251 of computing and storage performance, and mainly realizes Speech Recognition of isolated words based on specific persons.
2. Implementation Principle
Speech recognition includes preprocessing, feature extraction, training, and recognition.
Preprocessing mainly includes noise reduction, pre-weighting (removal of nose and nose radiation), and endpoint detection (Detection of valid speech segments.
Feature Extraction analyzes feature parameters of pre-processed speech signals. This process extracts the feature parameters that reflect the voice nature from the original voice signal to form the feature vector sequence. Main feature parameters include linear prediction encoding parameters (LPC), linear prediction Cepstrum parameters (lpcc), and Mel Cepstrum parameters (MFCC.
Speech mode Database: an acoustic parameter template. It uses clustering analysis and other methods to train speech parameters that have been repeated by a speaker or multiple speaker for a long time.
Speech Pattern Matching: compares and analyzes the input speech feature parameters with the trained speech pattern database to obtain the recognition result. Common methods include dynamic time normalization (DTW), neural network (ANN), Hidden Markov Model (HMM), and so on. DTW is simple and practical, and is suitable for Speech Recognition of isolated words. Hmm is complex and suitable for continuous speech recognition with large vocabulary.
Iii. Difficulties in Embedded Speech Recognition
The key to speech recognition is the recognition rate. The recognition rate of PC speech recognition is mainly limited by the system selection methods, such as the accuracy of endpoint detection, the validity of feature parameters, and the validity of pattern matching methods. Embedded speech recognition is not only affected by the selection method, but also by the algorithm operation accuracy. PC mainly uses floating-point numbers, while embedded mainly uses fixed-point algorithms. Therefore, it is very important to control the operation precision and error. Speech recognition algorithms include multiple modules and multiple algorithm operation processes. The cumulative error has a fatal impact on the results. Therefore, in the algorithm design process, you must carefully consider the precision of the fixed point value, not only to improve the accuracy as much as possible, but also to prevent overflow of the calculation results.
Embedded speech recognition also needs to consider the recognition speed. PC is running fast, and the user experience of Medium Vocabulary speech recognition is quite good. However, in Embedded speech recognition, the size of vocabulary has a serious impact on user experience. Even if the recognition rate is high, but the recognition speed is very slow, it is difficult to promote this product. Therefore, embedded Speech Recognition requires a compromise between recognition rate and recognition speed. The recognition speed is not only limited by the computing complexity of the selected algorithm, but also by the impact of embedded hardware, such as the size of data space. If the data Ram space is not large enough, other media (such as flash/Card) must be used as the cache. The frequency range of subsequent processing will seriously affect the recognition speed. Therefore, the recognition speed is limited by hardware conditions.
Iv. Hardware conditions and considerations of 80251 platform Speech Recognition
1. 80251 hardware conditions for platform Speech Recognition
Here we assume that the SOC is integrated with the 251 kernel and can implement the recording function, which is the most basic requirement for speech recognition. The hardware condition of speech recognition is all the resources allocated to the recording application by the system.
Recording Application resources generally include:
1) audio buffer (512 bytes), which is mainly used as the data cache after sampling and processing during audio recording. After sampling, audio puts 512 bytes of recording data to this buffer, the application then calls the file system write interface to write the buffer data to flash.
2) edata variable data space (1024 bytes), mainly used for variable data space of the Recording Application and middleware module. 622 bytes have been used for recording applications and middleware, and 402 bytes are left.
3) pcmram (12 Kb) is mainly used for data sampling during recording.
4) code space (9 KB), recording application and middleware code runtime space.
2. 80251 platform speech recognition resources available
To achieve speech recognition on the 80251 platform, you need to add the speech recognition function on the basis of the recording application. Therefore, speech recognition must share the above resources with the recording application. Speech recognition is divided into four sub-processes: preprocessing, feature extraction, training, and matching. The four sub-processes are not repeated during running time. The reuse of the above resources is considered based on the running time of each sub-process. Considering that 80251 uses the hardware bank mechanism, the speech recognition code space is not affected. Therefore, the reusability of data space is emphasized. In addition, speech recognition generally processes PCM data. The Quantization bits of 80251 of PCM data are 16 bits, that is, two bytes. The valid voice frequency of a person is less than 4 kHz. Therefore, 8 kHz is adopted to satisfy the nequest sampling theorem.
Preprocessing is a low-pass filter that does not occupy additional data space. This system does not consider Noise Removal for the moment. The main part of pre-processing is the speech endpoint detection, that is, removing the mute segment and retaining the valid speech segment. It is similar to the voice control recording in the recording. However, the voice recognition endpoint detection is more complex than the voice control recording, and requires more accuracy. The system uses a real-time online endpoint detection algorithm. Therefore, the audio buffer data must be processed in real time during the endpoint detection process (that is, the recording process of listening commands, checks whether the voice is valid. In this process, both audio buffer and pcmram are unavailable. 402 bytes of edata space can be used. Since all the processes of Speech Recognition require frame sharding (later), the system sets 128 points for each frame. Therefore, a 256-byte frame Processing Buffer must be applied. The remaining 402-256 = 146 bytes are used as the variable data space for speech recognition.
During the end point detection process, there are two types of data processing for this frame when the speech starts to be confirmed based on the short-term energy of the current frame or the zero-pass rate, one is to copy the current 512 bytes of data (one sector) to a speech buffer, and the other is to call the file system write interface to write it to flash. For the former, the speed is certainly better than the latter, because it does not need to call the file system interface during the process. In addition, the sudden pulse noise will make the system mistakenly think of the beginning of the speech. Therefore, after confirming the noise, it is necessary to re-start the check. If there is a large speech buffer, the performance of speech recognition will be greatly improved. Generally, the speech duration of an isolated word is less than 2 seconds, that is, 2 × 8000 × 2 ≈ 32 K bytes. However, the 80251 platform obviously does not have this resource. Therefore, you have chosen to write data from the current sector to a file, that is, using flash as the voice buffer. In addition, the 80251 file has a defect, and the write operation cannot call fseek. In this way, the detection speed of the endpoint will decrease. As mentioned above, the pulse noise will meet the conditions for starting the speech. Therefore, a piece of sound data after the start point of the pulse noise will be written into the flash file, the detection algorithm later found that the previous pulse noise would overwrite the previous data and overwrite the data. If you can call fseek, the statement is complete. Because of this defect, you need to call fsclose to close the file, call fs_remove to delete the file, and then call fs_create to create a new file.
The three sub-processes behind speech recognition run in non-recording time. Therefore, audio buffer and pcmram can be used. Since the subsequent processing process needs to read data from the flash file for processing, every time one slice, audio buffer is used as the buffer. Therefore, the data space that can be used in subsequent speech recognition processes is 12 kb of pcmram.
The Feature Extraction Process of speech recognition mainly requires a buffer of feature parameters. Other buffer files to be applied can be stored in the far data space (that is, the code space is used for data, but only the current bank code can be used ). If you are extracting features from the speech frame and writing data to the Flash feature file immediately, you do not need to buffer the buffer of the feature parameter, however, currently, the file system on the 80251 platform does not support reading and writing at the same time. Therefore, you must allocate cache buffer to feature parameters and write each voice command into the flash feature file after processing. A command speech can be used for a maximum of 2 seconds. If the frame is shifted to 64 points, there are 2x8000/64 = 250 frames, the number of LPC feature parameters for each frame of data is 16 (that is, the algorithm selects a 16-level LPC, and each parameter has two bytes). Therefore, a total of 250 × 16 × 2 = 8000 bytes are required. Considering that a reference entry feature parameter and feature parameter of the entry to be recognized need to be buffered during pattern matching, two 8000 bytes of pcmram will exceed 12 kb. Therefore, the buffer of the feature parameter can be allocated at most 6 K. There are two options. One is 16 for the number of feature parameters. The maximum valid speech duration is 6 × 1024/(2 × 16 × (8000/64) = 1.536 seconds. A normal person says that the time constraints of a command can be within 1.5 seconds. The other method is to reduce the number of LPC parameters for each frame of data. Generally, we select 12-level for speech recognition. This program selects 16-level for faster processing. If you select Level 8, it will be faster, but the precision will decrease.
The pattern matching process of speech recognition mainly involves the buffer of the two buffer feature parameters described above. Because the DTW algorithm is used, the intermediate cache involved in the algorithm is allocated to the far data space.
To sum up, the 80251 platform can implement the speech recognition function when the speed is limited.
V. Speech Recognition Algorithm Design
Vi. PC-side Speech Recognition Algorithm Design
Any algorithm transplanted to the embedded platform should be transplanted after the test certificate is upgraded in the PC environment, the algorithm design in the PC environment should be based on familiarity with the hardware structure of the Target Platform and compilation and integration environment. Understanding the two as deeply as possible will bring great benefits to the subsequent transplantation. Otherwise, it will be more than half the effort.
1. Simulate the 80251 recording process for endpoint detection
The PC machine reads files to simulate the 80251 recording process. The detailed endpoint detection process is as follows:
For data consistency, the PC debugging files are all 80251 recorded Recording files (8 K, PCM, 16 bit, single channel ). In addition, because 80251 is the four sub-processes of bank speech recognition, in order to ensure decoupling of each module, the number of valid frames for voice endpoint detection is written in some reserved areas of the wav file header. In this way, the parameter can be directly read and processed from the file.
80251 at the end of the recording, a sector data of the file header will be automatically prepared, and the file header contains a direct mute segment. Therefore, some parameters (such as the file length) must be set for the file header when the file header is written) modify and then write.
In fact, voice data can be used as a separate binary file without the need for file header information. However, for the purpose of debugging effectiveness, the correct file header is still written, so that the voice endpoint detection result is still a WAV file. By listening to this wav file, you will know whether the algorithm is effective.
The results of the algorithm can be visualized, so the MATLAB tool is used for drawing to assist debugging and speed up the debugging process.
2. Data Processing Accuracy
The PCM Data of the 80251 platform is 16-bit quantization, but in fact 8-bit quantization is enough for voice processing. Therefore, the processing of the recording data is performed with a maximum of 8 bits. This reduces the computing workload by half and prevents computing overflow.
Although high 8-bit data is used for processing, a 16-Bit Signed short integer is still used to express each sampling point for the smooth transition of the overall operation of the program. VC is short int and 80251 is signed Int. The obtained feature parameters are also expressed using a 16-Bit Signed short integer. During the operation, 32-bit signed integer is sometimes used to improve the operation accuracy.
3. Type migration
As mentioned in the preceding section, to facilitate transplantation, the program variables in VC must be of the same type as the variables in the Keil code. Therefore, a typeext. h file of the Keil environment needs to be redefined in VC. For example:
# Define int16signed short int
# Define int32signed int
And so on.
In this way, the main VC algorithm can be compiled and debugged in the Keil environment without any modification.
4. Fixed-Point Floating Point Algorithm
This is the most important part of speech recognition. If the cumulative error is controlled during the fixed-point process of the floating point algorithm, it is the most critical issue of the system. The algorithms described in Part 4 are all floating point algorithms. The recognition rate of voice recognition on PC is very high, in part because it chooses a complicated algorithm, and in part because it uses floating point numbers to better control the cumulative error. All its parameters are normalized so that the operation is between-1 and 1.
Floating-point libraries pose a huge burden on embedded platforms and are very slow in operation. Therefore, the key to embedded products is how to better control the error in the fixed-point process of floating point algorithms. An Effective Method to Control errors is to increase the precision of data operations, but increasing the accuracy will lead to an overflow of operations. Therefore, we must consider a compromise between the two to obtain a balance.
A major difficulty encountered when debugging this part of the function is the processing of e when the LPC parameter is obtained, because it was processed as a divisor during the loop process, if the error control is not completed in the loop process, it will be 0 in a certain operation, resulting in result overflow. The debugging process is still difficult. I once reminded myself that I had to try it without a floating point library. But at last, I insisted on using the fixed-point algorithm to continue debugging.
For the algorithm of this system, the fixed-point floating point algorithm is mainly considered in the following aspects.
1. As stated in section 5.2, the recording data is processed in 8-bit height.
2. The self-related system of the LPC algorithm is normalized, and 12 bits of valid data are obtained, that is, R (0) = 4096. In addition, if the cumulative value is found to be greater than 0x7fff during the operation, the corresponding shift is performed to ensure that the cumulative value is less than 0x7fff. Because all auto-related systems are divided by one value at the same time, there is no impact on the subsequent results.
3. The E of the LPC algorithm takes the 32-bit signed number, and the intermediate variable of the algorithm operation also takes the 32-bit.
4. Take 12 valid values for parameters A and K calculated by LPC, that is, the value is less than 4096.
5. The division accuracy in the LPC algorithm is less than 0.5, that is, the relationship between the remainder and the divisor is considered.
6. the DTW algorithm uses 32-bit precision.
When a good experience of fixed-point debugging of floating-point algorithms, two projects are used in VC to implement the floating-point Algorithms and fixed-point algorithms respectively, and then the calculation results of each step are tracked. If the error is too large, modify and adjust it in time.
5. Speed Optimization
When designing an algorithm, considering that the multiplication and division operations of 80251 are limited, shift and look-up tables are used to increase the computing speed. Example:
1. Pre-increment and other processes need to multiply by 0.9375, then multiply by 120, and then shift 7 places to the right, that is, divide by 128.
2. If you call COs and other mathematical functions according to the formula during window-based processing, the increase in the amount of code and the speed of operation will be fatal. Here, we use the lookup method for processing. That is to say, we use Hamming (128) in MATLAB to obtain the 127 point signal point. Because the signal point is a decimal point ranging from-1 to 1, we need to enlarge it by 1024 times, save these values to a table (using the far data space). During code calculation, the values are multiplied first, and then shifted to the right by 10 digits. Therefore, Division operations are cleverly avoided here.
VII. Speech Recognition on the 80251 Platform
1. algorithm porting
Because the PC-side algorithm has considered 80251 of the recording process and Keil C environment constraints as much as possible, the algorithm porting is still relatively smooth, especially in the speech endpoint detection section. This function is implemented without any algorithm changes. Of course, the recording application needs to be adjusted. The main changes are UI control and display, and the modification of the middleware recording process. That is, call the wav_vad function before the previous write operation to perform the endpoint detection. Make sure that the voice starts to be written into the flash recording file.
The adjustment of feature extraction and training and recognition mainly involves allocating code runtime space and far data space. The intermediate variable in the DTW algorithm is about 1 kb in size, but is allocated to the far data space because it is only used in this part.
2. Comparison between VC and Keil C
The differences between the VC and Keil C compilers were found during the debugging process. However, it takes a lot of time to discover this feature because it is relatively hidden. List the following points:
1. int32 M = (int32) (int16 val1 * int16 val2); VC automatically assigns the 32-bit multiplication result to m, while Keil C is to clear the 16-bit height of the 32-bit multiplication result and then pay it to M. This seems strange, but it is. It is correct only when the previous forced type conversion is removed. So sometimes forced type conversion is not necessarily a good thing.
2. Some Buffer Used in VC will be automatically initialized to 0, but not in Keil C. Therefore, the calculation result of the LPC algorithm is biased. Because the recognition rate of small machines and PCs is very different, we had to use the same file to debug on the small machine and PC at the same time to track the results of each step, and finally found this problem. In fact, programming habits are not good enough. Generally, it should be cleared before using a buffer segment.
3. the short int value in VC is extended to 32 bits, while the value range of Keil C is not increased when it is moved to the left, that is, the value exceeding 16 bits is discarded, which does not really achieve the purpose of left-shift multiplication.
8. 80251 platform speech recognition results and analysis
The recognition rate of 80251 platform speech recognition can be controlled at more than 93% (with a vocabulary of less than 50, based on experience), more than 50 have not been tested yet.
IX. References
1. Han Jiqing. Speech signal processing. Beijing 2004.9
2. Network-related papers.
3. MATLAB algorithm library.
Note: This is an actual project. If you want to reference it, please indicate the source. Thank you!
80251-based embedded Speech Recognition