I. Overview
As the most natural way of human-computer interaction --SpeechIs changing people's lives and enriching the application of multimedia technology. Speech recognition technology is an important branch of speech signal processing and a hot research field in recent years. With the rapid development of technology, speech recognition is not only widely used in desktop PCs and large workstations, but also has a place in the field of embedded systems, such as smart home, Apple's Siri, and on-board speech recognition systems. We believe that in the near future, speech recognition technology will penetrate into every corner of people's lives.
Ii. Classification of Speech Recognition Systems
Speech recognition can be divided into Speech Recognition Methods Based on the speaker's speech.Isolatedword recognition, Connected Word Recognition, and continuous speech recognition. Isolated Word Recognition means that the speaker speaks only one word or phrase at a time. Each word or phrase is counted as one entry in the vocabulary and is generally used in the speech dialing system; CI Speech Recognition supports a small syntactic network, which forms a state machine internally to implement simple control of household appliances, however, a complex connected word speech recognition system can be used in telephone speech query, airline ticket booking, and other systems. Continuous Speech Recognition refers to the speech of a speaker in a daily and natural way, generally, it refers to a dictation machine used for voice input.
According to the recognition object type, speech recognition can be dividedSpeakerdependent speech recognition and non-speaker Speech Recognition. Voice recognition refers to Speech Recognition for only one user. Non-specific persons can be used for different users.
The vocabulary size can be dividedSmall Vocabulary (less than 100 words), Medium Vocabulary (100 words ~ 500) large vocabulary (more than 500 words).
Continuous speech recognition is the focus of recent years and the difficulty of research. At present, the continuous speech recognition is mostly based on HMM (Hidden Markov Model) framework, and the acoustic and linguistic knowledge is introduced to improve this framework, its hardware platform is usually a powerful workstation or PC.
Iii. Principles of Speech Recognition
Speech recognition is to parse and understand the voice signal input by the microphone and convert it into the corresponding text or command.
A complete speech recognition system consists of three parts:
(1) Speech Feature Extraction (frontend processing): This function is used to filter out various interference components and extract time-varying feature vector sequences that can represent speech content from the speech waveform.
(2) acoustic model and pattern matching (Recognition Algorithm): an acoustic model is usually generated by training the obtained speech features to create a pronunciation template for each pronunciation. The input speech features are matched and compared with the acoustic model during recognition to obtain the optimal recognition result.
(3) semantic understanding (post-processing): The computer performs semantic and syntax analysis on the recognition results to understand the meaning of the speech so as to respond accordingly, which is usually achieved through the language model.
Shows the principle of speech recognition:
The speech to be recognized is converted into an electrical signal by a microphone and then added to the input end of the recognition system.PreprocessingThen extract the voice features and use several parameters that reflect the voice signal features to represent the original voice.Common speech features include linear prediction coefficient (LPC), linear prediction Cepstrum coefficient (lpcc), and Mel Spectrum Coefficient (MFCC.There are two phases ::Training and Recognition. In the training phase, voice signals expressed in the form of feature parameters are processed accordingly to obtain standard data indicating the common characteristics of the basic unit, which constitutes a reference template, combine the reference templates of all the basic units that can be recognized to form a reference mode library. In the identification phase, the speech signal to be recognized is extracted by features and matched one by one with each template in the reference mode library according to certain principles to find the pronunciation of the most similar reference template, that is, the recognition result. Finally, speech processing involves syntax analysis, speech comprehension, and semantic networks.
In the speech recognition process, the distance between the unknown speech mode and each template in the speech template library should be calculated based on the pattern matching principle to obtain the best matching mode.The pattern matching methods used in speech recognition mainly include dynamic time warping, DTW, Hidden Markov Model (HMM), and artificial neural networks, ann ).
Iv. Difficulties
Recognition RateIt is an important indicator to measure the performance of the speech recognition system. In practical application, the recognition rate is mainly affected by the following factors:
1. for Chinese speech recognition, dialects or accents will reduce the recognition rate;
2. background noise. Strong Noise in public places has a great impact on the recognition effect. Even in laboratory environments, it may be background noise when you press the keyboard or move the microphone;
3. Questions about "spoken language. It involves both natural language understanding and acoustics. The ultimate goal of speech recognition technology is to enable users to interact with each other in the form of human-machine interaction, the non-standard syntax and abnormal word order make semantic analysis and understanding difficult.
In addition, the recognition rate is related to the speaker's Gender and duration.
Real-timeIt is another indicator to measure the performance of the speech recognition system.
For a PC with high-speed computing capability, CPU and large-capacity storage can basically meet real-time requirements.
But for embedded systems with limited resources, real-time performance is hardly guaranteed.
The next article will discuss key technologies (such as endpoint detection, parameter extraction, and pattern recognition.
References: MATLAB extended Programming
, Thursday, June 26, 2014