Speech recognition technology, also known as automatic speech recognition, aims to convert the lexical content in human speech into machine-readable input, such as keystrokes, binary encodings, or character sequences.
Speech recognition technology as input mode, more efficient than key input and gesture input, learning cost is very low, for the non-specific people continuous speech recognition system recognition rate of 98.73%, has reached practical requirements, has a broad application prospects, mobile phone applications have voice dialing, voice input, voice command, Voice search and voice translation.
The technical principles of speech are more complex and can be understood from the process of voice interaction:
1. Turn on speech recognition function. Generally by the user manually click on the button to start, the mobile phone can not automatically start, such as by voice command to start or according to the level of the volume of judgment began to identify.
2. Enter the speaking interface. The program interface will visually reflect the volume change.
3. Speaking finished, the system began to analyze. There are two ways to end the input: one is automatic shutdown, usually when the word is finished and then shut down, the other is the user's phone manually shut down. The system processing process can be divided into the following steps:
A) front-end processing. The main task of the module is to remove the noise from the input signal and extract the characteristics for acoustic model processing. Signal processing before the breakpoint detection, endpoint detection refers to the voice signal in the speech and the voice signal period to distinguish between the precise identification of the starting point of the speech signal. After endpoint detection, subsequent processing can only be performed on speech signals, which plays an important role in improving the accuracy of the model and identifying the correct rate. The main task of speech enhancement is to eliminate the influence of ambient noise on speech. At present, the common method is to use Wiener filter, which is better than other filters in the case of large noise.
b) acoustic feature extraction. The extraction of acoustic features is not only a process of large information compression, but also a process of signal unwinding, so as to make the pattern dividing device better divided. such as uploading audio will use speech codec technology, can reduce the audio file size, storage space or transmission bit rate.
c) Statistical acoustic model. The acoustic characteristics of each frame are computed, such as context modeling. According to the sound mechanism, the sound can only be gradient, the first sound will affect the latter, so that the spectrum of the latter sound and other conditions of the spectrum differences, so that the model can more accurately describe the voice.
d) Pronunciation dictionaries. A pronunciation dictionary contains a vocabulary set and its pronunciation that the system can handle, similar to a thesaurus for pinyin input methods. such as input method, dictionary update hot Word and thesaurus have groups to improve the accuracy of matching.
e) language model. The language model models the language to which the system is directed, such as parsing the speech context.
Due to the limited size of the audio file, only a small number of dictionaries can be stored locally, which requires complex voice to connect to the server analysis. Google Voice search after the user input is completed before the hint can not be networked, before starting input should check the network connection status.
4. The system analyses the output result. One is to automatically display results based on results, such as Bing search, the other is to provide options for users to choose, which is related to the probability of output results. The results of user selection have an impact on the ranking of dictionaries, enhance the adaptive and robustness of speech, and help to form personalized input.
Depending on the product's identifiable vocabulary, the user can only enter words that match the command, such as a search for a contact name, for a specific voice command. The input method has more vocabulary, and the sentence search not only needs the huge vocabulary, but also needs to distinguish the legato and the tone of the continuous speech input, and it requires the more reasonable result according to the context and the hot word output. The less restrictive the condition, the greater the difficulty of speech recognition. Because to some extent avoid fuzzy sound, the less the dictionary data, the higher the accuracy of the input of specific words.
Chinese phonetic input is different from English, English can not match the dictionary configuration words can not be recognized, Chinese vocabulary is composed of words, Chinese may be based on word recognition.
Io 5 Input method has been added to the voice function, will gradually become the general function of mobile phone input, the final output of the accuracy and operating fluency is an important measure of the quality of their interactions.
Author: Xiao Sheng
Article Source: daichuanqing.com/index.php/archives/2800