Speech recognition technology is the technology that enables a machine to turn a voice signal into a corresponding text or command by identifying and understanding the process .
Embedded products based on speech recognition chip are also more and more, such as sensory Company's RSC series speech recognition Chip, Infineon Company's Unispeech and unilite voice chip, and so on, these chips have been widely used in the development of embedded hardware. In the software, the more successful speech recognition software is: Nuance, IBM ViaVoice and Microsoft's SAPI and open source software HTK, these software are for non-specific people, large vocabulary of continuous speech recognition system.
Speech recognition is essentially a process of pattern recognition, and the pattern of unknown speech is compared with the reference mode of the known speech, and the best matching reference mode is used as the recognition result.
The purpose of speech recognition is to allow the machine to give people the auditory characteristics, understand what people say, and make corresponding actions. At present, most speech recognition technology is based on statistical mode, and from the perspective of speech generation mechanism, speech recognition can be divided into two parts: speech layer and language layer.
Nowadays, the mainstream algorithms of speech recognition technology are based on dynamic time warping (DTW) algorithm, vector quantization based on nonparametric model (VQ) method, hidden Markov model based on parametric model (HMM), Artificial neural Network (ANN) and Support vector Machine (SVM) and other speech recognition methods.
Speech recognition Classification:
According to the degree of dependence on the speaker, divided into:
(1) Specific person speech recognition (SD): can only identify the specific user's voice, training → use.
(2) Non-specific person speech recognition (SI): can recognize any person's voice, no training.
According to the requirements of the way of speaking, divided into:
(1) Isolate word recognition: Only a single word can be recognized at a time.
(2) Continuous speech recognition: users to speak at a normal speed, you can identify the statements.
The model of speech recognition system is usually composed of two parts: acoustic model and language model, which correspond to the calculation of phonetic to syllable probability and syllable-to-word probability.
Sphinx is a large vocabulary, non-specific person, continuous English speech recognition system developed by Carnegie Mellon University in the United States. A continuous speech recognition system can be broadly divided into four parts: feature extraction, acoustic model training, language model training and decoder.
(1) pretreatment module:
The input of the original speech signal processing, filtering out the unimportant information and background noise, and the speech signal endpoint detection (to find out the whole story of the voice signal), voice sub-frame (approximate to think that in the 10-30ms is the voice signal is short-term smooth, The speech signal is divided into a section for analysis) and pre-emphasis (lifting high-frequency portion) and so on.
(2) Feature extraction:
To remove the redundant information which is useless for speech recognition in speech signal, it retains the information which can reflect the essential characteristics of speech and expresses it in certain form. It is to extract the key characteristic parameters which reflect the characteristic of speech signal to form the characteristic vector sequence for the subsequent processing.
At present, the more commonly used methods of extracting features are more, but the extraction methods are derived from the spectrum. Mel frequency cepstrum coefficient (MFCC) parameters are widely used because of their good noise resistance and robustness. It is also characterized by MFCC in Sphinx. The calculation of MFCC first uses FFT to transform the time domain signal into frequency domain, then the logarithmic energy spectrum is convolution according to the Delta Filter group distributed by the Mel scale, and finally the discrete cosine transform DCT is used for the vector of the output of each filter, and the first n coefficients are taken.
In the Sphinx, the frame frames to split the voice waveform, about 10ms per frame, and then each frame is extracted to represent the frame voice of 39 digits, these 39 numbers is the frame of speech mfcc features, with a feature vector to represent.
(3) Acoustic model training:
The acoustic model parameters are trained according to the characteristic parameters of the training Voice library. In recognition, the characteristic parameters of the speech to be recognized can be matched with the acoustic model to obtain the recognition result.
The current mainstream speech recognition system uses hidden Markov model hmm for acoustic modeling. The modeling unit of an acoustic model can be a variety of phonemes, syllables, and words. For the speech recognition system with small vocabulary, the syllables can be modeled directly. For the recognition system with large vocabulary, the phoneme is generally selected, that is, the consonant and the vowel are modeled. The larger the recognition size, the smaller the identification unit is selected. (About Hmm, there are a lot of classical commentary on the internet, such as "Hmm learning best example" and "Hidden Markov model (HMM) Introduction", etc., do not know can go to see)
Hmm is a statistical model for the time series structure of speech signals, which is regarded as a mathematical double stochastic process: a stochastic process that simulates the change of statistical characteristics of speech signals by using the Markov chain with finite number of States (the internal state of Markov model is not visible outside). The other is a stochastic process that is associated with each state of the Markov chain, an externally visible observation sequence (usually the acoustic characteristic computed from individual frames).
The human speech process is actually a double stochastic process, the voice signal itself is an observable time-variant sequence, is the brain based on grammatical knowledge and speech needs (non-observable state) of the phonemes emitted by the parameters of the flow (sounds). Hmm, it is an ideal phonetic model to imitate the process reasonably. It is necessary to make two hypotheses to describe the voice signal by Hmm, one is that the transfer of internal state is only related to the previous state, and the other is that the output value is only related to the current state (or the current state transition), which greatly reduces the complexity of the model.
The use of HMM in speech recognition is usually based on a left-to-right unidirectional, with a self-loop, with a cross-spanning topology to model the recognition primitives, a phoneme is a three to five state of the Hmm, a word is composed of multiple phonemes of the word hmm serial composed of Hmm, The whole model of continuous speech recognition is a hmm which combines words with silence.
(4) Language model training:
A language model is a probabilistic model used to calculate the probability of a sentence appearing. It is primarily used to determine which word sequence is more likely to be present, or to predict the content of the next upcoming word in the presence of several words. In other words, language models are used to constrain word searches. It defines which words can follow the previous recognized word (the match is a sequential process), so that some impossible words can be excluded for the matching process.
Language modeling can effectively combine the knowledge of Chinese grammar and semantics, describe the intrinsic relationship between words, thus improve the recognition rate and reduce the search scope. The language model is divided into three levels: dictionary knowledge, grammatical knowledge, syntactic knowledge.
The grammar and semantics of the training text database are analyzed, and the language model is trained by the statistical model. There are two methods of language modeling, which are based on rule model and statistical model. Statistical language model is to use probability statistics method to reveal the internal statistical laws of language units, in which the N-gram model is simple and effective and widely used. It contains the statistics of the word sequence.
The N-gram model is based on the assumption that the occurrence of the nth word is only related to the previous N-1, and is irrelevant to any other word, and the probability of the whole sentence is the product of the probability of each word appearing. These probabilities can be obtained by the number of simultaneous occurrences of n words directly from the corpus. Commonly used is two yuan of Bi-gram and ternary tri-gram.
Sphinx is a statistical language probability model using two-yuan syntax and ternary grammar, that is, the probability P (w2| W1) that the current word appears by the previous or two words, p (w3| W2, W1).
(5) Speech decoding and search algorithm:
decoder: refers to the recognition process in speech technology. For the input speech signal, according to the well-trained hmm acoustic model, language model and dictionary to establish a recognition network, according to the search algorithm in the network to find the best path, the path is able to output the maximum probability of the speech signal string, so that the voice sample to determine the text contained. Therefore, the decoding operation refers to the search algorithm: refers to the decoding end through the search technology to find the best word string method.
The search in continuous speech recognition is to find a word model sequence to describe the input speech signal, and then get the word decoding sequence. The search is based on the acoustic model score and the language model score in the formula. In practical use, the language model is often based on experience to add a high weight, and set a long word penalty score. Today's mainstream decoding technology is based on the Viterbi search algorithm, Sphinx is also.
Based on the dynamic programming of the Viterbi algorithm at each point in the state, the decoding state sequence is computed to observe the posterior probability of the sequence, the maximum probability of the path is retained, and the corresponding state information is recorded at each node in order to finally reverse the word decoding sequence. The Viterbi algorithm is essentially a dynamic programming algorithm that traverses the HMM state network and retains the optimal path score for each frame of speech in a certain state.
The recognition result of continuous speech recognition system is a word sequence. Decoding is actually the search for all the words in the Word table repeatedly. The arrangement of thesaurus morphemes affects the speed of searches, and the way words are arranged is the representation of a dictionary. In the Sphinx system, phonemes are used as acoustic training units, and dictionaries are usually used to record which phonemes each word consists of, or to annotate the pronunciation of each word.
N-best Search and multi-pass search: to use a variety of sources of knowledge in search, often to do multiple searches, the first time using a low-cost knowledge sources (such as acoustic model, language model and phonetic dictionary), to produce a candidate list or word candidate grid, Based on this, the second-pass search for the use of high-cost knowledge sources (such as 4-order or 5-order N-gram, 4-order, or higher-context-dependent models) gets the best path.
Personal understanding of the speech recognition process:
For example, I said to the computer: "Help me to open My computer." "Then the computer had to understand what I said, and then I did the operation of turning on my computer, so how do I do that?"
This has to have a job in advance, that is, the computer must first learn to "help me to open My computer." "This voice (actually a waveform) represents the text that is" help me to open My computer. " "This word string. So how to make it learn.
If the syllable (to Chinese is a word pronunciation) for the voice of the primitive words, then the computer is a word to learn, such as "Help" word, "I" word, and so on, then "help" how to learn the word. In other words, the computer received a "help" word of the voice waveform, how to analyze the understanding to know that it represents the "help" word. First we need to build a mathematical model to represent this voice. Because the voice is a continuous non-stationary signal, but in a short period of time can be considered smooth, so we need to split the voice signal for a frame, if about 25ms a frame, and then in order to let each frame smooth transition, we let each frame see overlap, if overlap 10ms. In this way, the language signal of each frame is smooth, and then the information that reflects the essential features of speech is extracted from each frame voice signal (eliminating the redundant information which is useless for speech recognition in speech signal, and reaching the dimensionality reduction). So what is the best feature to express every frame of speech? MFCC is used more than a kind of, here does not introduce. Then we extracted each frame speech mfcc characteristic, obtained is a series of coefficients, about forty or fifty such, Sphinx is 39 numbers, formed the eigenvector. Well, then we'll describe each frame by 39 numbers, and the different frames will have different combinations of 39 numbers, so what mathematical model do we use to describe the distribution of these 39 numbers? Here we can use a mixed Gaussian model to represent the distribution of 39 numbers, while the mixed Gaussian model has two parameters: mean and variance; then actually the voice of each frame corresponds to such a set of values and variance parameters. Oh, quite wordy ah.
OK, so that the "help" word in the voice waveform of a frame corresponds to a set of mean and variance (the observed sequence in the HMM model), then we just need to determine the "help" word (the HMM model of the implicit sequence) also corresponds to this set of mean and variance is OK. So how does the latter correspond? This is the role of training. We know that the description of a HMM model requires three parameters: initial state probability distribution π, The transfer matrix A of the suppressed state sequence (that is, the probability of this mean or variance in the probability of a state being transferred to another state) and the probability distribution of the output observation in some implied state (that is, corresponding to an implied state), and the acoustic model can be modeled with a HMM model, That is, for each speech unit that is modeled, we need to find a set of HMM model parameters (Π,A,B) to represent this voice unit. So how are these three parameters determined? Training. We give a voice database, indicating that the voice represents the word, and then let the computer to learn, that is, the database statistics, get (Π,A,B) these three parameters.
Well, a Word (modeling unit) of the acoustic model was established. Is there a lot of words in Chinese, then we have to build an acoustic model for each one. Assuming there are thousands of models, and then each model has three or 5 hmm states, then if you say the sentence has 10 words, then we have to search all the possible models to match your voice, that is how much search space, it is very time-consuming. Then we need to adopt a better search algorithm (here is the viterbi-beam algorithm), it every search to a state point, the maximum probability of preserving, and then discard the previous state point, so that a lot of clipping the search path, but because the previous path is ignored, So it can only get a local optimal solution.
What if the following conditions are present? For example, it's a nice day, which may be recognized from the voice as: it Sun niced A, or it son ice day. From an acoustic model it is impossible to differentiate these results, because the difference is only due to the different boundaries of each word. At this time the language model should be on the scene, semantically judging the probability that the result appears most, that is, search results. The language model N-gram is based on the assumption that the occurrence of nth words is only related to the first N-1 words, but not to any other words, and the probability of the whole sentence is the product of the probability of each word appearing. These probabilities can be obtained by the number of simultaneous occurrences of n words directly from the corpus. This allows the search to be constrained, increasing the accuracy of recognition.
For some of the basic concepts of speech I translated the Cmusphinx wiki part, specifically see:
http://blog.csdn.net/zouxy09/article/details/7941055
some of the added concepts: (organized from Baidu Encyclopedia)
Syllable:
Syllables are the most natural voice units that can be sensed by hearing, and one or several phonemes are combined in a certain pattern.
Chinese syllables:
In Chinese, a Chinese character is a syllable, and each syllable consists of three parts of the initials, the vowel and the tones; there are 400 syllables in the non-tonal syllable (no tone distinction) in Mandarin Chinese. Pinyin is the process of pronouncing syllables, in accordance with the rules of the composition of putonghua syllables, the initials, finals, tones of the rapid succession combined with tone and become a syllable. such as: Q-i-áng→qiáng (Strong).
English syllables:
Syllables are the basic unit of pronunciation, and the pronunciation of any word is decomposed into a syllable read aloud. A vowel phoneme in English can form a syllable, and a vowel phoneme combined with one or several consonant phonemes can also form a syllable. English words have one syllable, two syllables, and multiple syllables. One syllable is called a single syllable word, two syllables are called two-syllable words, more than three syllables are called multi-syllable. Such as: Take, ta ' ble table, po ' ta ' to potato, po ' pu ' la ' tion population, con ' gra ' tu ' la ' tion congratulations. Te ' le ' com ' mu ' ni ' ca ' tion Telecom.
Wenjin are the main body of a syllable, and the consonant is the dividing line of syllables. Each vowel phoneme can form a syllable, such as a bed bed, bet on the bet. Two vowel phonemes can form a syllable, such as: seat sitting, beat beating, Beast Excellent. When there is a consonant phoneme between the two vowel phonemes, the consonant phoneme is returned to a syllable, such as: Stu ' dent student, la ' bour labor. When there are two consonant phonemes, a consonant phoneme belongs to the previous syllable, one to the next syllable, such as: Win ' ter Winter fa ' ther father, tea ' Cher teacher.
phonemes:
Phonemes are the smallest units of speech that are divided according to the natural attributes of the voice. In terms of acoustic properties, phonemes are the smallest units of speech that are divided from the sound quality angle. From a physiological point of view, a phonetic action forms a phoneme. If (ma) contains (m)(a) two pronunciation actions, it is two phonemes. The same pronunciation action is the same phoneme, the sound of different pronunciation action is different phonemes. such as (ma-mi), two (M) pronunciation action is the same, is the same phoneme, (a)(i) pronunciation action is different, is different phonemes.
Chinese phonemes:
Syllables are just the most natural voice units, while phonemes are the smallest phonetic unit phonemes. The Chinese language consists of 10 vowels, 22 consonants, and a total of 32. A syllable, with at least one phoneme, at most four phonemes. such as "Mandarin", consisting of three syllables (one syllable per word), can be analyzed into "p,u,t,o,ng,h,u,a" eight phonemes.
English phonemes:
The symbol for recording English phonetics is called phonetic transcription. There are 48 phonemes in the English phonetic alphabet, including 20 vowels and 28 consonant phonemes. The function of English consonants and vowels in language is equivalent to the consonant and vowel in Chinese.
Corpus:
In general, it is virtually impossible to observe large-scale language instances in statistical natural language processing. Therefore, people simply use text as an alternative, and the context of the text as a substitute for the context of the real world language. We refer to a collection of text as a corpus (Corpus), which we call a Corpus collection (corpora) when there are several such collections of text. A corpus is usually a language material collected for linguistic research, stored in electronic form, compiled from a naturally occurring written or spoken sample, used to represent a particular language or language variant.
Corpus is the usual words of our words, some of the sentence passages of literary works, newspapers and magazines appeared in the paragraph and so on in real life in the real language materials to form a corpus, in order to do scientific research can be drawn from or obtained data.
For example, if I want to write a universal article about the word "force", I can find out the frequency, usage, etc. of the word in the corpus.
Reference
Robyn Wang, Sphinx-based Chinese continuous speech recognition, Taiyuan University of Technology, Master's degree thesis