1 reviewsIntroduction: At present the best speech recognition system uses two-way long-term memory network (LSTM,LONGSHORT), but the system has high training complexity, decoding Singo problems, especially in the industry's real-time recognition system is difficult to apply. In this year, Iflytek presented a new framework for speech recognition-deep full-sequence convolutional neural network (dfcnn,deep Fully convolutional neuralnetwork), which is more suitable for industrial applications. This paper is a detailed interpretation of the application of DFCNN to the technology of phonetic transcription, including the analysis of colloquial and discourse-level language model processing, noise and far-field recognition, real-time error correction and text post-processing in phonetic transcription.
In the application of artificial intelligence, speech recognition has made remarkable progress this year, whether in English, Chinese or other languages, the accuracy rate of the machine's speech recognition is increasing. Among them, the development of voice dictation technology is the most rapid, has been widely used in voice input, voice search, voice assistant and other products have been applied and maturing. However, another aspect of voice application, that is, the phonetic transcription, there are still some difficulties, because in the process of producing the recording file, the user did not anticipate that the recording will be used for speech recognition, so compared with voice dictation, phonetic transcription will face the speech style, accent, recording quality and many other challenges.
Typical scenarios for voice transcription include interviews with reporters, TV shows, classroom and conversational meetings, and even any recorded files produced by anyone in their daily work life. Phonetic transcription of the market and imagination space is huge, imagine, if the human can conquer the phonetic transcription, TV programs can automatically live subtitles, formal meetings can automatically form a minute, the reporters can automatically make the recording of the record ... There are more words in a person's life than we have written, and if there is a software that can record all the words we have said and manage efficiently, how incredible the world will be. Acoustic modeling technology based on DFCNN
Acoustic modeling of speech recognition is mainly used to model the relationship between voice signals and phonemes, and Iflytek, as a framework for acoustic modeling, was proposed last December 21 by the forward-fed sequential memory network (FSMN, Feed-forward sequential memory networks). This year, a new speech recognition framework, the deep full-sequence convolutional neural network (dfcnn,deep Fully convolutional neuralnetwork), is introduced again.
At present, the best speech recognition system uses bidirectional long-term memory network (lstm,longshort term memory), this kind of network can be used to model the length-time correlation of speech, thus improve the recognition accuracy. But the two-way lstm network has the problem of high training complexity and decoding Singo, which is difficult to be used in real-time recognition system of industry. Therefore, Iflytek uses deep full-sequence convolutional neural network to overcome the defects of bidirectional lstm.
CNN was used in speech recognition systems as early as 2012, but there was no big breakthrough. The main reason is that it uses fixed-length frame stitching as input and cannot see long enough speech context information; another flaw is that CNN is regarded as a feature extractor, so the number of convolution layers used is very small and the expressive ability is limited.
For these problems, DFCNN uses a large number of convolution layers to model the whole sentence directly. First of all, the input dfcnn directly as input, compared with other traditional speech features as input speech recognition framework compared with natural advantages. Secondly, on the model structure, it draws on the network configuration of image recognition, each convolution layer uses the small convolution core, and after multiple convolutional layers, plus the pooling layer, by accumulating a lot of convolution pool layer pair, so that can see very long history and future information. These two points ensure that the DFCNN can express the long-time correlation of speech, compared with the RNN network structure is more robust, and can achieve short-delay quasi-online decoding, which can be used in industrial systems.
(DFCNN) colloquial and text-level language model processing technology
The language model of speech recognition is mainly used for modeling the correspondence between phonemes and words. Because the human spoken language is a non-organized natural language, people in the free dialogue, often appear hesitant, readback, modal words and other complex language phenomenon, and the text form of the existence of the language is usually written language, the gap between the two to make the model for the spoken languages is facing a great challenge to modeling.
At Hkust, we use the idea of voice recognition to deal with noise problems, that is, to automatically introduce the "noise" phenomena such as readback, inversion, modal words and so on on the basis of written language, so that we can automatically generate mass spoken corpus and solve the problem of mismatch between spoken and written language. First of all, the collection of some spoken text and written text corpus pairs; Secondly, using neural network framework based on Encoder-decoder to model the correspondence between written and spoken text, thus realizing the automatic generation of spoken text.
In addition, contextual information can greatly help human understanding of the language, for machine transcription is the same reason. as a result, Iflytek presented a text-level language model in 21 last December, which automatically carries out key information extraction based on the decoding results of speech recognition, carries out corpus search and post-processing in real-time, and forms a language-specific speech-related model with decoding results and the corpus searched. So that the accuracy of phonetic transcription can be further improved.
noise and far-field recognition technology (text-level language model flowchart)
The application of far-field pickup and noise jamming in speech recognition has been the two major technical problems. For example, in the event of a meeting, if you use a recording pen to record, the speaker's voice is far from the sound is far-field with reverberation speech, because the reverberation will make the non-synchronous speech overlap each other, resulting in a phoneme overlapping masking effect, which seriously affect the speech recognition effect; Similarly, if there is background noise in the recording environment, The speech spectrum will be contaminated and its recognition effect will drop sharply. Iflytek for this problem using Tanmak and microphone array two kinds of hardware environment of noise reduction, the solution of reverberation technology, so far field, noise in the case of voice transcription also reached a practical threshold.
Tanmak Noise Reduction, reverb solution
For the lost speech collected, the method of mixing training and using the depth regression neural network to reduce the noise and reverberation is combined. That is, on the one hand, the clean voice is added to the noise, and with the clean voice mixed training, so as to improve the model for noisy speech robustness (Editor Note: Robust transliteration, that is, strong and strong meaning); On the other hand, using deep regression neural network to reduce noise and reverb, further improve the noise, The recognition accuracy of far-field speech.
Microphone Array Noise reduction, reverb solution
Just consider the noise in the speech processing process can be said to be a palliative, how to solve the reverberation and noise from the source seems to be the crux of the problem. In the face of this challenge, HKUST researchers are using multi-microphone arrays to reduce noise and reverb by adding multi-microphone arrays to recording devices. Specifically, using multiple microphones to capture multiple time-frequency signals, the convolution neural network is used to learn beamforming, which forms a pickup beam in the direction of the target signal and attenuates the reflected sound from other directions. This method, combined with the Tanmak noise reduction and reverberation, can further improve the recognition accuracy of noisy and far-field speech.
Text Processing real-time error correction + text post-processing
All of the above is just for the speech processing technology, will be transcribed into text, but as mentioned above, human speaking as a non-organized natural language, even if the phonetic transcription accuracy is very high, the readability of the text of the phonetic transcription still has a large problem, so the importance of text post-processing is reflected. The so-called text post-processing is to the colloquial text of the clause, segmentation, and the text content of the fluency of processing, and even a summary of the content, in order to facilitate better reading and editing.
post-processing Ⅰ: clauses and segments
Clauses, that is, the chapeau of the text of the transliteration of the clause division, and the clauses of the punctuation; segment, a text is divided into several semantic paragraphs, each paragraph describes a different sub-topic.
By extracting the contextual semantic features and combining the features of speech, the clauses and paragraphs are divided, considering that there are more difficult to obtain the annotated speech data, in the actual use of the Zhong Ke to use the two cascade bidirectional long-term memory network modeling technology, thus better solve the clause and segmentation problems.
post-processing Ⅱ: content is smooth
Content is smooth, also known as non-fluent detection, that is, to eliminate the transcription results of the pause words, modal words, repeating words, so that the smooth text is easier to read.
Through the use of generalization features and the combination of two-way long-term memory network modeling technology, the content of the smoothness of the accuracy achieved a practical stage.