Absrtact: Skype launched a preview of Real-time voice translation a few days ago, allowing users to communicate across language barriers. Today we'll talk about how Microsoft does this. Skype's translation system is divided into three main steps: First, convert your real-time voice to
Skype launched a preview of real-time voice translations a few days ago, allowing users to communicate across language barriers. Today we'll talk about how Microsoft does this.
Skype's translation system consists of three main steps: First, convert your real-time voice to text, then translate the text into another language, and finally, convert the text to speech. Among them, the recognition of real-time voice and converted to text has been the most difficult part.
Image processing and speech recognition are the two main directions in the development of depth learning. In recent years, due to the progress of deep learning, speech recognition relies on the deep neural network (deep neural NX) has also made a lot of progress. Neural networks have appeared in the 80 's, but the real glow was in 2012, when Google allowed computers to "cultivate themselves"-learning and summarizing the concept of cats in a bunch of videos.
Microsoft researcher John Platt said in an interview with Wired that Microsoft had started using neural networks to improve the accuracy of handwriting recognition in tablet computers long ago. The real breakthrough in Skype's real time voice translation system is to identify different speakers of different languages and accents.
The breakthrough took place at Christmas in 2009, when Microsoft sponsored a small seminar in Colombia, where a lecturer from the University of Toronto, Geoff Hinton, introduced a machine learning model that mimics the workings of the brain, which relies on multi-level artificial neurons, Let the machine gradually understand the more complex concept. After listening to the introduction, Microsoft then smashed a huge amount of money to allow the Hinton model to be tested with the latest graphics processor units. The results of the test were great and the accuracy of speech recognition increased by 25%.
Skype's machine learning prototype is trained by a large amount of data in the preview phase, and optimize speech recognition (SR) and automated Machine translation (MT) tasks, which include removing the less fluent elements in the statements (such as "AHS", "Umms", and repetitive languages), segmenting text into sentences, and increasing punctuation , the case of text, and so on.
Among them, the training set data of speech recognition and machine translation mainly have several sources, including the translated Web page, the video with subtitles, the one-to-one dialogue content of translation transcription and so on. In addition, many of the volunteers who contributed to Microsoft's voice chat are also a very important training data source. At the same time, the Skype translation system will also record the user's conversation content, achieve two times use, for data analysis, to learn.
After the data enters the system, the machine learning software creates a statistical model for the words in the conversation, and when you say something, the system looks for similar words in the statistical model and responds to similar translations previously done. Real-time speech translation is sensitive to the environment in which the user is talking, and a slight noise disturbance may reduce the accuracy. In this aspect, the depth neural network effectively reduces the recognition error rate, improves the robustness of the system, and enables real-time translation to have a greater range of applications.
As for text translations in different languages, Skype uses the same engine technology as Bing translation: a combination of syntax and statistical models, and special training for specific languages. Ordinary text translation often requires the use of the correct written language, while the Skype translation system includes not only the engine technology of Bing translation, but also the addition of a layer of colloquial language business.
In addition, Skype has established a customized framework for concatenating the entire process to coordinate operations across multiple parts of the system. How to operate the whole system in a simple and efficient way is not a small science.
Skype's real-time voice translation system also faces many challenges, such as the rapid pace of language change and the unique way in which everyone speaks, which can cause a lot of trouble for real-time translation. Vikram Dendi, Director of Microsoft Studies, said that as of Monday, a total of 50,000 users had registered a preview version of Skype translation, and a day later, the number became twice times. More and more people are excited about this kind of technology that could really change the way they communicate.