Introduction and comparison of several common voice interaction platforms

Source: Internet
Author: User
Tags sapi

1. Overview

Recently made two speech recognition related projects, the main tasks of two projects are speech recognition, or more specifically keyword recognition, but the development of different platforms, one is under Windows, the other is the Android platform, so also choose a different speech recognition platform, The former selected is Microsoft's Speech API development, the latter is the use of CMU Pocketsphinx, this paper will be a number of common voice interactive platform for a simple introduction and comparison.

The speech interactions described here include speech recognition (Speech RECOGNITION,SR, also known as automatic speech recognition, Automatic Speech RECOGNITION,ASR) and speech synthesis (Speech Synthesis,ss, Also known as text-to-speech, précis-writers is a TTS), and it also mentions Voice-over-Print recognition (recognition, précis-writers-VPR) technology.

Speech recognition technology is a technology that transforms a computer to receive, recognize, and understand speech signals into corresponding text files or commands. It is a cross discipline which involves phonetics linguistics, signal processing, pattern recognition, probability theory and information theory, sound mechanism and auditory mechanism, artificial intelligence. With the help of the speech recognition system, the computer can be operated by the speech recognition system even if the user does not understand the computer or cannot use the computer.

Speech synthesis, also known as the text to Speech technology, can be any text information in real-time into the standard fluent speech aloud, equivalent to the machine installed artificial mouth. It relates to acoustics, linguistics, digital signal processing, computer science and other disciplines of technology, is a cutting-edge technology in the field of Chinese information, the main problem is how to translate the text message into audible sound information, that is, let the machine like a person to speak.

Below the platform whether open source to introduce several common voice interactive platform, about speech recognition and speech synthesis technology related principles Please see my next article.

2. Commercial voice interactive platform 1) Microsoft Speech API

Microsoft's Speech API (referred to as SAPI) is Microsoft's launch of the speech recognition (SR) and speech synthesis (SS) engine of the Application Programming Interface (API), under Windows widely used. Currently, Microsoft has released multiple versions of SAPI (the latest version of SAPI 5.4), which are either published as Speech SDK development packages or directly included in the Windows operating system. SAPI supports the recognition and reading of multiple languages, including English, Chinese, and Japanese. SAPI's version is divided into two families, 1-4 for a family, these four versions are similar to each other, just slightly added some new features, the second family is SAPI5, this series version is completely new, and the first four versions are different.

The earliest SAPI 1.0 was released in 1995, supporting Windows 95 and Windows NT 3.51. This version of SAPI contains an API for the more elementary direct speech recognition and direct speech synthesis, the application can directly control the identification or synthesis engine, and simplifies the API for higher level voice commands and voice calls. Released in 97, SAPI3.0 has added dictation speech recognition (non-continuous speech recognition) and some application instances. 98 Microsoft released SAPI4.0, this version not only contains the core COM API, with C + + class encapsulation, make it easier to program in C + +, and there are ActiveX controls, this control can be dragged and dropped in VB. This version of the SS engine is released with Windows2000, and the SR engine and the SS engine are released in the form of an SDK.

SAPI5.0 was released in 2000, and the new version will strictly embody the idea that the application is decoupled from the engine, and all calls are made through dynamic invocation of Sapi.dll, which is done to make the API more engine-independent and to prevent the application from relying on an engine with a particular feature. This change is also intended to make application development easier by putting some configuration and initialization code at runtime.

2). IBM ViaVoice

IBM was one of the institutions that started research on speech recognition earlier in the the late 1950s, when IBM began the study of speech recognition, which was designed to detect specific language patterns and to derive statistical correlations between sounds and their corresponding text. At the world's Fair in 1964, IBM presented the digital speech recognition "shoe box recognizer" to the world. In 1984, IBM's speech recognition system reached a 95% recognition rate at 5,000 vocabulary levels.

In 1992, IBM introduced its first dictation system, called the IBM Speech Server Series (ISSS). A new version of the dictation system was released in 1996 as "VoiceType3.0", a prototype of ViaVoice, which does not require training to enable the identification of isolated words and sequential commands. The VoiceType3.0 supports the WINDOWS95 system and is integrated into the OS/2 warp system. At the same time, IBM released the world's first continuous dictation system "medspeak Radiology". Finally, IBM promptly released the popular and practical "Voicetype Simply Speaking" system in the holiday shopping season, the world's first consumer edition of the dictation product. .

In 1999, IBM released a free version of Voicetype. In 2003, IBM authorized ScanSoft to have a worldwide exclusive distribution of ViaVoice-based desktop products, while ScanSoft company had a competitive product "Dragon naturallyspeaking". Two years later, ScanSoft and Nuance merged, and announced the company was formally renamed Nuance Communications,inc. Now it's hard to find the IBM ViaVoice SDK, which has faded out of sight and replaced by nuance.

3) Nuance

Nuance Communications, a multinational computer software technology company headquartered in Burlington, Massachusetts, offers solutions and applications for voice and graphics. The current business focuses on server and embedded speech recognition, telephone steering systems, automated phone directory services, medical transcription software and systems, optical character recognition software, and desktop imaging software.

In addition to speech recognition technology, nuance Speech technology also expands speech synthesis, voice recognition and other technologies. With more than 1000 patented technologies, more than 80% of speech recognition is using nuance recognition engine technology, and the company's voice products can support more than 50 languages, with more than 2 billion users worldwide. Apple's iphone 4S is rumored to be using Nuance's speech recognition service in Siri speech recognition. In addition, according to the Nuance Company announced the heavy news, its auto-grade Dragon Driver Dragon Drive will be on the new Audi A3 to provide a hands-free communication interface, can realize the information of the acquisition and transmission.

Nuance Voice Platform (NVP) is an audio internet platform launched by Nuance Corporation. Nuance Company's NVP platform consists of three function blocks: Nuance Conversation Server dialog server, Nuance application Environment (NAE) application environment and Nuance Management Stations management station. The Nuance Conversation Server dialog server includes VoiceXML interpreter, text Converter (TTS), and voice-over-speech authentication software integrated with the Nuance phonetic recognition module. The NAE application environment includes a graphical development tool that makes the design of voice applications as convenient as the design of the application framework. Nuance Management Station Management provides a very powerful system management and analysis capability designed to meet the unique needs of voice services.

4) Iflytek--flying voice

The mention of Iflytek, we are not unfamiliar, its full name is "Anhui Iflytek Information Technology Co., Ltd.", its predecessor is Anhui Zhong Ke Information Technology Co., Ltd., was founded in December 99, 07 changed to Anhui Iflytek Information Technology Co., Ltd., Now is a professional engaged in intelligent voice and speech technology research, software and chip product development, voice information services enterprises, in China's voice technology is the leader in the world, also has considerable influence.

As China's largest intelligent voice technology provider, Iflytek has long-term research and accumulation in the field of intelligent speech technology, and has international leading achievements in Chinese speech synthesis, speech recognition, oral evaluation and other technologies. In 03, Hkust has been the only "national Science and Technology Progress Award" by the Chinese voice industry, and in 05 it was the highest honor "the Important Technology Invention Award of Information Industry" by China's information industry independent innovation. From 06 to 11, six consecutive English speech synthesis international competitions (Blizzard Challenge) won the first prize. 08 won the International Speaker Recognition Evaluation Contest (National Institute of Standards and Technology-nist 2008) Laurel, 09 won the International Language Recognition Evaluation Contest (NIST 2009) High difficulty confusion dialect Test index champion, the general test indicator runner-up.

Iflytek offers a full range of voice interaction platforms, including speech recognition, speech synthesis, and voice recognition. With independent intellectual property rights of Intelligent voice technology, Iflytek has launched from large-scale carrier-class applications to small embedded applications, from telecommunications, financial and other industries to enterprises and home users, from PC to mobile phones to mp3/mp4/pmp and toys, to meet the diverse application environment of a variety of products, hkust Iflytek occupies more than 60% market share in the Chinese speech technology market, and has reached more than 70% of the voice synthesis products.

5) Other

Other influential commercial voice interactive platform has Google's voice search (Google Voice searches), Baidu and Sogou voice input method, and so on, these platforms relative to the above 4 voice interactive platform, the scope of application is relatively limited, the influence is not so strong, here is not detailed introduction.

3. Open source Voice Interactive platform 1) Cmu-sphinx

Cmu-sphinx, also referred to as Sphinx (Sphinx), is an open-source speech recognition system developed by Carnegie Mellon University (Carnegie Mellon UNIVERSITY,CMU), which includes a range of speech recognizer and acoustic model training tools.

There are multiple versions of Sphinx, where Sphinx1~3 is in the C language, and Sphinx4 is in Java, and there is a streamlined, optimized version of Pocketsphinx for embedded devices. Developed by Kai-Fu Lee (Kai-fu Lee) around 1987, Sphinx-i uses a fixed hmm model (with 3 size of 256 codebook), which is known as the first high-performance continuous speech recognition system (in Resource The accuracy rate on the management database reached 90%+). Sphinx-ii was developed by Xuedong Huang around 1992, using a semi-continuous HMM model, the HMM model is a topological structure with 5 states, and uses the N-gram language model, using fast lextree as a real-time decoder, The recognition rate on the WSJ data set has also reached 90%+.

SPHINX-III was developed mainly by Eric Thayer and Mosur Ravishankar around 1996, using a fully continuous (and semi-continuous) HMM model with flexible feature vectors and a flexible hmm topology. Includes two optional decoders: slower flat search and faster Lextree search. This version of Wer (Word error ratio) is 19% on the BN (98 evaluation dataset). The initial version of SPHINX-III also has many limitations, such as support for Saningsuwenben only, Ngram model only (not cfg/fsa/scfg), all sound unit its hmm topology is the same, and the acoustic model is uniform. The latest version of SPHINX-III is the 0.8 edition released in early 09, and there are many improvements in these areas.

The latest Sphinx Speech recognition system includes the following packages:
? Pocketsphinx-recognizer library written in C.
? Sphinxbase-support Library required by Pocketsphinx
? Sphinx4-adjustable, modifiable recognizer written in Java
? Cmuclmtk-language Model Tools
? Sphinxtrain-acoustic Model Training Tools
The executable files and source code of these packages can be downloaded for free on SourceForge.

2) HTK

HTK is the abbreviation for hidden Markov model Toolkit (Hidden Markov models Toolkit), HTK is mainly used for speech recognition research, and has now been used in many other aspects of research, including speech synthesis, character recognition and DNA sequencing.

HTK was originally developed in 1989 by the Machine Intelligence Laboratory (formerly Voice Vision and Robotics group) of the University of Cambridge Engineering (Cambridge University Engineering Department, cued), It is used to build cued's large vocabulary speech recognition system. 93 Entropic Laboratory Inc. was granted the right to sell HTK and was transferred to the newly established entropic Cambridge LTD in 95, Entropic has been selling HTK until 99, when Microsoft acquired Entropic, Microsoft re-granted HTK's copyrights to cued and provided cued support, cued re-released HTK and provided development support on the Web.

The latest version of HTK is released in 09 in 3.4. Version 1, on the implementation of HTK and the use of various tools can be see HTK document Htkbook.

3) Julius

Julius is an open source project for high-performance, dual-channel, large vocabulary continuous speech recognition (large vocabulary continues speech recognition,lvcsr), suitable for a wide range of researchers and developers. It uses 3-gram and context-sensitive HMM to enable real-time speech recognition on the current PC, with a vocabulary of up to 60k.

Julius integrates the main search algorithm, the high modularity makes its structure model more independent, it supports a variety of hmm models (such as Shared-state Triphones and tied-mixture models, etc.), supports a variety of microphone channels, Supports a combination of multiple models and structures. It uses a standard format, which makes it easier to cross-use with other toolboxes. Its main supported platforms include Linux and other Unix-like systems, and also for Windows. It is open source and uses the BSD license agreement.

Since 97, Julius has continued as part of a free software toolkit for the Japanese LVCSR study, and was later operated by the Japan Continuous Voice Recognition Consortium (CSRC) in 2000. Starting with the 3.4 version, a grammar-based recognition parser called "Julian" was introduced, and Julian is a Julius, hand-crafted DFA as a language model that can be used to construct a small vocabulary of command recognition systems or voice conversation systems.

4) RWTH ASR

The toolkit contains the latest algorithms for automatic speech recognition technology, developed by the human Language technology and Pattern recognition Group of RWTH Aachen University.

The RWTH ASR Toolkit includes the building of acoustic models, parser and other important parts, including speaker adaptive components, speaker adaptive training components, unsupervised training components, personalized training and word stem processing components, which support the Linux and Mac OS and other operating systems, Its Project site has more comprehensive documentation and examples, as well as a ready-made model for research purposes.

The toolkit complies with an open source protocol developed from QPL and is only allowed for non-commercial use.

5) Other

The Open source Toolkit mentioned above is mainly used for speech recognition, and other open source speech recognition projects include Kaldi, Simon, Iatros-speech, Shout, Zanzibar OPENIVR and so on.

The common open source Toolbox for speech synthesis is Mary, Speakright, Festival, Freetts, Festvox, ESpeak, Flite, and so on.

The common open-source toolbox for voice-print recognition is Alize, OPENVP and so on.

4. Summary

This article introduces several common voice interactive platform, mainly is the speech recognition, the speech synthesis software or the tool package, also mentions the voice print recognition the content, below makes a simple summary:

The table above summarizes the hope to be useful to the reader!

Reference documents

[1] Speech recognition-Wikipedia: http://zh.wikipedia.org/wiki/speech recognition
[2] Speech synthesis-Baidu Encyclopedia: http://baike.baidu.com/view/549184.htm
[3] Microsoft Speech api:http://en.wikipedia.org/wiki/speech_application_programming_interface#sapi_1
[4] Msdn-sapi:http://msdn.microsoft.com/zh-cn/library/ms723627.aspx
[5] Microsoft Voice Technology for Windows voice programming preliminary: http://blog.csdn.net/yincheng01/article/details/3511525
[6] IBM Human Language Technologies history:http://www.research.ibm.com/hlt/html/history.html
[7] Nuance:http://en.wikipedia.org/wiki/nuance_communications
[8] Iflytek: http://baike.baidu.com/view/362434.htm
[9] Cmu-sphinx:http://en.wikipedia.org/wiki/cmu_sphinx
[Ten] CMU Sphinx homepage:http://cmusphinx.sourceforge.net/wiki/
[One] HTK toolkit:http://htk.eng.cam.ac.uk/
[Julius:http://en.wikipedia.org/wiki/julius_] (software)
[RWTH] Asr:http://en.wikipedia.org/wiki/rwth_asr
List of speech recognition Software:http://en.wikipedia.org/wiki/list_of_speech_recognition_software
[Speech] Recognition:http://en.wikipedia.org/wiki/speech_recognition
[Speech] Synthesis:http://en.wikipedia.org/wiki/speech_synthesis
[Speaker] Recognition:http://en.wikipedia.org/wiki/speaker_recognition

Reprint http://ibillxia.github.io/blog/2012/11/24/several-plantforms-on-audio-and-speech-signal-processing/

Introduction and comparison of several common voice interaction platforms

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.