A brief introduction to SAPI
API Overview
The SAPI API provides a high-level interface between one application and the speech engine. SAPI implements all of the required low-level details for real-time control and management of various speech engines.
The two basic types of the SAPI engine are text-to-speech systems (TTS) and speech recognition systems. TTS systems use synthetic speech to synthesize text strings and files to sound audio streams. Speech recognition technology converts the sound of human voice to a readable text string or file.
Text Speech Translation API
Applications can control text speech translation through the Ispvoice object build model (COM) interface. Once an application has an established Ispvoice object (see Text-to-Speech Guide), the application simply calls Ispvoice::speak to pronounce it from the text data. In addition, the Ispvoice interface also provides some ways to change the sound and composition properties, such as speed ispvoice::setrate, output volume ispvoice::setvolume, change the current voice ispvoice::setvoice, and so on.
A specific SAPI controller can also embed input text to change the properties of the speech synthesizer in real time, such as sound, tone, accent, speed, and volume. These composite tags are in sapi.xsd, using standard XML format, which is a simple but powerful way to customize TTS speech, independent of specific engines and currently used sounds.
The Ispvoice::speak method can be used for synchronous (returned after full completion of reading) or asynchronous (return immediately, read aloud in the background) operation. When reading aloud (Spf_async), real-time state information, such as the reading State and the current text position, can be obtained by ispvoice::getstatus. When reading asynchronously, you can interrupt the current read aloud output to read a new text or automatically append new text to the end of the text that is currently being read aloud.
In addition to the Ispvoice interface, SAPI also provides many useful COM interfaces for advanced TTS applications.
Event
SAPI uses standard callback mechanisms (window messages, callback functions, or WIN32 events) to send events to communicate with the application. For TTS, events are mostly used to output voice synchronously. Applications can synchronize with the real-time behaviors that occur, such as word boundaries, phonemes, lips, or application-tailored bookmarks. Applications can be used with Ispnotifysource, Ispnotifysink, Ispnotifytranslator, Ispeventsink, Ispeventsource, and Ispnotifycallback initializes and processes these real-time events.
Dictionary
Applications can provide customized word pronunciation for the TTS engine by using Ispcontainerlexicon, Isplexicon, and Ispphoneconverter methods.
Resources
Find and select SAPI voice data such as sound files and pronunciation dictionaries can be controlled by the following COM interfaces: Ispdatakey, Ispregdatakey, Ispobjecttokeninit, Ispobjecttokencategory, Ispobjecttoken, Ienumspobjecttokens, Ispobjectwithtoken, Ispresourcemanager and Isptask.
Audio
Finally, there is an interface that can output sound to a number of specified targets such as telephones and custom hardware (Ispaudio, Ispmmsysaudio, Ispstream, Ispstreamformat, Ispstreamformatconverter).
Speech recognition API
Just like Ispvoice is the main interface of speech synthesis, Isprecocontext is the main interface of speech recognition. Like Ispvoice, it is a ispeventsource interface, which means that it is the medium by which the voice program receives the requested speech recognition event notification.
An application must select one of two different types of speech recognition engines (Isprecognizer). One is the speech recognition engine that can share the recognizer with other speech recognition programs, which is recommended in most identification programs. To establish a shared Isprecocontext interface for Isprecognizer, an application only needs to invoke COM's CoCreateInstance method with Clsid_spsharedrecocontext. In this scenario, SAPI will create an audio input stream and set it to SAPI the default audio input stream. For large server programs, it may run on a separate system, performance is critical, and a inproc speech recognition engine is more appropriate.
In order to establish a isprecocontext for the InProc Isprecognizer, the program must first invoke CoCreateInstance with Clsid_spinprocrecoinstance to establish its own inproc Isprecognizer. The program must then call Isprecognizer::setinput (see also Ispobjecttoken) to create an audio input stream. The final program can call Isprecognizer::createrecocontext to get a isprecocontext.
The next step is to set up event notifications of interest to the program, because Isprecognizer is also a ispeventsource,ispeventsource actually Ispnotifysource, Program from its isprecocontext can call a Ispnotifysource method to indicate Isprecocontext where the event should be reported. It should then call Ispeventsource::setinterest to indicate which events should be communicated. The most important event is spei_recognition, which points out that Isprecocontext-related isprecognizer has identified some sounds. Details of other speech recognition events available are shown in Speventenum.
Finally, a speech program must be built, loaded, and activated by a Isprecogrammar, which essentially indicates which types of statements are recognized, such as dictation or a command and control grammar. First, the program uses Isprecocontext::creategrammar to build a Isprecogrammar, and then the program loads the appropriate grammar, and one of the following two methods is invoked: the invocation method of the Dictation mode Isprecogrammar:: Loaddictation, the command and control mode call the method isprecogrammar::loadcmdxxx. Finally, in order to activate these grammars to begin recognition, the program invokes isprecogrammar::setdictationstate for the spoken mode or calls Isprecogrammar for command and control mode:: Setrulestate or Isprecogrammar::setruleidstate.
When the recognition relies on notification mechanisms to return to the program, the member lparam of the spevent structure will be a isprecoresult, and the program can determine what Isprecogrammar is identified and used by the Isprecocontext.
A isprecognizer, whether shared or InProc, can have multiple isprecocontexts associated with it, and each can notify Isprecognizer through its own event notification method. Multiple Isprecogrammars can be established from a isprecocontext to facilitate the identification of different types of speeches.
Using the main steps of the Microsoft Speech SDK 5.1 for speech recognition development in MFC, take the speech API 5.1+VC6 as an example:
1. Initialize COM port
Typically, in subclasses of CWinApp, call the CoInitializeEx function for COM initialization with the following code:
:: CoInitializeEx (null,coinit_apartmentthreaded); Initializing COM
Note: When you call this function, you->c/c++ the tag in the project settings, category Select Preprocessor, and in the text box under preprocessor definitions: add ", _ Win32_dcom ". Otherwise the compilation cannot pass.
2. Create the identification engine
The Microsoft Speech SDK 5.1 supports two modes: Shared (Share) and exclusive (INPROC). In general, you can use a shared type, a large service program to use INPROC. As follows:
hr = M_cprecognizer.cocreateinstance (clsid_spsharedrecognizer);//share
hr = M_cprecognizer.cocreateinstance (clsid_spinprocrecognizer);//inproc
If it is a share type, you can go straight to step 3, and if it is a inproc type, you must set the voice input using Isprecognizer::setinput. As follows:
Ccomptr<ispobjecttoken> Cpaudiotoken; Define a token
hr = Spgetdefaulttokenfromcategoryid (Spcat_audioin, &cpaudiotoken); Create a default audio input object
if (SUCCEEDED (HR)) {hr = M_cprecognizer->setinput (Cpaudiotoken, TRUE);}
Or:
Ccomptr<ispaudio> Cpaudio; Define an Audio object
hr = Spcreatedefaultobjectfromcategoryid (Spcat_audioin, &cpaudio);//create default audio input object
hr = M_cprecoengine->setinput (Cpaudio, TRUE);/set identification engine input source
3. Create a Recognition context interface
Call Isprecognizer::createrecocontext to create the identification context interface (ISPRECOCONTEXT) as follows:
hr = M_cprecoengine->createrecocontext (&m_cprecoctxt);
4, set the identification message
Call Setnotifywindowmessage tells Windows which is our identification message and needs to be processed. As follows:
hr = M_cprecoctxt->setnotifywindowmessage (m_hwnd, wm_recoevent, 0, 0);
Setnotifywindowmessage is defined in Ispnotifysource.
5, set the event we are interested in
One of the most important events is "spei_recognition". Refer to Speventenum. The code is as follows:
Const ULONGLONG Ullinterest = Spfei (Spei_sound_start) | Spfei (spei_sound_end) | Spfei (spei_recognition);
hr = M_cprecoctxt->setinterest (ullinterest, ullinterest);
6. Create grammar rules
Grammatical rules are identified by the soul that must be set. It is divided into two types, one is the listening and speaking (dictation), the other is the imperative (command and control---c&c). First, the syntax object is created using Isprecocontext::creategrammar, and then the different syntax rules are loaded, as follows:
Dictation
hr = M_cprecoctxt->creategrammar (giddictation, &m_cpdictationgrammar);
if (SUCCEEDED (HR))
{
hr = M_cpdictationgrammar->loaddictation (NULL, splo_static);//Load Dictionary
}
//c&c
hr = M_cprecoctxt->creategrammar (Gidcmdctrl, &m_cpcmdgrammar);
Then use ISPRECOGRAMMAR::LOADCMDXXX to load the syntax, such as loading from Cmdctrl.xml:
WCHAR wszxmlfile[20]=l "";
MultiByteToWideChar (CP_ACP, 0, (LPCSTR) "Cmdctrl.xml" ,-1, Wszxmlfile, 256);//ansi ext unincode
hr = m_ Cpcmdgrammar->loadcmdfromfile (wszxmlfile,splo_dynamic);
Note: When c&c, the syntax file is in XML format, see the Designing Grammar Rules in Speech SDK 5.1. Simple example:
<grammar langid= "804" >
<define>
<id NAME = "cmd" val= "/>"
</define>
<rule name= "COMMAND" id= "cmd" toplevel= " ACTIVE "
<l>
<p> Yin Cheng </p>
<p> Shandong University </p>
<p > CAs </p>
</l>
</rule>
</grammar>
Langi*= "804" represents Simplified Chinese, adding commands in <*>...</*>.
7, in the beginning of recognition, the activation of the syntax to identify
hr = M_cpdictationgrammar->setdictationstate (sprs_active);//dictation
hr = M_cpcmdgrammar->setrulestate (null,null,sprs_active);//c&c
8, get the recognition message, for processing
Intercept the recognition message (wm_recoevent) and process it. The identified results are placed in the isprecoresult of the cspevent. As follows:
Uses_conversion;
Cspevent event;
Switch (Event.eeventid)
{
Case Spei_recognition:
{
Recognition of voice input
M_bgotreco = TRUE;
static const WCHAR wszunrecognized[] = L "<Unrecognized>";
Cspdynamicstring Dstrtext;
Get recognition results
If FAILED (event. Recoresult ()->gettext (Sp_getwholephrase, Sp_getwholephrase, TRUE, &dstrtext, NULL))
{
Dstrtext = wszunrecognized;
}
BSTR srout;
Dstrtext.copytobstr (&srout);
CString recstring;
Recstring.empty ();
recstring = Srout;
Further processing
......
}
Break
}
9, release the created engine, identify context objects, syntax, and so on. Call the appropriate release function.
Need source code please in my csdn left email!
The author of the book "Visual c++2010 Development Authority Guide" will be launched, please pay attention, visual c++2010 recent technology, WINDOWS7 development of the latest technologies.