Windows speech programming preliminary

Source: Internet
Author: User
Tags sapi

I. SAPI Introduction

Speech technology in the software includes two aspects: Speech Recognition and speech synthesis ). Both technologies require the support of the Speech engine. Although the Application Programming Interface API launched by Microsoft is not an industry standard, it is widely used.

SAPI stands for the Microsoft Speech API. The relevant Sr and SS engines are located in the speech SDK. The Speech Engine supports recognition and reading in multiple languages, including English, Chinese, and Japanese.

SAPI includes the following component objects (interfaces ):

(1) voice commands API. ApplyProgramIt is generally used in a speech recognition system. After a command is identified, the relevant interface is called to complete the corresponding function of the application. This group of objects must be used if the program wants to implement voice control.
(2) Voice dictation API. Dictation input, that is, the speech recognition interface.
(3) Voice text API. Converts Text to Speech, that is, speech synthesis.
(4) voice telephone API. Speech recognition and speech synthesis are integrated on the telephone system. This interface can be used to establish a telephone response system, or even to control computers by telephone.
(5) Audio objects API. The computer pronunciation system is encapsulated.

SAPI is based on the com architecture. Microsoft also provides ActiveX controls, so it can be used not only for common Windows programs, but also for Web pages, VBA, and even Excel Charts. If you are unfamiliar with COM, you can also use Microsoft's c ++ wrappers, which encapsulates the voice sdk com object with the c ++ class.

Ii. Install sapi sdk.

First download the development kit from this site:Http://

Microsoft Speech SDK 5.1 adds automation support. Therefore, it can be used in languages that support automation, such as VB and ecmascript.

Version description:
Version: 5.1
Release date: 8/8/2001
Speech: English
Download size: 2.0 MB-288.8 MB

This SDK also includes a speech synthesis engine (TTS) that can be released in English or Chinese, and a speech recognition engine (SR) for English, Chinese, and Japanese ).

The system requires a version above 98. Compiling the example program in the Development Kit requires a vc6 environment or above.

****** Download instructions ******:
(1) If you want to download the example program, the document will be explained, and sapiwill download the speechsdk51.exe, which is about 68 m, to be used in the development of the American English language engine.
(2) download speechsdk51langpack.exe if you want to use simplified Chinese and Japanese language. About 82 m.
(3) If you want to build a voice engine with your software, download speechsdk51msm.exe, which is about 132 MB.
(At this address, I fail to download it ).
(4) If you want to obtain the mike and Mary Voices Under XP, download sp5ttint.exe .exe. About 3.5 m.
(5) download SAPI. chm if you want to obtain the documentation for the development kit. About 2.3 m. This is already included in sdk51.

After the download is complete, first install speechsdk51.exe, then install the Chinese language patch package speechsdk51langpack, and then expand
Msttss22l, automatically installs the required DLL to the system directory.

3. Configure the VC Environment

To compile a speech project in the environment of vc6.0, you must first configure the compiling environment. Assume that the SDK is installed in the path D: \ Microsoft Speech SDK 5.1 \. Open the Project Settings dialog box and select the Preprocessor category in the C/C ++ column, enter
D: \ Microsoft Speech sdks 5.1 \ include
Tell VC where to compile the SAPI header file required by the program.
Switch to the link column and enter the following information in the path of the additional library under the input category:
D: \ Microsoft Speech sdks 5.1 \ Lib \ i386
So that the VC can find SAPI. Lib during the link.

Iv. Application of Speech Synthesis. That is, use SAPI to implement TTS (Text to Speech ).

1. initialize the Voice interface in two ways:
Ispvoice * pvoice;
: Coinitialize (null );
Hresult hR = cocreateinstance (clsid_spvoice, null, clsctx_all, iid_ispvoice,
(Void **) & pvoice );
Then you can use this pointer to call the SAPI function. For example:
Pvoice-> setvolume (50); // sets the volume.
Pvoice-> speak (Str. allocsysstring (), spf_async, null );

You can also use the following method:
Ccomptr <ispvoice> m_cpvoice;
Hresult hR = m_cpvoice.cocreateinstance (clsid_spvoice );
The m_cpvoice variable is used in the following example.

Clsid_spvoice is defined in SPAI. h.

2. Get/set the output frequency.

When SAPI reads text, it can output sound in multiple frequency modes, such:
8 kHz 8bit mono, 8 kHz 8bit stereo, 44 kHz 16bit mono, 44 kHz 16bit stereo, etc. There are differences in tone. For more information, see SAPI. h.

you can use the following Code to obtain the current configuration:
ccomptr cpstream;
hresult hroutputstream = m_cpvoice-> getoutputstream (& cpstream);
If (hroutputstream = s_ OK)
{< br> cspstreamformat FMT;
hR = FMT. assignformat (cpstream);
If (succeeded (HR)
{< br> spstreamformat efmt = FMT. computeformatenum ();
}< BR >}< br> spstreamf Ormat is an Enum type and is defined in SPAI. h. Each value corresponds to different frequency settings. For example, spsf_8khz8bitstereo = 5

Use the following code to set the current read frequency:
Ccomptr <ispaudio> m_cpoutaudio; // audio output interface
Spcreatedefaultobjectfromcategoryid (spcat_audioout, & m_cpoutaudio); // create an interface

Spstreamformat efmt = 21; // spsf_22khz 8bit stereo

Cspstreamformat FMT;
FMT. assignformat (efmt );
If (m_cpoutaudio)
HR = m_cpoutaudio-> setformat (FMT. formatid (), FMT. waveformatexptr ());
Else hR = e_fail;

If (succeeded (HR ))
M_cpvoice-> setoutput (m_cpoutaudio, false );

3. Obtain/set the voice for playback.

The audio data files used in the engine are generally stored in the SPD or VCE files under speechengines. After the SDK is installed, available voices are saved in the registry, such as male/female in English and male in simplified Chinese. Location:
HKEY_LOCAL_MACHINE \ SOFTWARE \ Microsoft \ speech \ voices \ tokens
If it is installed in the Chinese operating system, the read-only voice is simplified Chinese by default. The disadvantage of SAPI is that it does not support mixed reading between Chinese and English. When reading Chinese, you can only read the English letters one by one. Therefore, the program needs to perform voice switching on its own.

(1) You can use the following function to fill the voice supported by the current SDK in a combo box:
// Sapi5 helper function in sphelper. h
Hwnd hwndcombo = getdlgitem (hwnd, idc_combo_voices); // combox handle
Hresult hR = spinittokencombobox (hwndcombo, spcat_voices );
This function is used to enumerate the currently available voice interfaces through the ienumspobjecttokens interface, add the description text of the interface to the combo box, and use the interface pointer as the lparam
Save it in the combo box.
Remember to release the interfaces saved in the combo box when the program exits:
Spdestroytokencombobox (hwndcombo );
The principle of this function is to obtain the lparam data of each item in combo one by one, convert it to the iunknown interface pointer, and then call the release function.
(2) When the combo box is changed, you can use the following function to obtain the voice selected by the user:
Ispobjecttoken * ptoken = spgetcurselcomboboxtoken (hwndcombo );

(3) Use the following function to obtain the speech currently in use:
Ccomptr <ispobjecttoken> poldtoken;
Hresult hR = m_cpvoice-> getvoice (& poldtoken );
(4) When the selected voice is inconsistent with the currently used voice, use the following function to modify it:
If (poldtoken! = Ptoken)
// First, end the current reading. This is not necessary.
Hresult hR = m_cpvoice-> speak (null, spf_purgebeforespeak, 0 );
If (succeeded (HR ))
HR = m_cpvoice-> setvoice (ptoken );
(5) You can also use the spgettokenfromid function to obtain the token pointer of the specified voice. For example:
Wchar psztokenid [] = l "HKEY_LOCAL_MACHINE \ Software \ Microsoft \ speech \ voices \ tokens \ mssimplifiedchinesevoice ";
Spgettokenfromid (psztokenid, & pchinesetoken );

4. Start/pause/resume/end the current reading

The text to be read must be in a wide string. Suppose it is in szwtextstring, then:
Code to start reading:
HR = m_cpvoice-> speak (szwtextstring, spf_async | spf_is_not_xml, 0 );
To interpret an XML text, use:
HR = m_cpvoice-> speak (szwtextstring, spf_async | spf_is_xml, 0 );

Paused code: m_cpvoice-> pause ();
Recovery code: m_cpvoice-> resume ();
End Code: (as shown in the above example)
HR = m_cpvoice-> speak (null, spf_purgebeforespeak, 0 );

5. Skip part of the read text

During the reading process, you can skip part of the text to continue reading. The Code is as follows:
Ulong ulgarbage = 0;
Wchar szgarbage [] = l "sentence ";
HR = m_cpvoice-> SKIP (szgarbage, skipnum, & ulgarbage );
Skipnum is the number of sentences to be skipped. The value can be positive or negative.
According to the SDK description, SAPI currently only supports the sentence type. SAPI uses punctuation marks to differentiate sentences.

6. Play a WAV file. SAPI can play wav files, which is implemented through the ispstream interface:

Ccomptr <ispstream> cpwavstream;
Wchar szwwavfilename [norm_size] = l "";;

Wcscpy (szwwavfilename, T2W (szafilename); // convert the wav file name from ANSI to a wide string

// Use the function provided by sphelper. h to open the wav file and obtain an istream pointer.
HR = spbindtofile (szwwavfilename, spfm_open_readonly, & cpwavstream );
If (succeeded (HR ))
M_cpvoice-> speakstream (cpwavstream, spf_async, null); // play a WAV file
7. Save the read results to the wav file.
Tchar szfilename [256]; // assume that the path of the target file is saved.
Wchar m_szwfilename [max_file_path];
Wcscpy (m_szwfilename, T2W (szfilename); // convert to a wide string

// Create an output stream and bind it to a WAV file
Cspstreamformat originalfmt;
Ccomptr <ispstream> cpwavstream;
Ccomptr <ispstreamformat> cpoldstream;
Hresult hR = m_cpvoice-> getoutputstream (& cpoldstream );
If (hR = s_ OK) HR = originalfmt. assignformat (cpoldstream );
Else hR = e_fail;
// Use the function provided in sphelper. h to create a WAV file
If (succeeded (HR ))
HR = spbindtofile (m_szwfilename, spfm_create_always, & cpwavstream,
& Originalfmt. formatid (), originalfmt. waveformatexptr ());
If (succeeded (HR ))
// Set the audio output to a WAV file instead of speakers.
M_cpvoice-> setoutput (cpwavstream, true );
// Start reading aloud
M_cpvoice-> speak (szwtextstring, spf_async | spf_is_not_xml, 0 );

// Wait until the reading ends
M_cpvoice-> waituntildone (infinite );
Cpwavstream. Release ();

// Relocates the output to the original stream.
M_cpvoice-> setoutput (cpoldstream, false );

8. Set the reading volume and speed
M_cpvoice-> setvolume (ushort) hpos); // you can set the volume in the range of 0 to 100.
M_cpvoice-> setrate (hpos); // set the speed in the range of-10-10.

The hpos value is generally located

9. Set SAPI notification messages. When SAPI reads a message, it sends a message to the specified window. After receiving the message, the window can actively obtain the SAPI event,
Depending on the event, the user can get some information about the current SAPI, such as the location of the word being read, the current read-only value (used to display
This event is not provided in the case of Chinese speech.

To get SAPI notifications, register a message:
M_cpvoice-> setpolicywindowmessage (hwnd, wm_ttsappcustomevent, 0, 0 );
This code is generally called when the main window is initialized, and hwnd is the handle of the main window (or the window for receiving messages. Wm_ttsappcustomevent
Is a custom message.

In the function that responds to the wm_ttsappcustomevent message in the window, use the following code to obtain the SAPI notification event:

Cspevent event; // use this class, which is more convenient than using the spevent Structure

While (event. getfrom (m_cpvoice) = s_ OK)
Switch (event. eeventid)

There are many types of eeventid. For example, spei_start_input_stream indicates that reading starts, and spei_end_input_stream indicates that reading ends.
You can judge and use it as needed.

Iv. Conclusion

SAPI has many functions, such as speech recognition and syntax analysis. Due to the limited conditions and energy, I have not tried it one by one. If you are interested, you can install it on your own.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.