Introduction
Text-to-speech (TTS) is used to convert text into speech output through the TTS engine. This article does not describe how to build your own TTS engine, but briefly introduces how to use the Microsoft Speech SDK to build your own text speech conversion application.
Microsoft Speech SDK Overview
The Microsoft Speech SDK is a software development kit provided by Microsoft. The Speech API (SAPI) provides two main aspects:
- 1. API for text-to-speech
- 2. API for Speech Recognition
Among them, the API for text-to-speech is the interface of the Microsoft TTS engine, through which we can easily build powerful Text Speech programs, kingsoft's Word Reading function uses this write API. Currently, almost all text reading tools are developed using this SDK. As for API for speech recognition, it is the speech recognition corresponding to TTS. Speech technology is an exciting technology. However, due to the fact that the accuracy and speed of speech recognition technology are not ideal, it has not met the requirements of wide application.
Microsoft Speech SDK can be downloaded free of charge on Microsoft's website. The current version is 5.1. To support Chinese, you must download the additional Language Pack (langpack.
To use this SDK in VC, you must add the include and Lib directories of the SDK to the project to avoid adding directories to every project. The best way is
Option-> directoris must add the SDK's include and Lib directories.
A simple example
Let's take a look at an example of Getting Started:
# Include <SAPI. h> # pragma comment (Lib, "ole32.lib") // coinitialize cocreateinstance needs to call ole32.dll # pragma comment (Lib, "SAPI. lib ") // SAPI. lib must be correctly configured in the SDK lib directory. Int main (INT argc, char * argv []) {ispvoice * pvoice = NULL; // COM Initialization: If (failed (:: coinitialize (null) return false; // obtain the ispvoice interface: hresult hR = cocreateinstance (clsid_spvoice, null, clsctx_all, iid_ispvoice, (void **) & pvoice ); if (succeeded (HR) {hR = pvoice-> speak (L "Hello World", 0, null); pvoice-> release (); pvoice = NULL ;} // do not forget: couninitialize (); Return true ;}
In just over 20 lines of code, the text speech conversion is amazing. The SAPI provided by the SDK is encapsulated based on COM. Whether you are familiar with COM or not, you only need to use coinitialize () and cocreateinstance () to obtain the ispvoice interface step by step, note that after the COM Initialization, couninitialize () must be used to release resources before the program ends.
Main functions of the ispvoice Interface
The process of the above program is to obtain the ispvoice interface, and then use ispvoice: Speak () to output the text as a voice. visible, the core of the program is the ispvoice interface. In addition to speak, the ispvoice interface also has many member functions. For more information about the usage, see the SDK documentation. The following describes the usage of several major functions:
Hresult speak (const wchar * pwcs, DWORD dwflags, ulong * pulstreamnumber); function: refers to the Speak parameter: * The text string entered by pwcs, which must be Unicode, if it is an ANSI string, it must be converted to Unicode first. Dwflags is used to mark the mode of speak. spf_is_xml indicates that the input text contains XML tags, which will be discussed below. The pulstreamnumber output is used to obtain the position of the waiting queue for the current text input. It is only useful in asynchronous mode.
Hresult pause (void); hresult resume (void); function: You can see it at a glance.
Hresult setrate (long rateadjust); hresult getrate (long * prateadjust); function: Set/get playback speed, range:-10 to 10
Hresult setvolume (ushort usvolume); hresult getvolume (ushort * pusvolume); function: sets/gets the playback volume, range: 0 to 100
Hresult setsyncspeaktimeout (ulong mstimeout); hresult getsyncspeaktimeout (ulong * pmstimeout); function: Set/get synchronization timeout time. In synchronization mode, after the electrophoresis speak, the program will enter the blocking state and wait for the speak to return. To avoid a program failure for a long time, set the timeout time, in milliseconds.
Hresult setoutput (iunknown * punkoutput, bool fallowformatchanges); function: Set the output. The following describes how to use setoutput to output the speak to the wav file.
All the return types of these functions are hresult. If the result is successful, s_ OK is returned. The error has different error codes.
Use XML
I personally think that the most powerful feature of this TTS API is the ability to analyze XML tags, and set the volume, tone, extend, and pause through the XML tag, which can almost achieve the natural speech effect of the output. As mentioned above, when the Speak parameter dwflags is set to spf_is_xml, The TTS engine will analyze the XML text. The input text does not need to strictly abide by W3C standards, as long as it contains XML tags, the following is an example:
……pVoice->Speak(L"<VOICE REQUIRED=''NAME=Microsoft Mary''/>volume<VOLUME LEVEL=''100''>turn up</VOLUME>", SPF_IS_XML, NULL);……<VOICE REQUIRED=''NAME=Microsoft Mary''/>
Tag sets the sound to Microsoft Mary. The English version of the SDK contains three types of sound, the other two are Microsoft Sam and Microsoft Mike.
……<VOLUME LEVEL=''100''>
Set the volume to 100 and the volume range to 0 ~ 100.
In addition, the logo tone (-10 ~ 10 ):
<PITCH MIDDLE="10">text</PITCH>
Note: Add \ In front of C/C ++; otherwise, an error will occur. Mark speed (-10 ~ 10 ):
<RATE SPEED="-10">text</RATE>
Read by letter:
<SPELL>text</SPELL>
Note:
<EMPH>text</EMPH>
Pause for 200 milliseconds (max. 65,536 milliseconds ):
<SILENCE MSEC="200" />
Control pronunciation:
<PRON SYM = ''h eh - l ow 1''/>
This label has a strong function. The key point is that all language pronunciation is composed of basic phoneme. In Chinese pronunciation, Pinyin is the most basic element of pronunciation, as long as you know the Chinese pinyin, even if you don't know how to write it, you can know how it is written. For the TTS engine, it doesn't necessarily know all the words, however, if you give it the symbol (sym) corresponding to the pinyin alphabet, it will be able to read it, and the English pronunciation can be represented by the phonetic alphabet, ''h eh-l ow 1 ''is the phoneme corresponding to the word hello. For the relationship between pronunciation and Sym, see phoneme table in the SDK documentation.
In addition, there are also a set of rules for reading numbers, dates, and times. The SDK provides a detailed description, which is not mentioned here (too easy to translate). Here is an example:
<context ID = "date_ ymd">1999.12.21</context>
Will be read
"December twenty first nineteen ninety nine"
XML labels can be nested, but must comply with XML standards. XML labels are really easy to use and have good results, ...... Disadvantages: a word-"annoying", if you add a label to a paragraph of text, it will be easy.
Output text speech as a WAV file
#include <sapi.h>#include <sphelper.h>#pragma comment(lib,"ole32.lib")#pragma comment(lib,"sapi.lib")int main(int argc, char* argv[]){ISpVoice * pVoice = NULL;if (FAILED(::CoInitialize(NULL)))return FALSE;HRESULT hr = CoCreateInstance(CLSID_SpVoice, NULL, CLSCTX_ALL, IID_ISpVoice, (void **)&pVoice);if( SUCCEEDED( hr ) ){CComPtr<ISpStream>cpWavStream;CComPtr<ISpStreamFormat>cpOldStream;CSpStreamFormat OriginalFmt;pVoice->GetOutputStream( &cpOldStream );OriginalFmt.AssignFormat(cpOldStream);hr = SPBindToFile( L"D:\\output.wav",SPFM_CREATE_ALWAYS,&cpWavStream,&OriginalFmt.FormatId(),OriginalFmt.WaveFormatExPtr() );if( SUCCEEDED( hr ) ){pVoice->SetOutput(cpWavStream,TRUE);WCHAR WTX[] = L"<VOICE REQUIRED=''NAME=Microsoft Mary''/>text to wave";pVoice->Speak(WTX, SPF_IS_XML, NULL);pVoice->Release();pVoice = NULL;}}::CoUninitialize(); return TRUE;}
Spbindtofile binds the file to the output stream, while setoutput sets the output to the stream bound to the file.
Last
After reading this article, do you think it is very simple? Microsoft encapsulates powerful functions very well. In fact, another API in the SDK, Sr (Speech Recognition) is more interesting. If you are interested, try it and you will get an unexpected result.