Xml
In this section, let's take a look at the existing XML standardization processes in speech.
The work done by the consortium on Voice browsing
As an authoritative standard organization, the consortium has long been eyeing voice browsing (Voice Browser). The consortium set up a workstation called "Voice browser" in October in 1998. Here you can imagine that XML was first proposed at the end of 1996. The goal of the workstation is very clear, setting standards for voice browsing to meet the upcoming voice browsing spree. The work of workstations is mainly based on the following aspects, we also need to enlarge the study of some of the work.
Voice identification language requirements for dialogs
To put it simply, for a voice browser to be able to easily combine different input and output forms, a harmonious interaction with the user "dialogue" requires a restriction on the identity language used to represent the dialog data. As you can imagine, HTML used to represent a flat Web page is not competent for voice browsers. Not only is it messy, but its two-dimensional nature also makes it impossible to represent an interactive conversation.
Identify requirements for reusable dialogs in the language
As with advanced programming languages, there are small modules that are often reused. These small modules are designed to be functions or processes that are called once every time they are used. A similar mechanism is needed in the identity language used by speech browsers to represent parts that are often reused. This can be in the later article concrete experience. Parts that are often reused include simple confirmations, collecting phone numbers, dates, amounts, credit card information, and address information.
A description of the requirements for the speech recognition grammar layer in the identity language
We have mentioned earlier that the implementation of voice browser and the application of voice technology are inseparable. In order to realize the speech data recognition input, the speech browser will use the ready-made speech recognition platform. Different speech recognition methods have different platform requirements. If it's just an independent person's small vocabulary recognition, we may not need to spend too much time on recognition. But once the request is relaxed a little bit, the difficulty of speech recognition becomes very high all of a sudden. For the small vocabulary of the independent person and speech recognition with grammatical structure, it is necessary to describe the structure of the speech input grammar with the recognition on the interface of speech recognition platform. This situation is the most common case of voice browsers now in general. The consortium defines a set of such syntax identifier representations here.
Description of requirements for natural language processing in the identification language
This is actually the problem of speech comprehension that we have mentioned earlier. This, like the point above, is closely related to speech technology. Defines the requirement description for the identification of natural language understanding in the speech browser implementation.
Description of requirements for speech synthesis in the identification language
In order to achieve the output of speech, people have to use speech synthesis. A simple piece of pre-recorded sound can also be viewed as a form of speech synthesis. and the actual use of more or TTS (Text to Speech). How do you represent the statements to be synthesized? Different voice platforms have different methods. The common denominator of these features is abstracted from the consortium. For example, a word in a sentence to reread, a sentence is a male voice. By identifying the language, we can uniformly describe a piece of text to be synthesized.
Here we will zoom in on some of the work of the Consortium.
Speech synthesis
We've talked about some of the problems of speech synthesis earlier. When a voice browser needs to turn the output from a character into a natural voice, we must have time to mark some of the language features of the speech reader in advance and provide it to the speech synthesizer. This is implemented in the language of the voice browser in XML.
-->
Figure 1
As the text data shown in Figure 1 above is processed from the front of the speech browser, it becomes the identity language by a transformation mechanism. The conversion mechanism here is actually the overhead of the code. The identity language of the speech synthesizer is fed into a specific speech synthesizer, and the end user can hear the natural voice.
Figure 2
Note that the speech synthesis method shown here in Figure 1 is not unique. There is also a simple way to break down the text data into phrases that are often reused, and then to read the pre-recorded speech segments in a matching way, and flatten them into a natural voice. See Figure 2. But the drawbacks of this approach are very obvious. First of all, the scope of the text data is limited to a certain effect vocabulary, and has a fixed sentence structure. Secondly, the overhead of manual recording is large, and with the expansion of the text data range, the overhead is multiplied. Finally, the natural speech synthesized by this method is not natural.
Since different companies have different voice application platforms, the identity language in the previous speech synthesis is not uniform. But fortunately, because of the consistency of speech synthesis technology itself, these different identification languages are marked with similar phonetic features. The difference is only in the different representations of some identities.
Let's take a look at some of the major speech synthesis identification languages.
1 JSML (Java Speech Markup Language)
JSML, as the name implies, is used to enter text into the Java Speech API synthesizer for the identity language. JSML contains the natural speech attributes of the synthesized text. And it uses the Unicode character set, so it works for almost every language.
Example:
? XML version= "1.0" encoding= "gb2312"?>
<JSML>
< --> PARA > You owe me.
<EMP>
<sayas class= "Number" >10000 yuan </SAYAS>
</EMP>
</ PARA >
< PARA >
<EMP>
<sayas class= "literal" > Too exaggerated </SAYAS>
</EMP>
</ PARA >
</JSML>
We will not elaborate on the meaning of JSML's logo here, as we will introduce the standards of the consortium in more detail. We can see from the example that this is a dialogue. <PARA> says this is a passage. <EMP> stressed. The attribute level of <EMP> can also set the weight of the reread, which is not reflected in the example above. <SAYAS> to mark the way text is read out. The specific method is set by its specific attributes. For example, the class= "number" above means reading in a large way. class= "literal" means to read in a separate way, which is "too exaggerated".
2 Sable
Same as the above we also use an example to understand it.
Example:
<div type= "Paragraph" > This is last year
<EMPH> you </EMPH>
for me
<pitch base= "High" Range= "Large" >
<rate speed= " -20%" > Dot song </rate>
</pitch>
</DIV>
<audio The src= "Easycome_easygo.wav"/>
is broadly similar to sable and JSML. I just pick out a few more special logos to explain. <PITCH> is the tone of the requirements. <RATE> is a requirement for speed of speech. <AUDIO> indicates a section of recorded speech. The
supports Sable's typical festival speech synthesis system. Festival is a multilingual speech synthesis system developed by the Center for Speech Technology Study, Edinburgh University, UK. For people engaged in speech research, CStr is worth studying. Because it not only has a mature speech synthesis system such as festival, but also a set of open source speech recognition Platform software package.
Let's take a look at the standards that the world's standard in speech synthesis identifies language.
As in the two languages mentioned earlier, the consortium divides the identity and attributes it prescribes into three categories. One is the representation of the structure of language, one is the representation of speech features, and the other is the complement of the above two categories, generally for possible expansion of the room.
We will introduce these identities and attributes in tabular form. The following is the identity of the first layer under
:
File structure Text Processing Pronunciation phonemes |
Speak |
Root node |
Xml:lang |
Attributes that represent languages in the root node |
Paragraph and sentence |
A struct node that represents a paragraph and a word |
Sayas |
Defines the preprocessing format for text, with the property: type |
Phoneme |
Defines the pronunciation method for text, which is: ph |
Pronunciation Rhyme |
Voice |
Nodes that represent different methods of pronunciation |
Emphasis |
Accent Pronunciation Node |
Break |
Empty node that represents a temporary meal |
Prosody |
A node that controls the intonation of phonetic morphemes |
Other |
Audio |
Represents a node that joins a pre-recorded voice segment |
Mark |
Represents a tag in an asynchronous synthesis |
For some of the complex nodes of the structure options, let's take a closer look:
The Sayas node Type property can be an option value:
Pronunciation |
Acronym |
An acronym for a compound word. Read separately by a single letter. |
Sub (properties of the "Sayas" node) |
Value is the replacement of the node value. There is no declaration in the DTD listed later. |
Digital |
Number |
The value is the numeric form of the node value. The optional values are ordinal, digits, etc.: |
Time, measurement |
Date |
Date control format |
Time |
Time Control format |
Duration |
Time interval Length setting |
Currency |
Amount Control Format |
Measure |
Represents text as a metric |
Address, identity |
Name |
Name of person or company |
Net |
Network address. It could be an e-mail or a Web address. |
Address |
Postal address |
"Voice" Node property value:
Gender |
Sex of vocal Speaker |
Age |
Age range of vocal speakers |
Variant |
Audible to the different sounds of the speaker. (optional, platform-specific variables) |
Name |
Audible identification for the platform. (may be regarded as the name of the vocal speaker) |
Break Node Property value:
Size |
Set the size of the interrupt edge |
Time |
Quantitative setting of interrupt time length |
"Prosody" Property property value:
Pitch |
Base tone setting |
Contour |
Base Tone Profile Setting |
Range |
Pitch change range setting |
Rate |
Speed |
Duration |
Pronunciation Time length Set |
Volume |
Pronunciation with the size set |
The following is the DTD for the speech synthesis identification language set up by the consortium:
<?xml version= "1.0" encoding= "Iso-8859-1"?>
<!--Speech synthesis Markup Language v0.5 20000504-->
<! ENTITY% allowed-within-sentence "#PCDATA | Sayas | Phoneme |voice | Emphasis | Break | prosody | Audio | Value | Mark ">
<! ENTITY% structure "paragraph | P | Sentence | S ">
<! ENTITY% duration "CDATA" >
<! ENTITY% integer "CDATA" >
<! ENTITY% uri "CDATA" >
<! ENTITY% phoneme-string "CDATA" >
<! ENTITY% Phoneme-alphabet "CDATA" >
<!--definitions of the structural elements. -->
<!--Currently, these elements support the Xml:lang attribute-->
<! ELEMENT speak (%allowed-within-sentence; |%structure;) *>
<! ELEMENT paragraph (%allowed-within-sentence; | sentence | s) *>
<! ELEMENT sentence (%allowed-within-sentence;) *>
<!--the flexible container elements can occur within paragraph-->
<!--and sentence but may also contain these structural elements. -->
<! ENTITY% voice-name "CDATA" >
<! ELEMENT voice (%allowed-within-sentence; |%structure;) *>
<! Attlist Voice Gender (male|female|neutral) #IMPLIED
Age (%integer;|child|teenager|adult|elder) #IMPLIED
Variant (%integer;|different) #IMPLIED
Name (%voice-name;|default) #IMPLIED >
<! ELEMENT prosody (%allowed-within-sentence; |%structure;) *>
<! Attlist prosody
Pitch CDATA #IMPLIED
Contour CDATA #IMPLIED
Range CDATA #IMPLIED
Rate CDATA #IMPLIED
Duration CDATA #IMPLIED
Volume CDATA #IMPLIED >
<! ELEMENT Audio (%allowed-within-sentence; |%structure;) *>
<! Attlist Audio
SRC%uri; #IMPLIED >
<!--these basic container elements can contain any of the-->
<!--within-sentence elements, but neither sentence or paragraph. -->
<! ELEMENT emphasis (%allowed-within-sentence;) *>
<! Attlist emphasis level (strong|moderate|none|reduced) ' moderate ' >
<!--these basic container elements can contain only data-->
<! ENTITY% sayas-types "(acronym|number|ordinal|digits|telephone|date|time| duration|currency|measure|name|net|address) ">
<! ELEMENT Sayas (#PCDATA) >
<! Attlist Sayas
Type%sayas-types; #REQUIRED >
<! ELEMENT phoneme (#PCDATA) >
<! attlist phoneme ph%phoneme-string; #REQUIRED
Alphabet%phoneme-alphabet; #IMPLIED >
<!--definitions of the basic empty elements-->
<! ELEMENT Break empty>
<! Attlist Break Size (large|medium|small|none) ' Medium '
Time%duration; #IMPLIED >
<! ELEMENT Mark Empty>
<! Attlist Mark name CDATA #REQUIRED >
Voice Conversation
The so-called voice dialogue is the dialogue control part of the Voice browser. It is the main "food" for the entire voice browser. VoiceXML is actually a typical voice-browsing dialog Control language. Writing VoiceXML is very similar to writing a program, rather than being a mere stack of information like HTML.
Because of the special browsing mode of the voice browser, it is very difficult to design the logo language of the speech control part. The consortium also simply describes its needs and then authenticates the ready-made version submitted by industry. Since we'll detail the structure of the VoiceXML later, we'll just do a brief introduction to the requirements statement for the consortium in this section.
In general, the requirements statement illustrates four broad requirements:
Morphological requirements: This is mainly to the browser input and output, as well as the intermediate process of data form provisions.
Functional requirements: This is primarily the definition of a supported dialog hairstyle. For voice browsing will be the most commonly used mode of dialogue, a number of voice dialogue to identify the language required to achieve the dialogue function.
Format requirements: There are some requirements in the writing format of the identity language. For example, support for reusable, commonly used units.
Other aspects are very vague. Includes triggering event handles, voice browsing of the user system, and voice user authentication.