The application of XML in Speech (II.)

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Xml

In this section, let's take a look at the existing XML standardization processes in speech.

The work done by the consortium on Voice browsing

As an authoritative standard organization, the consortium has long been eyeing voice browsing (Voice Browser). The consortium set up a workstation called "Voice browser" in October in 1998. Here you can imagine that XML was first proposed at the end of 1996. The goal of the workstation is very clear, setting standards for voice browsing to meet the upcoming voice browsing spree. The work of workstations is mainly based on the following aspects, we also need to enlarge the study of some of the work.

Voice identification language requirements for dialogs

To put it simply, for a voice browser to be able to easily combine different input and output forms, a harmonious interaction with the user "dialogue" requires a restriction on the identity language used to represent the dialog data. As you can imagine, HTML used to represent a flat Web page is not competent for voice browsers. Not only is it messy, but its two-dimensional nature also makes it impossible to represent an interactive conversation.

Identify requirements for reusable dialogs in the language

As with advanced programming languages, there are small modules that are often reused. These small modules are designed to be functions or processes that are called once every time they are used. A similar mechanism is needed in the identity language used by speech browsers to represent parts that are often reused. This can be in the later article concrete experience. Parts that are often reused include simple confirmations, collecting phone numbers, dates, amounts, credit card information, and address information.

A description of the requirements for the speech recognition grammar layer in the identity language

We have mentioned earlier that the implementation of voice browser and the application of voice technology are inseparable. In order to realize the speech data recognition input, the speech browser will use the ready-made speech recognition platform. Different speech recognition methods have different platform requirements. If it's just an independent person's small vocabulary recognition, we may not need to spend too much time on recognition. But once the request is relaxed a little bit, the difficulty of speech recognition becomes very high all of a sudden. For the small vocabulary of the independent person and speech recognition with grammatical structure, it is necessary to describe the structure of the speech input grammar with the recognition on the interface of speech recognition platform. This situation is the most common case of voice browsers now in general. The consortium defines a set of such syntax identifier representations here.

Description of requirements for natural language processing in the identification language

This is actually the problem of speech comprehension that we have mentioned earlier. This, like the point above, is closely related to speech technology. Defines the requirement description for the identification of natural language understanding in the speech browser implementation.

Description of requirements for speech synthesis in the identification language

In order to achieve the output of speech, people have to use speech synthesis. A simple piece of pre-recorded sound can also be viewed as a form of speech synthesis. and the actual use of more or TTS (Text to Speech). How do you represent the statements to be synthesized? Different voice platforms have different methods. The common denominator of these features is abstracted from the consortium. For example, a word in a sentence to reread, a sentence is a male voice. By identifying the language, we can uniformly describe a piece of text to be synthesized.

Here we will zoom in on some of the work of the Consortium.

Speech synthesis

We've talked about some of the problems of speech synthesis earlier. When a voice browser needs to turn the output from a character into a natural voice, we must have time to mark some of the language features of the speech reader in advance and provide it to the speech synthesizer. This is implemented in the language of the voice browser in XML.

-->
Figure 1

As the text data shown in Figure 1 above is processed from the front of the speech browser, it becomes the identity language by a transformation mechanism. The conversion mechanism here is actually the overhead of the code. The identity language of the speech synthesizer is fed into a specific speech synthesizer, and the end user can hear the natural voice.

Figure 2

Note that the speech synthesis method shown here in Figure 1 is not unique. There is also a simple way to break down the text data into phrases that are often reused, and then to read the pre-recorded speech segments in a matching way, and flatten them into a natural voice. See Figure 2. But the drawbacks of this approach are very obvious. First of all, the scope of the text data is limited to a certain effect vocabulary, and has a fixed sentence structure. Secondly, the overhead of manual recording is large, and with the expansion of the text data range, the overhead is multiplied. Finally, the natural speech synthesized by this method is not natural.

Since different companies have different voice application platforms, the identity language in the previous speech synthesis is not uniform. But fortunately, because of the consistency of speech synthesis technology itself, these different identification languages are marked with similar phonetic features. The difference is only in the different representations of some identities.

Let's take a look at some of the major speech synthesis identification languages.

1 JSML (Java Speech Markup Language)

JSML, as the name implies, is used to enter text into the Java Speech API synthesizer for the identity language. JSML contains the natural speech attributes of the synthesized text. And it uses the Unicode character set, so it works for almost every language.

Example:
? XML version= "1.0" encoding= "gb2312"?>
<JSML>
< --> PARA > You owe me.
<EMP>
<sayas class= "Number" >10000 yuan </SAYAS>
</EMP>
</ PARA >
< PARA >
<EMP>
<sayas class= "literal" > Too exaggerated </SAYAS>
</EMP>
</ PARA >
</JSML>

We will not elaborate on the meaning of JSML's logo here, as we will introduce the standards of the consortium in more detail. We can see from the example that this is a dialogue. <PARA> says this is a passage. <EMP> stressed. The attribute level of <EMP> can also set the weight of the reread, which is not reflected in the example above. <SAYAS> to mark the way text is read out. The specific method is set by its specific attributes. For example, the class= "number" above means reading in a large way. class= "literal" means to read in a separate way, which is "too exaggerated".

2 Sable

Same as the above we also use an example to understand it.

Example:
<div type= "Paragraph" > This is last year
<EMPH> you </EMPH>
for me
<pitch base= "High" Range= "Large" >
<rate speed= " -20%" > Dot song </rate>
</pitch>
</DIV>
<audio The src= "Easycome_easygo.wav"/>

is broadly similar to sable and JSML. I just pick out a few more special logos to explain. <PITCH> is the tone of the requirements. <RATE> is a requirement for speed of speech. <AUDIO> indicates a section of recorded speech. The

supports Sable's typical festival speech synthesis system. Festival is a multilingual speech synthesis system developed by the Center for Speech Technology Study, Edinburgh University, UK. For people engaged in speech research, CStr is worth studying. Because it not only has a mature speech synthesis system such as festival, but also a set of open source speech recognition Platform software package.

Let's take a look at the standards that the world's standard in speech synthesis identifies language.

As in the two languages mentioned earlier, the consortium divides the identity and attributes it prescribes into three categories. One is the representation of the structure of language, one is the representation of speech features, and the other is the complement of the above two categories, generally for possible expansion of the room.

We will introduce these identities and attributes in tabular form. The following is the identity of the first layer under

:

File structure Text Processing Pronunciation phonemes	Speak	Root node
Xml:lang	Attributes that represent languages in the root node
Paragraph and sentence	A struct node that represents a paragraph and a word
Sayas	Defines the preprocessing format for text, with the property: type
Phoneme	Defines the pronunciation method for text, which is: ph
Pronunciation Rhyme	Voice	Nodes that represent different methods of pronunciation
Emphasis	Accent Pronunciation Node
Break	Empty node that represents a temporary meal
Prosody	A node that controls the intonation of phonetic morphemes
Other	Audio	Represents a node that joins a pre-recorded voice segment
Mark	Represents a tag in an asynchronous synthesis

For some of the complex nodes of the structure options, let's take a closer look:

The Sayas node Type property can be an option value:

Pronunciation	Acronym	An acronym for a compound word. Read separately by a single letter.
Sub (properties of the "Sayas" node)	Value is the replacement of the node value. There is no declaration in the DTD listed later.
Digital	Number	The value is the numeric form of the node value. The optional values are ordinal, digits, etc.:
Time, measurement	Date	Date control format
Time	Time Control format
Duration	Time interval Length setting
Currency	Amount Control Format
Measure	Represents text as a metric
Address, identity	Name	Name of person or company
Net	Network address. It could be an e-mail or a Web address.
Address	Postal address

"Voice" Node property value:

Gender	Sex of vocal Speaker
Age	Age range of vocal speakers
Variant	Audible to the different sounds of the speaker. (optional, platform-specific variables)
Name	Audible identification for the platform. (may be regarded as the name of the vocal speaker)

Break Node Property value:

Size	Set the size of the interrupt edge
Time	Quantitative setting of interrupt time length

"Prosody" Property property value:

Pitch	Base tone setting
Contour	Base Tone Profile Setting
Range	Pitch change range setting
Rate	Speed
Duration	Pronunciation Time length Set
Volume	Pronunciation with the size set

The following is the DTD for the speech synthesis identification language set up by the consortium:

<?xml version= "1.0" encoding= "Iso-8859-1"?>

<! ENTITY% allowed-within-sentence "#PCDATA | Sayas | Phoneme |voice | Emphasis | Break | prosody | Audio | Value | Mark ">

<! ENTITY% structure "paragraph | P | Sentence | S ">
<! ENTITY% duration "CDATA" >
<! ENTITY% integer "CDATA" >
<! ENTITY% uri "CDATA" >
<! ENTITY% phoneme-string "CDATA" >
<! ENTITY% Phoneme-alphabet "CDATA" >



<! ELEMENT speak (%allowed-within-sentence; |%structure;) *>
<! ELEMENT paragraph (%allowed-within-sentence; | sentence | s) *>
<! ELEMENT sentence (%allowed-within-sentence;) *>



<! ENTITY% voice-name "CDATA" >
<! ELEMENT voice (%allowed-within-sentence; |%structure;) *>
<! Attlist Voice Gender (male|female|neutral) #IMPLIED
Age (%integer;|child|teenager|adult|elder) #IMPLIED
Variant (%integer;|different) #IMPLIED
Name (%voice-name;|default) #IMPLIED >

<! ELEMENT prosody (%allowed-within-sentence; |%structure;) *>
<! Attlist prosody
Pitch CDATA #IMPLIED
Contour CDATA #IMPLIED
Range CDATA #IMPLIED
Rate CDATA #IMPLIED
Duration CDATA #IMPLIED
Volume CDATA #IMPLIED >

<! ELEMENT Audio (%allowed-within-sentence; |%structure;) *>
<! Attlist Audio
SRC%uri; #IMPLIED >



<! ELEMENT emphasis (%allowed-within-sentence;) *>
<! Attlist emphasis level (strong|moderate|none|reduced) ' moderate ' >


<! ENTITY% sayas-types "(acronym|number|ordinal|digits|telephone|date|time| duration|currency|measure|name|net|address) ">
<! ELEMENT Sayas (#PCDATA) >
<! Attlist Sayas
Type%sayas-types; #REQUIRED >
<! ELEMENT phoneme (#PCDATA) >
<! attlist phoneme ph%phoneme-string; #REQUIRED
Alphabet%phoneme-alphabet; #IMPLIED >


<! ELEMENT Break empty>
<! Attlist Break Size (large|medium|small|none) ' Medium '
Time%duration; #IMPLIED >

<! ELEMENT Mark Empty>
<! Attlist Mark name CDATA #REQUIRED >

Voice Conversation
The so-called voice dialogue is the dialogue control part of the Voice browser. It is the main "food" for the entire voice browser. VoiceXML is actually a typical voice-browsing dialog Control language. Writing VoiceXML is very similar to writing a program, rather than being a mere stack of information like HTML.

Because of the special browsing mode of the voice browser, it is very difficult to design the logo language of the speech control part. The consortium also simply describes its needs and then authenticates the ready-made version submitted by industry. Since we'll detail the structure of the VoiceXML later, we'll just do a brief introduction to the requirements statement for the consortium in this section.

In general, the requirements statement illustrates four broad requirements:

Morphological requirements: This is mainly to the browser input and output, as well as the intermediate process of data form provisions.
Functional requirements: This is primarily the definition of a supported dialog hairstyle. For voice browsing will be the most commonly used mode of dialogue, a number of voice dialogue to identify the language required to achieve the dialogue function.

Format requirements: There are some requirements in the writing format of the identity language. For example, support for reusable, commonly used units.

Other aspects are very vague. Includes triggering event handles, voice browsing of the user system, and voice user authentication.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The application of XML in Speech (II.)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

The application of XML in Speech (II.)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support