The application of XML in Speech (II.)

Source: Internet
Author: User
Tags date format empty end net string version time interval
Xml

In this section, let's take a look at the existing XML standardization processes in speech.

The work done by the consortium on Voice browsing

As an authoritative standard organization, the consortium has long been eyeing voice browsing (Voice Browser). The consortium set up a workstation called "Voice browser" in October in 1998. Here you can imagine that XML was first proposed at the end of 1996. The goal of the workstation is very clear, setting standards for voice browsing to meet the upcoming voice browsing spree. The work of workstations is mainly based on the following aspects, we also need to enlarge the study of some of the work.

Voice identification language requirements for dialogs

To put it simply, for a voice browser to be able to easily combine different input and output forms, a harmonious interaction with the user "dialogue" requires a restriction on the identity language used to represent the dialog data. As you can imagine, HTML used to represent a flat Web page is not competent for voice browsers. Not only is it messy, but its two-dimensional nature also makes it impossible to represent an interactive conversation.

Identify requirements for reusable dialogs in the language

As with advanced programming languages, there are small modules that are often reused. These small modules are designed to be functions or processes that are called once every time they are used. A similar mechanism is needed in the identity language used by speech browsers to represent parts that are often reused. This can be in the later article concrete experience. Parts that are often reused include simple confirmations, collecting phone numbers, dates, amounts, credit card information, and address information.

A description of the requirements for the speech recognition grammar layer in the identity language

We have mentioned earlier that the implementation of voice browser and the application of voice technology are inseparable. In order to realize the speech data recognition input, the speech browser will use the ready-made speech recognition platform. Different speech recognition methods have different platform requirements. If it's just an independent person's small vocabulary recognition, we may not need to spend too much time on recognition. But once the request is relaxed a little bit, the difficulty of speech recognition becomes very high all of a sudden. For the small vocabulary of the independent person and speech recognition with grammatical structure, it is necessary to describe the structure of the speech input grammar with the recognition on the interface of speech recognition platform. This situation is the most common case of voice browsers now in general. The consortium defines a set of such syntax identifier representations here.

Description of requirements for natural language processing in the identification language

This is actually the problem of speech comprehension that we have mentioned earlier. This, like the point above, is closely related to speech technology. Defines the requirement description for the identification of natural language understanding in the speech browser implementation.

Description of requirements for speech synthesis in the identification language

In order to achieve the output of speech, people have to use speech synthesis. A simple piece of pre-recorded sound can also be viewed as a form of speech synthesis. and the actual use of more or TTS (Text to Speech). How do you represent the statements to be synthesized? Different voice platforms have different methods. The common denominator of these features is abstracted from the consortium. For example, a word in a sentence to reread, a sentence is a male voice. By identifying the language, we can uniformly describe a piece of text to be synthesized.

Here we will zoom in on some of the work of the Consortium.

Speech synthesis

We've talked about some of the problems of speech synthesis earlier. When a voice browser needs to turn the output from a character into a natural voice, we must have time to mark some of the language features of the speech reader in advance and provide it to the speech synthesizer. This is implemented in the language of the voice browser in XML.

-->
Figure 1

As the text data shown in Figure 1 above is processed from the front of the speech browser, it becomes the identity language by a transformation mechanism. The conversion mechanism here is actually the overhead of the code. The identity language of the speech synthesizer is fed into a specific speech synthesizer, and the end user can hear the natural voice.


Figure 2

Note that the speech synthesis method shown here in Figure 1 is not unique. There is also a simple way to break down the text data into phrases that are often reused, and then to read the pre-recorded speech segments in a matching way, and flatten them into a natural voice. See Figure 2. But the drawbacks of this approach are very obvious. First of all, the scope of the text data is limited to a certain effect vocabulary, and has a fixed sentence structure. Secondly, the overhead of manual recording is large, and with the expansion of the text data range, the overhead is multiplied. Finally, the natural speech synthesized by this method is not natural.

Since different companies have different voice application platforms, the identity language in the previous speech synthesis is not uniform. But fortunately, because of the consistency of speech synthesis technology itself, these different identification languages are marked with similar phonetic features. The difference is only in the different representations of some identities.

Let's take a look at some of the major speech synthesis identification languages.


1 JSML (Java Speech Markup Language)

JSML, as the name implies, is used to enter text into the Java Speech API synthesizer for the identity language. JSML contains the natural speech attributes of the synthesized text. And it uses the Unicode character set, so it works for almost every language.


Example:
? XML version= "1.0" encoding= "gb2312"?>
<JSML>
< --> PARA > You owe me.
<EMP>
<sayas class= "Number" >10000 yuan </SAYAS>
</EMP>
</ PARA >
< PARA >
<EMP>
<sayas class= "literal" > Too exaggerated </SAYAS>
</EMP>
</ PARA >
</JSML>

We will not elaborate on the meaning of JSML's logo here, as we will introduce the standards of the consortium in more detail. We can see from the example that this is a dialogue. <PARA> says this is a passage. <EMP> stressed. The attribute level of <EMP> can also set the weight of the reread, which is not reflected in the example above. <SAYAS> to mark the way text is read out. The specific method is set by its specific attributes. For example, the class= "number" above means reading in a large way. class= "literal" means to read in a separate way, which is "too exaggerated".

2 Sable

Same as the above we also use an example to understand it.

Example:
<div type= "Paragraph" > This is last year
<EMPH> you </EMPH>
for me
<pitch base= "High" Range= "Large" >
<rate speed= " -20%" > Dot song </rate>
</pitch>
</DIV>
<audio The src= "Easycome_easygo.wav"/>

is broadly similar to sable and JSML. I just pick out a few more special logos to explain. <PITCH> is the tone of the requirements. <RATE> is a requirement for speed of speech. <AUDIO> indicates a section of recorded speech. The

supports Sable's typical festival speech synthesis system. Festival is a multilingual speech synthesis system developed by the Center for Speech Technology Study, Edinburgh University, UK. For people engaged in speech research, CStr is worth studying. Because it not only has a mature speech synthesis system such as festival, but also a set of open source speech recognition Platform software package.

Let's take a look at the standards that the world's standard in speech synthesis identifies language.

As in the two languages mentioned earlier, the consortium divides the identity and attributes it prescribes into three categories. One is the representation of the structure of language, one is the representation of speech features, and the other is the complement of the above two categories, generally for possible expansion of the room.

We will introduce these identities and attributes in tabular form. The following is the identity of the first layer under

:

File structure

Text Processing

Pronunciation phonemes

Speak

Root node

Xml:lang

Attributes that represent languages in the root node

Paragraph and sentence

A struct node that represents a paragraph and a word

Sayas

Defines the preprocessing format for text, with the property: type

Phoneme

Defines the pronunciation method for text, which is: ph

Pronunciation Rhyme

Voice

Nodes that represent different methods of pronunciation

Emphasis

Accent Pronunciation Node

Break

Empty node that represents a temporary meal

Prosody

A node that controls the intonation of phonetic morphemes

Other

Audio

Represents a node that joins a pre-recorded voice segment

Mark

Represents a tag in an asynchronous synthesis

For some of the complex nodes of the structure options, let's take a closer look:

The Sayas node Type property can be an option value:

Pronunciation

Acronym

An acronym for a compound word. Read separately by a single letter.

Sub (properties of the "Sayas" node)

Value is the replacement of the node value. There is no declaration in the DTD listed later.

Digital

Number

The value is the numeric form of the node value. The optional values are ordinal, digits, etc.:

Time, measurement

Date

Date control format

Time

Time Control format

Duration

Time interval Length setting

Currency

Amount Control Format

Measure

Represents text as a metric

Address, identity

Name

Name of person or company

Net

Network address. It could be an e-mail or a Web address.

Address

Postal address

"Voice" Node property value:

Gender

Sex of vocal Speaker

Age

Age range of vocal speakers

Variant

Audible to the different sounds of the speaker. (optional, platform-specific variables)

Name

Audible identification for the platform. (may be regarded as the name of the vocal speaker)

Break Node Property value:

Size

Set the size of the interrupt edge

Time

Quantitative setting of interrupt time length

"Prosody" Property property value:

Pitch

Base tone setting

Contour

Base Tone Profile Setting

Range

Pitch change range setting

Rate

Speed

Duration

Pronunciation Time length Set

Volume

Pronunciation with the size set

The following is the DTD for the speech synthesis identification language set up by the consortium:

<?xml version= "1.0" encoding= "Iso-8859-1"?>
<!--Speech synthesis Markup Language v0.5 20000504-->
<! ENTITY% allowed-within-sentence "#PCDATA | Sayas | Phoneme |voice | Emphasis | Break | prosody | Audio | Value | Mark ">

<! ENTITY% structure "paragraph | P | Sentence | S ">
<! ENTITY% duration "CDATA" >
<! ENTITY% integer "CDATA" >
<! ENTITY% uri "CDATA" >
<! ENTITY% phoneme-string "CDATA" >
<! ENTITY% Phoneme-alphabet "CDATA" >

<!--definitions of the structural elements. -->
<!--Currently, these elements support the Xml:lang attribute-->
<! ELEMENT speak (%allowed-within-sentence; |%structure;) *>
<! ELEMENT paragraph (%allowed-within-sentence; | sentence | s) *>
<! ELEMENT sentence (%allowed-within-sentence;) *>

<!--the flexible container elements can occur within paragraph-->
<!--and sentence but may also contain these structural elements. -->
<! ENTITY% voice-name "CDATA" >
<! ELEMENT voice (%allowed-within-sentence; |%structure;) *>
<! Attlist Voice Gender (male|female|neutral) #IMPLIED
Age (%integer;|child|teenager|adult|elder) #IMPLIED
Variant (%integer;|different) #IMPLIED
Name (%voice-name;|default) #IMPLIED >

<! ELEMENT prosody (%allowed-within-sentence; |%structure;) *>
<! Attlist prosody
Pitch CDATA #IMPLIED
Contour CDATA #IMPLIED
Range CDATA #IMPLIED
Rate CDATA #IMPLIED
Duration CDATA #IMPLIED
Volume CDATA #IMPLIED >

<! ELEMENT Audio (%allowed-within-sentence; |%structure;) *>
<! Attlist Audio
SRC%uri; #IMPLIED >

<!--these basic container elements can contain any of the-->
<!--within-sentence elements, but neither sentence or paragraph. -->
<! ELEMENT emphasis (%allowed-within-sentence;) *>
<! Attlist emphasis level (strong|moderate|none|reduced) ' moderate ' >

<!--these basic container elements can contain only data-->
<! ENTITY% sayas-types "(acronym|number|ordinal|digits|telephone|date|time| duration|currency|measure|name|net|address) ">
<! ELEMENT Sayas (#PCDATA) >
<! Attlist Sayas
Type%sayas-types; #REQUIRED >
<! ELEMENT phoneme (#PCDATA) >
<! attlist phoneme ph%phoneme-string; #REQUIRED
Alphabet%phoneme-alphabet; #IMPLIED >

<!--definitions of the basic empty elements-->
<! ELEMENT Break empty>
<! Attlist Break Size (large|medium|small|none) ' Medium '
Time%duration; #IMPLIED >

<! ELEMENT Mark Empty>
<! Attlist Mark name CDATA #REQUIRED >


Voice Conversation
The so-called voice dialogue is the dialogue control part of the Voice browser. It is the main "food" for the entire voice browser. VoiceXML is actually a typical voice-browsing dialog Control language. Writing VoiceXML is very similar to writing a program, rather than being a mere stack of information like HTML.

Because of the special browsing mode of the voice browser, it is very difficult to design the logo language of the speech control part. The consortium also simply describes its needs and then authenticates the ready-made version submitted by industry. Since we'll detail the structure of the VoiceXML later, we'll just do a brief introduction to the requirements statement for the consortium in this section.

In general, the requirements statement illustrates four broad requirements:

Morphological requirements: This is mainly to the browser input and output, as well as the intermediate process of data form provisions.
Functional requirements: This is primarily the definition of a supported dialog hairstyle. For voice browsing will be the most commonly used mode of dialogue, a number of voice dialogue to identify the language required to achieve the dialogue function.

Format requirements: There are some requirements in the writing format of the identity language. For example, support for reusable, commonly used units.

Other aspects are very vague. Includes triggering event handles, voice browsing of the user system, and voice user authentication.



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.