xml| Voice with XML in the industry is widely used in different fields in different industries, each derive some of the appropriate in some specific areas of the XML subset. One of the voicexml that will be introduced in this series of articles is the main one. It is very similar to the status of WML. WML is a special language used in wireless internet, and VoiceXML is a special language used in voice browsing. Wireless Internet (WAP) and voice browsing are the two hotspots of network development now, and XML has found its use in them.
The whole article will compare and introduce some important links in the application of XML in speech. In addition to VoiceXML, there are other aspects of XML that are used in speech technology. In this section is mainly the introduction of background knowledge.
Voice Browser, a hotbed of voicexml birth
The phone appears to facilitate communication. The computer appears to be calculated. After entering the post-PC era, the two words "computer" and "network" are getting closer. Nowadays, many people who buy computers are able to reach their goal to surf the Internet. The network has become a shortcut for information exchange and sharing, and people have been "shot down". So that the Internet is also on the agenda. The simpler it is, the more it is favored by the widest range of ordinary consumers. The more widely favored by the most common consumers, the more highly praised by the industry.
There is no doubt that there are more telephones than computers. Walk around the streets of a city and you'll easily find a pay phone. But I still can't imagine the "public computer Pavilion" that is ubiquitous on the street, and the five-dime three-minute internet access. Not only in the city, the phone is not a rare thing, but to buy a computer, many people still have to bite the teeth. Moreover, the advantage of telephones over computers is growing. Personal mobile phone development is in the unprecedented peak period, I believe that the future of single machine is not a myth.
In addition, the telephone is far more affinity than the computer. I believe that many people have a certain tool mentality in the operation of the computer. Not afraid of the original interface has been changed can not be restored, is afraid of not knowing where to modify the desired settings. This is already a friendlier window interface, not to mention the full screen black Unix and so on. And the phone is not the same, only a very limited number of keys, encounter trouble put the microphone a hang again to pick up is. And the key thing is that the handset says "words".
So we say that access to the Internet through the telephone is very promising. Plus now there is a trend to the various forms of access to the interface, such as voice, image and touch, together on the phone to become a veritable PDA (personal digital assistant), we have to be here once again to the development of the phone boast about the future. People will be able to connect to the network at any time and at any place via a "phone" that is readily available, in a straightforward way.
Voice Browsing is a new way of browsing the network
Owning a telephone switchboard is not a novelty for a small, scaled company. Telephone users can contact different employees through a number of keystrokes. There is also a popular telephone switchboard function is to use or through the digital keypad free layered directory structure to . And the voice browsing we're going to be talking about here is similar to the form here. Users mainly send their own information by voice. We can also extend the service to the Web site that provides content services. The contact network used is not necessarily a dedicated telephone network, it can also be the Internet.
What is a voice browser
The speech browser is the input channel with the voice input, the other form is the auxiliary input channel, the translation explains the execution speech markup language, and can produce the speech output the equipment.
The explanation for the above taste, such as crap, originates from the consortium. Like to use a word to describe a person, it is quite a lot of trouble to outline some vague generality. Walking upright with legs and mammals with upper limbs. What the Voice browser wants to achieve is through the most affinity of communication, the information from the network with the most abundant resources through the voice to the user.
With a discerning eye, the design completes a voice browser and a technical barrier to speech recognition, speech comprehension and speech synthesis. When a user wants to make a command out of voice, the voice browser needs to know what the user is saying? It is simpler to command speech recognition. The system requires you to say "mosquito" or "fly". The others are considered illegal input. Complexity involves understanding a piece of speech. For example, when the user says, "What day is today?" "or" Today is the first day of one weeks? "When you know it's the same thing. In this way, users are much more convenient. And this requires speech understanding technology to deal with. Speech synthesis is also essential. When you want to send the browsing information to the user from time to time through the voice. This can be a pre-recorded voice of a natural person, or a TTS speech synthesis system through text-to-speech. It can be very clear that the above to use the technology is required a lot of technology accumulation to achieve, is not a day of work. For the vast majority of design to achieve the development of voice browser will be very natural to adopt some in the voice of the company provides a lot of technical product support. The main focus is on integration.
The main features of the voice browser are very distinct. In many ordinary occasions, through an approximate nature of the dialogue to publish the command, through a different than the HTML two-dimensional browsing mode, the time to collect information linearly. It's just very tempting. In many cases, eyes are not allowed to be supervised at all, as when driving a car. The user is controlled only by a non-visual way.
Web browsers provide a way to jump between different foliage, and voice browsers can jump from one conversation to another. Web browsers provide a way to add and send forms, and a voice browser page can set a user's voice for a single purposeful input. In fact, there is a lot of similarity between voice browsers and HTML foliage browsers. The GRE analogy is the way to say: Voice browser: voicexml::html browser: HTML. In this section, let's take a look at the existing XML standardization processes in speech.
The work done by the consortium on Voice browsing
As an authoritative standard organization, the consortium has long been eyeing voice browsing (Voice Browser). The consortium set up a workstation called "Voice browser" in October in 1998. Here you can imagine that XML was first proposed at the end of 1996. The goal of the workstation is very clear, setting standards for voice browsing to meet the upcoming voice browsing spree. The work of workstations is mainly based on the following aspects, we also need to enlarge the study of some of the work.
Voice identification language requirements for dialogs
To put it simply, for a voice browser to be able to easily combine different input and output forms, a harmonious interaction with the user "dialogue" requires a restriction on the identity language used to represent the dialog data. As you can imagine, HTML used to represent a flat Web page is not competent for voice browsers. Not only is it messy, but its two-dimensional nature also makes it impossible to represent an interactive conversation.
Identify requirements for reusable dialogs in the language
As with advanced programming languages, there are small modules that are often reused. These small modules are designed to be functions or processes that are called once every time they are used. A similar mechanism is needed in the identity language used by speech browsers to represent parts that are often reused. This can be in the later article concrete experience. Parts that are often reused include simple confirmations, collecting phone numbers, dates, amounts, credit card information, and address information.
A description of the requirements for the speech recognition grammar layer in the identity language
We have mentioned earlier that the implementation of voice browser and the application of voice technology are inseparable. In order to realize the speech data recognition input, the speech browser will use the ready-made speech recognition platform. Different speech recognition methods have different platform requirements. If it's just an independent person's small vocabulary recognition, we may not need to spend too much time on recognition. But once the request is relaxed a little bit, the difficulty of speech recognition becomes very high all of a sudden. For the small vocabulary of the independent person and speech recognition with grammatical structure, it is necessary to describe the structure of the speech input grammar with the recognition on the interface of speech recognition platform. This situation is the most common case of voice browsers now in general. The consortium defines a set of such syntax identifier representations here.
Description of requirements for natural language processing in the identification language
This is actually the problem of speech comprehension that we have mentioned earlier. This, like the point above, is closely related to speech technology. Defines the requirement description for the identification of natural language understanding in the speech browser implementation.
Description of requirements for speech synthesis in the identification language
In order to achieve the output of speech, people have to use speech synthesis. A simple piece of pre-recorded sound can also be viewed as a form of speech synthesis. and the actual use of more or TTS (Text to Speech). How do you represent the statements to be synthesized? Different voice platforms have different methods. The common denominator of these features is abstracted from the consortium. For example, a word in a sentence to reread, a sentence is a male voice. By identifying the language, we can uniformly describe a piece of text to be synthesized.
Here we will zoom in on some of the work of the Consortium.
Speech Synthesis
We've talked about some of the problems of speech synthesis earlier. When a voice browser needs to turn the output from a character into a natural voice, we must have time to mark some of the language features of the speech reader in advance and provide it to the speech synthesizer. This is implemented in the language of the voice browser in XML.
When the text data is processed from the front of the speech browser, it becomes the identification language by a transformation mechanism. The conversion mechanism here is actually the overhead of the code. The identity language of the speech synthesizer is fed into a specific speech synthesizer, and the end user can hear the natural voice.
Note that the speech synthesis method is not unique. There is also a simple way to break down the text data into phrases that are often reused, and then to read the pre-recorded speech segments in a matching way, and flatten them into a natural voice. But the drawbacks of this approach are very obvious. First of all, the scope of the text data is limited to a certain effect vocabulary, and has a fixed sentence structure. Secondly, the overhead of manual recording is large, and with the expansion of the text data range, the overhead is multiplied. Finally, the natural speech synthesized by this method is not natural.
Since different companies have different voice application platforms, the identity language in the previous speech synthesis is not uniform. But fortunately, because of the consistency of speech synthesis technology itself, these different identification languages are marked with similar phonetic features. The difference is only in the different representations of some identities.
Let's take a look at some of the major speech synthesis identification languages.
1 JSML (Java Speech Markup Language)
JSML, as the name implies, is used to enter text into the Java Speech API synthesizer for the identity language. JSML contains the natural speech attributes of the synthesized text. And it uses the Unicode character set, so it works for almost every language.
Example:
? XML version= "1.0" encoding= "gb2312"?>
<JSML>
<PARA> you owe me.
<EMP>
<sayas class= "Number" >10000 yuan </SAYAS>
</EMP>
</PARA>
<PARA>
<EMP>
<sayas class= "literal" > Too exaggerated </SAYAS>
</EMP>
</PARA>
</JSML>
We will not elaborate on the meaning of JSML's logo here, as we will introduce the standards of the consortium in more detail. We can see from the example that this is a dialogue. <PARA> says this is a passage. <EMP> stressed. The attribute level of <EMP> can also set the weight of the reread, which is not reflected in the example above. <SAYAS> to mark the way text is read out. The specific method is set by its specific attributes. For example, the class= "number" above means reading in a large way. class= "literal" means to read in a separate way, which is "too exaggerated".
2 Sable
As with the above, we also understand it by an example.
Example:
<div type= "Paragraph" > this is last year.
<EMPH> you </EMPH>
For me
<pitch base= "High" range= "large" >
<rate speed= " -20%" > Point of the song </RATE>
</PITCH>
</DIV>
<audio src= "Easycome_easygo.wav"/>
Generally speaking, Sable and JSML are very similar. I just pick out a few more special logos to explain. <PITCH> is the tone of the requirements. <RATE> is a requirement for speed of speech. <AUDIO> indicates a section of recorded speech.
Support Sable Typical festival speech synthesis system. Festival is a multilingual speech synthesis system developed by the Center for Speech Technology Study, Edinburgh University, UK. For people engaged in speech research, CStr is worth studying. Because it not only has a mature speech synthesis system such as festival, but also a set of open source speech recognition Platform software package.
So let's take a look at the standards that the world's consortium is making in speech synthesis logo language.
As with the two languages mentioned earlier, the consortium divides the identity and attributes it prescribes into three categories. One is the representation of the structure of language, one is the representation of speech features, and the other is the complement of the above two categories, generally for possible expansion of the room.
We will introduce these identities and attributes in tabular form.
The following is the identification of the first layer:
File structure Text Processing Pronunciation phonemes |
Speak |
Root node |
Xml:lang |
Attributes that represent languages in the root node |
Paragraph and sentence |
A struct node that represents a paragraph and a word |
Sayas |
Defines the preprocessing format for text, with the property: type |
Phoneme |
Defines the pronunciation method for text, which is: ph |
Pronunciation Rhyme |
Voice |
Nodes that represent different methods of pronunciation |
Emphasis |
Accent Pronunciation Node |
Break |
Empty node that represents a temporary meal |
Prosody |
A node that controls the intonation of phonetic morphemes |
Other |
Audio |
Represents a node that joins a pre-recorded voice segment |
Mark |
Represents a tag in an asynchronous synthesis |
For some of the complex nodes of the structure options, let's take a closer look:
The Sayas node Type property can be an option value:
Pronunciation |
Acronym |
An acronym for a compound word. Read separately by a single letter. |
Sub (properties of the "Sayas" node) |
Value is the replacement of the node value. There is no declaration in the DTD listed later. |
Digital |
Number |
The value is the numeric form of the node value. The optional values are ordinal, digits, etc.: |
Time, measurement |
Date |
Date control format |
Time |
Time Control format |
Duration |
Time interval Length setting |
Currency |
Amount Control Format |
Measure |
Represents text as a metric |
Address, identity |
Name |
Name of person or company |
Net |
Network address. It could be an e-mail or a Web address. |
Address |
Postal address |
"Voice" Node property value:
Gender |
Sex of vocal Speaker |
Age |
Age range of vocal speakers |
Variant |
Audible to the different sounds of the speaker. (optional, platform-specific variables) |
Name |
Audible identification for the platform. (may be regarded as the name of the vocal speaker) |
Break Node Property value:
Size |
Set the size of the interrupt edge |
Time |
Quantitative setting of interrupt time length |
"Prosody" Property property value:
Pitch |
Base tone setting |
Contour |
Base Tone Profile Setting |
Range |
Pitch change range setting |
Rate |
Speed |
Duration |
Pronunciation Time length Set |
Volume |
Pronunciation with the size set |
The following is the DTD for the speech synthesis identification language set up by the consortium:
<?xml version= "1.0" encoding= "Iso-8859-1"?>
<!--Speech synthesis Markup Language v0.5 20000504-->
<! ENTITY% allowed-within-sentence "#PCDATA | Sayas | Phoneme |voice | Emphasis | Break | prosody | Audio | Value | Mark ">
<! ENTITY% structure "paragraph | P | Sentence | S ">
<! ENTITY% duration "CDATA" >
<! ENTITY% integer "CDATA" >
<! ENTITY% uri "CDATA" >
<! ENTITY% phoneme-string "CDATA" >
<! ENTITY% Phoneme-alphabet "CDATA" >
<!--definitions of the structural elements. -->
<!--Currently, these elements support the Xml:lang attribute-->
<! ELEMENT speak (%allowed-within-sentence; |%structure;) *>
<! ELEMENT paragraph (%allowed-within-sentence; | sentence | s) *>
<! ELEMENT sentence (%allowed-within-sentence;) *>
<!--the flexible container elements can occur within paragraph-->
<!--and sentence but may also contain these structural elements. -->
<! ENTITY% voice-name "CDATA" >
<! ELEMENT voice (%allowed-within-sentence; |%structure;) *>
<! Attlist Voice Gender (male|female|neutral) #IMPLIED
Age (%integer;|child|teenager|adult|elder) #IMPLIED
Variant (%integer;|different) #IMPLIED
Name (%voice-name;|default) #IMPLIED >
<! ELEMENT prosody (%allowed-within-sentence; |%structure;) *>
<! Attlist prosody
Pitch CDATA #IMPLIED
Contour CDATA #IMPLIED
Range CDATA #IMPLIED
Rate CDATA #IMPLIED
Duration CDATA #IMPLIED
Volume CDATA #IMPLIED >
<! ELEMENT Audio (%allowed-within-sentence; |%structure;) *>
<! Attlist Audio
SRC%uri; #IMPLIED >
<!--these basic container elements can contain any of the-->
<!--within-sentence elements, but neither sentence or paragraph. -->
<! ELEMENT emphasis (%allowed-within-sentence;) *>
<! Attlist emphasis level (strong|moderate|none|reduced) ' moderate ' >
<!--these basic container elements can contain only data-->
<! ENTITY% sayas-types "(acronym|number|ordinal|digits|telephone|date|time| duration|currency|measure|name|net|address) ">
<! ELEMENT Sayas (#PCDATA) >
<! Attlist Sayas
Type%sayas-types; #REQUIRED >
<! ELEMENT phoneme (#PCDATA) >
<! attlist phoneme ph%phoneme-string; #REQUIRED
Alphabet%phoneme-alphabet; #IMPLIED >
<!--definitions of the basic empty elements-->
<! ELEMENT Break empty>
<! Attlist Break Size (large|medium|small|none) ' Medium '
Time%duration; #IMPLIED >
<! ELEMENT Mark Empty>
<! Attlist Mark name CDATA #REQUIRED >
Voice Conversation
The so-called voice dialogue is the dialogue control part of the Voice browser. It is the main "food" for the entire voice browser. VoiceXML is actually a typical voice-browsing dialog Control language. Writing VoiceXML is very similar to writing a program, rather than being a mere stack of information like HTML.
Because of the special browsing mode of the voice browser, it is very difficult to design the logo language of the speech control part. The consortium also simply describes its needs and then authenticates the ready-made version submitted by industry. Since we'll detail the structure of the VoiceXML later, we'll just do a brief introduction to the requirements statement for the consortium in this section.
In general, the requirements statement illustrates four broad requirements:
Morphological requirements: This is mainly to the browser input and output, as well as the intermediate process of data form provisions.
Functional requirements: This is primarily the definition of a supported dialog hairstyle. For voice browsing will be the most commonly used mode of dialogue, a number of voice dialogue to identify the language required to achieve the dialogue function.
Format requirements: There are some requirements in the writing format of the identity language. For example, support for reusable, commonly used units.
Other aspects are very vague. Includes triggering event handles, voice browsing of the user system, and voice user authentication.
In this section is mainly the introduction of VoiceXML. All VoiceXML syntax is not fully listed here. For a detailed description, please refer to the information on the world's consortium. Here is just a key link for you to understand. I believe that not every cyber warrior has studied the syntax of HTML in detail, and more on the basis of the existing pages.
VoiceXML Foundation
VoiceXML is a programming language. But it is not a general programming language, it must take into account the special voice application environment. The following are some of its implementation features:
Recognize input speech
Identify phone key input
Output audio
Speech synthesis
Basic Phone Connection Features
The functions listed above are not done by VoiceXML alone. What VoiceXML do is to provide a mechanism for applying other components. And the specific operation is implemented by the Voice browser. It is worth mentioning that VoiceXML implements all the basic features of the basic programming language, such as program flow control, code reuse, and so on.
In addition, there are a lot of similarities between VoiceXML and HTML. In HTML, a file is made up of a voicexml identity, which is also made up of identities. In HTML the code is separated by "page", and in VoiceXML it is separated by "document". There is a link's identity to jump in HTML, and you can jump to other documents in VoiceXML.
Each VoiceXML document is a complete XML document. Therefore, the identification that forms the following VoiceXML documents is essential:
<?xml Version = "1.0"?>
<vxml Version = "1.0" >
............
</vxml>
As mentioned earlier, the VoiceXML are separated by documents. A VoiceXML application is a collection of a series of voicexml documents. And each application contains a "root document." This is a bit like the default.asp or index.asp of a dynamic website. The root document is always invoked when the VoiceXML application call.
In every document, there are a lot of "dialogs" (Dialog). At each particular moment, human-computer interaction is in a particular state of dialogue. Each dialogue can clearly indicate the next purpose of the dialogue. The entire document is a "finite state machine" consisting of these dialog states. The jump to the dialog is specified by the URI. Each document and dialog can be specified exactly. If you do not specify a specific document, the current document is the default document. If a specific conversation is not specified, the first dialog of the destination document is the default dialog.
There are two kinds of dialogues, one is form, the other is menu. A form is designed to collect information from clients and assign values to specific variables. In the form, there are a lot of fields (field). field is the equivalent of our commonly used variables. form assigns the identified value to the variable by collecting user input (voice or key). Menu form is to provide users with multiple options, depending on the choice, the application process will jump to a different conversation.
The details of form and menu are described in detail in the following sections.
form in the VoiceXML
Like the meaning of form, it is a table. It's as if you have to fill in the details when applying for a loan, and you need to interact with the user in voice applications. The user enters each item in the table by telephone.
Let's take a concrete example to see:
<?xml version= "1.0"? encoding= "gb2312"?>
<vxml version= "1.0" >
<form id= "selected film" >
<field name= "movie" >
<prompt> which movie do you want to enjoy? </prompt>
<grammar> Dahua West Tour | domestic 007 | Ding Kee </grammar>
<filled>
<if cond= "movie = = ' domestic 007 '" >
<prompt> 007 Today only 10.1 shows in the evening. </prompt>
</if>
</filled>
</field>
</form>
</vxml>
In the example above, there is a form identifier in which only one variable (field) needs to be populated. First, the system asks the user, "which movie do you want to enjoy?" , when the user can answer the grammar logo limited "Dahua West Tour", "Domestic 007" and "Deer Ding kee." The input is assigned to the field named movie. The filled identity is used to complete the operation after the field is populated. Here, the operation is read out "homemade 007 today only 10.1 shows at night." "Such a remark.
the prompt in the VoiceXML
The prompt logo is simpler, indicating that the system reads the statements contained within the identity for the user. We have actually seen it in the previous example. A more complete browser is a common speech synthesis to achieve pronunciation.
the grammar in the VoiceXML
Grammar logo is simply to give the user a limited range of input. Users can only pronounce in the range listed by grammar. The Voice browser generates an error message when user input is recognized as not within the grammar range. You can set error handling in the VoiceXML document.
Grammar can be divided into three categories: inline, external and built-in.
Inline Grammar is the explicit way to specify grammar in a given domain. In the previous example, inline grammar. External Grammar is an externally applied method to specify grammar. This time you need to use the identification grammar to illustrate:
<grammar src= "URI" type= "Mime-type"/>
The mime-type here can be default. Because it can be obtained through the protocol of the URI, it can also be obtained by the suffix of the external grammar file. Even if none of the above methods are available, it can be specified from the system running platform. Some people will ask, what to use to write external grammar? The more common is the Javaäspeech API Grammar Format (JSGF).
Built-in grammar is the grammar that the browser platform sets itself. It is something that is personalized. Different manufacturers may have different built-in grammar. In order to facilitate the application development in the browser.