On the application of XML in speech synthesis

Source: Internet
Author: User
Tags format

the Internet and everything associated with it now seems to be everywhere. You may have tried a voice call from a night phone salesman, or you have received a prescription notice from a local pharmacy. Now, there is a new technology to use speech synthesis combined with XML technology to transmit voice information.


The means to deliver information by voice is not something new. It is the communication method we have been using for thousands of of years. And it's not a new invention to receive telephones from a single computer. Many voice technologies are now in the pipeline, from fax machines, autodial to integrated voice-response systems (IVR). Telephone is, of course, its most common application.

The traditional voice system uses pre-recorded samples, dictionaries, and tones to create the sound we hear. However, there are many problems with using this prerecorded method. One of the most common problems is lack of coherence and change. If there is only one recorded voice version, where each word or sound has only a single sample, it is difficult for the computer to send out interrogative sentences that differ in tone from the ordinary statements. It is also difficult for the computer to know when to use a tone of voice or what intonation to pronounce.

To help address speech synthesis issues, the consortium created a new working draft for the speech Synthesis Markup Language (Speech synthesis Markup Language). This new XML glossary enables voice browser developers to control how a speech synthesizer is created. For example, a developer can include a command in the volume and use it when synthesizing speech modes.

The SSML specification is based on a previous study by Sun company called Jspeeck Markup Language (JSML). JSML is based on the Java Speech API Markup Language. Now SSML is working on the Working Group on Voice Research in the consortium.

The basic goal of the SSML language is a processor of text-to-speech (text-to-speech for short TTS). A TTS engine obtains a collection of text and converts it to speech. Several TTS applications are now available, such as the Telephony synthesis response system, and more advanced systems designed for the blind. The inherent uncertainty of the pronunciation of a particular text set is one of the major challenges facing existing TTS systems. Other common problems focus on the pronunciation of parts of speech such as abbreviations (HTML), spelling and pronunciation of different words (such as subpoena).

The underlying element of the SSML language specifies the format of the text. For example, a paragraph element is provided for the HTML,SSML language and goes further. Because it also provides a sentence element. The TTS engine can more accurately generate speech by specifying the address of the sentence, including the start and end addresses, as specified paragraphs.

In addition to the basic format, SSML also provides the ability to specify how to send a predetermined word or a set of words. This feature is implemented by the "Say-as" element. It is a very useful component in SSML. It allows you to specify a template that describes how to pronounce a word or a collection of words. By "Say-as," we can specify how to pronounce the abbreviated word, or pronounce the words differently for spelling and pronunciation. We can also list the difference between numbers and dates. The "Say-as" element contains support for email addresses, currency, phone numbers, and so on.

We can also provide a phonetic representation of the text. For example, we can use this method to point out the difference between American English and British English in potato words.

Several advanced properties of the SSML language can help us make the TTS system produce more human-friendly sounds. We can use the "Voice" element to specify the voice of a male, female, or neutral, and also to specify the age at which the sound belongs. We can use this element to specify any sound between the 4-Year-old boy and the 75-year-old old woman.

We can also use the "emphasis" element to surround text that needs to be emphasized or relatively minor. We can also use the "break" element to tell the system that the voice should be paused somewhere.

One of the superlative features of SSML language is embodied in its "prosody" element. It allows us to generate the voice of a certain set of text in some specified way. We can specify the tone, range, and speed of the sound (every minute of the word). We can even specify something more detailed by using the "contour" element. The "contour" element integrates intonation and speed. By specifying the "contour" element value of a text collection, we can define more precisely how to generate speech.



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.