How to use Python to convert a generic dictionary file into XML

Last Update:2017-02-27 Source: Internet

Author: User

Tags resource

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Brief introduction

Many sophisticated software projects have been using generic text configuration and resource files for many years, but no major problems have arisen. As the project expands and complexity increases, so does the need for higher rigor and greater adaptability. With XML and XML applications that use specific standards, you can benefit from cross project and Cross-platform compatibility, robustness, and scalability in areas such as Unicode.

Common abbreviations

HTK: Hidden Markov Model Toolkit (Hidden Markov models Toolkit)

PLS: Pronunciation Vocabulary Specification (Pronunciation Lexicon specification)

XML: Extensible Markup Language (extensilble Markup Language)

You can also improve flexibility and reliability by translating plain text files into relevant open source standards. A good example of this is the dictionary of speech recognition work. Regardless of whether your open source project turns to XML-formatted resource files, you can use XML standards in your work without losing functionality.

In this article, we'll learn how to easily convert between plain text and pronunciation Lexicon specification (PLS) format. Several examples show how to store the custom dictionaries in PLS format and extract the data into the normal file that you want.

Example: Dictionary

A dictionary is a list of words used in speech recognition tools. They contain information about how to print or display a word in a graphic, and how it uses phonemes to pronounce. Dictionaries that are often used with Hidden Markov Model Toolkit (HTK) are widely used in speech control projects. Listing 1 is an excerpt from a voxforge HTK dictionary.

Listing 1. Listing 1 comes from an excerpt from a voxforge HTK dictionary.

AGENCY  [AGENCY]        ey JH ih n s iy
AGENDA  [AGENDA]        ax JH eh n d ax
AGENT   [agent] ey JH IH n t
  
   agents  [AGENTS]        ey JH ih n T s
ager    [ager]  ey g er
AGES    [AGES]  ey JH IH z

The file in Listing 1 contains three tab-delimited fields:

General description of the label of the word

The square brackets around the word when you want to print or display a word on the screen (word element)

A series of single, space-delimited phonemes from the Arpabet set (see Resources) that describe the pronunciation of words

In the above example, English pronunciation is mostly included in the American Standard Code for Information Interchange (ASCII) character.

The CMU Sphinx Project stores dictionaries (or dictionaries) in a similar manner in the CMU Sphinx context. Listing 2 gives an excerpt.

Listing 2. Excerpt from a CMU Sphinx dictionary

Agency  EY JH ah \ S IY
Agenda  Ah JH eh n d ah
agendas ah JH eh n d ah Z
agent   EY JH Ah n t
  agents  EY JH AH N T S
ager    EY JH ER

In Listing 2, there are only two fields: Word/character and its phonemes. The two dictionary examples have some nuances:

Words and phonemes are completely different types.

Sounds have some subtle differences.

There are slightly different ways to treat punctuation (commas and exclamation marks, and so on).

You can see the entire dictionary in the Cmu07a.dic file in the currently downloaded Pocketsphinx.

Because the dictionary gives you the pronunciation of a particular word, you may need to edit the file to fit a particular person or dialect. Over time, you can build your knowledge assets in your custom dictionaries. Using a text editor makes it easy to edit plain files, but it is also easy to introduce errors, such as using delimiters other than file standards, inserting non-ASCII characters, placing fields in the wrong order, improperly sorting fields, missing square brackets where needed, and so on.

There is a shortage of ordinary documents. When you build a custom file, it is always incompatible with other speech items. A dictionary of standard XML formats (such as PLS), once identified by two items, is instantly compatible with each other.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More