How to use Python to convert a generic dictionary file into XML

Source: Internet
Author: User
Tags resource

Brief introduction

Many sophisticated software projects have been using generic text configuration and resource files for many years, but no major problems have arisen. As the project expands and complexity increases, so does the need for higher rigor and greater adaptability. With XML and XML applications that use specific standards, you can benefit from cross project and Cross-platform compatibility, robustness, and scalability in areas such as Unicode.

Common abbreviations

HTK: Hidden Markov Model Toolkit (Hidden Markov models Toolkit)

PLS: Pronunciation Vocabulary Specification (Pronunciation Lexicon specification)

XML: Extensible Markup Language (extensilble Markup Language)

You can also improve flexibility and reliability by translating plain text files into relevant open source standards. A good example of this is the dictionary of speech recognition work. Regardless of whether your open source project turns to XML-formatted resource files, you can use XML standards in your work without losing functionality.

In this article, we'll learn how to easily convert between plain text and pronunciation Lexicon specification (PLS) format. Several examples show how to store the custom dictionaries in PLS format and extract the data into the normal file that you want.

Example: Dictionary

A dictionary is a list of words used in speech recognition tools. They contain information about how to print or display a word in a graphic, and how it uses phonemes to pronounce. Dictionaries that are often used with Hidden Markov Model Toolkit (HTK) are widely used in speech control projects. Listing 1 is an excerpt from a voxforge HTK dictionary.

Listing 1. Listing 1 comes from an excerpt from a voxforge HTK dictionary.

AGENCY  [AGENCY]        ey JH ih n s iy
AGENDA  [AGENDA]        ax JH eh n d ax
AGENT   [agent] ey JH IH n t
  
   agents  [AGENTS]        ey JH ih n T s
ager    [ager]  ey g er
AGES    [AGES]  ey JH IH z
  

The file in Listing 1 contains three tab-delimited fields:

General description of the label of the word

The square brackets around the word when you want to print or display a word on the screen (word element)

A series of single, space-delimited phonemes from the Arpabet set (see Resources) that describe the pronunciation of words

In the above example, English pronunciation is mostly included in the American Standard Code for Information Interchange (ASCII) character.

The CMU Sphinx Project stores dictionaries (or dictionaries) in a similar manner in the CMU Sphinx context. Listing 2 gives an excerpt.

Listing 2. Excerpt from a CMU Sphinx dictionary

Agency  EY JH ah \ S IY
Agenda  Ah JH eh n d ah
agendas ah JH eh n d ah Z
agent   EY JH Ah n t
  agents  EY JH AH N T S
ager    EY JH ER

In Listing 2, there are only two fields: Word/character and its phonemes. The two dictionary examples have some nuances:

Words and phonemes are completely different types.

Sounds have some subtle differences.

There are slightly different ways to treat punctuation (commas and exclamation marks, and so on).

You can see the entire dictionary in the Cmu07a.dic file in the currently downloaded Pocketsphinx.

Because the dictionary gives you the pronunciation of a particular word, you may need to edit the file to fit a particular person or dialect. Over time, you can build your knowledge assets in your custom dictionaries. Using a text editor makes it easy to edit plain files, but it is also easy to introduce errors, such as using delimiters other than file standards, inserting non-ASCII characters, placing fields in the wrong order, improperly sorting fields, missing square brackets where needed, and so on.

There is a shortage of ordinary documents. When you build a custom file, it is always incompatible with other speech items. A dictionary of standard XML formats (such as PLS), once identified by two items, is instantly compatible with each other.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.