"Sphinx" Chinese acoustic Model training

Source: Internet
Author: User

I. Using Cmusphinx to train an acoustic model

The Cmusphinx Toolkit comes with several high-quality acoustic models. American language model, French, Chinese model. These models are optimized for optimal performance, and most instruction interaction systems can use these models directly, and even some large vocabulary applications can use them directly.

In addition, Cmusphinx provides the functionality to adapt to existing models, in order to meet some needs for higher precision. When you need to use a different recording environment (such as close-up, away from the mic or through a call), these situations are good for adapting the results, or when it is necessary to convert an accent, such as the conversion of American and English, the use of Indian English, etc. Adaptive can satisfy the requirement that you need to support a new language in a very short period of time, then you only need to make an acoustic model based on the dictionary to transform the phoneme set into the target phoneme set.

At some point, however, the current model doesn't work. such as handwriting recognition, or monitoring of other languages. In these cases, you need to retrain your own acoustic model. The following tutorials will guide you on how to start training.

Two start training

Before training, let's say you have enough data:

    • For single-person instruction applications, requiring at least one hour of recording,
    • For many people to command the application, need 200 recording person, 5 hours per person
    • For a single person's dictation, it takes 10 hours for his recording
    • For multi-person dictation, requires 200 speakers, 50 hours of recording per person
    • At the same time you have to have the phonetic knowledge of the language, and you have enough like one months to train the model

And if you don't have enough data, enough time, enough experience, then it is recommended that you do the adaptation of the existing model to meet your requirements.

Data Preparation

The trainer needs to know which sound unit to use to learn the parameters, at least each sequence will appear in your training set. This information is stored in the transcript file.

Then through the dictionary dictionary, where each word has a corresponding sound sequence mapped.

So, in addition to voice data, you also need a transcripts, and two dictionaries. One is the corresponding table for each word to the pronunciation, and a table in which the unit is not pronounced, recorded as filler Dictionay.

Training begins

The following two directories need to be prepared before training

    • etc
      • your_db.dic- Phonetic dictionary
      • your_db.phone- phoneset file
      • your_db.lm.dmp- Language Model-language model
      • your_db.filler- List of fillers
      • your_db_train.fileids- List of files for training
      • your_db_train.transcription- transcription for training
      • your_db_test.fileids- List of files for testing
      • your_db _test.transcription- transcription for testing
    • wav
      • speaker_1
        • file_1.wav- recording of speech utterance
      • speaker_2
        • file_2.wav

Fileids (Your_db_train.fileids and Your_db_test.fileids are listed in the file name of the voice data. If it is a multi-person recording, you can add the information of the recording person, note that the filename does not add a suffix.

   Speaker_1/file_1   speaker_2/file_2

Your recording text is listed in transcription file (Your_db_train.transcription and Your_db_test.transcription). Add the <s> tag to the sentence and the sentence number at the end.

   <s> Hello World </s> (file_1)   <s> foo bar </s> (file_2)

Note that the ordinal of the line in the transcript and the ordinal in the fileids are consistent, as follows, the second sentence in the first place, is a wrong example, will be error.

   speaker_2/file_2   speaker_1/file_1   //error! Do not create Fileids file like this!
Recording files, such as MS WAV format, are used for desktop applications with a sample rate of 16khz,16bit,mono mono recording. 8khz,16bit,mono recording is used for telephony applications. Note this, the wrong voice format, is often the cause of the error.

Speech Recordings (WAV files) Recording files must is in MS WAV format with specific sample rate-16 kHz, + bit, mono for desktop application, 8kHz, 1 6bit, mono for telephone applications. Double-check that, wrong audio file format is the most common source of training issues. Audio files shouldn ' t be very long and shouldn ' t is very short. Optimal length is not less than 5 seconds and isn't more than seconds. Amount of silence in the beginning of the utterance and in the end of the utterance should is not exceed 0.2 second.

It's critical to has audio files in a specific format. Sphinxtrain does support some variety of the sample rates but by default it's configured to train from 16khz 16bit mono files in MS WAV format. You need to make sure this you recordings is at A sampling rate of 8 khz if you train A telephone model) I N MONO with single CHANNEL.

If you train the from 8khz model need to make sure you configured feature extraction properly.

Please note this can not upsample your audio, which means you can is not train-khz model with 8khz data.

Audio format mismatch is the most common training problem.

Phonetic Dictionary (your_db.dict) should has one line per word with word following the Phonetic transcription

HELLO HH AH l owworld W AO R l D

If you need to the Find phonetic dictionary, read Wikipedia or a book on phonetics. If you are using existing phonetic dictionary. Do not use the case-sensitive variants like "E" and "E". Instead, all your phones must is different even in case-insensitive variation. Sphinxtrain doesn ' t support some special characters like ' * ' or '/' and supports most of the others like "+" or "-" or ":" but To is safe we recommend you alphanumeric-only Phone-set.

Replace special characters in the phone-set, such as colons or dashes or tildes, with something alphanumeric. For example, replace ' a~ ' with ' AA ' to make it alphanumeric only. Nowadays, even cell phones has gigabytes of memory on board. There is no sense in trying to save space with cryptic special characters.

There is one very important thing. For a large vocabulary database, phonetic representation are more or less known; It ' s simple phones described in any book. If you don't have a phonetic book, you can just use the word ' s spelling and it gives very good results:

One o N etwo T W o

For small vocabulary cmusphinx are different from the other toolkits. It ' s often recommended to train word-based models for small vocabulary databases like digits. But it is only makes sense if your HMMs could has variable length. Cmusphinx does not support word models. Instead, need to use a word-dependent phone dictionary:

One w_one ah_one n_onetwo t_two uh_twonine n_nine ay_nine n_end_nine

This was actually equivalent to word-based models and some times even gives better accuracy. word-based models with Cmusphinx.

phoneset file (your_db.phone) should has one phone per line. The number of phones should match the phones used in the dictionary plus the special SIL phone for silence:

Ahaxdhix

"Sphinx" Chinese acoustic Model training

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.