[Sphinx] Chinese Language model training

Last Update:2015-09-15 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A short phrase language model training without participle

Reference resources: HTTP://CMUSPHINX.SOURCEFORGE.NET/WIKI/TUTORIALLM Sphinx Official Tutorials

1) Text Preparation

Generates a text file that contains one word in a row. There are <s> </s> marks at the beginning and tail, as shown below, where there are spaces before and after the words. The file is in utf-8 format and the file name is Test.txt.

<s>Sophie</s><s>Hundred Things</s><s>Nestle</s><s>P & g</s><s>Shell</s><s>Unified</s><s>Qualcomm</s><s>Kohler</s>

2) Upload this file to the server to generate the word frequency analysis file

< Test  > test.vocab

The intermediate process is as follows:

Text2wfreq:reading text from standard Input...wfreq2vocab:will generate a vocabulary containing the most              frequent 20000 words. Reading Wfreq stream from Stdin...text2wfreq:done.wfreq2vocab:done.

The resulting file is Test.vocab, with the format:

# # Vocab generated by V2 of the Cmu-cambridge statistcal## Language Modeling toolkit.#### includes 178 words # #
   
    </
    s
    >
    <
    s
    > 
    one shop best Shanghai Beach Silk tower Tiffany

3) Generate ARPA file

< Test . Txtidngram2lm-vocab_type 0-idngram test.idngram-vocab test.vocab-arpa test.lm

The middle procedure of the first command is

Text2idngramVocab:test.vocabOutput Idngram:test.idngramn-gram Buffer Size:100hash table                      Size:2000000temp Directory:cmuclmtk-mtadbfmax Open Files:20fof size:10n : 3Initialising Hash Table ... Reading Vocabulary ... allocating memory for the N-gram buffer ... Reading text into the N-gram buffer ... 20,000 n-grams processed for each ".", 1,000,000 for each line. Sorting n-grams ... Writing sorted n-grams to temporary file cmuclmtk-mtadbf/1merging 1 temporary files ... 2-grams occurring:n times > N times sug.                             -spec_num value 0 351 364 1                               348 3 13 2 2 1 11 3              0 1 11 4 0 1 11 5                               0 1 11 6 0 1                               11 7 0 1 11 8                               0 1 11 9 0 1 11 10 0 1 113-grams occurring:n times > N times sug.                             -spec_num value 0 525 540 1                               522 3 13 2 3 0 10 3              0 0 10 4 0 0               10 5 0 0 10 6 0     0 10 7 0 0 10 8 0 0                               10 9 0 0 10 10 0 0 10text2idngram:done.

The result file is Test.idngram, where the format is

^@^@^@^a^@^@^@^b^@^@^@^c^@^@^@^a^@^@^@^a^@^@^@^b^@^@^@^d^@^@^@^a^@^@^@^a^@^@^@^b^@^@^@^e^@^@^@^a^@^@^@^a^@^@^@ ^b^@^@^@^f^@^@^@^a^@^@^@^a^@^@^@^b^@^@^@^g^@^@^@^a^@^@^@^a^@^@^@^b^@^@^@^h^@^@^@^a^@^@^@^a^@^@^@^b^@^@^@  ^ @^@^@^a^@^@^@^a^@^@^@^b^@^@^@@@

The second command, the middle process is to produce a lot of warning, but finally show done, here the language model should be a problem.

Warning:p (2) = 0 (0/177) ncount = 1warning:p (2) = 0 (0/177) ncount = 1warning:p (2) = 0 (0/177) ncount = 1Warning : P (2) = 0 (0/177) ncount = 1
。。。。。。。

Writing out language model ...
Arpa-style 3-gram'll be written to TEST.LM
Idngram2lm:done.

The result file is TEST.LM, open to view content

This was a closed-vocabulary model  (Oovs eliminated from training data and be forbidden in test data) good-turing Disco Unting was Applied.1-gram frequency of frequency:1742-gram frequency of frequency:348 2 0 0 0 0 03-gram frequency of f requency:522 3 0 0 0 0 01-gram discounting Ratios:0.992-gram discounting Ratios:0.003-gram discounting ratios:0.00 This file was in the Arpa-standard format introduced by Doug Paul.

Here the meaning is only 1-gram, lack of 2-gram and 3-gram, in fact, look at the contents of the LM in the back, listed 2-gram and 3-gram, is a behavioral demarcation.

Two use language models

Use the Sphinx's own Chinese acoustics model, and the Chinese dictionary, as well as the language models that are trained here. Identify certain strings. Here are 160 words, and the pronunciation of the 160 words of the dictionary, as well as the inclusion of these words a large and abundant acoustic model, so according to the logic, the identification process to find the corresponding each word, and then according to the language model of the combination of different words to form a word, can identify the correct phrase.

Pocketsphinx is installed on Windows using the following:

Pocketsphinx_continuous.exe-inmic yes-lm test.lm-dict test.dic-hmm zh_broadcastnews_ptm256_8000

Here, the model introduced by-LM is the model of the directly generated LM suffix, and in the martial arts cheats the LM model is first converted to the DMP model, which is used here without knowing whether the problem is here.

Three Nextplan

1) using all the strings, the strings are passed through participle, training the language model, and then working with the native acoustic model

Online Word segmentation tools, regardless of performance, such as the following can be directly used:

PHP Word breaker Demo: Http://www.phpbone.com/phpanalysis/demo.php?ac=done

SCWS Chinese participle: http://www.xunsearch.com/scws/demo.php

Nlpir Chinese Academy of Sciences computer nlp:http://ictclas.nlpir.org/nlpir/(just want to say this is my mind of NLP interesting way)

The results need to be dealt with and are not very practical at the moment.

2) record 300 sentences, train an acoustic model, and use it with the corresponding language model.

[Sphinx] Chinese Language model training

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

[Sphinx] Chinese Language model training

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

[Sphinx] Chinese Language model training

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support