[Sphinx] Chinese Language model training

Source: Internet
Author: User

A short phrase language model training without participle

Reference resources: HTTP://CMUSPHINX.SOURCEFORGE.NET/WIKI/TUTORIALLM Sphinx Official Tutorials

1) Text Preparation

Generates a text file that contains one word in a row. There are <s> </s> marks at the beginning and tail, as shown below, where there are spaces before and after the words. The file is in utf-8 format and the file name is Test.txt.

<s>Sophie</s><s>Hundred Things</s><s>Nestle</s><s>P & g</s><s>Shell</s><s>Unified</s><s>Qualcomm</s><s>Kohler</s>

2) Upload this file to the server to generate the word frequency analysis file

< Test  > test.vocab

The intermediate process is as follows:

Text2wfreq:reading text from standard Input...wfreq2vocab:will generate a vocabulary containing the most              frequent 20000 words. Reading Wfreq stream from Stdin...text2wfreq:done.wfreq2vocab:done.

The resulting file is Test.vocab, with the format:

# # Vocab generated by V2 of the Cmu-cambridge statistcal## Language Modeling toolkit.#### includes 178 words # #
   
    </
    s
    >
    <
    s
    > 
    one shop best Shanghai Beach Silk tower Tiffany
   

3) Generate ARPA file

< Test . Txtidngram2lm-vocab_type 0-idngram test.idngram-vocab test.vocab-arpa test.lm

The middle procedure of the first command is

Text2idngramVocab:test.vocabOutput Idngram:test.idngramn-gram Buffer Size:100hash table                      Size:2000000temp Directory:cmuclmtk-mtadbfmax Open Files:20fof size:10n : 3Initialising Hash Table ... Reading Vocabulary ... allocating memory for the N-gram buffer ... Reading text into the N-gram buffer ... 20,000 n-grams processed for each ".", 1,000,000 for each line. Sorting n-grams ... Writing sorted n-grams to temporary file cmuclmtk-mtadbf/1merging 1 temporary files ... 2-grams occurring:n times > N times sug.                             -spec_num value 0 351 364 1                               348 3 13 2 2 1 11 3              0 1 11 4 0 1 11 5                               0 1 11 6 0 1                               11 7 0 1 11 8                               0 1 11 9 0 1 11 10 0 1 113-grams occurring:n times > N times sug.                             -spec_num value 0 525 540 1                               522 3 13 2 3 0 10 3              0 0 10 4 0 0               10 5 0 0 10 6 0     0 10 7 0 0 10 8 0 0                               10 9 0 0 10 10 0 0 10text2idngram:done.

The result file is Test.idngram, where the format is

^@^@^@^a^@^@^@^b^@^@^@^c^@^@^@^a^@^@^@^a^@^@^@^b^@^@^@^d^@^@^@^a^@^@^@^a^@^@^@^b^@^@^@^e^@^@^@^a^@^@^@^a^@^@^@ ^b^@^@^@^f^@^@^@^a^@^@^@^a^@^@^@^b^@^@^@^g^@^@^@^a^@^@^@^a^@^@^@^b^@^@^@^h^@^@^@^a^@^@^@^a^@^@^@^b^@^@^@  ^ @^@^@^a^@^@^@^a^@^@^@^b^@^@^@@@

The second command, the middle process is to produce a lot of warning, but finally show done, here the language model should be a problem.

Warning:p (2) = 0 (0/177) ncount = 1warning:p (2) = 0 (0/177) ncount = 1warning:p (2) = 0 (0/177) ncount = 1Warning : P (2) = 0 (0/177) ncount = 1
。。。。。。。

Writing out language model ...
Arpa-style 3-gram'll be written to TEST.LM
Idngram2lm:done.

The result file is TEST.LM, open to view content

This was a closed-vocabulary model  (Oovs eliminated from training data and be forbidden in test data) good-turing Disco Unting was Applied.1-gram frequency of frequency:1742-gram frequency of frequency:348 2 0 0 0 0 03-gram frequency of f requency:522 3 0 0 0 0 01-gram discounting Ratios:0.992-gram discounting Ratios:0.003-gram discounting ratios:0.00 This file was in the Arpa-standard format introduced by Doug Paul.

Here the meaning is only 1-gram, lack of 2-gram and 3-gram, in fact, look at the contents of the LM in the back, listed 2-gram and 3-gram, is a behavioral demarcation.

Two use language models

Use the Sphinx's own Chinese acoustics model, and the Chinese dictionary, as well as the language models that are trained here. Identify certain strings. Here are 160 words, and the pronunciation of the 160 words of the dictionary, as well as the inclusion of these words a large and abundant acoustic model, so according to the logic, the identification process to find the corresponding each word, and then according to the language model of the combination of different words to form a word, can identify the correct phrase.

Pocketsphinx is installed on Windows using the following:

Pocketsphinx_continuous.exe-inmic yes-lm test.lm-dict test.dic-hmm zh_broadcastnews_ptm256_8000

Here, the model introduced by-LM is the model of the directly generated LM suffix, and in the martial arts cheats the LM model is first converted to the DMP model, which is used here without knowing whether the problem is here.

Three Nextplan

1) using all the strings, the strings are passed through participle, training the language model, and then working with the native acoustic model

Online Word segmentation tools, regardless of performance, such as the following can be directly used:

PHP Word breaker Demo: Http://www.phpbone.com/phpanalysis/demo.php?ac=done

SCWS Chinese participle: http://www.xunsearch.com/scws/demo.php

Nlpir Chinese Academy of Sciences computer nlp:http://ictclas.nlpir.org/nlpir/(just want to say this is my mind of NLP interesting way)

The results need to be dealt with and are not very practical at the moment.

2) record 300 sentences, train an acoustic model, and use it with the corresponding language model.

[Sphinx] Chinese Language model training

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.