A short phrase language model training without participle
Reference resources: HTTP://CMUSPHINX.SOURCEFORGE.NET/WIKI/TUTORIALLM Sphinx Official Tutorials
1) Text Preparation
Generates a text file that contains one word in a row. There are <s> </s> marks at the beginning and tail, as shown below, where there are spaces before and after the words. The file is in utf-8 format and the file name is Test.txt.
<s>Sophie</s><s>Hundred Things</s><s>Nestle</s><s>P & g</s><s>Shell</s><s>Unified</s><s>Qualcomm</s><s>Kohler</s>
2) Upload this file to the server to generate the word frequency analysis file
< Test > test.vocab
The intermediate process is as follows:
Text2wfreq:reading text from standard Input...wfreq2vocab:will generate a vocabulary containing the most frequent 20000 words. Reading Wfreq stream from Stdin...text2wfreq:done.wfreq2vocab:done.
The resulting file is Test.vocab, with the format:
# # Vocab generated by V2 of the Cmu-cambridge statistcal## Language Modeling toolkit.#### includes 178 words # #
</
s
>
<
s
>
one shop best Shanghai Beach Silk tower Tiffany
3) Generate ARPA file
< Test . Txtidngram2lm-vocab_type 0-idngram test.idngram-vocab test.vocab-arpa test.lm
The middle procedure of the first command is
Text2idngramVocab:test.vocabOutput Idngram:test.idngramn-gram Buffer Size:100hash table Size:2000000temp Directory:cmuclmtk-mtadbfmax Open Files:20fof size:10n : 3Initialising Hash Table ... Reading Vocabulary ... allocating memory for the N-gram buffer ... Reading text into the N-gram buffer ... 20,000 n-grams processed for each ".", 1,000,000 for each line. Sorting n-grams ... Writing sorted n-grams to temporary file cmuclmtk-mtadbf/1merging 1 temporary files ... 2-grams occurring:n times > N times sug. -spec_num value 0 351 364 1 348 3 13 2 2 1 11 3 0 1 11 4 0 1 11 5 0 1 11 6 0 1 11 7 0 1 11 8 0 1 11 9 0 1 11 10 0 1 113-grams occurring:n times > N times sug. -spec_num value 0 525 540 1 522 3 13 2 3 0 10 3 0 0 10 4 0 0 10 5 0 0 10 6 0 0 10 7 0 0 10 8 0 0 10 9 0 0 10 10 0 0 10text2idngram:done.
The result file is Test.idngram, where the format is
^@^@^@^a^@^@^@^b^@^@^@^c^@^@^@^a^@^@^@^a^@^@^@^b^@^@^@^d^@^@^@^a^@^@^@^a^@^@^@^b^@^@^@^e^@^@^@^a^@^@^@^a^@^@^@ ^b^@^@^@^f^@^@^@^a^@^@^@^a^@^@^@^b^@^@^@^g^@^@^@^a^@^@^@^a^@^@^@^b^@^@^@^h^@^@^@^a^@^@^@^a^@^@^@^b^@^@^@ ^ @^@^@^a^@^@^@^a^@^@^@^b^@^@^@@@
The second command, the middle process is to produce a lot of warning, but finally show done, here the language model should be a problem.
Warning:p (2) = 0 (0/177) ncount = 1warning:p (2) = 0 (0/177) ncount = 1warning:p (2) = 0 (0/177) ncount = 1Warning : P (2) = 0 (0/177) ncount = 1
。。。。。。。
Writing out language model ...
Arpa-style 3-gram'll be written to TEST.LM
Idngram2lm:done.
The result file is TEST.LM, open to view content
This was a closed-vocabulary model (Oovs eliminated from training data and be forbidden in test data) good-turing Disco Unting was Applied.1-gram frequency of frequency:1742-gram frequency of frequency:348 2 0 0 0 0 03-gram frequency of f requency:522 3 0 0 0 0 01-gram discounting Ratios:0.992-gram discounting Ratios:0.003-gram discounting ratios:0.00 This file was in the Arpa-standard format introduced by Doug Paul.
Here the meaning is only 1-gram, lack of 2-gram and 3-gram, in fact, look at the contents of the LM in the back, listed 2-gram and 3-gram, is a behavioral demarcation.
Two use language models
Use the Sphinx's own Chinese acoustics model, and the Chinese dictionary, as well as the language models that are trained here. Identify certain strings. Here are 160 words, and the pronunciation of the 160 words of the dictionary, as well as the inclusion of these words a large and abundant acoustic model, so according to the logic, the identification process to find the corresponding each word, and then according to the language model of the combination of different words to form a word, can identify the correct phrase.
Pocketsphinx is installed on Windows using the following:
Pocketsphinx_continuous.exe-inmic yes-lm test.lm-dict test.dic-hmm zh_broadcastnews_ptm256_8000
Here, the model introduced by-LM is the model of the directly generated LM suffix, and in the martial arts cheats the LM model is first converted to the DMP model, which is used here without knowing whether the problem is here.
Three Nextplan
1) using all the strings, the strings are passed through participle, training the language model, and then working with the native acoustic model
Online Word segmentation tools, regardless of performance, such as the following can be directly used:
PHP Word breaker Demo: Http://www.phpbone.com/phpanalysis/demo.php?ac=done
SCWS Chinese participle: http://www.xunsearch.com/scws/demo.php
Nlpir Chinese Academy of Sciences computer nlp:http://ictclas.nlpir.org/nlpir/(just want to say this is my mind of NLP interesting way)
The results need to be dealt with and are not very practical at the moment.
2) record 300 sentences, train an acoustic model, and use it with the corresponding language model.
[Sphinx] Chinese Language model training