Python based on Hidden Markov model for Chinese pinyin input

Source: Internet
Author: User
Tags natural logarithm
On the Internet to see an introduction about the hidden Markov model, think it can no longer magical, and found on the Internet a great God on how to use the hidden Markov model to achieve Chinese Pinyin input blog, but the great God did not give the code can be run, can only be found on the net manual stutter Word library, Based on this training, the hidden Markov model is obtained, and a simple pinyin input method is realized by the Viterbi algorithm. Githuh Address: Https://github.com/LiuRoy/Pinyin_Demo

Principle Introduction Hidden Markov model

Copy an online definition:

Hidden Markov models (Hidden Markov model) are statistical models used to describe a Markov process with hidden unknown parameters. The difficulty is to determine the implicit parameters of the process from observable parameters, and then use these parameters for further analysis.

Pinyin input method can be observed in the parameters are pinyin, the implicit parameter is the corresponding Chinese characters.

Viterbi algorithm

Refer to the https://zh.wikipedia.org/wiki/Viterbi algorithm, the idea is dynamic programming, the code is relatively simple do not repeat.

Code explanation

Model definition

Code See model/table.py file, for hidden Markov three probability matrix, respectively, designed three data table storage. The advantage is obvious that the transfer probability matrix of Chinese characters is a very large sparse matrix, the direct file storage occupies a large space, and the time of loading can only be read into memory at once, not only high memory consumption and slow loading speed. In addition, the join operation of the database is very convenient for the probability calculation in the Viterbi algorithm.

The data table is defined as follows:

Class Transition (Basemodel):  __tablename__ = ' Transition '  id = Column (Integer, primary_key=true)  Previous = Column (string (1), nullable=false)  behind = Column (string (1), nullable=false)  probability = column ( Float, Nullable=false) class emission (Basemodel):  __tablename__ = ' emission '  id = Column (Integer, primary_key= True)  character = column (string (1), nullable=false)  pinyin = Column (string (7), Nullable=false)  probability = Column (Float, Nullable=false) class starting (Basemodel):  __tablename__ = ' starting '  id = column (Integer, Primary_key=true)  character = column (String (1), nullable=false)  probability = column (Float, Nullable=false)

Model generation

Code See train/main.py file, inside the initstarting,initemission,init_transition corresponding to the generation of hidden Markov model in the initial probability matrix, emission probability matrix, transfer probability matrix and writes the resulting results to the SQLite file. The data set used in the training is the thesaurus in the stuttering participle, because there is no long sentence to be trained and the result of the last run proves that it can only be applied to the input.

Initial probability matrix

Statistical initialization probability matrix, is to find all the characters appearing in the beginning of the word, and statistics they appear in the beginning of the number of words, and finally according to the above data to calculate the probability of these Chinese characters appear in the beginning of the word, no statistics of Chinese characters that appear in the first probability is 0, do not write to the database. One thing to note is that in order to prevent the probability calculation because the smaller the computer can not compare, all the probability of the natural logarithm operation. The results of the statistics are as follows:

Transfer probability matrix

Here is the simplest first-order hidden Markov model, that is, in a sentence, the appearance of each character is only related to one of the Chinese characters in front of it, although simple and rough, but can meet most of the situation. The process of statistics is to find out the character set that appears behind each Chinese character in the dictionary, and statistical probability. Because this probability matrix is very large, data write to the database is too slow, the subsequent can be optimized for batch writing, improve training efficiency. The results are as follows:

The 10 words that appear at the back of the show are the most probable, and are quite consistent with everyday habits.

Emission probability matrix

Popular point is to count each Chinese character corresponding to the pinyin and in the daily situation of the use of probability, has been a violent example, it has two pronunciations: Bao and Pu, the difficulty is to find Bao and Pu appeared probability. Here the statistical use of the Pypinyin module, the dictionary to convert the phrase into pinyin after the probability statistics, but some of the pronunciation is not exactly correct, the last running input method will appear and pinyin mismatch results. The statistical results are as follows:

Viterbi implementation

Code build input_method/viterbi.py file, here will find up to 10 local optimal solution, note is 10 local optimal solution instead of 10 global optimal solution, but the best of the 10 solution is the global optimal solution, the code is as follows:

def Viterbi (pinyin_list): "" "  Viterbi Algorithm Implementation Input Method  Aargs:    pinyin_list (list): Phonetic list  " ""  Start_ char = emission.join_starting (pinyin_list[0])  V = {char:prob for char, prob in Start_char} for  I in range (1, Len ( pinyin_list)):    pinyin = Pinyin_list[i]    prob_map = {}    for phrase, prob in V.iteritems ():      character = PHRASE[-1]      result = transition.join_emission (pinyin, character)      if not result:        continue state      , New_prob = result      Prob_map[phrase + state] = New_prob + prob    if prob_map:      V = prob_map    else:      ret Urn v  return v

Results show

Run the input_method/viterbi.py file and simply display the results of the operation:

Problem Statistics:

The statistical dictionary generation transfer matrix is too slow to write to the database and runs for nearly 10 minutes. Emission probability matrix data is not accurate, there are always some Chinese pinyin does not match. The training set is too small to implement an input method that does not apply to long sentences.

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.