Python implementation of Chinese word segmentation--based on HMM algorithm

Source: Internet
Author: User
Tags extend
Hidden Markov model (HMM) Model Introduction

The HMM model is composed of a "five-tuple": Statusset: A set of state values Observedset: Observed values set transprobmatrix: Transfer probability matrix Emitprobmatrix: Launch probability matrix initstatus: initial state distribution

Applying HMM to Word segmentation, the problem to be solved is that the parameters (Observedset, Transprobmatrix, Emitrobmatrix, initstatus) are known to solve the state value sequence. The most famous way to solve this problem is the Viterbi algorithm. Parameter Introduction Statusset, the state values are set to (B, M, E, S): {b:begin, M:middle, E:end, S:single}. Representing each state is the position of the word in the word, B means the word is the beginning of the word, M is the middle word in the word, E is the ending word in the word, S is the Tango word. Observedset, a collection of observed values is a collection of all Chinese characters, even punctuation. Transprobmatrix, the meaning of the state transition probability matrix is the probability of transferring from State X to the state Y, which is a 4x4 matrix, that is {b,e,m,s}x{b,e,m,s}. Emitprobmatrix, each element of the launch probability matrix is a conditional probability, representing P (observed[i]| STATUS[J]) Initstatus, the probability distribution of the initial state indicates the probability of the first word of the sentence belonging to the four states of {b,e,m,s}. Viterbi Algorithm

The core idea of Viterbi algorithm is to realize the shortest path of dynamic programming, according to Michael Collins, the core idea is:
Define a dynamic programming tableπ (K,U,V),
π (k,u,v) = maximum probability of a tag sequence ending in tags u,v at position k.
For any k∈{1...n}:π (k,u,v) = max (π (k-1,w,u) xq (v|w,u) XE (xk|v))
The complete Viterbi algorithm on the internet has a lot of data can be viewed, this article focuses on the implementation of the Code. Experiment Code 1: Model Training

Generate three files:
- prob_start.py is the initial state probability
- prob_trans.py for state transition probability
- prob_emit.py for launch probability

#-*-Coding:utf-8-*-# Two meta Hidden Markov model (Bigram HMMs) # ' Traincorpus.txt_utf8 ' for People's Daily has been expected to artificial participle, more than 290,000 sentences import sys #state_M = 4 #word_N = 0 A_dic = {} B_dic = {} Count_dic = {} Pi_dic = {} Word_set = set () state_list = [' B ', ' M ', ' E ', ' S '] line_num =-      1 input_data = "Traincorpus.txt_utf8" Prob_start = "trainhmm\prob_start.py" #初始状态概率 prob_emit = "trainhmm\prob_emit.py"
    #发射概率 Prob_trans = "trainhmm\prob_trans.py" #转移概率 def init (): #初始化字典 #global state_m #global word_n For state in state_list:a_dic[state] = {} for state1 in state_list:a_dic[state][state1] = 0. 0 for State_list:pi_dic[state] = 0.0 b_dic[state] = {} Count_dic[state] = 0 def ge Tlist (INPUT_STR): #输入词语, output status Outpout_str = [] If len (input_str) = = 1:outpout_str.append (' S ') elif le N (input_str) = = 2:outpout_str = [' B ', ' E '] else:m_num = Len (input_str)-2 m_list = [' M '] * m_ Num OUTPOUT_STR.Append (' B ') outpout_str.extend (m_list) #把M_list中的 ' M ' added separately outpout_str.append (' E ') return OUTPOUT_STR Def Output (): #输出模型的三个参数: initial probability + transfer probability + launch probability start_fp = file (Prob_start, ' w ') emit_fp = file (Prob_emit, ' W ') tra
        NS_FP = File (Prob_trans, ' W ') print "Len (word_set) =%s"% (len (word_set)) for key in Pi_dic: #状态的初始概率 Pi_dic[key] = pi_dic[key] * 1.0/line_num print >>start_fp,pi_dic for key in A_dic: #状态 Transfer probability for Key1 in A_dic[key]: a_dic[key][key1] = A_dic[key][key1]/count_dic[key] Print >>tr Ans_fp,a_dic for key in B_dic: #发射概率 (conditional probability of State-> words) for word in B_dic[key]: B_dic[key
    ][word] = B_dic[key][word]/count_dic[key] Print >>emit_fp,b_dic start_fp.close () emit_fp.close ()   Trans_fp.close () def main (): IFP = file (input_data) init () Global Word_set #初始是set () Global Line_num #初始是-1 for line in Ifp:line_num + + 1 if line_num% 10000 = = 0:print Line_num line = Line.strip ()
        If not line:continue line = Line.decode ("Utf-8", "ignore") #设置为ignore, ignoring illegal characters word_list = [] For I in range (len): if line[i] = = ": Continue Word_list.append (line[i)) Word_set = Word_set |
            Set (word_list) #训练预料库中所有字的集合 Linearr = Line.split ("") line_state = [] for item in Linearr:  Line_state.extend (GetList (item)) #一句话对应一行连续的状态 If Len (word_list)!= len (line_state): Print >> Sys.stderr, "[line_num =%d][line =%s]"% (Line_num, Line.endoce ("utf-8", ' ignore ')) Else:fo R i in range (len (line_state)): if i = = 0:pi_dic[line_state[0]] + 1 #Pi_dic记录句子第
                    The state of a word, used to calculate the initial state probability count_dic[line_state[0] = + 1 #记录每一个状态的出现次数 Else: A_dic[lIne_state[i-1]][line_state[i]] + + 1 #用于计算转移概率 count_dic[line_state[i]] = 1 if
                    Not B_dic[line_state[i]].has_key (Word_list[i]): b_dic[line_state[i]][word_list[i] = 0.0


Else:b_dic[line_state[i]][word_list[i]] + + 1 #用于计算发射概率 Output () ifp.close () if __name__ = = "__main__": Main ()
Code 2: Test participle effects
#-*-Coding:utf-8-*-def load_model (f_name): IFP = File (f_name, ' RB ') return eval (Ifp.read ()) #eval参数是一个字符串, This string can be evaluated as an expression, Prob_start = Load_model ("trainhmm\prob_start.py") Prob_trans = Load_model ("trainhmm\prob_trans.py"
    ) Prob_emit = Load_model ("trainhmm\prob_emit.py") def Viterbi (Obs, states, Start_p, trans_p, emit_p): #维特比算法 (a recursive algorithm) V = [{}] path = {} for Y in states: #初始值 v[0][y] = start_p[y] * Emit_p[y].get (obs[0],0) #在位置0, end in Y state Maximum probability of the state sequence of the tail path[y] = [y] for T in range (1,len (OBS)): V.append ({}) NewPath = {} for Y In states: #从y0 recursion of the-> y states (prob, state) = MAX ([v[t-1][y0] * Trans_p[y0].get (y,0) * Emit_p[y].get (obs[
        t],0), y0) for y0 in states if V[T-1][Y0]>0]) V[t][y] =prob newpath[y] = path[state] + [y] 
    Path = NewPath #记录状态序列 (prob, state) = Max ([(V[len (OBS)-1][y], y) to Y in states]) #在最后一个位置 the maximum probability of a sequence of states at the end of the Y state Return (Prob, path[statE] #返回概率和状态序列 def cut (sentence): prob, pos_list = Viterbi (sentence, (' B ', ' M ', ' E ', ' S '), Prob_start, Prob_trans, pro B_emit) return (prob,pos_list) if __name__ = = "__main__": test_str = u "Xinhua correspondent in Tokyo" Prob,pos_list = Cut (test_ STR) Print TEST_STR print pos_list
Results
Xinhua correspondent in Tokyo Reports
[' B ', ' M ', ' E ', ' S ', ' B ', ' E ', ' B ', ' E ', ' B ', ' E ']

The expectation of artificial participle (TRAINCORPUS.TXT_UTF8) can be downloaded from here.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.