Chinese Word Segmentation: binary word graph and Viterbi algorithm (3)
Chinese Word Segmentation: binary word graphs and Viterbi algorithms (1)
1. First, create a dictionary. The dictionary here should be understood as: the data structure formed after the word frequency and other information in the training corpus are collected, which is different from the dictionary in "Xinhua Dictionary. I have created two dictionaries in my implementation: the "word" dictionary counts the number of times each word appears, the "dual-word" dictionary counts the number of consecutive occurrences of each two words (because the binary syntax model is used ). Then, the word dictionary and the two-word dictionary form a level-1 trie tree, or a dictionary called "first-word index.
CodeAs follows:
Create a word dictionary from the training corpus
# -*-Coding: cp936 -*-
Import Re
Import Cpickle as mypickle
Def Datafile (name, SEP = ' | ' ):
For Line In File (name ):
Yield Line. Split (SEP)
Candidates = Datafile (R ' C: \ python26 \ bigramwordsegemtation \ data \ training.txt ' )
P1 = Re. Compile ( ' (^ \ S + | \ s + $) ' )
P2 = Re. Compile ( ' \ D ' )
# P3 = Re. Compile ('\ s + ')
Mysingleworddict = {}
# Mydoubleworddict = {}
For M In Candidates:
# Singleline = []
For E In M:
E = P1.sub ( '' , E)
If P2.match (e ):
# E = p3.sub ('_', E)
Mysingleworddict [E] = Float (mysingleworddict. Get (E, 0) + 1 )
Print ' Word: % s, Number: % s ' % (E, mysingleworddict [e])
N= Sum (mysingleworddict. itervalues ())
For Key In Mysingleworddict. iterkeys ():
Mysingleworddict [Key] = Mysingleworddict [Key] / N
# For m in mysingleworddict. iteritems ():
# Print m
FID = File ( ' Singleworddictionarycrossvalidation. dat ' , ' W ' )
Mypickle. Dump (mysingleworddict, FID)
FID. Close ()
Print ' Finish '
Print N
Create a "dual-word" dictionary from the training corpus
# -*-Coding: cp936 -*-
Import Re
Import Cpickle as mypickle
Delimiter = ' | '
Def Datafile (name, SEP = ' | ' ):
''' Use generator to create a iterable object '''
For Line In File (name ):
Yield Line. Split (SEP)
Candidates = Datafile (R ' C: \ python26 \ bigramwordsegemtation \ data \ training.txt ' )
P1 = Re. Compile ( ' (^ \ S + | \ s + $) ' )
P2 = Re. Compile ( ' \ D ' )
Mydoubleworddict = {}
For M In Candidates:
Singleline = []
For E In M:
E = P1.sub ( '' , E)
If P2.match (e ):
Singleline. append (E)
If Len (singleline) > = 2 :
Initial = Singleline [0] + Delimiter + ' S '
Mydoubleworddict [initial] = Float (mydoubleworddict. Get (initial, 0) + 1 )
Print ' Word: % s, Number: % s ' % (Initial, mydoubleworddict [initial])
For I In Range (0, Len (singleline) - 1 ):
C = Delimiter. Join (singleline [I: I + 2 ])
Mydoubleworddict [c] = Float (mydoubleworddict. Get (C, 0) + 1 )
Print ' Word: % s, Number: % s ' % (C, mydoubleworddict [c])
N = Sum (mydoubleworddict. itervalues ())
For Key In Mydoubleworddict. iterkeys ():
Mydoubleworddict [Key] = Mydoubleworddict [Key] / N
# For m in mydoubleworddict. iteritems ():
# Print m
# Print n
FID = File ( ' Doubleworddictionarycrossvalidation2.dat ' , ' W ' )
Mypickle. Dump (mydoubleworddict, FID)
FID. Close ()
Print ' Finish '
Print N
Create a level-1 trie tree from the dictionary
# -*-Coding: cp936 -*-
Import Re
Import Cpickle as mypickle
P = Re. Compile ( ' \ D + ' )
Mydict = Mypickle. Load (file ( ' Doubleworddictionarycrossvalidation. dat ' ))
Mytrie = {}
# Level 2 trie tree, where words are classified by the first word
For Key In Mydict. iterkeys ():
TMP = P. findall (key)
If Mytrie. Get (TMP [0]) = None:
Mytrie [TMP [0] = {}
For (Key, Val) In Mydict. iteritems ():
TMP = P. findall (key)
Mytrie [TMP [0] [Key] = Val
Print ' Level 1 key % s level 2 key % s value % s ' % (TMP [0], key, Val)
FID = file ( ' mydoublewordtriecrosvalidation. dat ' , ' W ' )
mypickle. dump (mytrie, FID)
FID. close ()
Print ' finish '
the next part is the main algorithm module, in the main algorithm module, the data structure we call is the "word" level-1 trie tree dictionary, level 1 trie tree dictionary