Recently need to use the text of the phonetic similarity calculation, looked at the HANKCS great God in the HANLP through the AC automata to achieve pinyin storage, want to turn it into a Python version. Start chewing on the AC automaton.
AC automata are established in Trie tree and KMP string matching algorithms. First chew the trie tree.
About the concept of trie tree, http://blog.csdn.net/v_july_v/article/details/6897097 This article is very good, but also attached to the suffix tree.
All I have to do is to match the UTF-8 encoded Chinese words with pinyin. The UTF-8 encoding converts a encoding into 3 byte, each byte is stored in 16. In view of this situation, it is necessary to construct a 256 Trie, that is, each layer may have 256 nodes.
After reading a few programs, set the wisdom of the people, wrote a own.
# coding:utf-8 Import sys reload (SYS) sys.setdefaultencoding ("Utf-8") class Trienode (object): Def __init__ (self): Self.one_byte = {} Self.value = None Self.is_word = False class Trie256 (object): Def __init__ (self): Self.root = Trienode () def getutf8string (Self, string): Bytes_array = ByteArray (string.encode
("Utf-8"))
return Bytes_array def insert (self, bytes_array, str): node = Self.root for byte in Bytes_array:
Child = Node.one_byte.get (byte) if child = = None:node.one_byte[byte] = Trienode ()
node = node.one_byte[byte] Node.is_word = True node.value = str def find (self, bytes_array): node = Self.root for byte in bytes_array:child = Node.one_byte.get (byte) if chil
D = = None:print "No this word Trie." return None node = node.one_byte[byte] If not Node.is_word:print "It's not a word."
Return None Else:return node.value def modify (self, Bytes_array, str): node = self.root
For byte in bytes_array:child = Node.one_byte.get (byte) if = None:
Print "This word isn't in this Trie, we'll insert it." Node.one_byte[byte] = Trienode () node = Node.one_byte[byte] If not node.is_word:print "Th
Are word is isn't a word in this Trie, we'll make it a word. Node.is_word = True Node.value = str Else:print "Modify this word ..." Node.va Lue = str def delete (self, bytes_array): node = Self.root for byte in Bytes_array:child
= Node.one_byte.get (byte) if child = = None:print "This word isn't in this Trie." Break node = Node.one_Byte[byte] If not node.is_word:print "It's not a word."
Else:node.is_word = False Node.value = None child = Node.one_byte.keys ()
If Len (child) = = 0:node.one_byte.clear () def print_item (self, p, indent=0): if P:
IND = ' + ' \ t ' * Indent for key in P.one_byte.keys (): label = ' '%s ': '% key Print IND + label + ' {' Self.print_item (P.one_byte[key], indent + 1) #print ind + ' * len
(label) + '} ' #self. Print_item (P.one_byte[key], indent + 1) if __name__ = = "__main__": Trie = Trie256 () With open ("Dictionary/pinyin.txt", ' R ') as Fd:line = Fd.readline () while Line:line_spli t = line.split (' = ') word = line_split[0] Pinyin = Line_split[1].strip () bytes = trie.g Etutf8string (word) sentence = ' for byte in bytes:sentence = sentence + ' x ' + str (byte) print sentence Trie.insert (b ytes, pinyin) line = Fd.readline () trie.print_item (trie.root) bytes = trie.getutf8string ("One minute". Deco
De ("Utf-8")) for the byte in bytes:print byte print trie.find (bytes)