AC automata 1--for UTF-8-encoded trie tree __trie tree

Source: Internet
Author: User
Tags readline

Recently need to use the text of the phonetic similarity calculation, looked at the HANKCS great God in the HANLP through the AC automata to achieve pinyin storage, want to turn it into a Python version. Start chewing on the AC automaton.

AC automata are established in Trie tree and KMP string matching algorithms. First chew the trie tree.

About the concept of trie tree, http://blog.csdn.net/v_july_v/article/details/6897097 This article is very good, but also attached to the suffix tree.

All I have to do is to match the UTF-8 encoded Chinese words with pinyin. The UTF-8 encoding converts a encoding into 3 byte, each byte is stored in 16. In view of this situation, it is necessary to construct a 256 Trie, that is, each layer may have 256 nodes.

After reading a few programs, set the wisdom of the people, wrote a own.

# coding:utf-8 Import sys reload (SYS) sys.setdefaultencoding ("Utf-8") class Trienode (object): Def __init__ (self): Self.one_byte = {} Self.value = None Self.is_word = False class Trie256 (object): Def __init__ (self): Self.root = Trienode () def getutf8string (Self, string): Bytes_array = ByteArray (string.encode
        ("Utf-8"))
            return Bytes_array def insert (self, bytes_array, str): node = Self.root for byte in Bytes_array:
            Child = Node.one_byte.get (byte) if child = = None:node.one_byte[byte] = Trienode () 
        node = node.one_byte[byte] Node.is_word = True node.value = str def find (self, bytes_array): node = Self.root for byte in bytes_array:child = Node.one_byte.get (byte) if chil
                D = = None:print "No this word Trie." return None node = node.one_byte[byte] If not Node.is_word:print "It's not a word."
        Return None Else:return node.value def modify (self, Bytes_array, str): node = self.root
                For byte in bytes_array:child = Node.one_byte.get (byte) if = None:
                Print "This word isn't in this Trie, we'll insert it." Node.one_byte[byte] = Trienode () node = Node.one_byte[byte] If not node.is_word:print "Th
            Are word is isn't a word in this Trie, we'll make it a word. Node.is_word = True Node.value = str Else:print "Modify this word ..." Node.va Lue = str def delete (self, bytes_array): node = Self.root for byte in Bytes_array:child
                = Node.one_byte.get (byte) if child = = None:print "This word isn't in this Trie." Break node = Node.one_Byte[byte] If not node.is_word:print "It's not a word."
            Else:node.is_word = False Node.value = None child = Node.one_byte.keys ()
            If Len (child) = = 0:node.one_byte.clear () def print_item (self, p, indent=0): if P:
                IND = ' + ' \ t ' * Indent for key in P.one_byte.keys (): label = ' '%s ': '% key Print IND + label + ' {' Self.print_item (P.one_byte[key], indent + 1) #print ind + ' * len

    (label) + '} ' #self. Print_item (P.one_byte[key], indent + 1) if __name__ = = "__main__": Trie = Trie256 () With open ("Dictionary/pinyin.txt", ' R ') as Fd:line = Fd.readline () while Line:line_spli t = line.split (' = ') word = line_split[0] Pinyin = Line_split[1].strip () bytes = trie.g Etutf8string (word) sentence = ' for byte in bytes:sentence = sentence + ' x ' + str (byte) print sentence Trie.insert (b ytes, pinyin) line = Fd.readline () trie.print_item (trie.root) bytes = trie.getutf8string ("One minute". Deco
 De ("Utf-8")) for the byte in bytes:print byte print trie.find (bytes)


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.