Python Chinese and English participle with dozens of lines of code

Source: Internet
Author: User
When it comes to participle, we must generally think it is a very advanced technology, but today the author with just dozens of lines of code to get it, exclamation python is very powerful Ah! The author is also very powerful. But this is just the best match, no machine learning ability

Note: To download Sogou thesaurus before use

#-*-Coding:utf-8-*-#写了一个简单的支持中文的正向最大匹配的机械分词, other don't have to explain, just dozens of lines of code # attached: Sogou thesaurus Download Address: http://vdisk.weibo.com/s/7RlE5 Import String__dict = {} def load_dict (dict_file= ' words.dic '): #加载词库, load the thesaurus into a key-first character, value is a dictionary of the list of related words words = [Unicode (        Line, ' Utf-8 '). Split ()-for-line-in-open (Dict_file)] for Word_len, word in Words:first_char = word[0] __dict.setdefault (First_char, []) __dict[first_char].append (word) #按词的长度倒序排列 for First_char, words in _     _dict.items (): __dict[first_char] = sorted (words, Key=lambda x:len (x), reverse=true) def __match_ascii (i, input): #返回连续的英文字母, number, symbol result = "For I in range (I, Len (input)): If not input[i] in String.ascii_letters:brea K result + = Input[i] return result def __match_word (First_char, I, input): #根据当前位置进行分词, ASCII direct read sequential character, medium Read Thesaurus if not __dict.has_key (First_char): If First_char in String.ascii_letters:return __match_as CII (i, input) return fiRst_char words = __dict[first_char] for word in words:if input[i:i+len (word)] = = Word:return Word Return First_char def tokenize (input): #对input进行分词, input must be uncode encoded if not input:return [] to  Kens = [] i = 0 while I < len (input): First_char = input[i] Matched_word = __match_word (First_char, I, input) tokens.append (matched_word) i + = Len (Matched_word) return tokens if __name__ = = ' __mai N__ ': Def get_test_text (): import urllib2 url = "Http://news.baidu.com/n?cmd=4&class=rolling&pn=1 &from=tab&sub=0 "text = Urllib2.urlopen (URL). Read () Return Unicode (text, ' GBK ') def load_dict_ Test (): Load_dict () for First_char, words in __dict.items (): print '%s:%s '% (First_char, '. Joi            n (words)) def tokenize_test (text): Load_dict () tokens = tokenize (text) for tokens in tokens:     Print token  Tokenize_test (Unicode (U ' beautiful garden with all kinds of small animals ')) Tokenize_test (Get_test_text ()) 
  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.