Topic Center

Contact Sales

Home > Developer > Python

Python Chinese and English participle with dozens of lines of code

Last Update:2016-10-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

When it comes to participle, we must generally think it is a very advanced technology, but today the author with just dozens of lines of code to get it, exclamation python is very powerful Ah! The author is also very powerful. But this is just the best match, no machine learning ability

Note: To download Sogou thesaurus before use

#-*-Coding:utf-8-*-#写了一个简单的支持中文的正向最大匹配的机械分词, other don't have to explain, just dozens of lines of code # attached: Sogou thesaurus Download Address: http://vdisk.weibo.com/s/7RlE5 Import String__dict = {} def load_dict (dict_file= ' words.dic '): #加载词库, load the thesaurus into a key-first character, value is a dictionary of the list of related words words = [Unicode (        Line, ' Utf-8 '). Split ()-for-line-in-open (Dict_file)] for Word_len, word in Words:first_char = word[0] __dict.setdefault (First_char, []) __dict[first_char].append (word) #按词的长度倒序排列 for First_char, words in _     _dict.items (): __dict[first_char] = sorted (words, Key=lambda x:len (x), reverse=true) def __match_ascii (i, input): #返回连续的英文字母, number, symbol result = "For I in range (I, Len (input)): If not input[i] in String.ascii_letters:brea K result + = Input[i] return result def __match_word (First_char, I, input): #根据当前位置进行分词, ASCII direct read sequential character, medium Read Thesaurus if not __dict.has_key (First_char): If First_char in String.ascii_letters:return __match_as CII (i, input) return fiRst_char words = __dict[first_char] for word in words:if input[i:i+len (word)] = = Word:return Word Return First_char def tokenize (input): #对input进行分词, input must be uncode encoded if not input:return [] to  Kens = [] i = 0 while I < len (input): First_char = input[i] Matched_word = __match_word (First_char, I, input) tokens.append (matched_word) i + = Len (Matched_word) return tokens if __name__ = = ' __mai N__ ': Def get_test_text (): import urllib2 url = "Http://news.baidu.com/n?cmd=4&class=rolling&pn=1 &from=tab&sub=0 "text = Urllib2.urlopen (URL). Read () Return Unicode (text, ' GBK ') def load_dict_ Test (): Load_dict () for First_char, words in __dict.items (): print '%s:%s '% (First_char, '. Joi            n (words)) def tokenize_test (text): Load_dict () tokens = tokenize (text) for tokens in tokens:     Print token  Tokenize_test (Unicode (U ' beautiful garden with all kinds of small animals ')) Tokenize_test (Get_test_text ())



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

logical lines of code count lines of code in directory precedence of lines python comment block of code chinese tv shows with english subtitles in and of discount code in and of discount code

Python thread pause, resume, exit detail and Example _python 01-18

Python design mode-UML-Package diagrams (Package Diagram) 09-09

Python abstract class (ABC module) 09-18

The difference between OS and sys two modules in Python 04-05

Python: Database Operations 12-08

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Chinese and English participle with dozens of lines of code

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support