This article to share the content is three kinds of commonly used Python Chinese word breaker tool, has a certain reference value, the need for friends can refer to
These three kinds of word-breaker tools, here to share the next ~
1.jieba participle:
#-*-Coding:utf-8-*-import osimport codecsimport jiebaseg_list = Jieba.cut (' Deng Chao, 1979 born in Nanchang, China, Chinese actor, film director, investment producer, Internet investors. ') F1 = Codecs.open ("D2w_ltp.txt", "W") print "/". Join (Seg_list) for I in Seg_list: f1.write (I.encode ("Utf-8")) F1.write (str (""))
Effect:
Deng Chao/,/1979//born/Jiangxi/Nanchang/,/China/mainland/actor/,/Film/director/,/investment/producer///Internet/investor/.
This includes stuttering participle and writing the file form
It is noteworthy that the stutter participle comes out of the character encoding is ' Unicode ' encoding, we need to put Unicode---Utf-8
2. Zhang Huaping Teacher's Nlpir
(Https://github.com/NLPIR-team/NLPIR)
Here's the GitHub address for Zhang Huaping, a friend you need to go to the teacher's git and get licence
There are two kinds: 10-day \ one-month
Of course, the detailed code and installation package I also uploaded to the csdn above, interested friends can see (or need to update licence)
It is worth mentioning that most of the domestic papers are using this word breaker tool, more authoritative
r = Open (' Text_no_seg.txt ', ' r ') List_senten = []sentence = ' Deng Chao, born in 1979 in Nanchang, Jiangxi Province, Chinese actor, film director, investment producer, internet investor. ' For I in seg (sentence): list_senten.append (i[0]) print "/". Join (list_senten) F1 = Codecs.open ("D2w_ltp.txt", "W") For I in seg (sentence): f1.write (i[0]) f1.write (str (""))
Effect:
Deng Chao/,/1979/born/In/Jiangxi/Nanchang/,/China/mainland/male/actor/,/Film/director/,/investment/production/people/,/Internet/investor/.
Of course, Nlpir has a good effect on named entity recognition:
Deng Chao nr, wd1979 year t born VI in P Jiangxi NS Nanchang NS, WD China NS Inland s male b actor N, wn film n director N, WN Investment n produced vi people n, WN Internet n investor N. Wj
3. Harbin LTP
#-*-Coding:utf-8-*-import osimport codecsfrom pyltp import segmentor# word def segmentor (sentence): segmentor = Segmen Tor () # Initialize instance segmentor.load (' Ltp_data/cws.model ') # load model words = segmentor.segment (sentence) # participle words_list = list (words) segmentor.release () # release model return WORDS_LISTF1 = Codecs.open ("D2w_ Ltp.txt "," w ") sentence = ' Deng Chao, born in 1979 in Nanchang, Jiangxi Province, Chinese actor, film director, investment producer, internet investor. ' Print '/'. Join (Segmentor (sentence)) for I in Segmentor (sentence): f1.write (i) f1.write (str (""))
Effect:
Deng//,/1979 Year/born/In/Jiangxi/Nanchang/,/China/mainland/male/actor/,/Film/director/,/investment/producer/,/Internet/investor/.