Recently in doing micro-blog text processing, in the choice of Word tools, respectively, selected: Jieba \ Nlpir LTP
These three kinds of participle tools, here to share the next ~
1.jieba participle:
#-*-coding:utf-8-*-import
os import
codecs
import jieba
seg_list = Jieba.cut (' Deng Chao, born in Nanchang, Jiangxi, 1979 Mainland Chinese actor, film director, investment producer, internet investor. ')
f1 = Codecs.open ("D2w_ltp.txt", "W")
print "/". Join (Seg_list) for
i in Seg_list:
f1.write (I.encode ("Utf-8"))
F1.write (str (""))
Effect:
Deng Chao/,/1979//born/Jiangxi/Nanchang/,/China/mainland/actor/, Film/director/,/investment/producer/,/Internet/investor/.
This includes stuttering participle and writing the form of a file
It is noteworthy that the stuttering word out of the character encoding is ' Unicode ' encoding, we need to put the Unicode-> utf-8
2. Zhang Huaping Teacher's Nlpir (Https://github.com/NLPIR-team/NLPIR)
Here give Zhang Huaping Teacher's GitHub address, need to use friends can go to the teacher's git to get licence there are two kinds: 10 days of one months of
Of course, detailed code and installation package I also uploaded to the csdn above, interested friends can see (or need to update licence)
It is worth mentioning that most of the domestic papers are using this word tool, more authoritative
r = Open (' Text_no_seg.txt ', ' R ')
List_senten = []
sentence = ' Deng Chao, 1979 born in Nanchang, Jiangxi province, mainland China actor, film director, investment producer, internet investor. ' For
i in seg (sentence):
list_senten.append (i[0])
print "/". Join (List_senten)
f1 = Codecs.open ("d2w _ltp.txt ", W") for
I-in seg (sentence):
f1.write (i[0])
f1.write (str (""))
Effect:
Deng Chao/,/1979/born/In/Jiangxi/Nanchang/,/China/mainland/male/actress//Film/director/,/investment/production/person/,/Internet/investor/.
Of course Nlpir has a good effect on named entity recognition:
Deng Chao nr
, wd
1979 T
born VI
in P
Jiangxi NS
Nanchang NS
, WD
China NS
mainland s
male b
actor n
, WN
film N
director N
, WN
investment n
produced vi
people n
, wn
internet n
investor n
. Wj
3. LTP
#-*-Coding:utf-8-*-
import os
import codecs from
PYLTP import segmentor
#分词
def segmentor ( Sentence):
segmentor = Segmentor () # Initialize instance
segmentor.load (' Ltp_data/cws.model ') # load model
words = Segmentor.segment (sentence) # participle
words_list = list (words)
segmentor.release () # release Model
return words_list
f1 = Codecs.open ("D2w_ltp.txt", "w")
sentence = ' Deng Chao, born in Nanchang, Jiangxi Province in 1979, China's mainland actor, film director, investment producer, Internet investor. '
print '/'. Join (Segmentor (sentence)) for
I-segmentor (sentence):
f1.write (i)
f1.write (str (" "))
Effect:
Deng//,/1979 Year/born/In/Jiangxi/Nanchang/,/China/mainland/male/actress/,/Film/director/,/investment/producer/,/Internet/investor/.