Three common python Chinese word breaker tools

Source: Internet
Author: User
This article to share the content is three kinds of commonly used Python Chinese word breaker tool, has a certain reference value, the need for friends can refer to

These three kinds of word-breaker tools, here to share the next ~

1.jieba participle:

#-*-Coding:utf-8-*-import osimport codecsimport jiebaseg_list = Jieba.cut (' Deng Chao, 1979 born in Nanchang, China, Chinese actor, film director, investment producer, Internet investors. ') F1 = Codecs.open ("D2w_ltp.txt", "W") print "/". Join (Seg_list) for I in Seg_list:    f1.write (I.encode ("Utf-8"))    F1.write (str (""))

Effect:

Deng Chao/,/1979//born/Jiangxi/Nanchang/,/China/mainland/actor/,/Film/director/,/investment/producer///Internet/investor/.

This includes stuttering participle and writing the file form

It is noteworthy that the stutter participle comes out of the character encoding is ' Unicode ' encoding, we need to put Unicode---Utf-8


2. Zhang Huaping Teacher's Nlpir

(Https://github.com/NLPIR-team/NLPIR)


Here's the GitHub address for Zhang Huaping, a friend you need to go to the teacher's git and get licence

There are two kinds: 10-day \ one-month

Of course, the detailed code and installation package I also uploaded to the csdn above, interested friends can see (or need to update licence)

It is worth mentioning that most of the domestic papers are using this word breaker tool, more authoritative

r = Open (' Text_no_seg.txt ', ' r ') List_senten = []sentence = ' Deng Chao, born in 1979 in Nanchang, Jiangxi Province, Chinese actor, film director, investment producer, internet investor. ' For I in seg (sentence):   list_senten.append (i[0]) print "/". Join (list_senten) F1 = Codecs.open ("D2w_ltp.txt", "W") For I in seg (sentence):   f1.write (i[0])   f1.write (str (""))

Effect:

Deng Chao/,/1979/born/In/Jiangxi/Nanchang/,/China/mainland/male/actor/,/Film/director/,/investment/production/people/,/Internet/investor/.

Of course, Nlpir has a good effect on named entity recognition:

Deng Chao nr, wd1979 year t born VI in P Jiangxi NS Nanchang NS, WD China NS Inland s male b actor N, wn film n director N, WN Investment n produced vi people n, WN Internet n investor N. Wj


3. Harbin LTP

#-*-Coding:utf-8-*-import osimport codecsfrom pyltp import segmentor# word def segmentor (sentence):    segmentor = Segmen Tor ()  # Initialize instance    segmentor.load (' Ltp_data/cws.model ')  # load model    words = segmentor.segment (sentence)  # participle    words_list = list (words)    segmentor.release ()  # release model    return WORDS_LISTF1 = Codecs.open ("D2w_ Ltp.txt "," w ") sentence = ' Deng Chao, born in 1979 in Nanchang, Jiangxi Province, Chinese actor, film director, investment producer, internet investor. ' Print '/'. Join (Segmentor (sentence)) for I in Segmentor (sentence):    f1.write (i)    f1.write (str (""))

Effect:

Deng//,/1979 Year/born/In/Jiangxi/Nanchang/,/China/mainland/male/actor/,/Film/director/,/investment/producer/,/Internet/investor/.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.