Implementation of Chinese Information Processing in Python (1)

Last Update:2014-03-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Implementation of Chinese Information Processing in Python (1) 1. Install and test the Chinese word segmentation tool in Python

Post "Four python Chinese word segmentation system simple test ".

From the evaluation results

A good Chinese word segmentation tool that can be used in Python is the Chinese Word Segmentation System of jieba and the Chinese Emy of sciences.

Test the two tools.

1. Install the jieba Chinese word segmentation tool

Install the latest jieba Chinese word segmentation tool in 32-bit Windows 7 and Python2.7.

Procedure:

(1) download link: https://github.com/fxsjy/jieba, installation instructions

(2) download and decompress the package to the directory, such as C:/jieba-master.

(3) Go to the directory and run the command python setup. py install to complete the installation.

(4) conduct tests and compare with NLPIR/ICTCLAS2013 in Java

# Coding = UTF-8 ''' Created on 2014-3-19 test the jieba Chinese word segmentation tool @ author: liTC ''' import jiebaimport jieba. posseg as your gimport timet1=time.time(%%%f%open(t_with_splitter.txt, r) # Read text # string = f. read (). decode (UTF-8) string = 'originally from Wenzhou City, Zhejiang Province, born in Wenzhou, Zhejiang Province on February 28, 1975, as a singer. In September 1987, he was admitted to the qingxian Baihua opera troupe in Wenzhou, Zhejiang Province, and sang Xiaosheng in the group. 'Words = w.g. cut (string) # word segmentation result = # variable for recording the final result for w in words: result + = str (w. word) +/+ str (w. flag) # print resultf1_open(t_with_pos_tag.txt, w) # Save the result to another document. write (result) f. close () t2 = time. time () print (Word Segmentation and part of speech tagging completed, time: + str (t2-t1) + seconds .) # Feedback results

Test 1:

The statement for the test is: "A singer was born in Wenzhou, Zhejiang Province in February 28, 1975. In September 1987, I was admitted to the qingxian Baihua opera troupe in Wenzhou, Zhejiang Province, and sang Xiaosheng in the group ."

The result of NLPIR/ICTCLAS2013 is:

Ancestral Home/n Zhejiang/ns Wenzhou/ns,/wd 1975/t February/t 28/t born/vi at/p Zhejiang/ns Wenzhou/ns, /wd artist/n. /Wj in 1987/t test/v Zhejiang/ns Wenzhou/ns Qing/a county/n small/a Baihua/n Yue Opera Troupe/n, /wd in/p Group/n singing/v Xiaosheng/n _

The Chinese word segmentation result of jieba is:

Ancestral Home/n Zhejiang/ns Wenzhou/ns,/x1975/m/m2/m/m28/m Sunrise/v born/v Zhejiang/ns Wenzhou/ns, /x artist/n. /X1987/m/v Zhejiang/ns Wenzhou/ns Qing County/ns small/n Baihua/n Yue Opera Troupe/nt, /x in/p Group/n singing/v Xiaosheng/n (Word Segmentation and part-of-speech tagging completed, time: 1.96300005913 seconds)

Test 2:

The statement for the test is: after passing through the subordinate departments every month, the MIIT's virgin officer should personally explain the installation of 24-port switches and other technical devices.

The result of NLPIR/ICTCLAS2013 is:

Work/n letter/n Virgin/n Officer/n monthly/r pass/p subordinate/v Department/n all/d to/v user/d account/v 24/m Port /q switch/n, etc./udeng technical/n device/n/ude1 installation/vn work/vn _

The Chinese word segmentation result of jieba is:

MIIT/n female officer/n monthly/r pass/p subordinate/v Department/n all/d/v user/n account/n24/m Port/n switch/n ETC/u technical/n device/n/uj installation/v Work/vn (Word Segmentation and part-of-speech tagging are completed, time consumed: 1.93799996376 seconds .)

From the word segmentation of the above two sentences, the result of the Chinese word segmentation of jieba is slightly better than that of NLPIR/ICTCLAS2013, but it cannot be ruled out that these two are special cases. In addition, it can be seen that the Chinese word segmentation of jieba seems to take the long term priority method, so it will be better if the long term is incorrectly divided into short words, however, it cannot be evaluated whether this is a good solution under other experimental conditions.

2. Install NLPIR/ICTCLAS2014 in Python

Install the latest NLPIR/ICTCLAS2014 in 32-bit Windows 7 and Python2.7.

Procedure:

(1) download link http://ictclas.nlpir.org/downloads

(2) refer to the installation process for renewal:

(3) copy the entire [Data] folder to [sample] -- [pythonsample ].

(4) copy the dll of each model in the [lib] folder to [pythonsample] -- [nlpir], replace the original dll, and change the file name accordingly, for example, we copied the NLPIR under win32. dll. Change the corresponding value to NLPIR32.dll and put it in [pythonsample] -- [nlpir ].

(5) Open nlpir in [pythonsample. py, set libFile = '. change the dll in the/nlpir/NLPIR64.dll statement to the dll corresponding to your own system version. For example, if it is a 32-bit dll, change it to libFile = '. /nlpir/NLPIR32.dll'

(6) copy Data, nlpir ,__ init _. py and nlpir. py to the project code and run nlpir. py to test whether word segmentation is allowed.

(7) import nlpir in ICTCLAS2014Test. py for actual measurement and comparison with Chinese Word Segmentation of jieba

# Coding = UTF-8 ''' Created on 2014-3-19 test NLPIR/ICTCLAS2014 word segmentation tool @ author: liTC ''' import nlpirimport time success, r) # Read text # string = f. read (). decode (UTF-8) string = 'Once every month, the MIIT virgin officer will personally inform the subordinate departments of the installation of 24-port switches and other technical devices. 'words = nlpir. seg (string) # word segmentation result = # variable for recording the final result for w in words: result + = w [0] +/+ w [1] # print resultf1_open(t_with_pos_tag.txt, w) # Save the result to another document. write (result) f. close () t2 = time. t Ime () print (Word Segmentation and part of speech tagging completed, time: + str (t2-t1) + seconds .) # Feedback results

Test 1:

The result of NLPIR/ICTCLAS2014 is:

Ancestral Home/n Zhejiang/ns Wenzhou/ns,/wd 1975/t February/t28/t was born/vi at/p Zhejiang/ns Wenzhou/ns,/wd singer/n. /Wj in 1987/t test/v Zhejiang/ns Wenzhou/ns Qing/a county/n small/a Baihua/n Yue Opera Troupe/n, /wd in the/p Group/n sing/v Xiaosheng/n. /Wj (Word Segmentation and part-of-speech tagging completed, time: 0.00100016593933 seconds)

Test 2:

The result of NLPIR/ICTCLAS2014 is:

Work/n letter/n Virgin/n Officer/n monthly/r pass/p subordinate/v Department/n all/d to/v user/d account/v24/m Port/ q switch/n, etc./udeng technical/n device/n/ude1 installation/vn work/vn (Word Segmentation and part-of-speech tagging completed, time consumed: 0.00200009346008 seconds .)

From the word segmentation of the above two sentences, the results of NLPIR/ICTCLAS2014 are almost unchanged from those of NLPIR/ICTCLAS2013, and the results of Chinese word segmentation of jieba are slightly better than those of NLPIR/ICTCLAS2014, however, NLPIR/ICTCLAS2014 is at least 1000 times faster than jieba's Chinese word segmentation. If it is used for scientific research, jieba's Chinese word segmentation may be tolerable, but if it is used for products, it is definitely the NLPIR/ICTCLAS2014.

PS: I will upload two word splitting tools to my resources. Please try.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Implementation of Chinese Information Processing in Python (1)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Implementation of Chinese Information Processing in Python (1)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support