Implementation of Chinese Information Processing in Python (1)

Source: Internet
Author: User

Implementation of Chinese Information Processing in Python (1) 1. Install and test the Chinese word segmentation tool in Python

 

Post "Four python Chinese word segmentation system simple test ".

From the evaluation results

 

A good Chinese word segmentation tool that can be used in Python is the Chinese Word Segmentation System of jieba and the Chinese Emy of sciences.

Test the two tools.

1. Install the jieba Chinese word segmentation tool

Install the latest jieba Chinese word segmentation tool in 32-bit Windows 7 and Python2.7.

Procedure:

(1) download link: https://github.com/fxsjy/jieba, installation instructions

(2) download and decompress the package to the directory, such as C:/jieba-master.

(3) Go to the directory and run the command python setup. py install to complete the installation.

(4) conduct tests and compare with NLPIR/ICTCLAS2013 in Java

 

# Coding = UTF-8 ''' Created on 2014-3-19 test the jieba Chinese word segmentation tool @ author: liTC ''' import jiebaimport jieba. posseg as your gimport timet1=time.time(%%%f%open(t_with_splitter.txt, r) # Read text # string = f. read (). decode (UTF-8) string = 'originally from Wenzhou City, Zhejiang Province, born in Wenzhou, Zhejiang Province on February 28, 1975, as a singer. In September 1987, he was admitted to the qingxian Baihua opera troupe in Wenzhou, Zhejiang Province, and sang Xiaosheng in the group. 'Words = w.g. cut (string) # word segmentation result = # variable for recording the final result for w in words: result + = str (w. word) +/+ str (w. flag) # print resultf1_open(t_with_pos_tag.txt, w) # Save the result to another document. write (result) f. close () t2 = time. time () print (Word Segmentation and part of speech tagging completed, time: + str (t2-t1) + seconds .) # Feedback results

 

Test 1:

The statement for the test is: "A singer was born in Wenzhou, Zhejiang Province in February 28, 1975. In September 1987, I was admitted to the qingxian Baihua opera troupe in Wenzhou, Zhejiang Province, and sang Xiaosheng in the group ."

The result of NLPIR/ICTCLAS2013 is:

Ancestral Home/n Zhejiang/ns Wenzhou/ns,/wd 1975/t February/t 28/t born/vi at/p Zhejiang/ns Wenzhou/ns, /wd artist/n. /Wj in 1987/t test/v Zhejiang/ns Wenzhou/ns Qing/a county/n small/a Baihua/n Yue Opera Troupe/n, /wd in/p Group/n singing/v Xiaosheng/n _

The Chinese word segmentation result of jieba is:

Ancestral Home/n Zhejiang/ns Wenzhou/ns,/x1975/m/m2/m/m28/m Sunrise/v born/v Zhejiang/ns Wenzhou/ns, /x artist/n. /X1987/m/v Zhejiang/ns Wenzhou/ns Qing County/ns small/n Baihua/n Yue Opera Troupe/nt, /x in/p Group/n singing/v Xiaosheng/n (Word Segmentation and part-of-speech tagging completed, time: 1.96300005913 seconds)

Test 2:

The statement for the test is: after passing through the subordinate departments every month, the MIIT's virgin officer should personally explain the installation of 24-port switches and other technical devices.

The result of NLPIR/ICTCLAS2013 is:

Work/n letter/n Virgin/n Officer/n monthly/r pass/p subordinate/v Department/n all/d to/v user/d account/v 24/m Port /q switch/n, etc./udeng technical/n device/n/ude1 installation/vn work/vn _

The Chinese word segmentation result of jieba is:

MIIT/n female officer/n monthly/r pass/p subordinate/v Department/n all/d/v user/n account/n24/m Port/n switch/n ETC/u technical/n device/n/uj installation/v Work/vn (Word Segmentation and part-of-speech tagging are completed, time consumed: 1.93799996376 seconds .)

 

From the word segmentation of the above two sentences, the result of the Chinese word segmentation of jieba is slightly better than that of NLPIR/ICTCLAS2013, but it cannot be ruled out that these two are special cases. In addition, it can be seen that the Chinese word segmentation of jieba seems to take the long term priority method, so it will be better if the long term is incorrectly divided into short words, however, it cannot be evaluated whether this is a good solution under other experimental conditions.

2. Install NLPIR/ICTCLAS2014 in Python

Install the latest NLPIR/ICTCLAS2014 in 32-bit Windows 7 and Python2.7.

Procedure:

(1) download link http://ictclas.nlpir.org/downloads

(2) refer to the installation process for renewal:

(3) copy the entire [Data] folder to [sample] -- [pythonsample ].

(4) copy the dll of each model in the [lib] folder to [pythonsample] -- [nlpir], replace the original dll, and change the file name accordingly, for example, we copied the NLPIR under win32. dll. Change the corresponding value to NLPIR32.dll and put it in [pythonsample] -- [nlpir ].

(5) Open nlpir in [pythonsample. py, set libFile = '. change the dll in the/nlpir/NLPIR64.dll statement to the dll corresponding to your own system version. For example, if it is a 32-bit dll, change it to libFile = '. /nlpir/NLPIR32.dll'

(6) copy Data, nlpir ,__ init _. py and nlpir. py to the project code and run nlpir. py to test whether word segmentation is allowed.

(7) import nlpir in ICTCLAS2014Test. py for actual measurement and comparison with Chinese Word Segmentation of jieba

 

# Coding = UTF-8 ''' Created on 2014-3-19 test NLPIR/ICTCLAS2014 word segmentation tool @ author: liTC ''' import nlpirimport time success, r) # Read text # string = f. read (). decode (UTF-8) string = 'Once every month, the MIIT virgin officer will personally inform the subordinate departments of the installation of 24-port switches and other technical devices. 'words = nlpir. seg (string) # word segmentation result = # variable for recording the final result for w in words: result + = w [0] +/+ w [1] # print resultf1_open(t_with_pos_tag.txt, w) # Save the result to another document. write (result) f. close () t2 = time. t Ime () print (Word Segmentation and part of speech tagging completed, time: + str (t2-t1) + seconds .) # Feedback results

 

Test 1:

The statement for the test is: "A singer was born in Wenzhou, Zhejiang Province in February 28, 1975. In September 1987, I was admitted to the qingxian Baihua opera troupe in Wenzhou, Zhejiang Province, and sang Xiaosheng in the group ."

The result of NLPIR/ICTCLAS2014 is:

Ancestral Home/n Zhejiang/ns Wenzhou/ns,/wd 1975/t February/t28/t was born/vi at/p Zhejiang/ns Wenzhou/ns,/wd singer/n. /Wj in 1987/t test/v Zhejiang/ns Wenzhou/ns Qing/a county/n small/a Baihua/n Yue Opera Troupe/n, /wd in the/p Group/n sing/v Xiaosheng/n. /Wj (Word Segmentation and part-of-speech tagging completed, time: 0.00100016593933 seconds)

Test 2:

The statement for the test is: after passing through the subordinate departments every month, the MIIT's virgin officer should personally explain the installation of 24-port switches and other technical devices.

The result of NLPIR/ICTCLAS2014 is:

Work/n letter/n Virgin/n Officer/n monthly/r pass/p subordinate/v Department/n all/d to/v user/d account/v24/m Port/ q switch/n, etc./udeng technical/n device/n/ude1 installation/vn work/vn (Word Segmentation and part-of-speech tagging completed, time consumed: 0.00200009346008 seconds .)

From the word segmentation of the above two sentences, the results of NLPIR/ICTCLAS2014 are almost unchanged from those of NLPIR/ICTCLAS2013, and the results of Chinese word segmentation of jieba are slightly better than those of NLPIR/ICTCLAS2014, however, NLPIR/ICTCLAS2014 is at least 1000 times faster than jieba's Chinese word segmentation. If it is used for scientific research, jieba's Chinese word segmentation may be tolerable, but if it is used for products, it is definitely the NLPIR/ICTCLAS2014.

PS: I will upload two word splitting tools to my resources. Please try.

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.