Implementation of Chinese Information Processing in Python (1) 1. Install and test the Chinese word segmentation tool in Python
Post "Four python Chinese word segmentation system simple test ".
From the evaluation results
A good Chinese word segmentation tool that can be used in Python is the Chinese Word Segmentation System of jieba and the Chinese Emy of sciences.
Test the two tools.
1. Install the jieba Chinese word segmentation tool
Install the latest jieba Chinese word segmentation tool in 32-bit Windows 7 and Python2.7.
Procedure:
(1) download link: https://github.com/fxsjy/jieba, installation instructions
(2) download and decompress the package to the directory, such as C:/jieba-master.
(3) Go to the directory and run the command python setup. py install to complete the installation.
(4) conduct tests and compare with NLPIR/ICTCLAS2013 in Java
# Coding = UTF-8 ''' Created on 2014-3-19 test the jieba Chinese word segmentation tool @ author: liTC ''' import jiebaimport jieba. posseg as your gimport timet1=time.time(%%%f%open(t_with_splitter.txt, r) # Read text # string = f. read (). decode (UTF-8) string = 'originally from Wenzhou City, Zhejiang Province, born in Wenzhou, Zhejiang Province on February 28, 1975, as a singer. In September 1987, he was admitted to the qingxian Baihua opera troupe in Wenzhou, Zhejiang Province, and sang Xiaosheng in the group. 'Words = w.g. cut (string) # word segmentation result = # variable for recording the final result for w in words: result + = str (w. word) +/+ str (w. flag) # print resultf1_open(t_with_pos_tag.txt, w) # Save the result to another document. write (result) f. close () t2 = time. time () print (Word Segmentation and part of speech tagging completed, time: + str (t2-t1) + seconds .) # Feedback results
Test 1:
The statement for the test is: "A singer was born in Wenzhou, Zhejiang Province in February 28, 1975. In September 1987, I was admitted to the qingxian Baihua opera troupe in Wenzhou, Zhejiang Province, and sang Xiaosheng in the group ."
The result of NLPIR/ICTCLAS2013 is:
Ancestral Home/n Zhejiang/ns Wenzhou/ns,/wd 1975/t February/t 28/t born/vi at/p Zhejiang/ns Wenzhou/ns, /wd artist/n. /Wj in 1987/t test/v Zhejiang/ns Wenzhou/ns Qing/a county/n small/a Baihua/n Yue Opera Troupe/n, /wd in/p Group/n singing/v Xiaosheng/n _
The Chinese word segmentation result of jieba is:
Ancestral Home/n Zhejiang/ns Wenzhou/ns,/x1975/m/m2/m/m28/m Sunrise/v born/v Zhejiang/ns Wenzhou/ns, /x artist/n. /X1987/m/v Zhejiang/ns Wenzhou/ns Qing County/ns small/n Baihua/n Yue Opera Troupe/nt, /x in/p Group/n singing/v Xiaosheng/n (Word Segmentation and part-of-speech tagging completed, time: 1.96300005913 seconds)
Test 2:
The statement for the test is: after passing through the subordinate departments every month, the MIIT's virgin officer should personally explain the installation of 24-port switches and other technical devices.
The result of NLPIR/ICTCLAS2013 is:
Work/n letter/n Virgin/n Officer/n monthly/r pass/p subordinate/v Department/n all/d to/v user/d account/v 24/m Port /q switch/n, etc./udeng technical/n device/n/ude1 installation/vn work/vn _
The Chinese word segmentation result of jieba is:
MIIT/n female officer/n monthly/r pass/p subordinate/v Department/n all/d/v user/n account/n24/m Port/n switch/n ETC/u technical/n device/n/uj installation/v Work/vn (Word Segmentation and part-of-speech tagging are completed, time consumed: 1.93799996376 seconds .)
From the word segmentation of the above two sentences, the result of the Chinese word segmentation of jieba is slightly better than that of NLPIR/ICTCLAS2013, but it cannot be ruled out that these two are special cases. In addition, it can be seen that the Chinese word segmentation of jieba seems to take the long term priority method, so it will be better if the long term is incorrectly divided into short words, however, it cannot be evaluated whether this is a good solution under other experimental conditions.
2. Install NLPIR/ICTCLAS2014 in Python
Install the latest NLPIR/ICTCLAS2014 in 32-bit Windows 7 and Python2.7.
Procedure:
(1) download link http://ictclas.nlpir.org/downloads
(2) refer to the installation process for renewal:
(3) copy the entire [Data] folder to [sample] -- [pythonsample ].
(4) copy the dll of each model in the [lib] folder to [pythonsample] -- [nlpir], replace the original dll, and change the file name accordingly, for example, we copied the NLPIR under win32. dll. Change the corresponding value to NLPIR32.dll and put it in [pythonsample] -- [nlpir ].
(5) Open nlpir in [pythonsample. py, set libFile = '. change the dll in the/nlpir/NLPIR64.dll statement to the dll corresponding to your own system version. For example, if it is a 32-bit dll, change it to libFile = '. /nlpir/NLPIR32.dll'
(6) copy Data, nlpir ,__ init _. py and nlpir. py to the project code and run nlpir. py to test whether word segmentation is allowed.
(7) import nlpir in ICTCLAS2014Test. py for actual measurement and comparison with Chinese Word Segmentation of jieba
# Coding = UTF-8 ''' Created on 2014-3-19 test NLPIR/ICTCLAS2014 word segmentation tool @ author: liTC ''' import nlpirimport time success, r) # Read text # string = f. read (). decode (UTF-8) string = 'Once every month, the MIIT virgin officer will personally inform the subordinate departments of the installation of 24-port switches and other technical devices. 'words = nlpir. seg (string) # word segmentation result = # variable for recording the final result for w in words: result + = w [0] +/+ w [1] # print resultf1_open(t_with_pos_tag.txt, w) # Save the result to another document. write (result) f. close () t2 = time. t Ime () print (Word Segmentation and part of speech tagging completed, time: + str (t2-t1) + seconds .) # Feedback results
Test 1:
The statement for the test is: "A singer was born in Wenzhou, Zhejiang Province in February 28, 1975. In September 1987, I was admitted to the qingxian Baihua opera troupe in Wenzhou, Zhejiang Province, and sang Xiaosheng in the group ."
The result of NLPIR/ICTCLAS2014 is:
Ancestral Home/n Zhejiang/ns Wenzhou/ns,/wd 1975/t February/t28/t was born/vi at/p Zhejiang/ns Wenzhou/ns,/wd singer/n. /Wj in 1987/t test/v Zhejiang/ns Wenzhou/ns Qing/a county/n small/a Baihua/n Yue Opera Troupe/n, /wd in the/p Group/n sing/v Xiaosheng/n. /Wj (Word Segmentation and part-of-speech tagging completed, time: 0.00100016593933 seconds)
Test 2:
The statement for the test is: after passing through the subordinate departments every month, the MIIT's virgin officer should personally explain the installation of 24-port switches and other technical devices.
The result of NLPIR/ICTCLAS2014 is:
Work/n letter/n Virgin/n Officer/n monthly/r pass/p subordinate/v Department/n all/d to/v user/d account/v24/m Port/ q switch/n, etc./udeng technical/n device/n/ude1 installation/vn work/vn (Word Segmentation and part-of-speech tagging completed, time consumed: 0.00200009346008 seconds .)
From the word segmentation of the above two sentences, the results of NLPIR/ICTCLAS2014 are almost unchanged from those of NLPIR/ICTCLAS2013, and the results of Chinese word segmentation of jieba are slightly better than those of NLPIR/ICTCLAS2014, however, NLPIR/ICTCLAS2014 is at least 1000 times faster than jieba's Chinese word segmentation. If it is used for scientific research, jieba's Chinese word segmentation may be tolerable, but if it is used for products, it is definitely the NLPIR/ICTCLAS2014.
PS: I will upload two word splitting tools to my resources. Please try.