Chinese text classification is not like English text classification just need to separate the words one by one, the Chinese text classification needs to be composed of words to form a vector. Therefore, participle is required.
Here use the online popular open source participle tool stutter participle (jieba), it can effectively put the words in the sentence one by one to extract, about the principle of stuttering participle here no longer repeat, the key is his use of methods.
1. Installation
Stutter participle is a Python tool library that is installed in a Python environment and is installed in the following manner:
(1) Under python2.x
Fully automatic installation: Easy_install Jieba or pip install Jieba
Semi-automatic installation: Download first, unpack and run Python setup.py install
Manual Installation: Place the Jieba directory in the current directory or the Site-packages directory
Refer to by Import Jieba
(2) Under python3.x
The master branch is currently supported only for python2.x
The python3.x version of the branch has also been basically available:
git clone git checkout jieba3kpython setup.py install
2. Use
When using it, the first import Jieba code into the Jieba library, and because the Chinese text may be in addition to the text content of some symbols such as parentheses, equals or arrows, but also need to use these regular expressions to match the way and delete,
Because of the use of regular expressions, you also need to import the related libraries using the import re.
The specific code is as follows:
def textparse (sentence): import jieba import re #以下两行过滤出中文及字符串以外的其他符号 r= re.compile ("[\s+\.\!\/_,$ %^*(+\"\']+| [+--! ,。? , ~@#¥%......&* ()]+ ") sentence=r.sub (", sentence) seg_list = jieba.cut (sentence) #print ("Default Mode: ",". Join (Seg_list)) return [Tok for Tok in Seg_list]
The Textparse function receives a sentence (sentence) as a parameter, and the returned result is an array of sentence words.
The most critical function in stuttering participle is jieba.cut the function divides the received sentences into words and returns a generator that can be iterated. The last line of the code transforms the structure into an array.
3. Stop using words
The term "inactive" refers to the presence of words or connectives in Chinese, which, if not kicked out, will affect the definite relationship between the core words and the classification. For example, ",", "and", "and" and so on, can also be used to increase the use of the classification of the scene of the stop word. The Chinese Stop glossary covers 1598 discontinued words. Can be obtained from GitHub.
Project improvements are as follows:
(1) Create a new deactivation glossary in the project Stopkey.txt
All Chinese inactive words are entered in the text file.
(2) Adding filter and stop word function in Chinese word segmentation
4. Custom Dictionaries
For the classification of the scene, to customize some of the common words, when the word word encountered these words to treat them as a single word. If you increase the database in the "many-to-many" into the dictionary can avoid word segmentation when the above words are divided into "many" "to" "more." The definitions of these dictionaries are also related to the classifier application scenario.
Project improvements are as follows:
(1) Add a custom dictionary file userdict.txt
(2) Adding a custom dictionary word segmentation function in Chinese word segmentation
5, the improved Chinese word segmentation function
The code is as follows (with the addition of other common symbols):
#中文分词def textparse (sentence): import jieba import re #以下两行过滤出中文及字符串以外的其他符号 r= re.compile ("[\s+\.\! \/_\?【】\-(?:\)) (?:\ ()(?:\ [)(?:\]) (\:):, $%^* (+\ "\ ']+| [+--! ,。? , ~@#¥%......&* ()]+ ") sentence=r.sub (", sentence) jieba.load_userdict ("Userdict.txt"); #加载自定义词典 Stoplist={}.fromkeys ([Line.strip () for line in open ("Stopkey.txt", ' R ', encoding= ' Utf-8 ')]) #停用词文件是utf8编码 seg_ List = jieba.cut (sentence) Seg_list=[word for Word in list (seg_list) If Word is in Stoplist] #print ("Default M Ode: ",". Join (Seg_list)) return seg_list