Python is related to Seo.ArticleAs mentioned above, I want to share with you some knowledge about Chinese Word Segmentation in Python today.
Speaking of word segmentation, if you are a friend of Google, it is very easy to use Python word segmentation. You can use spaces for word segmentation, or there are related nltk modules for processing.
Chinese Word Segmentation is troublesome because it cannot be segmented by spaces, and semantic issues must be considered for word segmentation.
The following lists some of the better Chinese Word Segmentation: I use mostly jieba word segmentation, which is described in detail below:
1 jieba word segmentation 0.22 released, Python Chinese Word Segmentation component
Jieba supports three word segmentation modes:
Accurate mode, which is suitable for text analysis;
Full mode: scans all words in a sentence that can be used as words. The speed is very fast, but ambiguity cannot be solved;
The search engine mode, based on the precise mode, further segmentation of long words to improve the recall rate, is suitable for word segmentation of search engines.
There are also five features: 1 Word Segmentation 2 add custom dictionary 3 keyword extraction 4 part of speech tagging 5 parallel Word Segmentation
Install Python 2.x
Automatic Installation: easy_install jieba or PIP install jieba
Semi-automatic installation: Download The http://pypi.python.org/pypi/jieba/, unzip it, and run Python setup. py install
Manual installation: place the jieba directory in the current directory or the site-packages directory.
Reference through import jieba (the trie tree needs to be built during the first import, which takes several seconds)
Python 3.x Installation
Currently, the master Branch only supports python2.x.
Python3.x Branch is also basically available: https://github.com/fxsjy/jieba/tree/jieba3k
Git clone https://github.com/fxsjy/jieba.git
Git checkout jieba3k
Python setup. py install
2 pymmseg-CPP:Is a pythonPortPymmseg-CPP,OfRmmseg CPP ProjectOf. Rmmseg-CPPIsMmsegChineseWord SegmentationAlgorithmImplementationInA rubyC ++Interface.
3 loso:LosoIsWritten in PythonOfChineseWord Segmentation System.
It was initiallyDevelopmentIsImprovementPlurkSearch,HoweverApplicableSimplifiedChinese.
4 smallseg:
Smallseg-open-source lightweight Chinese Word Segmentation Toolkit
Features: Customizable dictionary, fast, and run on Google App Engine.
5 sentences: http://judou.org/
1. Open Chinese Word Segmentation Project
2. High-performance and high-availability word splitting system