Python Chinese Word Segmentation

Source: Internet
Author: User

Python is related to Seo.ArticleAs mentioned above, I want to share with you some knowledge about Chinese Word Segmentation in Python today.

Speaking of word segmentation, if you are a friend of Google, it is very easy to use Python word segmentation. You can use spaces for word segmentation, or there are related nltk modules for processing.

Chinese Word Segmentation is troublesome because it cannot be segmented by spaces, and semantic issues must be considered for word segmentation.

The following lists some of the better Chinese Word Segmentation: I use mostly jieba word segmentation, which is described in detail below:

1 jieba word segmentation 0.22 released, Python Chinese Word Segmentation component

Jieba supports three word segmentation modes:
Accurate mode, which is suitable for text analysis;
Full mode: scans all words in a sentence that can be used as words. The speed is very fast, but ambiguity cannot be solved;
The search engine mode, based on the precise mode, further segmentation of long words to improve the recall rate, is suitable for word segmentation of search engines.

There are also five features: 1 Word Segmentation 2 add custom dictionary 3 keyword extraction 4 part of speech tagging 5 parallel Word Segmentation

Install Python 2.x

Automatic Installation: easy_install jieba or PIP install jieba
Semi-automatic installation: Download The http://pypi.python.org/pypi/jieba/, unzip it, and run Python setup. py install
Manual installation: place the jieba directory in the current directory or the site-packages directory.
Reference through import jieba (the trie tree needs to be built during the first import, which takes several seconds)

Python 3.x Installation

Currently, the master Branch only supports python2.x.

Python3.x Branch is also basically available: https://github.com/fxsjy/jieba/tree/jieba3k

Git clone https://github.com/fxsjy/jieba.git
Git checkout jieba3k
Python setup. py install

2 pymmseg-CPP:Is a pythonPortPymmseg-CPP,OfRmmseg CPP ProjectOf. Rmmseg-CPPIsMmsegChineseWord SegmentationAlgorithmImplementationInA rubyC ++Interface.

3 loso:LosoIsWritten in PythonOfChineseWord Segmentation System.
It was initiallyDevelopmentIsImprovementPlurkSearch,HoweverApplicableSimplifiedChinese.

4 smallseg:

Smallseg-open-source lightweight Chinese Word Segmentation Toolkit

Features: Customizable dictionary, fast, and run on Google App Engine.

5 sentences: http://judou.org/

1. Open Chinese Word Segmentation Project

2. High-performance and high-availability word splitting system

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.