Pymmseg installation method and garbled solution, pymmseg Installation Method
Pymmseg-cpp is a Python port of the rmmseg-cpp project. rmmseg-cpp is a MMSEG Chinese word segmenting algorithm implemented in C ++ with a Ruby interface.
: Http://code.google.com/p/pymmseg-cpp/
Windows users can download the pymmseg-cpp-win32-1.0.1.tar.gz, the installation method is as follows:
1. decompress the package
2. Install vs2008 and use the command line window of VS2008 to compile the program. The tool is located at/visual studio 2008 command prompt.
In the command line window, enter the pymmseg/mmseg-cpp folder. Enter python build. py and press ENTER
Write the program as follows:
# Coding: UTF-8from pymmseg import mmsegmmseg. dict_load_defaults () text = 'Today I am so happy. 'algor = mmseg. algorithm (text) for tok in algor: print '% s [% d .. % d] '% (tok. text, tok. start, tok. end)
Garbled characters will appear after running. This is because mmseg supports utf8, and the local default encoding for windows is cp936, that is, gbk encoding.
Rewrite the Code as follows:
# Coding: UTF-8from pymmseg import mmseg. dict_load_defaults () text = 'Today I am so happy. 'algor = mmseg. algorithm (text) for tok in algor: print '% s [% d .. % d] '% (tok. text. decode ('utf-8 '). encode ('gbk'), tok. start, tok. end)