1 Introduction to Chinese Word Segmentation
Currently, there are roughly two methods for Chinese Word Segmentation:
First, modify the source code. In this way, you can directly modify the processing class of the nutch word segmentation and call the completed word segmentation components to perform word segmentation.
Second, compile the word segmentation plug-in. This method is to rewrite the rules or add the Chinese word segmentation plug-in according to the plug-in defined by nuttch.
Both methods are optional. At present, due to the active open-source community, many word segmentation components have emerged. Both the source code modification method and the method of compiling word segmentation plug-ins depend on these word segmentation components. The main word segmentation components are listed below:
1. cjkanalyzer
Lucene comes with a Chinese, Japanese, and Japanese word divider.
2. chineseanalyzer
Lucene's built-in Chinese Word Divider
3. ik_canalyzer (mik_canalyzer)
It is a dictionary-based word divider that comes with Lucene and is relatively simple.
4. paoding Word Segmentation
The famous Ding jieniu word segmentation component provides high efficiency and high word segmentation accuracy.
5. Je Word Segmentation
The word segmentation component compiled by netizens has good performance.
6. ICTCLAS
A group of Word Segmentation tools from the Chinese Emy of sciences, including open-source and paid versions, based on HMM models. The main reasons are as follows:
Ictclas_opensrc_c_windows and ictclas_opensrc_c_linux are developed by Zhang huaping and Liu Qun, Institute of computing science and technology, the Chinese Emy of sciences.
Sharpictclas is an ICTCLAS on the. NET platform. It was developed by Lu Zhenyu, School of Economics and Management, Hebei University of Technology, based on the free version of ICTCLAS and partially rewritten and adjusted the original code.
The ictclas4j Chinese Word Segmentation System is a Java open-source word segmentation project completed by sinboy Based on freeictclas developed by Chinese Emy of Sciences Zhang huaping and Liu Qun, which simplifies the complexity of the original word segmentation program, it aims to provide a better learning opportunity for the majority of Chinese word segmentation enthusiasts.
Imdict-
Chinese-analyzer is an intelligent Chinese Word Segmentation module of the imdict intelligent dictionary. It is developed by Gao Xiaoping and based on the Hidden Markov Model (Hidden Markov ).
Model,
Hmm) is the re-Implementation of the ICTCLAS Chinese word segmentation program (based on Java) of the institute of Computing Technology of the Chinese Emy of sciences. It can directly provide Chinese Word Segmentation support for Lucene search engines.
2. Word Segmentation Structure Analysis
Before conducting an experiment, you need to analyze the nutch word splitting structure. This article has carefully studied the org. Apache. nutch. anlysis package of nutch. Most of the classes are related to the text word segmentation parsing of the web page when nutch crawls the web page.
The bottom layer of the nutch word segmentation is Lucene's analyzer abstract class, which is located in the org. Apache. Lucene. analysis package.
.
Tokenstream (string fieldname, Reader
Reader) returns the tokenstream type. This method is used to analyze the text. In the subsequent classes, this method implements the strategy and Algorithm for extracting the index phrase from the text. Returns
The tokenstream class of is an abstract class that can enumerate the token sequence from text or query phrases. In Lucene, It inherits its specific classes: tokenizer,
Tokenfilter.
The nutchanalyzer class is an extension of the Extended analysis text in nutch. All plug-ins used to parse the text must implement this extension point. A typical extension for analyzer is
First, create a tokenizer (Org. Apache. Lucene. analysis. tokenizer), which is used to read
Stream is decomposed into the original phrase (token --- org. Apache. Lucene. analysis. Token ).
After stream, one or more tokenfilters are used to filter meaningless phrases in these phrases.
The nutchdocumentanalyzer class inherits the nutchanalyzer class. Among them, there are three static private internal classes: contentanalyzer,
Anchorfilter, anchoranalyzer
They inherit analyzer (Org. Apache. Lucene. analysis. analyzer ),
Tokenfilter (Org. Apache. Lucene. analysis. tokenfilter ),
Analyzer (Org. Apache. Lucene. analysis. analyzer ). Called in contentanalyzer
Commongrams class (Org. Apache. nutch. Analysis), which constructs an n-grams word segmentation solution, because you need to consider phrases in the Index
And implement Optimization Measures for the n-grams scheme phrase query. In the n-grams scheme, a single phrase will also be indexed by this scheme, which is heavily used during the indexing process.
Token (Org. Apache. Lucene. analysis. Token) methods-
Configuration attributes of analysis. Common. terms. File in default. xml.
The UML diagram of each class and interface is as follows:
3. Experiment Process
This experiment selects the paoding word segmentation component and imdict-Chinese-analyzer as the experimental tool.
3.1 paoding plugin
The following is an experiment on the paoding word splitting component:
1. Test preparation
Download the paoding word segmentation component (: http://code.google.com/p/paoding/
Unzip the downloaded file paoding-analysis-2.0.4-beta.zip to get the paoding-analysis.jar.
In this experiment, we assume that the configuration is complete.
2. Create a directory under./src/Plugin: Analysis-ZH as the word segmentation package; create a directory Lib-paoding-analyzers to store paoding-analysis.jar.
3. Create two files and one folder in the analysis-ZH package:
(1) Build. xml
<Project name = "analysis-ZH" default = "jar-core"> <br/> <import file = "../build-plugin.xml"/> <br/> <! -- Build compilation dependencies --> <br/> <target name = "deps-jar"> <br/> <ant target = "jar" inheritall = "false" dir = ".. /lib-paoding-analyzers "/> <br/> </Target> <br/> <! -- Add compilation dependencies to classpath --> <br/> <path id = "plugin. deps "> <br/> <fileset dir =" $ {nutch. root}/build "> <br/> <include name =" **/lib-paoding-analyzers /*. jar "/> <br/> </fileset> <br/> </path> <br/> </Project>
(2) plugin. xml
<Plugin <br/> id = "analysis-ZH" <br/> name = "paoding analysis plug-in" <br/> provider-name = "net. paoding "> <br/> <runtime> <br/> <library name =" paoding-analysis.jar "> <br/> <export name =" * "/> <br/> </library> <br/> </runtime> <br/> <requires> <br/> <import plugin = "nutch-extensionpoints"/> <br/> <import plugin = "Lib-Lucene-analyzers"/> <br/> </requires> <br/> <extension id = "org. apache. nutch. analysis. zh "<br/> name =" paoding analyzer "<br/> point =" org. apache. nutch. analysis. nutchanalyzer "> <br/> <implementation id =" paodinganalyzer "<br/> class =" org. apache. nutch. analysis. zh. paodinganalyzer "> <br/> <parameter name =" Lang "value =" FR "/> <br/> </implementation> <br/> </extension> <br/> </plugin>
(3) create a new package in the src/Java Folder: ORG/Apache/nutch/analysis/ZH (create these folders layer by layer), and create the file paodinganalyzer. Java in the package:
Package Org. apache. nutch. analysis. zh; <br/> Import Java. io. reader; <br/> Import Org. apache. lucene. analysis. analyzer; <br/> Import Org. apache. lucene. analysis. tokenstream; <br/> public class paodinganalyzer {<br/> private final static analyzer = <br/> new net. paoding. analysis. analyzer. paodinganalyzer (); </P> <p>/** creates a new instance of chineseanalyzer */<br/> Public paodinganalyzer () {}< br/> Public tokenstream (string fieldname, Reader) {<br/> return analyzer. tokenstream (fieldname, Reader); <br/>}< br/>}
4. Create the following file in the Lib-paoding-analyzers Folder:
(1) Build. xml:
<? XML version = "1.0"?> <Br/> <project name = "Lib-paoding-analyzers" default = "jar"> <br/> <import file = ".. /build-plugin.xml "/> <br/> <! -- <Br/>! Override the compile and jar targets, <br/>! Since there is nothing to compile here. <br/>! --> <Br/> <target name = "compile" depends = "init"/> <br/> <target name = "jar" depends = "compile"> <br/> <copy todir = "$ {build. dir} "verbose =" true "> <br/> <fileset dir = ". /lib "includes = "**/*. jar "/> <br/> </copy> <br/> </Target> <br/> </Project>
(2) plugin. xml:
<? XML version = "1.0" encoding = "UTF-8"?> <Br/> <plugin <br/> id = "Lib-paoding-analyzers" <br/> name = "paoding analysers" <br/> provider-name = "net. paoding "> <br/> <runtime> <br/> <library name =" lib-paoding-analyzers.jar "> <br/> <export name =" * "/> <br/> </library> <br/> </runtime> <br/> </plugin>
(3) create a new Lib folder and put the paoding-analysis.jar in it.
5. Modify the nutch/CONF/nutch-site.xml and add the following content to let the nutch load into our upcoming analysis-zh.jar package.
<Property>
<Name> plugin. Includes </Name>
<Value> protocol-HTTP | urlfilter-RegEx | parse-(Text | HTML | JS) | analysis-(zh) | index-Basic | query-(Basic | site | URL) | Summary-Basic | scoring-OPIC | urlnormalizer-(pass | RegEx | basic) </value>
<Description>
</Description>
</Property>
6. Modify org. Apache. nutch. analysis. nutchdocumentanalyzer in the directory src/Java/org/Apache/nutch/analysis /.
(1) Add private static analyzer paoding_analyzer to private static analyzer content_analyzer;
(2) run the following code:
Public nutchdocumentanalyzer (configuration conf) {<br/> This. conf = conf; <br/> content_analyzer = new contentanalyzer (CONF); <br/> anchor_analyzer = new anchoranalyzer (); <br/>}
Change
Public nutchdocumentanalyzer (configuration conf) {<br/> This. conf = conf; <br/> content_analyzer = new contentanalyzer (CONF); <br/> anchor_analyzer = new anchoranalyzer (); <br/> paoding_analyzer = new paodinganalyzer (); <br/>}
(3) Code
Public tokenstream (string fieldname, reader) {<br/> analyzer; <br/> If ("anchor ". equals (fieldname) <br/> analyzer = anchor_analyzer; <br/> else <br/> analyzer = content_analyzer; <br/> return analyzer. tokenstream (fieldname, Reader); <br/>}
Changed:
Public tokenstream (string fieldname, reader) {<br/> analyzer; <br/> // If ("anchor ". equals (fieldname) <br/> // analyzer = anchor_analyzer; <br/> // else <br/> // analyzer = content_analyzer; <br/> analyzer = paoding_analyzer; <br/> return analyzer. tokenstream (fieldname, Reader); <br/>}
(4) Add the following statement to import the paoding package in the import area at the top of the file
Import net. paoding. analysis. analyzer. paodinganalyzer;
7. Modify nutchanalysis. jj
Modify the value of | <sigram: <CJK> of row 130 to | <sigram: (<CJK>) +>
8. Modify org. Apache. nutch. indexer. Lucene. javasewriter.
Code
Public void write (nutchdocument DOC) throws ioexception {<br/> Final Document export edoc = createlucenedoc (DOC); <br/> final nutchanalyzer analyzer = analyzerfactory. get (lucenedoc. get ("Lang"); <br/> If (indexer. log. isdebugenabled () {<br/> indexer. log. debug ("indexing [" + javasedoc. get ("url") <br/> + "] With analyzer" + analyzer + "(" + lucenedoc. get ("Lang") <br/> + ")"); <br/>}< br/> writer. adddocument (lucenedoc, analyzer); <br/>}
Change
Public void write (nutchdocument DOC) throws ioexception {<br/> Final Document export edoc = createlucenedoc (DOC); <br/>/* Final nutchanalyzer analyzer = analyzerfactory. get (lucenedoc. get ("Lang"); <br/> If (indexer. log. isdebugenabled () {<br/> indexer. log. debug ("indexing [" + javasedoc. get ("url") <br/> + "] With analyzer" + analyzer + "(" + lucenedoc. get ("Lang") <br/> + ")"); <br/>}< br/> writer. adddocument (lucenedoc, analyzer); */<br/> string lang = NULL; <br/> If (lucenedoc. get ("Lang") = NULL) {<br/> lang = "ZH"; <br/>}< br/> finalnutchanalyzer analyzer = analyzerfactory. get (Lang); <br/> If (indexer. log. isdebugenabled () {<br/> indexer. log. debug ("indexing [" + javasedoc. get ("url") + "] With analyzer" + analyzer + "(" + lucenedoc. get ("Lang") + "); <br/>}< br/> writer. adddocument (lucenedoc, analyzer); <br/>}
The above Code only sets the default language CH and adds several lines of code without modifying other code.
9. Modify org. Apache. nutch. analysis. nutchanalysis and add the following method:
Final public query parsebylucene (configuration conf) throws parseexception {// bupo changed <br/> query = new query (CONF); <br/> If (querystring. length ()> 0) {// Lucene word segmentation <br/> Org. apache. lucene. queryparser. queryparser parserlucene = new Org. apache. lucene. queryparser. queryparser (Org. apache. lucene. util. version. paie_current, "", analyzer); // bupo changed <br/> Org. apache. lucene. search. query q = NULL; <br/> try {<br/> q = parserlucene. parse (querystring); <br/>} catch (Org. apache. lucene. queryparser. parseexception e) {<br/> E. printstacktrace (); <br/>}< br/> string termstrings = Q. tostring (); <br/> If (termstrings. indexof ("")>-1) <br/> termstrings = termstrings. substring (1, termstrings. length ()-1); <br/> string [] terms = termstrings. split (""); <br/> for (INT I = 0; I <terms. length; I ++) {<br/> string [] tems = {terms}; <br/> query. addrequiredphrase (TEMS, clause. default_field); <br/>}< br/> return query; <br/>}
Modify the public static query parsequery () {} method of 55 rows
Public static query parsequery (string querystring, configuration conf) throws ioexception {
Return parsequery (querystring, null, conf );
}
Is:
Public static query parsequery (string querystring, analyzer, configuration conf) throwsioexception, parseexception {// bupo changed <br/> nuchanalysis parser = new nuchanalysis (querystring, (analyzer! = NULL )? Analyzer: New nutchdocumentanalyzer (CONF); <br/> parser. querystring = querystring; <br/> parser. queryfilters = new queryfilters (CONF); <br/> return parser. parsebylucene (CONF); <br/>}
10. modify the build of nutch. in the <target name = "war" depends = "jar, compile, generate-Docs "> </Target> <lib> </lib> with <include name =" paoding-analysis.jar "/>;
Modify <targe tname = "job" depends = "compile"> at row 150 to <target name = "job" depends = "compile, war "> the nutch-1.2.job, nutch-1.2.war, and nutch-1.2.jar files are automatically generated under the bulid folder after compilation.
11. Modify the build. xml file under the nutch-1.0/src/plugin, add
<Targetname = "deploy">
<Ant dir = "analysis-ZH" target = "deploy"/>
<Ant dir = "Lib-paoding-analyzers" target = "deploy"/>
.......
</Target>
<Targetname = "clean">
<Ant dir = "analysis-ZH" target = "clean"/>
<Ant dir = "Lib-paoding-analyzers" target = "clean"/>
.......
</Target>
Refer:
1. nutch word segmentation Chinese Word Segmentation paoding blister
Http://blog.csdn.net/mutou12456/archive/2010/04/01/5439935.aspx
2. added the Chinese word segmentation plug-in for nutch 1.2
Http://www.mikkoo.info /? P = 93
3. Chinese Word Segmentation of nutch je
Http://blog.csdn.net/oprah_7/archive/2011/03/09/6234296.aspx
4. Add Chinese Word Segmentation in nutch1.2
Http://www.hadoopor.com/thread-2805-1-1.html
5. Question about path of the DIC dictionary in nutch1.2 paoding
Http://hi.baidu.com/oliverwinner/blog/item/cc8a22f1b0666706b17ec53e.html