Chinese Word Segmentation of nutch

Source: Internet
Author: User

1 Introduction to Chinese Word Segmentation
Currently, there are roughly two methods for Chinese Word Segmentation:
First, modify the source code. In this way, you can directly modify the processing class of the nutch word segmentation and call the completed word segmentation components to perform word segmentation.
Second, compile the word segmentation plug-in. This method is to rewrite the rules or add the Chinese word segmentation plug-in according to the plug-in defined by nuttch.
Both methods are optional. At present, due to the active open-source community, many word segmentation components have emerged. Both the source code modification method and the method of compiling word segmentation plug-ins depend on these word segmentation components. The main word segmentation components are listed below:
1. cjkanalyzer
Lucene comes with a Chinese, Japanese, and Japanese word divider.
2. chineseanalyzer
Lucene's built-in Chinese Word Divider
3. ik_canalyzer (mik_canalyzer)
It is a dictionary-based word divider that comes with Lucene and is relatively simple.
4. paoding Word Segmentation
The famous Ding jieniu word segmentation component provides high efficiency and high word segmentation accuracy.
5. Je Word Segmentation
The word segmentation component compiled by netizens has good performance.
6. ICTCLAS
A group of Word Segmentation tools from the Chinese Emy of sciences, including open-source and paid versions, based on HMM models. The main reasons are as follows:
Ictclas_opensrc_c_windows and ictclas_opensrc_c_linux are developed by Zhang huaping and Liu Qun, Institute of computing science and technology, the Chinese Emy of sciences.
Sharpictclas is an ICTCLAS on the. NET platform. It was developed by Lu Zhenyu, School of Economics and Management, Hebei University of Technology, based on the free version of ICTCLAS and partially rewritten and adjusted the original code.
The ictclas4j Chinese Word Segmentation System is a Java open-source word segmentation project completed by sinboy Based on freeictclas developed by Chinese Emy of Sciences Zhang huaping and Liu Qun, which simplifies the complexity of the original word segmentation program, it aims to provide a better learning opportunity for the majority of Chinese word segmentation enthusiasts.
Imdict-Chinese-analyzer is an intelligent Chinese Word Segmentation module of the imdict intelligent dictionary. It is developed by Gao Xiaoping and is based on the Hidden Markov Model (HMM ), it is a re-Implementation of the ICTCLAS Chinese word segmentation program (based on Java) of the institute of Computing Technology of the Chinese Emy of sciences. It can directly provide Chinese Word Segmentation support for the Lucene search engine.
2. Word Segmentation Structure Analysis
Before conducting an experiment, you need to analyze the nutch word splitting structure. This article has carefully studied the org. Apache. nutch. anlysis package of nutch. Most of the classes are related to the text word segmentation parsing of the web page when nutch crawls the web page.

The bottom layer of the nutch word segmentation is Lucene's analyzer abstract class, which is located at Org. apache. lucene. in the analysis package, nutchanalyzer inherits the analyzer class, implements the retriable and pluggable interfaces. This abstract class defines a public abstract method tokenstream (string fieldname, reader) and returns the tokenstream type. This method is used to analyze the text. In the subsequent classes, this method implements the strategy and Algorithm for extracting the index phrase from the text. The returned tokenstream class is an abstract class that can enumerate the token sequence from text or query phrases. In Lucene, It inherits the specific classes of tokenizer and tokenfilter.
The nutchanalyzer class is an extension of the Extended analysis text in nutch. All plug-ins used to parse the text must implement this extension point. A typical extension for analyzer is to first create a tokenizer (Org. apache. lucene. analysis. tokenizer), which is used to break the stream read in reader into the original phrase (token --- Org. apache. lucene. analysis. token), after the tokenier breaks down the stream, one or more tokenfilters are used to filter meaningless phrases in these phrases.
The nutchdocumentanalyzer class inherits the nutchanalyzer class. Among them, three static private internal classes contentanalyzer, anchorfilter, and anchoranalyzer inherit the analyzer (Org. apache. lucene. analysis. analyzer), tokenfilter (Org. apache. lucene. analysis. tokenfilter), analyzer (Org. apache. lucene. analysis. analyzer ). The commongrams class (Org. apache. nutch. analysis), this class constructs an n-grams word segmentation scheme, because the occurrence frequency of phrases needs to be considered in the index, and the optimization measures for the n-grams program phrase query are implemented. In the n-grams scheme, a single phrase will also be indexed by this scheme, during which token (Org. apache. lucene. analysis. and calls the analysis. common. terms. file configuration properties.
The UML diagram of each class and interface is as follows:

Corresponding to the previous article, you can carefully study other structures of the nutch to find out how to add the Chinese Word Segmentation Method to the nutch. From the analysis, we can see that, the basic abstract class or interface of Lucene is widely used in the word segmentation of nutch, which is inseparable from Doug cutting, the host of both projects, of course, Lucene's good architecture also lays the foundation for various applications to expand and use it.

3. Experiment Process

3.1 je plugin (Assume that the configuration of the nutch is complete)
1. Test preparation
Download Lucene jar package je-analysis-1.5.1.jar

Http://download.csdn.net/source/637846 

2. Create a directory analysis-ZH under src/plugin as a word segmentation component package,

Create a directory Lib-Je-analysis to store the je-analysis-1.5.1.jar.
3. Create the build. xml and plugin. xml files in the analysis-ZH package, and set the/src/Java folders:
Build. xml:

<Project name = "analysis-ZH" default = "jar-core"> </P> <p> <import file = ".. /build-plugin.xml "/> </P> <p> <target name =" deps-jar "> <br/> <ant target =" jar "inheritall =" false "dir = ".. /lib-Je-analysis "/> <br/> </Target> </P> <p> <! -- Add compilation dependencies to classpath --> <br/> <path id = "plugin. deps "> <br/> <fileset dir =" $ {nutch. root}/build "> <br/> <include name =" **/lib-Je-analysis /*. jar "/> <br/> </fileset> <br/> </path> </P> <p> </Project>
Plugin. xml:

<Plugin <br/> id = "analysis-ZH" <br/> name = "Je analysis plug-in" <br/> Version = "1.5.1" <br/> provider -Name = "Je. analysis "> </P> <p> <runtime> <br/> <library name =" analysis-zh.jar "> <br/> <export name =" * "/> <br /> </library> <br/> </runtime> </P> <p> <requires> <br/> <import plugin = "nutch-extensionpoints"/> <br/> <import plugin = "Lib-Je-analysis"/> <br/> </requires> </P> <p> <extension id = "org. apache. nutch. analysis. zh "<br/> name =" chineseanalyzer "<br/> point =" org. apache. nutch. analysis. nutchanalyzer "> </P> <p> <implementation id =" org. apache. nutch. analysis. zh. chineseanalyzer "<br/> class =" org. apache. nutch. analysis. zh. chineseanalyzer "> <br/> <parameter name =" Lang "value =" ZH "/> <br/> </implementation> </P> <p> </extension> </P> <p> </plugin>
In the src/Java folder, create a new package: ORG/Apache/nutch/analysis/ZH, and create the file chineseanalyzer. Java in the package:
Package Org. apache. nutch. analysis. zh; </P> <p> // JDK imports <br/> Import Java. io. reader; </P> <p> // Lucene imports <br/> Import Org. apache. lucene. analysis. analyzer; <br/> Import Org. apache. lucene. analysis. tokenstream; </P> <p> // nutch imports <br/> Import Org. apache. nutch. analysis. nutchanalyzer; </P> <p> public class chineseanalyzer extends nutchanalyzer {</P> <p> private final static analyzer = <br/> New jeasy. analysis. mmanalyzer (); </P> <p>/** creates a new instance of chineseanalyzer */<br/> Public chineseanalyzer () {}</P> <p> Public tokenstream (string fieldname, Reader) {<br/> return analyzer. tokenstream (fieldname, Reader); <br/>}</P> <p>}

Create the following file in the Lib-Je-analyzers Folder:
Build. xml:

<Project name = "Lib-Je-analysis" default = "jar"> </P> <p> <import file = ".. /build-plugin.xml "/> </P> <p> <! -- <Br/>! Override the compile and jar targets, <br/>! Since there is nothing to compile here. <br/>! --> <Br/> <target name = "compile" depends = "init"/> </P> <p> <target name = "jar" depends = "compile"> <br/> <copy todir = "$ {build. dir} "verbose =" true "> <br/> <fileset dir = ". /lib "includes = "**/*. jar "/> <br/> </copy> <br/> </Target> </P> <p> </Project> <br/>
Plugin. xml

<Plugin <br/> id = "Lib-Je-analysis" <br/> name = "Je analysis" <br/> Version = "1.5.1" <br/> provider- name = "Je. analysis "> </P> <p> <runtime> <br/> <library name =" je-analysis-1.5.1.jar "> <br/> <export name =" * "/> <br /> </library> <br/> </runtime> </P> <p> </plugin>

Create a new Lib folder, place the je-analysis-1.5.1.jar in it, and add it to referenced libraries.

In addition, modify the value of nutchdocumentanalyzer:

Public class nutchdocumentanalyzer extends nutchanalyzer {</P> <p>/** Analyzer Used to index textual content. */<br/> private static analyzer content_analyzer; <br/> // Anchor Analysis <br/> // Like content analysis, but leave gap between anchors to inhibit <br/> // cross-anchor phrase matching. <br/>/** <br/> * The number of unused term Positions between anchors in the anchor field. <br/> */<br/> Public static final int inter_anchor_gap = 4; <br/>/** analyzer used to analyze anchors. */<br/> private static analyzer anchor_analyzer; </P> <p> private static analyzer je_analyzer; </P> <p>/** <br/> * @ Param conf <br/> */<br/> Public nutchdocumentanalyzer (configuration conf) {<br/> This. conf = conf; <br/> content_analyzer = new contentanalyzer (CONF); <br/> anchor_analyzer = new anchoranalyzer (); <br/> je_analyzer = new Org. apache. nutch. analysis. zh. chineseanalyzer (); <br/>}</P> <p>/** Analyzer Used to index textual content. */<br/> Private Static class contentanalyzer extends analyzer {<br/> private commongrams; </P> <p> Public contentanalyzer (configuration conf) {<br/> This. commongrams = new commongrams (CONF); <br/>}</P> <p>/** constructs a {@ link nutchdocumenttokenizer }. */<br/> Public tokenstream (string field, reader) {<br/> return this. commongrams. getfilter (New nutchdocumenttokenizer (Reader), <br/> field ); <br/>}</P> <p> Private Static class anchorfilter extends tokenfilter {<br/> private Boolean first = true; </P> <p> Public anchorfilter (tokenstream input) {<br/> super (input ); <br/>}</P> <p> Public final token next () throws ioexception {<br/> token result = input. next (); <br/> If (result = NULL) <br/> return result; <br/> If (first) {<br/> result. setpositionincrement (inter_anchor_gap); <br/> first = false; <br/>}< br/> return result; <br/>}</P> <p> Private Static class anchoranalyzer extends analyzer {<br/> Public final tokenstream (string fieldname, reader) {<br/> return New anchorfilter (content_analyzer.tokenstream (fieldname, Reader )); <br/>}</P> <p>/** returns a new token stream for text from the named field. */<br/> Public tokenstream (string fieldname, Reader) {<br/> analyzer; <br/>/* If ("anchor ". equals (fieldname) <br/> analyzer = anchor_analyzer; <br/> else <br/> analyzer = content_analyzer; */<br/> analyzer = je_analyzer; </P> <p> return analyzer. tokenstream (fieldname, Reader); <br/>}< br/>

4. Edit build. xml under src/plugin and add the following content:

<Ant dir = "analysis-ZH" target = "deploy"/> <br/> <ant dir = "Lib-Je-analyzers" target = "deploy"/> <br /> <ant dir = "analysis-ZH" target = "clean"/> <br/> <ant dir = "Lib-Je-analyzers" target = "clean"/> <br/>
5. Modify and query the line of nutchanalysis. JJ to: | <sigram: (<CJK>) +>. Use javacc to recompile and replace the corresponding file. Use the javacc tool. JJ generates the Java file. A total of seven java files are generated: charstream. java, nutchanalysis. java, nutchanalysisconstants. java, nutchanalysipolicenmanager. java, parseexception. java, Token. java, tokenmgrerror. java. Copy them to the src/Java/org/Apache/nutch/analysis folder to replace the original files. In this case, an exception will still occur. You need to add a parseexception thrown in the corresponding method declaration in the nutchanalysis. Java file.

Public static query parsequery (string querystring, analyzer, configuration conf) <br/> throws ioexception, parseexception {<br/> nuchanalysis parser = new nuchanalysis (<br/> querystring, (analyzer! = NULL )? Analyzer: New nutchdocumentanalyzer (CONF); <br/> parser. querystring = querystring; <br/> parser. queryfilters = new queryfilters (CONF); <br/> return parser. parse (CONF); <br/>}</P> <p>

 

 How to install and use javacc?

Download the javacc https://javacc.dev.java.net/, decompress it, and add the main directory of javacc to the environment variable. Go to the command line and enter javacc. If the command is not recognized, the installation is successful.

Go to the directory where the file nutchanalysis. JJ is located, and enter the javacc nutchanalysis. JJ command to generate seven java files.

6. Use ant to re-compile.

 

(1) copy the $ {nutch-1.0}/build/nutch-1.0.job file to replace the file of the same name under ${nutch-1.0.

(2) package the $ {nutch-1.0}/build/classes folder into a jar file. Enter the class directory under the command line and enter the following command: jar CVF nutch-1.0.jar

Replace the nutch-1.0 file under $ {nutch-1.0.jar.

(3) copy the nutch-1.0.jar, je-analysis-1.5.1.jar, analysis-zh.jar file to the WEB-INF/lib of the application in Tomcat.

 

7. Start the crawler, and use Luke (http://download.csdn.net/source/634007) to view: file-> open Lucene index, locate $ {nutch}/{crawl web directory}/index directory. The created index has changed a word into a phrase.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.