Getting started with nutch (4) -- add Chinese Word Segmentation

Source: Internet
Author: User
Tags java throws
Getting started with Nutch (4) -- add Chinese Word Segmentation
    Blog type:

  • Java
WindowsApache Internet AntluceneJava code
  1. /**
  2. * I am also a beginner. If you have any errors, please give me more advice. Thank you!
  3. Some content of javacc NutchAnalysis. jj * can be found on the Internet. If you have any offense, please forgive me.
  4. **/

Basic Information

This topic describes how to add Chinese word segmentation for the Nutch, and describes how to add Chinese word segmentation.

 

Preparations

1. Go to http://nutch.apache.org/to download nutch-1.0.tar.gz. After downloading the file, you can decompress it directly.

2. javacc.

3. Go to http://ant.apache.org/to download the ant file, decompress it, and add "ant path/bin" to the environment variable path. Enter ant in the windows Command window. If no error is prompted, ant is successfully installed.

4. IKAnalyzer: Download ikanalyzerfrom http://code.google.com/p/ik-analyzer/downloads/list. the example uses 3.1.1ga.

 

Add Chinese word segmentation for Nutch

1. Enable the Chinese word segmentation feature of Nutch.

Find src \ java \ org \ apache \ Nutch \ analysis \ NutchAnalysis in the directory. jj file, which is copied to another directory. It is best that there is no java file in this directory and you can modify the content of the file. modify the jj file as follows:

Java code
  1. Line 130:
  2. | <SIGRAM: <CJK>
  3. Change:
  4. | <SIGRAM: (<CJK>) +>

Switch the windows command environment to the Change directory and execute the following command:

Java code
  1. Javacc NutchAnalysis. jj

Seven java files are generated. Among them, line 58 of NutchAnalysis. java throws an exception and captures the exception. Replace the seven java files with the same names under the src \ java \ org \ apache \ nutch \ analysis directory.

 

2. Add IKAnalyzer.

Download IKAnalyzer3.1.1Stable. copy the jar file to the lib folder under the Nutch directory, and modify the contents of the src \ java \ org \ apache \ nutch \ analysis directory. java file, add the import Statement, and modify the tokenStream method. The Code is as follows:

Java code
  1. Import org. wltea. analyzer. lucene. IKAnalyzer;

 

Java code
  1. Public TokenStream tokenStream (String fieldName, Reader reader ){
  2. Analyzer analyzer;
  3. Analyzer = new IKAnalyzer ();
  4. Return analyzer. tokenStream (fieldName, reader );
  5. }

 

3. recompile the program.

Modify the build. xml file in the Nutch directory and add the following code after line 3:

Java code
  1. <Include name = "IKAnalyzer3.1.1Stable. jar"/>

Switch the command environment to the Nutch directory and execute the ant command. The following error occurs:

Java code
  1. Buildfile: build. xml
  2. Init:
  3. BUILD FAILED
  4. D: \ nuttch \ nutch-1.0 \ build. xml: 62: Specify at least one source -- a file or resource
  5. Collection.
  6. Total time: 0 seconds

There are two solutions:

(1). Download the missing config/*. template file from SVN.
(2) Modify build. xml and remove the lines-64 so that the template file is no longer needed.

 

Replace the nutch-1.0.job file under the build folder generated by compilation with the nutch-1.0.job file under the Nutch directory.

Package the compiled build \ classes file into a nutch-1.0.jar, replace the nutch-1.0.jar file under the Nutch directory. The packaging command is as follows:

Java code
  1. Jar cvf nutch-1.0.jar org

Replace the nutch-1.0.war in the nutch-1.0.jar package with the nutch-1.0.jar you just generated and add IKAnalyzer3.1.1Stable. jar.

 

So far, all work has been completed. For more information about crawling and searching, see my other two articles:

Getting started with Nutch (1) -- Preparation and Intranet crawling

Introduction to using Nutch (2) -- Internet crawling

 

Share:
Getting started with Nutch (3) -- loading configuration files
  • Browse 3121
  • Comment (3)
  • Category: Internet
  • Related Recommendations
Comment

Third floor
Commanderhyk 2010-07-15 just looked at the code generated by the nutch, it seems that some attributes are missing. I tried to add
Analyzer analyzer;
Analyzer = new IKAnalyzer ();
TokenStream tokenStream = analyzer. tokenStream (fieldName, reader );
TokenStream. addAttribute (TypeAttribute. class); // supplement
TokenStream. addAttribute (FlagsAttribute. class); // supplement
TokenStream. addAttribute (PayloadAttribute. class); // Add
TokenStream. addattriment( PositionIncrementAttribute. class); // supplement
Return tokenStream;

It has passed the test. I can check the data, but I don't know why.

2nd floor
Commanderhyk 2010-07-15 recently published the 1.1 Release of nutch. The basic configuration examples all run and are learning to replace the word splitter. You have successfully deployed the switch according to the above method, but an exception is thrown during the final query.
Java. lang. IllegalArgumentException: This AttributeSource does not have the attribute 'org. apache. lucene. analysis. tokenattributes. TermAttribute '.
I started to use 3.2.3 and later changed it to 3.1.1. I do not know how to solve this problem. I hope you can help me find out how to solve this problem. Thank you.
There is a similar error on javaeye that it can be solved, but no solution is provided. below is the connection

Http://www.iteye.com/topic/476897? Page = 21 floor
Softkid 2010-05-31 pay attention to the Chinese word segmentation version. I just used it for IKAnalyzer3.1.1Stable, And I tried 3.2. Also, he tried the latest one, but he couldn't.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.