In the original version of nutch, the processing of Chinese characters is divided by words rather than words. We can add some Chinese Word Segmentation plug-ins for them, such as ikanalyzer. After reading many tutorials on the internet, I finally failed to succeed. Finally, I integrated the content of several tutorials and finally succeeded.
First, make the following descriptions to facilitate understanding:
(1) currently, there are roughly two methods for Chinese Word Segmentation of nutch:
First, modify the source code. In this way, you can directly modify the processing class of the nutch word segmentation and call the completed word segmentation components to perform word segmentation.
Second, compile the word segmentation plug-in. This method is to rewrite the rules or add the Chinese word segmentation plug-in according to the plug-in defined by nuttch.
Both methods are optional. At present, due to the active open-source community, many word segmentation components have emerged. Both the source code modification method and the method of compiling word segmentation plug-ins depend on these word segmentation components. For example, Ik, je, and Ding.
(2) The nutchanalysis. JJ file is used for search;Used for indexing.
(3) javacc and ant tools are used. Javacc is used to compile nutchanalysis. JJ file. It is best to copy the file to other directories for compilation. After compilation, copy the seven generated files back to the original directory. If it is compiled in the original directory, only four files will be generated.
(4) The build. xml file is the ant configuration file.
(5) The Ant and javacc tools are used in a similar way. After decompression, add the bin directory path to the system directory (PATH) and restart the computer. For details, see ant and javacc.Installation and Use
Well, the specific process is as follows:
I. preparations:
Import the ikanalyzer3.2.8.jar file to the directory of nutch/lib.
Ii. Code modification:
1.Nutchanalysis. jj
In the directory of nutch/src/Java/org/Apache/nutch/Analysis
Find|<Sigram:<CJK>>, Which indicates dividing by words and changing|<Sigram:(<CJK>) +>
Use javacc to generate the source code of nutchanalysis. JJ and overwrite all the generated Java source code (7 files)Src/Java/org/Apache/nutch/analysis package.
Javacc usage: Run cmd to go to the command line and switch to the directory where nutchanalysis. JJ is located (finally copy it to another directory for compilation, such as disk D). D:, enter the command
JavaccNutchanalysis. jj
Seven files are generated.
2.Nutchanalysis. Java
In the directory of nutch/src/Java/org/Apache/nutch/Analysis
(1) Add the following code in the import region (this step is not required if you are not sure)
ImportOrg. wltea. analyzer. Lucene. iktokenizer;
(2) Add the parseexception exception capture command in two locations; otherwise, the system will prompt when ant is added. The following code has been added
ViewPlain
- PublicStaticQueryParsequery (stringQuerystring,ConfigurationConf)ThrowsIoexception, parseexception{
- ReturnParsequery (querystring,Null,Conf );
- }
ViewPlain
- PublicStaticQueryParsequery (stringQuerystring,AnalyzerAnalyzer,ConfigurationConf)
- ThrowsIoexception, parseexception{
- NutchanalysisParser =New Nutchanalysis (
- Querystring,(Analyzer! =Null)?Analyzer:NewNutchdocumentanalyzer (CONF ));
- Parser. querystring =Querystring;
- Parser. queryfilters =New Queryfilters (CONF );
- ReturnParser. parse (CONF );
- }
3.Nutchdocumentanalyzer. Java
(1) introduce the IK package
ImportOrg. wltea. analyzer. Lucene. ikanalyzer;// TjtUpdate
ImportOrg. Apache. Lucene. analysis. tokenattributes .*;
(2) modify publicTokenstreamTokenstream (stringFieldname,ReaderReader){Function:
ViewPlain
- PublicTokenstreamTokenstream (stringFieldname,ReaderReader){
-
- AnalyzerAnalyzer =New Org. wltea. analyzer. Lucene. ikanalyzer ();
- ReturnAnalyzer. tokenstream (fieldname,Reader );
- }
4.Nutch/build. xml
(1) In<TargetName = "war"Depends = "jar, compile, generate-Docs"> </Target> <lib> </lib> <includeName = "log4j-*. Jar"/>Add the following command to the JAR file Je-analysis when compiling the war file. Note the version number of ikanalyzer3.2.8.jar.
<IncludeName = "ikanalyzer3.2.8.jar"/>
(2) Modify <targeTname = "job"Change depends = "compile"> to <targetName = "job"Depends = "compile, war"> This way the nutch-1.2.job, nutch-1.2.war, and nutch-1.2.jar files are automatically generated under the bulid folder after compilation. (Note: If this is not the case, antWar,AntJarCan generate nutch-1.2.war, nutch-1.2.jar)
5.Ant
Run cmd to go to the command line, switch to the directory where the nutch is located, and execute the ant command to start ant work. After completion, the build directory will be generated under the nutch directory
6. file replacement
(1) Replace the build/nutch-1.2.job with the files under the nutch directory
(2) Replace the build/nutch-1.2.jar with the files under the nutch directory
(3) Replace the build/nutch-1.2.war with the files under the nutch directory, this step does not need
NOTE: If ant does not generate nutch-1.2.jar and nutch-1.2.war files under the build directory, the command line executes the commandAntJarAndAntWar generates these two files
7.Crawling and indexing again
Note that you need to configure the environment (CONF/nutch-site.xml, crawl-urlfilter.txt, URLs directory, etc.). For more information, see the environment configuration.)
Bin/nutchCrawlURLs-DirCsdn-Threads4-Depth2-Topn30 Crawling
Bin/nutchOrg. Apache. nutch. searcher. nutchbeanCsdn Search
8. view results
Use Luke to view the index. If the index has been changed to a phrase, the index is successful. You can also use the nutchbean command that comes with nutch to retrieve the phrase contained in the webpage. If any result is displayed, the index is successful, as shown in figure
Bin/nutchOrg. Apache. nutch. searcher. nutchbean'Engine'
Note: The following configuration is required in the conf/nutch-site.xml file, E:/nutch/csdn is the search result directory
<Property>
<Name> searcher. dir </Name>
<Value> E:/nutch/csdn </value>
<Description> </description>
</Property>
9. Tomcat search
(1) copy the newly generated nutch-1.2.war to Tomcat7.0/webapps directory. After the Tomcat service is started, a folder is automatically generated under the directory.
(2) copy the new nutch-1.2.jar file and Word Segmentation package (ikanalyzer3.2.8.jar) to Tomcat under the WEB-INF/lib
(3) http: // localhost: 8080/nutch Search Test
Note: tomcat/webapps/nutch/WEB-INF/classes/nutch-site.xmlConfiguration
Other Instructions:
(1) the modifications to the nutchdocumentanalyzer are divided into multiple steps on the Internet. I didn't know which one was correct at the beginning. I checked them carefully and found that they were the same, just define the variables into a single line of command. Http://blog.csdn.net/laigood12345/archive/2010/12/12/6071046.aspx
(2) The above step is to directly modify the source code, other plug-ins, such as the example of http://blog.csdn.net/oprah_7/archive/2011/03/09/6234296.aspx
Refer to the following:
LetNutchSupport Chinese Word SegmentationMethodologyNutch1.0+Ik-analyzer3.1.6Chinese Word Segmentation http://trac.nchc.org.tw/cloud/wiki/waue/2010/0715
Unable to access, you can go to my Baidu Library to view
Option: add the dictionary into the dictionary.
PendingNutch-*. JobZipThe tool is released, and the following two cases are put into the nutch-*. JobMedium
Ikanalyzer. cfg. xml
<Properties>
<Comment> ikAnalyzer </comment>
<EntryKey = "ext_dict">/mydic. DIC </entry>
</Properties>
Mydic.doc
Guojia Expressway
National Expressway Network
National Expressway Network and Computing Center
Slightly troublesome or incorrect information on the Internet:
(1) package build/classes into a nutch-1.2.jar
CD$ Nutch-1.0/build/classes
JarCVFNutch-1.0-ika.jar.
Or (JarCVFE:/nutch/build/classes)
Troublesome. ant can be used directly.Jar command, or follow the step (2) in step 2
(2) That is, the resources obtained by the original nutch crawl willNutch-1.0.jar after re-renewalNutch-1.0.jobIkanalyzer3.1.6Are you sure you have placed it in your original nutch search webpage. After you restart tomcat, you can also directly enjoy the results of words.(Incorrect)
(3) coming soonIkanalyzer3.1.6ga. JarOf/Org/wltea/Analyzer/DIC/Item content, Put into the dictionary you want, you can take the same information withinMain. DIC(Incorrect, unavailable)
References:
1.LetNutchSupport Chinese Word SegmentationMethodologyNutch1.0+Ik-analyzer3.1.6Chinese Word Segmentation
Http://trac.nchc.org.tw/cloud/wiki/waue/2010/0715
2. Added ikanalyzer Chinese Word Segmentation in nutch1.2
Http://blog.csdn.net/laigood12345/archive/2010/12/12/6071046.aspx
3.Chinese Word Segmentation of nutch(JE Word Segmentation in Plug-in Mode)
Http://blog.csdn.net/oprah_7/archive/2011/03/09/6234296.aspx
4.NutchWord SegmentationChinese Word SegmentationPaodingBlister(Plug-in Mode)
Http://blog.csdn.net/mutou12456/archive/2010/04/01/5439935.aspx
5.Is1.0 add je Chinese Word Segmentation(Modify source code)
Http://yjiezhao.blog.163.com/blog/static/1152322392009101983837179/
The question about how to modify the search by using the link Ik-analyzerWhen indexingWord Segmentation added
JavaCode IndexwriterIndexwrite= New Indexwriter (indexdir,