A. Install SVN, download the latest version from Apache, (http://svn.apache.org/repos/asf/lucene/nutch/branches/branch-0.9/) This can be compiled with ant tool, directly download the package file can not use ant
B. Install ant. http://ant.apache.org/download the latest compilation tool
C. Install the javacc https://javacc.dev.java.net/
D. Add D:/javacc/bin to the environment variable; D:/ANT/bin.
1). Use the ICTCLAS component. After testing, ICTCLAS can be used in cygwin to put the ictclas dll file under/lib/native/ICTCLAS. dll. Otherwise, ICTCLAS. dll cannot be found.
2 ). Package ICTCLAS into. Jar (ICTCLAS. Jar) in the/lib directory, so that you can call the word segmentation method.
3. Put the data class in ICTCLAS into this directory, and call this dictionary () during word segmentation ()
4. ModifyCode, Maid under/src/Java/org/Apache/nutch/Analysis
| <Sigram: (<CJK>) +>
{
System. Out. println ("");
}
Enable him to support Chinese Word Segmentation
5. Use javacc to compile and generate code
6. modify the code in nutchdocumenttokenizer. Java and add
Private Static reader myreader = NULL;
Public nutchdocumenttokenizer (Reader reader) {
super (process (Reader);
tokenmanager = new nutchanalysistokenmanager (myreader);
}
Public static reader process (Reader reader) {
bufferedreader in = new bufferedreader (Reader);
string line = "";
string temp = NULL;
try {
while (temp = in. readline ())! = NULL) {
line + = temp. replaceall ("/", "");
system. out. println (line);
}< BR >}catch (exception e) {
system. out. println (E);
}< br> try {
If (line! = NULL &&! Line. equals ("") {
COM. xjt. NLP. word. ICTCLAS Ic = com. xjt. NLP. word. ICTCLAS. getinstance ();
line = IC. paragraphprocess (line);
myreader = new stringreader (line);
}< BR >}catch (exception e) {
}< br> return myreader;
}
In this way, ICTCLAS can be used before word segmentation, but some files cannot be processed. For example, if "/" exists, it needs to be improved.
Then use bin/nutch crawl URLs-Dir crawler-depth 3-topn 50 to re-generate an index directory, use the following tool Luke to view the word segmentation, and you can see that the words in the tool are words divided into ICTCLAS, rather than words one by one.
ZZ: http://blog.csdn.net/xiajing12345/archive/2007/06/05/1638624.aspx