Nutch-0.9 (2) added ICTCLAS support for Chinese Word Segmentation

Source: Internet
Author: User


A. Install SVN, download the latest version from Apache, (http://svn.apache.org/repos/asf/lucene/nutch/branches/branch-0.9/) This can be compiled with ant tool, directly download the package file can not use ant
B. Install ant. http://ant.apache.org/download the latest compilation tool
C. Install the javacc https://javacc.dev.java.net/
D. Add D:/javacc/bin to the environment variable; D:/ANT/bin.

1). Use the ICTCLAS component. After testing, ICTCLAS can be used in cygwin to put the ictclas dll file under/lib/native/ICTCLAS. dll. Otherwise, ICTCLAS. dll cannot be found.
2 ). Package ICTCLAS into. Jar (ICTCLAS. Jar) in the/lib directory, so that you can call the word segmentation method.
3. Put the data class in ICTCLAS into this directory, and call this dictionary () during word segmentation ()
4. ModifyCode, Maid under/src/Java/org/Apache/nutch/Analysis
| <Sigram: (<CJK>) +>
{
System. Out. println ("");
}
Enable him to support Chinese Word Segmentation
5. Use javacc to compile and generate code
6. modify the code in nutchdocumenttokenizer. Java and add
Private Static reader myreader = NULL;

Public nutchdocumenttokenizer (Reader reader) {
super (process (Reader);
tokenmanager = new nutchanalysistokenmanager (myreader);
}

Public static reader process (Reader reader) {
bufferedreader in = new bufferedreader (Reader);
string line = "";
string temp = NULL;
try {
while (temp = in. readline ())! = NULL) {
line + = temp. replaceall ("/", "");
system. out. println (line);
}< BR >}catch (exception e) {
system. out. println (E);
}< br> try {
If (line! = NULL &&! Line. equals ("") {
COM. xjt. NLP. word. ICTCLAS Ic = com. xjt. NLP. word. ICTCLAS. getinstance ();
line = IC. paragraphprocess (line);
myreader = new stringreader (line);
}< BR >}catch (exception e) {
}< br> return myreader;
}

In this way, ICTCLAS can be used before word segmentation, but some files cannot be processed. For example, if "/" exists, it needs to be improved.
Then use bin/nutch crawl URLs-Dir crawler-depth 3-topn 50 to re-generate an index directory, use the following tool Luke to view the word segmentation, and you can see that the words in the tool are words divided into ICTCLAS, rather than words one by one.

 

ZZ: http://blog.csdn.net/xiajing12345/archive/2007/06/05/1638624.aspx

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.