ICTCLAS is a Chinese Word Segmentation package produced by computing of the Chinese Emy of sciences. It has a good reputation and high usage in China. C #, Delphi, and Java versions have been available only in the past. The following uses a very small example to enable ICTCLAS in 10 minutes. From then on, we have started to develop our own text classification and search engine.
It should be noted first that, unlike the JNI call provided by the previous C ++ version, this uses the pure Java version ICTCLAS at http://ictclas.org/down_opensrc.asp.
Well, suppose you have downloaded the required Java version ictclas4j, decompress it, and copy the entire Data folder to the Eclipse project folder, the org folder under the bin directory is copied to the bin directory of your Eclipse project, copy the entire org folder under the src directory to the src directory of the Eclipse project (the simplest and quickest way to use it, or you can pack it into a jar package by yourself, no matter where you put it, you can import the jar package in build path ).
Now you can create a new class in your project. The code for creating a new class is as follows:
Import org. ictclas4j. bean. segResult; <br/> import org. ictclas4j. segment. segTag; <br/> public class OneMain {<br/> public static void main (String [] args) {<br/> System. out. println ("This is OneMain"); <br/> SegTag st = new SegTag (1); <br/> SegResult sr = st. split ("a piece of diligent and beautiful money,/create an economic aircraft carrier. ABCD. # $ % Hello World! /N another piece of text: 123 vehicles! 3.0 "); <br/> System. out. println (sr. getFinalResult (); <br/>}< br/>}
Obviously, the text is "a piece of diligent and beautiful money,/to create an economic aircraft carrier. ABCD. # $ % Hello World! "N another piece of text: 123 vehicles! "3.0" is the text we use for testing. It contains Chinese characters, English letters, punctuation marks, messy symbols (laughter) and Arabic numerals.
Run the program and check the output:
This is OneMain
One/s hard work/a place/u beautiful/a/u a/m/q Money/n, /w // nx create/v Economy/n/u aircraft carrier/n. /W ABCD. # $ %/nx Hello/nx World/nx! /W and/d/m/q text/n 123/m/q
As you can see, word segmentation results in a long String type of data. Each word is separated by space, and each word is marked with a part of speech in English. Let's take a look at some interesting places.
In the original text, there are actually two "one piece" and one is "one piece of diligence". Here we correctly identify them as adverbs, the "one" in the next "one dollar" is also correctly recognized as a quantizer.
Arabic numerals are correctly recognized as numerals, including decimal form "3.0 ". English and messy symbols (including the invisible line break, where did you find it ?) It is classified as one type --/nx! (Because I Don't Know What ICTCLAS calls it. It's invalid or invalid, or other characters. Can I name it by myself)
There are two exclamation points in the test text. One is half-width English !, One is full of Chinese Characters !, Both are correctly recognized as punctuation marks, but the English ending "." is regarded as/nx.
Spaces in the test text are ignored completely.
Okay. It's very simple, right? Go and have fun.
The above content comes from the signature. This is mainly because I am not familiar with java and ecplise tools. Here I will record my problems and hope to help me, just like me, to get started with java.
1. First, I encountered the following error:
Exception in thread "main" java. lang. Error: Unresolved compilation problems:
SegTag cannot be resolved to a type
SegTag cannot be resolved to a type
SegResult cannot be resolved to a type
This error is mainly because eclipse does not import the corresponding code (I naively thought that you only need to copy the code and copy the corresponding folder as mentioned above, and then you can run it ), this error can be easily eliminated (now it is easy to say, I can explore it for a long time), that is, press F5 to refresh, or right-click the project and choose refresh.
2. Now you can Run it. I'm glad to press the "Run As" button, but the following error occurs:
Exception in thread "main" java. lang. Error: Unresolved compilation problems:
The import org. apache cannot be resolved
ReflectionToStringBuilder cannot be resolved
At org. ictclas4j. bean. WordTable. <init> (WordTable. java: 5)
At org. ictclas4j. bean. Dictionary. init (Dictionary. java: 46)
At org. ictclas4j. bean. Dictionary. <init> (Dictionary. java: 38)
At org. ictclas4j. segment. SegTag. <init> (SegTag. java: 28)
At Test. main (Test. java: 7)
Use the eclipse tool to locate the problem:
Import org. apache. commons. lang. builder. ReflectionToStringBuilder;
Visible, we also need a class that contains ReflectionToStringBuilder, by reading the reply below the original article, we still lack a commons-lang-2.4.jar package. Okay. If you know what you need, find it...
However, this is a ridiculous scene. I don't know how to use this package in this project, and I once again asked Google for help. Finally, I found a feasible method:
1. Create a folder lib under the eclipse project and copy the commons-lang-2.4.jar package to it.
2. In the left-side project box, right-click-> build path-> configure build path-> libraries-> add JARs, and then find and select the commons-lang-2.4.jar package.
Through the above operation, I finally saw the cute running result:
This is OneMain
One/s hard work/a place/u beautiful/a/u a/m/q Money/n, /w // nx create/v Economy/n/u aircraft carrier/n. /W ABCD. # $ %/nx Hello/nx World/nx! /W
And/d/m/q text/n 123/m/q! /W 3.0/m
This is my first access to the java project. Although the code can be run, I don't know whether it is because I have found the cause and solution. I still happen to be confused, of course, I will also begin to learn this knowledge ). In short, now we can use the word splitting system to do something.
PS: Some people may say, "Isn't that a problem? Does it take an hour ?", Oh, that's the problem! You can smile.