Problem:
I tried stemming and lemmatization mentioned in the article.
- Reduce words to the root form, such as "cars" to "car. This operation is called stemming.
- Convert words into the root form, such as "drove" to "drive. This operation is called lemmatization.
Test failed
The Code is as follows:
Public class TestNorms { Public void createIndex () throws IOException { Directory d = new SimpleFSDirectory (new File ("d:/falconTest/descrie3/norms ")); IndexWriter writer = new IndexWriter (d, new StandardAnalyzer (Version. paie_30 ), True, IndexWriter. MaxFieldLength. UNLIMITED ); Field field = new Field ("desc", "", Field. Store. YES, Field. Index. ANALYZED ); Document doc = new Document (); Field. setValue ("Hello students was drive "); Doc. add (field ); Writer. addDocument (doc ); Writer. optimize (); Writer. close (); } Public void search () throws IOException { Directory d = new SimpleFSDirectory (new File ("d:/falconTest/descrie3/norms ")); IndexReader reader = IndexReader. open (d ); IndexSearcher searcher = new IndexSearcher (reader ); TopDocs docs = searcher. search (new TermQuery (new Term ("desc", "drove"), 10 ); System. out. println (docs. totalHits ); } Public static void main (String [] args) throws IOException { TestNorms test = new TestNorms (); Test. createIndex (); Test. search (); } } |
No matter whether it is a single complex or a word change, it is not reflected.
I don't know why it is a word divider?
Answer:
It is indeed a problem of word divider. StandardAnalyzer does not support stemming and lemmatization, so it cannot distinguish between single-plural and word form.
This article describes the basic principles of full-text search. It is helpful for us to better understand Lucene, but it does not mean that Lucene is based entirely on this basic process.
(1) about stemming
As stemming, a famous Algorithm is The Porter Stemming Algorithm, whose home page is http://tartarus.org /~ Martin/PorterStemmer/, you can also view its paper http://tartarus.org /~ Martin/PorterStemmer/def.txt.
You can perform a simple test on the following webpage:Porter's Stemming Algorithm Online[Http://facweb.cs.depaul.edu/mobasher/classes/csc575/porter.html]
Cars-> car
Driving-> drive
Tokenization-> token
However
Drove-> drove
It can be seen that stemming is reduced to the root of the word by using rules, but cannot recognize the changes of the Word type.
In the latest Lucene 3.0, we already have the PorterStemFilter class to implement the above algorithm. Unfortunately, there is no Analyzer-directed matching, but it doesn't matter. We can simply implement it:
Public class PorterStemAnalyzer extends Analyzer { @ Override Public TokenStream tokenStream (String fieldName, Reader reader ){ Return new PorterStemFilter (new LowerCaseTokenizer (reader )); } } |
You can use this tokenizer in your program to identify the change in the word type of a single and multiple numbers and rules.
Public void createIndex () throws IOException { Directory d = new SimpleFSDirectory (new File ("d:/falconTest/descrie3/norms ")); IndexWriter writer = new IndexWriter (d, new PorterStemAnalyzer (), true, IndexWriter. MaxFieldLength. UNLIMITED ); Field field = new Field ("desc", "", Field. Store. YES, Field. Index. ANALYZED ); Document doc = new Document (); Field. setValue ("Hello students was driving cars professionally "); Doc. add (field ); Writer. addDocument (doc ); Writer. optimize (); Writer. close (); } Public void search () throws IOException { Directory d = new SimpleFSDirectory (new File ("d:/falconTest/descrie3/norms ")); IndexReader reader = IndexReader. open (d ); IndexSearcher searcher = new IndexSearcher (reader ); TopDocs docs = searcher. search (new TermQuery (new Term ("desc", "car"), 10 ); System. out. println (docs. totalHits ); Docs = searcher. search (new TermQuery (new Term ("desc", "drive"), 10 ); System. out. println (docs. totalHits ); Docs = searcher. search (new TermQuery (new Term ("desc", "Sion"), 10 ); System. out. println (docs. totalHits ); } |
(2) lemmatization
As for lemmatization, there is generally a dictionary, which can correspond to "drive" from "drove ".
I searched the internet and found European ages lemmatizer [http://lemmatizer.org/ratio. This is only available in the CIDR area of linux.
Download, compile, and install the SDK according to the website instructions:
LibMAFSA is the core of the lemmatizer. All other libraries depend on it. Download the last version from the following page, unpack it and compile: # tar xzf libMAFSA-0.2.tar.gz# cd libMAFSA-0.2/# cmake .# make# sudo make install After this you shoshould install libturglem. You can download it at the same place. # tar xzf libturglem-0.2.tar.gz# cd libturglem-0.2# cmake .# make# sudo make install Next you shoshould install english dictionaries with some additional features to work. # tar xzf turglem-english-0.2.tar.gz# cd turglem-english-0.2# cmake .# make# sudo make install |
After installation:
- /Usr/local/include/turglem is the header file used to compile your own code.
- /Usr/local/share/turglem/english is a dictionary file, of which lemmas. in xml, we can see the correspondence between "drove" and "drive", and between "was" and "be.
- LibMAFSA. a libturglem. a libturglem-english.a libtxml. a in/usr/local/lib is a static library used to generate applications
<L id = "DRIVE" p = "6"/> <L id = "DROVE" p = "6"/> <L id = "DRIVING" p = "6"/> |
There is an example test program test_utf8.cpp in the turglem-english-0.2 directory
# Include <stdio. h> # Include <stdlib. h> # Include <string. h> # Include <unistd. h> # Include <turglem/lemmatizer. h> # Include <turglem/lemmatizer. hpp> # Include <turglem/english/charset_adapters.hpp> Int main (int argc, char ** argv) { Char in_s_buf [1024]; Char * nl_ptr; Tl: lemmatizer lem; If (argc! = 4) { Printf ("Usage: % s words. dic predict. dic flexias. bin \ n", argv [0]); Return-1; } Lem. load_lemmatizer (argv [1], argv [3], argv [2]); While (! Feof (stdin )) { Fgets (in_s_buf, 1024, stdin ); Nl_ptr = strchr (in_s_buf, '\ n '); If (nl_ptr) * nl_ptr = 0; Nl_ptr = strchr (in_s_buf, '\ R '); If (nl_ptr) * nl_ptr = 0; If (in_s_buf [0]) { Printf ("processing % s \ n", in_s_buf ); Tl: lem_result pars; Size_t pcnt = lem. lemmatize <english_utf8_adapter> (in_s_buf, pars ); Printf ("% d \ n", pcnt ); For (size_t I = 0; I <pcnt; I ++) { Std: string s; U_int32_t src_form = lem. get_src_form (pars, I ); S = lem. get_text <english_utf8_adapter> (pars, I, 0 ); Printf ("PARADIGM % d: normal form '% s' \ n", (unsigned int) I, s. c_str ()); Printf ("\ tpart of speech: % d \ n", lem. get_part_of_speech (pars, (unsigned int) I, src_form )); } } } Return 0; } |
Compile the file and link it to the static library. Note the link sequence. Otherwise, an error may occur.
G ++-g-o output test_utf8.cpp-L/usr/local/lib/-lturglem-english-lturglem-lMAFSA-ltxml |
Run the compiled program:
./Output/usr/local/share/turglem/english/dict_english.auto /Usr/local/share/turglem/english/prediction_english.auto /Usr/local/share/turglem/english/paradigms_english.bin |
Although I do not know much about its mechanism, we can see the role of lemmatization:
Drove Processing drove 3 PARADIGM 0: normal form 'drove' Part of speech: 0 PARADIGM 1: normal form 'drove' Part of speech: 2 PARADIGM 2: normal form 'Drive' Part of speech: 2 Was Processing was 3 PARADIGM 0: normal form 'be' Part of speech: 3 PARADIGM 1: normal form 'be' Part of speech: 3 PARADIGM 2: normal form 'be' Part of speech: 3 |