Lucene problems (2): stemming and lemmatization

Last Update:2018-12-08 Source: Internet

Author: User

Tags createindex

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Problem:

I tried stemming and lemmatization mentioned in the article.

Reduce words to the root form, such as "cars" to "car. This operation is called stemming.
Convert words into the root form, such as "drove" to "drive. This operation is called lemmatization.

Test failed

The Code is as follows:

Public class TestNorms {
Public void createIndex () throws IOException {
Directory d = new SimpleFSDirectory (new File ("d:/falconTest/descrie3/norms "));
IndexWriter writer = new IndexWriter (d, new StandardAnalyzer (Version. paie_30 ),
True, IndexWriter. MaxFieldLength. UNLIMITED );
Field field = new Field ("desc", "", Field. Store. YES, Field. Index. ANALYZED );
Document doc = new Document ();
Field. setValue ("Hello students was drive ");
Doc. add (field );
Writer. addDocument (doc );
Writer. optimize ();
Writer. close ();
}
Public void search () throws IOException {
Directory d = new SimpleFSDirectory (new File ("d:/falconTest/descrie3/norms "));
IndexReader reader = IndexReader. open (d );
IndexSearcher searcher = new IndexSearcher (reader );
TopDocs docs = searcher. search (new TermQuery (new Term ("desc", "drove"), 10 );
System. out. println (docs. totalHits );
}
Public static void main (String [] args) throws IOException {
TestNorms test = new TestNorms ();
Test. createIndex ();
Test. search ();
}
}

No matter whether it is a single complex or a word change, it is not reflected.

I don't know why it is a word divider?

Answer:

It is indeed a problem of word divider. StandardAnalyzer does not support stemming and lemmatization, so it cannot distinguish between single-plural and word form.

This article describes the basic principles of full-text search. It is helpful for us to better understand Lucene, but it does not mean that Lucene is based entirely on this basic process.

(1) about stemming

As stemming, a famous Algorithm is The Porter Stemming Algorithm, whose home page is http://tartarus.org /~ Martin/PorterStemmer/, you can also view its paper http://tartarus.org /~ Martin/PorterStemmer/def.txt.

You can perform a simple test on the following webpage:Porter's Stemming Algorithm Online[Http://facweb.cs.depaul.edu/mobasher/classes/csc575/porter.html]

Cars-> car

Driving-> drive

Tokenization-> token

However

Drove-> drove

It can be seen that stemming is reduced to the root of the word by using rules, but cannot recognize the changes of the Word type.

In the latest Lucene 3.0, we already have the PorterStemFilter class to implement the above algorithm. Unfortunately, there is no Analyzer-directed matching, but it doesn't matter. We can simply implement it:

Public class PorterStemAnalyzer extends Analyzer
{
@ Override
Public TokenStream tokenStream (String fieldName, Reader reader ){
Return new PorterStemFilter (new LowerCaseTokenizer (reader ));
}
}

You can use this tokenizer in your program to identify the change in the word type of a single and multiple numbers and rules.

Public void createIndex () throws IOException {
Directory d = new SimpleFSDirectory (new File ("d:/falconTest/descrie3/norms "));
IndexWriter writer = new IndexWriter (d, new PorterStemAnalyzer (), true, IndexWriter. MaxFieldLength. UNLIMITED );

Field field = new Field ("desc", "", Field. Store. YES, Field. Index. ANALYZED );
Document doc = new Document ();
Field. setValue ("Hello students was driving cars professionally ");
Doc. add (field );

Writer. addDocument (doc );
Writer. optimize ();
Writer. close ();
}

Public void search () throws IOException {
Directory d = new SimpleFSDirectory (new File ("d:/falconTest/descrie3/norms "));
IndexReader reader = IndexReader. open (d );
IndexSearcher searcher = new IndexSearcher (reader );
TopDocs docs = searcher. search (new TermQuery (new Term ("desc", "car"), 10 );
System. out. println (docs. totalHits );
Docs = searcher. search (new TermQuery (new Term ("desc", "drive"), 10 );
System. out. println (docs. totalHits );
Docs = searcher. search (new TermQuery (new Term ("desc", "Sion"), 10 );
System. out. println (docs. totalHits );
}

(2) lemmatization

As for lemmatization, there is generally a dictionary, which can correspond to "drive" from "drove ".

I searched the internet and found European ages lemmatizer [http://lemmatizer.org/ratio. This is only available in the CIDR area of linux.

Download, compile, and install the SDK according to the website instructions:

LibMAFSA is the core of the lemmatizer. All other libraries depend on it. Download the last version from the following page, unpack it and compile:

# tar xzf libMAFSA-0.2.tar.gz# cd libMAFSA-0.2/# cmake .# make# sudo make install

After this you shoshould install libturglem. You can download it at the same place.

# tar xzf libturglem-0.2.tar.gz# cd libturglem-0.2# cmake .# make# sudo make install

Next you shoshould install english dictionaries with some additional features to work.

# tar xzf turglem-english-0.2.tar.gz# cd turglem-english-0.2# cmake .# make# sudo make install

After installation:

/Usr/local/include/turglem is the header file used to compile your own code.
/Usr/local/share/turglem/english is a dictionary file, of which lemmas. in xml, we can see the correspondence between "drove" and "drive", and between "was" and "be.
LibMAFSA. a libturglem. a libturglem-english.a libtxml. a in/usr/local/lib is a static library used to generate applications

There is an example test program test_utf8.cpp in the turglem-english-0.2 directory

# Include <stdio. h>
# Include <stdlib. h>
# Include <string. h>
# Include <unistd. h>
# Include <turglem/lemmatizer. h>
# Include <turglem/lemmatizer. hpp>
# Include <turglem/english/charset_adapters.hpp>

Int main (int argc, char ** argv)
{
Char in_s_buf [1024];
Char * nl_ptr;

Tl: lemmatizer lem;

If (argc! = 4)
{
Printf ("Usage: % s words. dic predict. dic flexias. bin \ n", argv [0]);
Return-1;
}

Lem. load_lemmatizer (argv [1], argv [3], argv [2]);

While (! Feof (stdin ))
{
Fgets (in_s_buf, 1024, stdin );
Nl_ptr = strchr (in_s_buf, '\ n ');
If (nl_ptr) * nl_ptr = 0;
Nl_ptr = strchr (in_s_buf, '\ R ');
If (nl_ptr) * nl_ptr = 0;

If (in_s_buf [0])
{
Printf ("processing % s \ n", in_s_buf );
Tl: lem_result pars;
Size_t pcnt = lem. lemmatize <english_utf8_adapter> (in_s_buf, pars );
Printf ("% d \ n", pcnt );
For (size_t I = 0; I <pcnt; I ++)
{
Std: string s;
U_int32_t src_form = lem. get_src_form (pars, I );
S = lem. get_text <english_utf8_adapter> (pars, I, 0 );
Printf ("PARADIGM % d: normal form '% s' \ n", (unsigned int) I, s. c_str ());
Printf ("\ tpart of speech: % d \ n", lem. get_part_of_speech (pars, (unsigned int) I, src_form ));
}
}
}

Return 0;
}

Compile the file and link it to the static library. Note the link sequence. Otherwise, an error may occur.

G ++-g-o output test_utf8.cpp-L/usr/local/lib/-lturglem-english-lturglem-lMAFSA-ltxml

Run the compiled program:

./Output/usr/local/share/turglem/english/dict_english.auto

/Usr/local/share/turglem/english/prediction_english.auto

/Usr/local/share/turglem/english/paradigms_english.bin

Although I do not know much about its mechanism, we can see the role of lemmatization:

Drove
Processing drove
3
PARADIGM 0: normal form 'drove'
Part of speech: 0
PARADIGM 1: normal form 'drove'
Part of speech: 2
PARADIGM 2: normal form 'Drive'
Part of speech: 2

Was
Processing was
3
PARADIGM 0: normal form 'be'
Part of speech: 3
PARADIGM 1: normal form 'be'
Part of speech: 3
PARADIGM 2: normal form 'be'
Part of speech: 3

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More