Lucene problems (2): stemming and lemmatization

Source: Internet
Author: User
Tags createindex
Problem:

I tried stemming and lemmatization mentioned in the article.

  • Reduce words to the root form, such as "cars" to "car. This operation is called stemming.
  • Convert words into the root form, such as "drove" to "drive. This operation is called lemmatization.

Test failed

The Code is as follows:

Public class TestNorms {
Public void createIndex () throws IOException {
Directory d = new SimpleFSDirectory (new File ("d:/falconTest/descrie3/norms "));
IndexWriter writer = new IndexWriter (d, new StandardAnalyzer (Version. paie_30 ),
True, IndexWriter. MaxFieldLength. UNLIMITED );
Field field = new Field ("desc", "", Field. Store. YES, Field. Index. ANALYZED );
Document doc = new Document ();
Field. setValue ("Hello students was drive ");
Doc. add (field );
Writer. addDocument (doc );
Writer. optimize ();
Writer. close ();
}
Public void search () throws IOException {
Directory d = new SimpleFSDirectory (new File ("d:/falconTest/descrie3/norms "));
IndexReader reader = IndexReader. open (d );
IndexSearcher searcher = new IndexSearcher (reader );
TopDocs docs = searcher. search (new TermQuery (new Term ("desc", "drove"), 10 );
System. out. println (docs. totalHits );
}
Public static void main (String [] args) throws IOException {
TestNorms test = new TestNorms ();
Test. createIndex ();
Test. search ();
}
}

No matter whether it is a single complex or a word change, it is not reflected.

I don't know why it is a word divider?

Answer:

It is indeed a problem of word divider. StandardAnalyzer does not support stemming and lemmatization, so it cannot distinguish between single-plural and word form.

This article describes the basic principles of full-text search. It is helpful for us to better understand Lucene, but it does not mean that Lucene is based entirely on this basic process.

(1) about stemming

As stemming, a famous Algorithm is The Porter Stemming Algorithm, whose home page is http://tartarus.org /~ Martin/PorterStemmer/, you can also view its paper http://tartarus.org /~ Martin/PorterStemmer/def.txt.

You can perform a simple test on the following webpage:Porter's Stemming Algorithm Online[Http://facweb.cs.depaul.edu/mobasher/classes/csc575/porter.html]

Cars-> car

Driving-> drive

Tokenization-> token

However

Drove-> drove

It can be seen that stemming is reduced to the root of the word by using rules, but cannot recognize the changes of the Word type.

In the latest Lucene 3.0, we already have the PorterStemFilter class to implement the above algorithm. Unfortunately, there is no Analyzer-directed matching, but it doesn't matter. We can simply implement it:

Public class PorterStemAnalyzer extends Analyzer
{
@ Override
Public TokenStream tokenStream (String fieldName, Reader reader ){
Return new PorterStemFilter (new LowerCaseTokenizer (reader ));
}
}

You can use this tokenizer in your program to identify the change in the word type of a single and multiple numbers and rules.

Public void createIndex () throws IOException {
Directory d = new SimpleFSDirectory (new File ("d:/falconTest/descrie3/norms "));
IndexWriter writer = new IndexWriter (d, new PorterStemAnalyzer (), true, IndexWriter. MaxFieldLength. UNLIMITED );

Field field = new Field ("desc", "", Field. Store. YES, Field. Index. ANALYZED );
Document doc = new Document ();
Field. setValue ("Hello students was driving cars professionally ");
Doc. add (field );

Writer. addDocument (doc );
Writer. optimize ();
Writer. close ();
}

Public void search () throws IOException {
Directory d = new SimpleFSDirectory (new File ("d:/falconTest/descrie3/norms "));
IndexReader reader = IndexReader. open (d );
IndexSearcher searcher = new IndexSearcher (reader );
TopDocs docs = searcher. search (new TermQuery (new Term ("desc", "car"), 10 );
System. out. println (docs. totalHits );
Docs = searcher. search (new TermQuery (new Term ("desc", "drive"), 10 );
System. out. println (docs. totalHits );
Docs = searcher. search (new TermQuery (new Term ("desc", "Sion"), 10 );
System. out. println (docs. totalHits );
}

(2) lemmatization

As for lemmatization, there is generally a dictionary, which can correspond to "drive" from "drove ".

I searched the internet and found European ages lemmatizer [http://lemmatizer.org/ratio. This is only available in the CIDR area of linux.

Download, compile, and install the SDK according to the website instructions:

LibMAFSA is the core of the lemmatizer. All other libraries depend on it. Download the last version from the following page, unpack it and compile:

# tar xzf libMAFSA-0.2.tar.gz# cd libMAFSA-0.2/# cmake .# make# sudo make install

After this you shoshould install libturglem. You can download it at the same place.

# tar xzf libturglem-0.2.tar.gz# cd libturglem-0.2# cmake .# make# sudo make install

Next you shoshould install english dictionaries with some additional features to work.

# tar xzf turglem-english-0.2.tar.gz# cd turglem-english-0.2# cmake .# make# sudo make install

After installation:

  • /Usr/local/include/turglem is the header file used to compile your own code.
  • /Usr/local/share/turglem/english is a dictionary file, of which lemmas. in xml, we can see the correspondence between "drove" and "drive", and between "was" and "be.
  • LibMAFSA. a libturglem. a libturglem-english.a libtxml. a in/usr/local/lib is a static library used to generate applications

<L id = "DRIVE" p = "6"/>

<L id = "DROVE" p = "6"/>

<L id = "DRIVING" p = "6"/>

There is an example test program test_utf8.cpp in the turglem-english-0.2 directory

# Include <stdio. h>
# Include <stdlib. h>
# Include <string. h>
# Include <unistd. h>
# Include <turglem/lemmatizer. h>
# Include <turglem/lemmatizer. hpp>
# Include <turglem/english/charset_adapters.hpp>

Int main (int argc, char ** argv)
{
Char in_s_buf [1024];
Char * nl_ptr;

Tl: lemmatizer lem;

If (argc! = 4)
{
Printf ("Usage: % s words. dic predict. dic flexias. bin \ n", argv [0]);
Return-1;
}

Lem. load_lemmatizer (argv [1], argv [3], argv [2]);

While (! Feof (stdin ))
{
Fgets (in_s_buf, 1024, stdin );
Nl_ptr = strchr (in_s_buf, '\ n ');
If (nl_ptr) * nl_ptr = 0;
Nl_ptr = strchr (in_s_buf, '\ R ');
If (nl_ptr) * nl_ptr = 0;

If (in_s_buf [0])
{
Printf ("processing % s \ n", in_s_buf );
Tl: lem_result pars;
Size_t pcnt = lem. lemmatize <english_utf8_adapter> (in_s_buf, pars );
Printf ("% d \ n", pcnt );
For (size_t I = 0; I <pcnt; I ++)
{
Std: string s;
U_int32_t src_form = lem. get_src_form (pars, I );
S = lem. get_text <english_utf8_adapter> (pars, I, 0 );
Printf ("PARADIGM % d: normal form '% s' \ n", (unsigned int) I, s. c_str ());
Printf ("\ tpart of speech: % d \ n", lem. get_part_of_speech (pars, (unsigned int) I, src_form ));
}
}
}

Return 0;
}

Compile the file and link it to the static library. Note the link sequence. Otherwise, an error may occur.

G ++-g-o output test_utf8.cpp-L/usr/local/lib/-lturglem-english-lturglem-lMAFSA-ltxml

Run the compiled program:

./Output/usr/local/share/turglem/english/dict_english.auto

/Usr/local/share/turglem/english/prediction_english.auto

/Usr/local/share/turglem/english/paradigms_english.bin

Although I do not know much about its mechanism, we can see the role of lemmatization:

Drove
Processing drove
3
PARADIGM 0: normal form 'drove'
Part of speech: 0
PARADIGM 1: normal form 'drove'
Part of speech: 2
PARADIGM 2: normal form 'Drive'
Part of speech: 2

Was
Processing was
3
PARADIGM 0: normal form 'be'
Part of speech: 3
PARADIGM 1: normal form 'be'
Part of speech: 3
PARADIGM 2: normal form 'be'
Part of speech: 3

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.