Lucene Foundation (three)--Chinese word segmentation and highlighting

Last Update:2015-10-19 Source: Internet

Author: User

Tags createindex

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Lucene word breaker and highlightingWord breaker

In Lucene we follow the word segmentation method to index the document, different word breaker index effect is not the same, the previous example used is the standard word breaker, for the English effect is very good, but the Chinese word segmentation effect is not how, he will follow the words of Chinese characters direct participle, no concept of words.

Using the word breaker only needs to instantiate analyzer as our third-party word breaker.

There are many Chinese participle, here use Ikanalyzer for example,
Https://git.oschina.net/wltea/IK-Analyzer-2012FF now there's a tutorial inside.

Highlight

Importing Lucene-highlighter-xxx.jar highlighting the results of the query

Keyword highlighted HTML tags, need to import Lucene-highlighter-xxx.jar

Simplehtmlformatter simplehtmlformatter = new Simplehtmlformatter ("", " ");

Highlighter highlighter = new Highlighter (Simplehtmlformatter, new Queryscorer (query));

for (int i = 0; i < hits.length; i++) {

Document doc = Isearcher.doc (hits[i].doc);

　　　　Content added highlighting

Tokenstream Tokenstream = Analyzer.tokenstream (new StringReader (Doc.get (" content ") ));

" Content ")); SYSTEM.OUT.PRINTLN (content);

}

lucene Chinese word breaker

Package lucene_demo04;

Import java.io.IOException;
Import Java.io.StringReader;

Import Org.apache.lucene.analysis.Analyzer;
Import Org.apache.lucene.analysis.TokenStream;
Import org.apache.lucene.document.Document;
Import Org.apache.lucene.document.Field;
Import Org.apache.lucene.document.TextField;
Import org.apache.lucene.index.CorruptIndexException;
Import Org.apache.lucene.index.DirectoryReader;
Import Org.apache.lucene.index.IndexWriter;
Import Org.apache.lucene.index.IndexWriterConfig;
Import Org.apache.lucene.index.IndexWriterConfig.OpenMode;
Import org.apache.lucene.queryparser.classic.ParseException;
Import Org.apache.lucene.queryparser.classic.QueryParser;
Import Org.apache.lucene.search.IndexSearcher;
Import Org.apache.lucene.search.Query;
Import Org.apache.lucene.search.QueryWrapperFilter;
Import Org.apache.lucene.search.ScoreDoc;
Import Org.apache.lucene.search.highlight.Highlighter;
Import org.apache.lucene.search.highlight.InvalidTokenOffsetsException;
Import Org.apache.lucene.search.highlight.QueryScorer;
Import Org.apache.lucene.search.highlight.SimpleHTMLFormatter;
Import Org.apache.lucene.store.Directory;
Import Org.apache.lucene.store.RAMDirectory;
Import org.apache.lucene.util.Version;
Import Org.wltea.analyzer.lucene.IKAnalyzer;

/**
* Chinese word segmentation, ikanalayzer, highlighting the results of the index
*
* @author Yipfun
*/
public class LuceneDemo04
{
Private static final version version = Version.lucene_4_9;
Private Directory directory = NULL;
Private Directoryreader ireader = null;
Private IndexWriter iwriter = null;
Private Ikanalyzer Analyzer;

Test data
Private string[] content = {"Hello, I am a Chinese Communist", "People's Republic of China", "Chinese people stand Up", "Lucene is a good tool for full-text search", "full-text Search chinese word"};

/**
* Construction Method
*/
Public LuceneDemo04 ()
{
directory = new Ramdirectory ();
}

Private Ikanalyzer Getanalyzer ()
{
if (analyzer = = null)
{
return new Ikanalyzer ();
} else
{
return analyzer;
}
}

/**
* Create an index
*/
public void CreateIndex ()
{
Document doc = null;
Try
{
Indexwriterconfig iwconfig = new Indexwriterconfig (version, Getanalyzer ());
Iwconfig.setopenmode (Openmode.create_or_append);
Iwriter = new IndexWriter (directory, iwconfig);
for (String text:content)
{
doc = new Document ();
Doc.add (New TextField ("content", text, Field.Store.YES));
Iwriter.adddocument (DOC);
}

} catch (IOException e)
{
E.printstacktrace ();
} finally
{
Try
{
if (iwriter! = null)
Iwriter.close ();
} catch (IOException e)
{
E.printstacktrace ();
}
}

}

Public Indexsearcher Getsearcher ()
{
Try
{
if (Ireader = = null)
{
Ireader = directoryreader.open (directory);
} else
{
Directoryreader tr = directoryreader.openifchanged (Ireader);
if (tr! = NULL)
{
Ireader.close ();
Ireader = TR;
}
}
return new Indexsearcher (Ireader);
} catch (Corruptindexexception e)
{
E.printstacktrace ();
} catch (IOException e)
{
E.printstacktrace ();
}
return null;
}

public void Searchbyterm (string field, string keyword, int num) throws Invalidtokenoffsetsexception
{
Indexsearcher Isearcher = Getsearcher ();
Analyzer Analyzer = Getanalyzer ();
To construct a query object using Queryparser Queries Analyzer
Queryparser QP = new Queryparser (version, field, analyzer);
What is the effect of this sentence?
Qp.setdefaultoperator (Queryparser.or_operator);
Try
{
Query query = qp.parse (keyword);
Scoredoc[] Hits;

Several methods to pay attention to searcher
hits = isearcher.search (query, NULL, num). Scoredocs;

Keyword highlighted HTML tags, need to import Lucene-highlighter-xxx.jar
Simplehtmlformatter simplehtmlformatter = new Simplehtmlformatter ("", "") ;
Highlighter highlighter = new Highlighter (Simplehtmlformatter, new Queryscorer (query));

for (int i = 0; i < hits.length; i++)
{
Document doc = Isearcher.doc (hits[i].doc);
Content added highlighting
Tokenstream Tokenstream = Analyzer.tokenstream ("Content", New StringReader (Doc.get ("content"));
String content = highlighter.getbestfragment (Tokenstream, Doc.get ("content"));
SYSTEM.OUT.PRINTLN (content);
}

} catch (IOException e)
{
E.printstacktrace ();
} catch (ParseException e)
{
E.printstacktrace ();
}
}

/**
* Use filter query
*
* @param field
* @param keyword
* @param num
* @throws invalidtokenoffsetsexception
*/
public void Searchbytermfilter (string field, string keyword, int num) throws Invalidtokenoffsetsexception
{
Indexsearcher Isearcher = Getsearcher ();
Analyzer Analyzer = Getanalyzer ();
To construct a query object using Queryparser Queries Analyzer
Queryparser QP = new Queryparser (version, field, analyzer);
What is the effect of this sentence?
Qp.setdefaultoperator (Queryparser.or_operator);
Try
{
Query query = qp.parse (keyword);
Query q2 = Qp.parse ("Full text Search");
Scoredoc[] Hits;

Querywrapperfilter filter = new Querywrapperfilter (q2);
Several methods to pay attention to searcher
hits = Isearcher.search (query, filter, num). Scoredocs;

} catch (IOException e)
{
E.printstacktrace ();
} catch (ParseException e)
{
E.printstacktrace ();
}
}

public static void Main (string[] args) throws Invalidtokenoffsetsexception
{
System.out.println ("Start");
LuceneDemo04 ld = new LuceneDemo04 ();
Ld.createindex ();
Long start = System.currenttimemillis ();
Ld.searchbyterm ("Content", "people", 500);
System.out.println ("End Search Use" + (System.currenttimemillis ()-start) + "MS");
}

}

Operation Result:

Start load extension Dictionary: Ext.dic

Load extension Stop dictionary: stopword.dic

Chinese People Republic

China People stand up from here

End Search use 129ms

Lucene Foundation (three)--Chinese word segmentation and highlighting

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More