Lucene Foundation (three)--Chinese word segmentation and highlighting

Source: Internet
Author: User
Tags createindex

Lucene word breaker and highlightingWord breaker

In Lucene we follow the word segmentation method to index the document, different word breaker index effect is not the same, the previous example used is the standard word breaker, for the English effect is very good, but the Chinese word segmentation effect is not how, he will follow the words of Chinese characters direct participle, no concept of words.

Using the word breaker only needs to instantiate analyzer as our third-party word breaker.

There are many Chinese participle, here use Ikanalyzer for example,
Https://git.oschina.net/wltea/IK-Analyzer-2012FF now there's a tutorial inside.

Highlight

Importing Lucene-highlighter-xxx.jar highlighting the results of the query

Keyword highlighted HTML tags, need to import Lucene-highlighter-xxx.jar

Simplehtmlformatter simplehtmlformatter = new Simplehtmlformatter ("<span style= ' color:red ' >", " </span> ");

Highlighter highlighter = new Highlighter (Simplehtmlformatter, new Queryscorer (query));

  for (int i = 0; i < hits.length; i++) {

Document doc = Isearcher.doc (hits[i].doc);

    Content added highlighting

Tokenstream Tokenstream = Analyzer.tokenstream (new StringReader (Doc.get (" content ") ));

" Content ")); SYSTEM.OUT.PRINTLN (content);

}

lucene Chinese word breaker

Package lucene_demo04;

Import java.io.IOException;
Import Java.io.StringReader;

Import Org.apache.lucene.analysis.Analyzer;
Import Org.apache.lucene.analysis.TokenStream;
Import org.apache.lucene.document.Document;
Import Org.apache.lucene.document.Field;
Import Org.apache.lucene.document.TextField;
Import org.apache.lucene.index.CorruptIndexException;
Import Org.apache.lucene.index.DirectoryReader;
Import Org.apache.lucene.index.IndexWriter;
Import Org.apache.lucene.index.IndexWriterConfig;
Import Org.apache.lucene.index.IndexWriterConfig.OpenMode;
Import org.apache.lucene.queryparser.classic.ParseException;
Import Org.apache.lucene.queryparser.classic.QueryParser;
Import Org.apache.lucene.search.IndexSearcher;
Import Org.apache.lucene.search.Query;
Import Org.apache.lucene.search.QueryWrapperFilter;
Import Org.apache.lucene.search.ScoreDoc;
Import Org.apache.lucene.search.highlight.Highlighter;
Import org.apache.lucene.search.highlight.InvalidTokenOffsetsException;
Import Org.apache.lucene.search.highlight.QueryScorer;
Import Org.apache.lucene.search.highlight.SimpleHTMLFormatter;
Import Org.apache.lucene.store.Directory;
Import Org.apache.lucene.store.RAMDirectory;
Import org.apache.lucene.util.Version;
Import Org.wltea.analyzer.lucene.IKAnalyzer;

/**
* Chinese word segmentation, ikanalayzer, highlighting the results of the index
*
* @author Yipfun
*/
public class LuceneDemo04
{
Private static final version version = Version.lucene_4_9;
Private Directory directory = NULL;
Private Directoryreader ireader = null;
Private IndexWriter iwriter = null;
Private Ikanalyzer Analyzer;

Test data
Private string[] content = {"Hello, I am a Chinese Communist", "People's Republic of China", "Chinese people stand Up", "Lucene is a good tool for full-text search", "full-text Search chinese word"};

/**
* Construction Method
*/
Public LuceneDemo04 ()
{
directory = new Ramdirectory ();
}

Private Ikanalyzer Getanalyzer ()
{
if (analyzer = = null)
{
return new Ikanalyzer ();
} else
{
return analyzer;
}
}

/**
* Create an index
*/
public void CreateIndex ()
{
Document doc = null;
Try
{
Indexwriterconfig iwconfig = new Indexwriterconfig (version, Getanalyzer ());
Iwconfig.setopenmode (Openmode.create_or_append);
Iwriter = new IndexWriter (directory, iwconfig);
for (String text:content)
{
doc = new Document ();
Doc.add (New TextField ("content", text, Field.Store.YES));
Iwriter.adddocument (DOC);
}

} catch (IOException e)
{
E.printstacktrace ();
} finally
{
Try
{
if (iwriter! = null)
Iwriter.close ();
} catch (IOException e)
{
E.printstacktrace ();
}
}

}

Public Indexsearcher Getsearcher ()
{
Try
{
if (Ireader = = null)
{
Ireader = directoryreader.open (directory);
} else
{
Directoryreader tr = directoryreader.openifchanged (Ireader);
if (tr! = NULL)
{
Ireader.close ();
Ireader = TR;
}
}
return new Indexsearcher (Ireader);
} catch (Corruptindexexception e)
{
E.printstacktrace ();
} catch (IOException e)
{
E.printstacktrace ();
}
return null;
}

public void Searchbyterm (string field, string keyword, int num) throws Invalidtokenoffsetsexception
{
Indexsearcher Isearcher = Getsearcher ();
Analyzer Analyzer = Getanalyzer ();
To construct a query object using Queryparser Queries Analyzer
Queryparser QP = new Queryparser (version, field, analyzer);
What is the effect of this sentence?
Qp.setdefaultoperator (Queryparser.or_operator);
Try
{
Query query = qp.parse (keyword);
Scoredoc[] Hits;

Several methods to pay attention to searcher
hits = isearcher.search (query, NULL, num). Scoredocs;

Keyword highlighted HTML tags, need to import Lucene-highlighter-xxx.jar
Simplehtmlformatter simplehtmlformatter = new Simplehtmlformatter ("<span style= ' color:red ' >", "</span>") ;
Highlighter highlighter = new Highlighter (Simplehtmlformatter, new Queryscorer (query));

for (int i = 0; i < hits.length; i++)
{
Document doc = Isearcher.doc (hits[i].doc);
Content added highlighting
Tokenstream Tokenstream = Analyzer.tokenstream ("Content", New StringReader (Doc.get ("content"));
String content = highlighter.getbestfragment (Tokenstream, Doc.get ("content"));
SYSTEM.OUT.PRINTLN (content);
}

} catch (IOException e)
{
E.printstacktrace ();
} catch (ParseException e)
{
E.printstacktrace ();
}
}

/**
* Use filter query
*
* @param field
* @param keyword
* @param num
* @throws invalidtokenoffsetsexception
*/
public void Searchbytermfilter (string field, string keyword, int num) throws Invalidtokenoffsetsexception
{
Indexsearcher Isearcher = Getsearcher ();
Analyzer Analyzer = Getanalyzer ();
To construct a query object using Queryparser Queries Analyzer
Queryparser QP = new Queryparser (version, field, analyzer);
What is the effect of this sentence?
Qp.setdefaultoperator (Queryparser.or_operator);
Try
{
Query query = qp.parse (keyword);
Query q2 = Qp.parse ("Full text Search");
Scoredoc[] Hits;

Querywrapperfilter filter = new Querywrapperfilter (q2);
Several methods to pay attention to searcher
hits = Isearcher.search (query, filter, num). Scoredocs;

Keyword highlighted HTML tags, need to import Lucene-highlighter-xxx.jar
Simplehtmlformatter simplehtmlformatter = new Simplehtmlformatter ("<span style= ' color:red ' >", "</span>") ;
Highlighter highlighter = new Highlighter (Simplehtmlformatter, new Queryscorer (query));

for (int i = 0; i < hits.length; i++)
{
Document doc = Isearcher.doc (hits[i].doc);
Content added highlighting
Tokenstream Tokenstream = Analyzer.tokenstream ("Content", New StringReader (Doc.get ("content"));
String content = highlighter.getbestfragment (Tokenstream, Doc.get ("content"));
SYSTEM.OUT.PRINTLN (content);
}

} catch (IOException e)
{
E.printstacktrace ();
} catch (ParseException e)
{
E.printstacktrace ();
}
}

public static void Main (string[] args) throws Invalidtokenoffsetsexception
{
System.out.println ("Start");
LuceneDemo04 ld = new LuceneDemo04 ();
Ld.createindex ();
Long start = System.currenttimemillis ();
Ld.searchbyterm ("Content", "people", 500);
System.out.println ("End Search Use" + (System.currenttimemillis ()-start) + "MS");
}

}

Operation Result:

Start load extension Dictionary: Ext.dic

Load extension Stop dictionary: stopword.dic

Chinese <span style=' color:red ' > People </span> Republic

China <span style=' color:red ' > People </span> stand up from here

End Search use 129ms

Lucene Foundation (three)--Chinese word segmentation and highlighting

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.