"lucene" Apache lucene Full text search engine architecture Chinese word segmentation and highlighting

Last Update:2016-07-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The previous summary is the use of Lucene standard word breaker, which is for the English language, but the Chinese language is not helpful, because the Chinese vocabulary and English is different, so generally when we develop, have Chinese words must use Chinese word, This blog post mainly describes how to use the SMARTCN Chinese word breaker and the highlighting of the Results.1. Chinese participleUse Chinese word words, first to add Chinese word breaker jar Package.<pre class="prettyprint"><pre class="prettyprint"><code class="language-xml hljs "><dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-analyzers-smartcn</artifactId> <version>5.3.1</version></dependency></code></pre></pre>Then make some data, use the Chinese word breaker to generate the index, in order to use later Search.<pre class="prettyprint"><code class="language-java hljs "> public class Indexer { PrivateDirectory dir;//location where the index is stored //prepare the data to be tested PrivateInteger ids[] = {1,2,3};//used to identify the document PrivateString citys[] = {"shanghai","nanjing","qingdao"};PrivateString descs[] = {"shanghai is a bustling City. ","nanjing is a city with a culture. ","qingdao is a beautiful City. "};//build Index @Test public void Index(String Indexdir)throwsException {dir = Fsdirectory.open (paths.get (indexdir)); IndexWriter writer = Getwriter (); for(inti =0; I < ids.length; I++) {Document doc =NewDocument (); Doc.add (NewIntfield ("id", ids[i], Field.Store.YES)); Doc.add (NewStringfield ("city", citys[i], Field.Store.YES)); Doc.add (NewTextField ("desc", descs[i], Field.Store.YES)); Writer.adddocument (doc);//add a document} writer.close ();//close to actually write to the Document.}//get IndexWriter Instances PrivateIndexWritergetwriter()throwsException {smartchineseanalyzer Analyzer =NewSmartchineseanalyzer ();//use Chinese word breakerIndexwriterconfig config =NewIndexwriterconfig (analyzer);//match The standard word breaker to the configuration of the write indexIndexWriter writer =NewIndexWriter (dir, config);//instantiate Write Index object returnWriter } public Static void Main(string[] Args)throwsException {NewIndexer (). Index ("d:\\lucene2"); }}</code></pre>The index is set up and the query is Followed.<pre class="prettyprint"><code class="language-java hljs "> public class Searcher { public Static void Search(string indexdir, String Q)throwsException {Directory dir = fsdirectory.open (paths.get (indexdir));//get the path to query, which is where the index is locatedIndexreader reader = Directoryreader.open (dir); Indexsearcher searcher =NewIndexsearcher (reader); Smartchineseanalyzer Analyzer =NewSmartchineseanalyzer ();//use Chinese word breakerQueryparser parser =NewQueryparser ("desc", analyzer);//query ParserQuery query = Parser.parse (q);//to Get the query object by parsing the string to query LongStartTime = System.currenttimemillis ();//record index Start timeTopdocs docs = Searcher.search (query,Ten);//start query, Query the first 10 data, save the record in Docs LongEndTime = System.currenttimemillis ();//record Index End timeSystem.out.println ("match"+ q +"total time-consuming"+ (endtime-starttime) +"milliseconds"); System.out.println ("query to"+ Docs.totalhits +"records"); for(scoredoc ScoreDoc:docs.scoreDocs) {//remove each query resultDocument doc = Searcher.doc (scoredoc.doc);//scoredoc.doc equivalent to docid, according to this docid to obtain the documentSystem.out.println (doc.get ("city")); System.out.println (doc.get ("desc")); String desc = doc.get ("desc"); } reader.close (); } public Static void Main(string[] Args) {String Indexdir ="d:\\lucene2"; String q ="shanghai bustling";//query this character Try{search (indexdir, q); }Catch(Exception E) {e.printstacktrace (); } }}</code></pre>Look at the results of the query: <blockquote> <blockquote> 15 milliseconds to match Shanghai downtown Query to 1 records Shanghai Shanghai is a bustling City. </blockquote> </blockquote>2. HighlightingGeneral query out the effect is to be highlighted, for example, Baidu found out the results are monogram Red what, Lucene can also do So. The first step is to introduce the highlighted Jar Package.<pre class="prettyprint"><pre class="prettyprint"><code class="language-xml hljs "><dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-highlighter</artifactId> <version>5.3.1</version></dependency></code></pre></pre>Then add the following highlighted sections to the java code you searched for Above.<pre class="prettyprint"><code class="language-java hljs "> public class Searcher { public Static void Search(string indexdir, String Q)throwsException {//omit ...System.out.println ("match"+ q +"total time-consuming"+ (endtime-starttime) +"milliseconds"); System.out.println ("query to"+ Docs.totalhits +"records"); Simplehtmlformatter Simplehtmlformatter =NewSimplehtmlformatter ("","");//if no parameters are specified, the default is bold, i.e. Queryscorer scorer =NewQueryscorer (query);//calculate score, initialize a score with highest query resultFragmenter Fragmenter =NewSimplespanfragmenter (scorer);//calculate A fragment based on this scoreHighlighter highlighter =NewHighlighter (simplehtmlformatter, scorer); Highlighter.settextfragmenter (fragmenter);//set the clip to be displayed for(scoredoc ScoreDoc:docs.scoreDocs) {//remove each query resultDocument doc = Searcher.doc (scoredoc.doc);//scoredoc.doc equivalent to docid, according to this docid to obtain the documentSystem.out.println (doc.get ("city")); System.out.println (doc.get ("desc")); String desc = doc.get ("desc");//display highlighting if(desc! =NULL) {tokenstream Tokenstream = Analyzer.tokenstream ("desc",NewStringReader (desc)); String Summary = highlighter.getbestfragment (tokenstream, desc); System.out.println (summary); }} Reader.close (); } public Static void Main(string[] Args) {String Indexdir ="d:\\lucene2"; String q ="shanghai bustling";//query this character Try{search (indexdir, q); }Catch(Exception E) {e.printstacktrace (); } }}</code></pre>Look at the results of the query: <blockquote> <blockquote> 15 milliseconds to match Shanghai downtown Query to 1 records Shanghai Shanghai is a bustling City. Shanghai is a bustling City. </blockquote> </blockquote>Here's A brief explanation of the score in the program above, that is, in a text, there might be more than one place to search for a keyword, so Lucene automatically calculates the score at each point, that is, the closest user to the search, and then shows some fragments near that Location. The above example of the description of the part is too little, in a word, not reflected, I put the description of Nanjing to add a little longer, as Follows: <blockquote> <blockquote> Nanjing is a cultural city of nanjing, called ning, is the capital of Jiangsu province, located in the eastern part of china, the lower reaches of the Yangtze River near the Coast. The city under the jurisdiction of 11 districts, a total area of 6597 Square kilometers, completed in 2013 area of 752.83 Square kilometers, the resident population of 8.1878 million, of which the urban population of 6.591 million people. [1-4] "jiangnan beauty, jinling Imperial state", Nanjing has more than 6,000 years of civilization, nearly 2,600 years of history and nearly 500 years of capital history, is one of the four ancient capitals of china, "the ancient capital of six dynasties", "10 DPRK metropolis", is the important birthplace of Chinese civilization, The history of several times to bless the new china, Long-term is the political, economic and cultural center of southern china, with heavy cultural heritage and rich historical relics. [5-7] Nanjing is the Country's important science and education center, since ancient times is a chong-re-teaching city, there is "world Wen Yu", "southeast of the first school" Reputation. As of 2013, Nanjing has 75 institutions of higher learning, of which 211 are 8, second only to Beijing and shanghai, the State Key laboratory 25, the state key subjects 169, the two houses of the academician 83, are ranked China's Third. [8-10]. </blockquote> </blockquote>This is long enough, if I search "nanjing culture", look at the Results: <blockquote> <blockquote> Nanjing is a cultural city of Nanjing , called ning, is the capital of Jiangsu province, located in the eastern part of china, the lower reaches of the Yangtze River near the Coast. The city under the jurisdiction of 11 districts, a total area of 6597 Square kilometers, 2013 completed area of 752.83 Square kilometers, the resident population of 8.1878 million, of which </blockquote> </blockquote>If I search for "nanjing civilization", then look at the Results: <blockquote> <blockquote> The urban population is 6.591 million people. [1-4] "jiangnan Beautiful place, jinling Imperial state", Nanjing has more than 6,000 years of civilization , nearly 2,600 years of history and nearly 500 years of capital history, is one of the four ancient capitals of china, "six dynasties ancient capital", "10 DPRK metropolis", is the Chinese civilization of the </blockquote> </blockquote>This is the so-called score in lucene, which is actually the most matched Fragment. It can be seen that Lucene's Chinese search is also very powerful, of course, if it is a professional search, that still have to study, the general development of the site search has been enough to use. 　　-willing to share and progress together! --my Blog Home: http://blog.csdn.net/eson_15 "lucene" Apache lucene Full text search engine architecture Chinese word segmentation and highlighting

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

"lucene" Apache lucene Full text search engine architecture Chinese word segmentation and highlighting

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

"lucene" Apache lucene Full text search engine architecture Chinese word segmentation and highlighting

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support