"lucene" Apache lucene Full text search engine architecture Chinese word segmentation and highlighting

Source: Internet
Author: User
<span id="Label3"></p><p><p>The previous summary is the use of Lucene standard word breaker, which is for the English language, but the Chinese language is not helpful, because the Chinese vocabulary and English is different, so generally when we develop, have Chinese words must use Chinese word, This blog post mainly describes how to use the SMARTCN Chinese word breaker and the highlighting of the Results.</p></p><strong><strong>1. Chinese participle</strong></strong><p><p>Use Chinese word words, first to add Chinese word breaker jar Package.</p></p><pre class="prettyprint"><pre class="prettyprint"><code class="language-xml hljs "><span class="hljs-comment"><!-- lucene中文分词器 --></span><span class="hljs-tag"><<span class="hljs-title">dependency</span>></span> <span class="hljs-tag"><<span class="hljs-title">groupId</span>></span>org.apache.lucene<span class="hljs-tag"></<span class="hljs-title">groupId</span>></span> <span class="hljs-tag"><<span class="hljs-title">artifactId</span>></span>lucene-analyzers-smartcn<span class="hljs-tag"></<span class="hljs-title">artifactId</span>></span> <span class="hljs-tag"><<span class="hljs-title">version</span>></span>5.3.1<span class="hljs-tag"></<span class="hljs-title">version</span>></span><span class="hljs-tag"></<span class="hljs-title">dependency</span>></span></code></pre></pre><p><p>Then make some data, use the Chinese word breaker to generate the index, in order to use later Search.</p></p><pre class="prettyprint"><code class="language-java hljs "><span class="hljs-keyword"><span class="hljs-keyword"></span> public</span> <span class="hljs-class"><span class="hljs-class"> <span class="hljs-keyword">class</span> <span class="hljs-title">Indexer</span> {</span></span> <span class="hljs-keyword"><span class="hljs-keyword">Private</span></span>Directory dir;<span class="hljs-comment"><span class="hljs-comment">//location where the index is stored</span></span> <span class="hljs-comment"><span class="hljs-comment">//prepare the data to be tested</span></span> <span class="hljs-keyword"><span class="hljs-keyword">Private</span></span>Integer ids[] = {<span class="hljs-number"><span class="hljs-number">1</span></span>,<span class="hljs-number"><span class="hljs-number">2</span></span>,<span class="hljs-number"><span class="hljs-number">3</span></span>};<span class="hljs-comment"><span class="hljs-comment">//used to identify the document</span></span> <span class="hljs-keyword"><span class="hljs-keyword">Private</span></span>String citys[] = {<span class="hljs-string"><span class="hljs-string">"shanghai"</span></span>,<span class="hljs-string"><span class="hljs-string">"nanjing"</span></span>,<span class="hljs-string"><span class="hljs-string">"qingdao"</span></span>};<span class="hljs-keyword"><span class="hljs-keyword">Private</span></span>String descs[] = {<span class="hljs-string"><span class="hljs-string">"shanghai is a bustling City. "</span></span>,<span class="hljs-string"><span class="hljs-string">"nanjing is a city with a culture. "</span></span>,<span class="hljs-string"><span class="hljs-string">"qingdao is a beautiful City. "</span></span>};<span class="hljs-comment"><span class="hljs-comment">//build Index</span></span> <span class="hljs-annotation"><span class="hljs-annotation">@Test</span></span> <span class="hljs-keyword"><span class="hljs-keyword"></span> public</span> <span class="hljs-keyword"><span class="hljs-keyword">void</span></span> <span class="hljs-title"><span class="hljs-title">Index</span></span>(String Indexdir)<span class="hljs-keyword"><span class="hljs-keyword">throws</span></span>Exception {dir = Fsdirectory.open (paths.get (indexdir)); IndexWriter writer = Getwriter ();<span class="hljs-keyword"><span class="hljs-keyword"></span> for</span>(<span class="hljs-keyword"><span class="hljs-keyword">int</span></span>i =<span class="hljs-number"><span class="hljs-number">0</span></span>; I < ids.length; I++) {Document doc =<span class="hljs-keyword"><span class="hljs-keyword">New</span></span>Document (); Doc.add (<span class="hljs-keyword"><span class="hljs-keyword">New</span></span>Intfield (<span class="hljs-string"><span class="hljs-string">"id"</span></span>, ids[i], Field.Store.YES)); Doc.add (<span class="hljs-keyword"><span class="hljs-keyword">New</span></span>Stringfield (<span class="hljs-string"><span class="hljs-string">"city"</span></span>, citys[i], Field.Store.YES)); Doc.add (<span class="hljs-keyword"><span class="hljs-keyword">New</span></span>TextField (<span class="hljs-string"><span class="hljs-string">"desc"</span></span>, descs[i], Field.Store.YES)); Writer.adddocument (doc);<span class="hljs-comment"><span class="hljs-comment">//add a document</span></span>} writer.close ();<span class="hljs-comment"><span class="hljs-comment">//close to actually write to the Document.</span></span>}<span class="hljs-comment"><span class="hljs-comment">//get IndexWriter Instances</span></span> <span class="hljs-keyword"><span class="hljs-keyword">Private</span></span>IndexWriter<span class="hljs-title"><span class="hljs-title">getwriter</span></span>()<span class="hljs-keyword"><span class="hljs-keyword">throws</span></span>Exception {smartchineseanalyzer Analyzer =<span class="hljs-keyword"><span class="hljs-keyword">New</span></span>Smartchineseanalyzer ();<span class="hljs-comment"><span class="hljs-comment">//use Chinese word breaker</span></span>Indexwriterconfig config =<span class="hljs-keyword"><span class="hljs-keyword">New</span></span>Indexwriterconfig (analyzer);<span class="hljs-comment"><span class="hljs-comment">//match The standard word breaker to the configuration of the write index</span></span>IndexWriter writer =<span class="hljs-keyword"><span class="hljs-keyword">New</span></span>IndexWriter (dir, config);<span class="hljs-comment"><span class="hljs-comment">//instantiate Write Index object</span></span> <span class="hljs-keyword"><span class="hljs-keyword">return</span></span>Writer }<span class="hljs-keyword"><span class="hljs-keyword"></span> public</span> <span class="hljs-keyword"><span class="hljs-keyword">Static</span></span> <span class="hljs-keyword"><span class="hljs-keyword">void</span></span> <span class="hljs-title"><span class="hljs-title">Main</span></span>(string[] Args)<span class="hljs-keyword"><span class="hljs-keyword">throws</span></span>Exception {<span class="hljs-keyword"><span class="hljs-keyword">New</span></span>Indexer (). Index (<span class="hljs-string"><span class="hljs-string">"d:\\lucene2"</span></span>); }}</code></pre><p><p>The index is set up and the query is Followed.</p></p><pre class="prettyprint"><code class="language-java hljs "><span class="hljs-keyword"><span class="hljs-keyword"></span> public</span> <span class="hljs-class"><span class="hljs-class"> <span class="hljs-keyword">class</span> <span class="hljs-title">Searcher</span> {</span></span> <span class="hljs-keyword"><span class="hljs-keyword"></span> public</span> <span class="hljs-keyword"><span class="hljs-keyword">Static</span></span> <span class="hljs-keyword"><span class="hljs-keyword">void</span></span> <span class="hljs-title"><span class="hljs-title">Search</span></span>(string indexdir, String Q)<span class="hljs-keyword"><span class="hljs-keyword">throws</span></span>Exception {Directory dir = fsdirectory.open (paths.get (indexdir));<span class="hljs-comment"><span class="hljs-comment">//get the path to query, which is where the index is located</span></span>Indexreader reader = Directoryreader.open (dir); Indexsearcher searcher =<span class="hljs-keyword"><span class="hljs-keyword">New</span></span>Indexsearcher (reader); Smartchineseanalyzer Analyzer =<span class="hljs-keyword"><span class="hljs-keyword">New</span></span>Smartchineseanalyzer ();<span class="hljs-comment"><span class="hljs-comment">//use Chinese word breaker</span></span>Queryparser parser =<span class="hljs-keyword"><span class="hljs-keyword">New</span></span>Queryparser (<span class="hljs-string"><span class="hljs-string">"desc"</span></span>, analyzer);<span class="hljs-comment"><span class="hljs-comment">//query Parser</span></span>Query query = Parser.parse (q);<span class="hljs-comment"><span class="hljs-comment">//to Get the query object by parsing the string to query</span></span> <span class="hljs-keyword"><span class="hljs-keyword">Long</span></span>StartTime = System.currenttimemillis ();<span class="hljs-comment"><span class="hljs-comment">//record index Start time</span></span>Topdocs docs = Searcher.search (query,<span class="hljs-number"><span class="hljs-number">Ten</span></span>);<span class="hljs-comment"><span class="hljs-comment">//start query, Query the first 10 data, save the record in Docs</span></span> <span class="hljs-keyword"><span class="hljs-keyword">Long</span></span>EndTime = System.currenttimemillis ();<span class="hljs-comment"><span class="hljs-comment">//record Index End time</span></span>System.out.println (<span class="hljs-string"><span class="hljs-string">"match"</span></span>+ q +<span class="hljs-string"><span class="hljs-string">"total time-consuming"</span></span>+ (endtime-starttime) +<span class="hljs-string"><span class="hljs-string">"milliseconds"</span></span>); System.out.println (<span class="hljs-string"><span class="hljs-string">"query to"</span></span>+ Docs.totalhits +<span class="hljs-string"><span class="hljs-string">"records"</span></span>);<span class="hljs-keyword"><span class="hljs-keyword"></span> for</span>(scoredoc ScoreDoc:docs.scoreDocs) {<span class="hljs-comment"><span class="hljs-comment">//remove each query result</span></span>Document doc = Searcher.doc (scoredoc.doc);<span class="hljs-comment"><span class="hljs-comment">//scoredoc.doc equivalent to docid, according to this docid to obtain the document</span></span>System.out.println (doc.get (<span class="hljs-string"><span class="hljs-string">"city"</span></span>)); System.out.println (doc.get (<span class="hljs-string"><span class="hljs-string">"desc"</span></span>)); String desc = doc.get (<span class="hljs-string"><span class="hljs-string">"desc"</span></span>); } reader.close (); }<span class="hljs-keyword"><span class="hljs-keyword"></span> public</span> <span class="hljs-keyword"><span class="hljs-keyword">Static</span></span> <span class="hljs-keyword"><span class="hljs-keyword">void</span></span> <span class="hljs-title"><span class="hljs-title">Main</span></span>(string[] Args) {String Indexdir =<span class="hljs-string"><span class="hljs-string">"d:\\lucene2"</span></span>; String q =<span class="hljs-string"><span class="hljs-string">"shanghai bustling"</span></span>;<span class="hljs-comment"><span class="hljs-comment">//query this character</span></span> <span class="hljs-keyword"><span class="hljs-keyword">Try</span></span>{search (indexdir, q); }<span class="hljs-keyword"><span class="hljs-keyword">Catch</span></span>(Exception E) {e.printstacktrace (); } }}</code></pre><p><p>Look at the results of the query:</p></p> <blockquote> <blockquote> <p>15 milliseconds to match Shanghai downtown<br>Query to 1 records<br>Shanghai<br>Shanghai is a bustling City.</p> </blockquote> </blockquote><strong><strong>2. Highlighting</strong></strong><p><p>General query out the effect is to be highlighted, for example, Baidu found out the results are monogram Red what, Lucene can also do So. The first step is to introduce the highlighted Jar Package.</p></p><pre class="prettyprint"><pre class="prettyprint"><code class="language-xml hljs "><span class="hljs-comment"><!-- lucene高亮显示 --></span><span class="hljs-tag"><<span class="hljs-title">dependency</span>></span> <span class="hljs-tag"><<span class="hljs-title">groupId</span>></span>org.apache.lucene<span class="hljs-tag"></<span class="hljs-title">groupId</span>></span> <span class="hljs-tag"><<span class="hljs-title">artifactId</span>></span>lucene-highlighter<span class="hljs-tag"></<span class="hljs-title">artifactId</span>></span> <span class="hljs-tag"><<span class="hljs-title">version</span>></span>5.3.1<span class="hljs-tag"></<span class="hljs-title">version</span>></span><span class="hljs-tag"></<span class="hljs-title">dependency</span>></span></code></pre></pre><p><p>Then add the following highlighted sections to the java code you searched for Above.</p></p><pre class="prettyprint"><code class="language-java hljs "><span class="hljs-keyword"><span class="hljs-keyword"></span> public</span> <span class="hljs-class"><span class="hljs-class"> <span class="hljs-keyword">class</span> <span class="hljs-title">Searcher</span> {</span></span> <span class="hljs-keyword"><span class="hljs-keyword"></span> public</span> <span class="hljs-keyword"><span class="hljs-keyword">Static</span></span> <span class="hljs-keyword"><span class="hljs-keyword">void</span></span> <span class="hljs-title"><span class="hljs-title">Search</span></span>(string indexdir, String Q)<span class="hljs-keyword"><span class="hljs-keyword">throws</span></span>Exception {<span class="hljs-comment"><span class="hljs-comment">//omit ...</span></span>System.out.println (<span class="hljs-string"><span class="hljs-string">"match"</span></span>+ q +<span class="hljs-string"><span class="hljs-string">"total time-consuming"</span></span>+ (endtime-starttime) +<span class="hljs-string"><span class="hljs-string">"milliseconds"</span></span>); System.out.println (<span class="hljs-string"><span class="hljs-string">"query to"</span></span>+ Docs.totalhits +<span class="hljs-string"><span class="hljs-string">"records"</span></span>); Simplehtmlformatter Simplehtmlformatter =<span class="hljs-keyword"><span class="hljs-keyword">New</span></span>Simplehtmlformatter (<span class="hljs-string"><span class="hljs-string">"<b><font color=red>"</span></span>,<span class="hljs-string"><span class="hljs-string">"</font></b>"</span></span>);<span class="hljs-comment"><span class="hljs-comment">//if no parameters are specified, the default is bold, i.e. <b><b/></span></span>Queryscorer scorer =<span class="hljs-keyword"><span class="hljs-keyword">New</span></span>Queryscorer (query);<span class="hljs-comment"><span class="hljs-comment">//calculate score, initialize a score with highest query result</span></span>Fragmenter Fragmenter =<span class="hljs-keyword"><span class="hljs-keyword">New</span></span>Simplespanfragmenter (scorer);<span class="hljs-comment"><span class="hljs-comment">//calculate A fragment based on this score</span></span>Highlighter highlighter =<span class="hljs-keyword"><span class="hljs-keyword">New</span></span>Highlighter (simplehtmlformatter, scorer); Highlighter.settextfragmenter (fragmenter);<span class="hljs-comment"><span class="hljs-comment">//set the clip to be displayed</span></span> <span class="hljs-keyword"><span class="hljs-keyword"></span> for</span>(scoredoc ScoreDoc:docs.scoreDocs) {<span class="hljs-comment"><span class="hljs-comment">//remove each query result</span></span>Document doc = Searcher.doc (scoredoc.doc);<span class="hljs-comment"><span class="hljs-comment">//scoredoc.doc equivalent to docid, according to this docid to obtain the document</span></span>System.out.println (doc.get (<span class="hljs-string"><span class="hljs-string">"city"</span></span>)); System.out.println (doc.get (<span class="hljs-string"><span class="hljs-string">"desc"</span></span>)); String desc = doc.get (<span class="hljs-string"><span class="hljs-string">"desc"</span></span>);<span class="hljs-comment"><span class="hljs-comment">//display highlighting</span></span> <span class="hljs-keyword"><span class="hljs-keyword">if</span></span>(desc! =<span class="hljs-keyword"><span class="hljs-keyword">NULL</span></span>) {tokenstream Tokenstream = Analyzer.tokenstream (<span class="hljs-string"><span class="hljs-string">"desc"</span></span>,<span class="hljs-keyword"><span class="hljs-keyword">New</span></span>StringReader (desc)); String Summary = highlighter.getbestfragment (tokenstream, desc); System.out.println (summary); }} Reader.close (); }<span class="hljs-keyword"><span class="hljs-keyword"></span> public</span> <span class="hljs-keyword"><span class="hljs-keyword">Static</span></span> <span class="hljs-keyword"><span class="hljs-keyword">void</span></span> <span class="hljs-title"><span class="hljs-title">Main</span></span>(string[] Args) {String Indexdir =<span class="hljs-string"><span class="hljs-string">"d:\\lucene2"</span></span>; String q =<span class="hljs-string"><span class="hljs-string">"shanghai bustling"</span></span>;<span class="hljs-comment"><span class="hljs-comment">//query this character</span></span> <span class="hljs-keyword"><span class="hljs-keyword">Try</span></span>{search (indexdir, q); }<span class="hljs-keyword"><span class="hljs-keyword">Catch</span></span>(Exception E) {e.printstacktrace (); } }}</code></pre><p><p>Look at the results of the query:</p></p> <blockquote> <blockquote> <p>15 milliseconds to match Shanghai downtown<br>Query to 1 records<br>Shanghai<br>Shanghai is a bustling City.<br><b>Shanghai</b> is a <b>bustling</b> City.</p> </blockquote> </blockquote><p><p>Here's A brief explanation of the score in the program above, that is, in a text, there might be more than one place to search for a keyword, so Lucene automatically calculates the score at each point, that is, the closest user to the search, and then shows some fragments near that Location. The above example of the description of the part is too little, in a word, not reflected, I put the description of Nanjing to add a little longer, as Follows:</p></p> <blockquote> <blockquote> <p>Nanjing is a cultural city of nanjing, called ning, is the capital of Jiangsu province, located in the eastern part of china, the lower reaches of the Yangtze River near the Coast. The city under the jurisdiction of 11 districts, a total area of 6597 Square kilometers, completed in 2013 area of 752.83 Square kilometers, the resident population of 8.1878 million, of which the urban population of 6.591 million people. [1-4] "jiangnan beauty, jinling Imperial state", Nanjing has more than 6,000 years of civilization, nearly 2,600 years of history and nearly 500 years of capital history, is one of the four ancient capitals of china, "the ancient capital of six dynasties", "10 DPRK metropolis", is the important birthplace of Chinese civilization, The history of several times to bless the new china, Long-term is the political, economic and cultural center of southern china, with heavy cultural heritage and rich historical relics. [5-7] Nanjing is the Country's important science and education center, since ancient times is a chong-re-teaching city, there is "world Wen Yu", "southeast of the first school" Reputation. As of 2013, Nanjing has 75 institutions of higher learning, of which 211 are 8, second only to Beijing and shanghai, the State Key laboratory 25, the state key subjects 169, the two houses of the academician 83, are ranked China's Third. [8-10].</p> </blockquote> </blockquote><p><p>This is long enough, if I search "nanjing culture", look at the Results:</p></p> <blockquote> <blockquote> <p><b>Nanjing</b> is a <b>cultural</b> city of <b>Nanjing</b> , called ning, is the capital of Jiangsu province, located in the eastern part of china, the lower reaches of the Yangtze River near the Coast. The city under the jurisdiction of 11 districts, a total area of 6597 Square kilometers, 2013 completed area of 752.83 Square kilometers, the resident population of 8.1878 million, of which</p> </blockquote> </blockquote><p><p>If I search for "nanjing civilization", then look at the Results:</p></p> <blockquote> <blockquote> <p>The urban population is 6.591 million people. [1-4] "jiangnan Beautiful place, jinling Imperial state", <b>Nanjing</b> has more than 6,000 years of <b>civilization</b> , nearly 2,600 years of history and nearly 500 years of capital history, is one of the four ancient capitals of china, "six dynasties ancient capital", "10 DPRK metropolis", is the Chinese <b>civilization </b>of the</p> </blockquote> </blockquote><p><p>This is the so-called score in lucene, which is actually the most matched Fragment. It can be seen that Lucene's Chinese search is also very powerful, of course, if it is a professional search, that still have to study, the general development of the site search has been enough to use.<br>  </p></p><p><p>-willing to share and progress together!<br>--my Blog Home: http://blog.csdn.net/eson_15</p></p> <p><p>"lucene" Apache lucene Full text search engine architecture Chinese word segmentation and highlighting</p></p></span>

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.