"lucene" Apache lucene Full Text search engine architecture Introduction

Source: Internet
Author: User
<span id="Label3"></p> <blockquote> <blockquote> <p>Lucene is a set of open source libraries for Full-text search and search, supported and provided by the Apache software Foundation. Lucene provides a simple yet powerful application interface that enables Full-text indexing and searching. In the Java development environment Lucene is a mature free open source Tool. For its part, Lucene is the most popular free Java Information Retrieval library in the current and recent Years. --"baidu encyclopedia"</p> </blockquote> </blockquote><p><p>This blog post mainly starts from two aspects, first introduces the Full-text search principle in lucene, followed by the program example to show how to use Lucene. About the principle of full-text search I searched the internet for a bit, also read several articles, and finally in writing this article part of the reference to two of them (address I put at the end of the article), thank the original Author.</p></p><strong><strong>1. Full-text Search</strong></strong><p><p>What is Full-text search? For example, for example, now in a file to find a string, the most direct idea is to retrieve from the beginning, found on the ok, this small amount of data for the file, very practical, but for the big data volume of the file, it is a bit hehe. Or to find a file containing a string, the same is true, if in a hard disk with dozens of G to find that efficiency can be imagined, is very low.<br>The data in the file is non-structured data, that is, it is not what structure, to solve the above mentioned efficiency problem, first we have to the unstructured data part of the information extracted, reorganize, so that it has a certain structure, and then a certain structure of the data to search, So as to achieve a relatively fast search. This is called Full-text Search. That is, the process of indexing and then searching the Index.<br>So how is the index indexed in lucene? Suppose you now have two documents with the following content:</p></p> <blockquote> <blockquote> <p>Article 1 of the content is: Tom lives in Guangzhou, I live in Guangzhou too.<br>Article 2 of the content Is: He once lived in Shanghai.</p> </blockquote> </blockquote><p><p>The first step is to pass the document to the Sub-phrase (tokenizer), which divides the document into words and removes punctuation and stop words. The term "stop" refers to words that have no special meaning, such as A,the,too in English. After participle, get the word element (Token). As follows:</p></p> <blockquote> <blockquote> <p>Article 1 results after participle: [Tom] [lives] [Guangzhou] [I] [live] [Guangzhou]<br>Article 2 results after participle: [He] [lives] [shanghai]</p> </blockquote> </blockquote><p><p>Then pass the word to the language processing component (linguistic Processor), for english, the language processing component will generally turn the letter into lowercase, reduce the word to root form, such as "lives" to "live" and so on, the word into a root form, such as "drove" to " Drive "and So On. Then get the word (term). As follows:</p></p> <blockquote> <blockquote> <p>Article 1 results after processing: [tom] [live] [guangzhou] [i] [live] [guangzhou]<br>Article 2 results after processing: [he] [live] [shanghai]</p> </blockquote> </blockquote><p><p>finally, The resulting word is passed to the index component (Indexer), and the index component is processed to get the following index Structure:</p></p> <table> <thead> <tr> <th align="center"> keywords </th> <th align="center"> article number [frequency] </th> <th align="center"> occurrence position /th> </th> </tr> </thead> <tbody> <tr> <td align="center">guangzhou </td> <td align="center">1[2] </td> <td alig n="center">3,6 </td> </tr> <tr> <td align="center">he </td> <td align="center">2[1] </td> <td align="center"> 1 </td> </tr> <tr> <td align="center">i </td> <td align="center">1[1] </td> <td align="center">4 </td> </tr> <tr> <td align="center">live </td> <td align="center">1[2],2[1] </td> <td align="center">2,5,2 </td> </tr> <tr> <td align="center">shanghai </td> <td align="center">2[1] </td> <td align="center">3 </td> </tr> <tr> <td align="center">tom </td> <td align="center">1[1] </td> <td align="center">1 </td> </tr> </tbody> </table><p><p>These are the most central parts of the Lucene index Structure. Its keywords are in alphabetical order, so lucene can quickly locate keywords with a two-dollar search Algorithm. When implemented, Lucene saves the above three columns as a dictionary file (term Dictionary), a frequency file (frequencies), and a location file (positions). The dictionary file not only holds each keyword, but also retains a pointer to the frequency file and location file, and the pointer can find the frequency information and location information of the Keyword.<br>The search process is to find the dictionary two yuan, find the word, through the pointer to the frequency file to read out all the article number, and then return the results, and then you can find in a specific article based on where the word appears. So Lucene may be slower at first indexing, but it doesn't have to be indexed every time, so it's Fast. Of course, This is a search for english, the rules for Chinese will be different, i'll look at the relevant information later.</p></p><strong><strong>2. Sample Code</strong></strong><p><p>According to the above analysis, the Full-text search has two steps, first index, then Retrieve. So in order to test this process, I wrote two Java classes, one for test indexing and the other for Test retrieval. First set up a maven project, Pom.xml as Follows:</p></p><pre class="prettyprint"><code class="language-xml hljs "><span class="hljs-tag"><span class="hljs-tag"><<span class="hljs-title">project</span> <span class="hljs-attribute">xmlns</span>=<span class="hljs-value">"http://maven.apache.org/POM/4.0.0"</span> <span class="hljs-attribute">xmlns:xsi</span>=<span class="hljs-value">" Http://www.w3.org/2001/XMLSchema-instance "</span> <span class="hljs-attribute">xsi:schemalocation</span>=<span class="hljs-value">" http://maven.apache.org/POM/ 4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd "</span>></span></span> <span class="hljs-tag"><span class="hljs-tag"><<span class="hljs-title">modelversion</span>></span></span>4.0.0<span class="hljs-tag"><span class="hljs-tag"></<span class="hljs-title">modelversion</span>></span></span> <span class="hljs-tag"><span class="hljs-tag"><<span class="hljs-title">groupId</span>></span></span>Demo.lucene<span class="hljs-tag"><span class="hljs-tag"></<span class="hljs-title">groupId</span>></span></span> <span class="hljs-tag"><span class="hljs-tag"><<span class="hljs-title">artifactid</span>></span></span>Lucene01<span class="hljs-tag"><span class="hljs-tag"></<span class="hljs-title">artifactid</span>></span></span> <span class="hljs-tag"><span class="hljs-tag"><<span class="hljs-title">version</span>></span></span>0.0.1-snapshot<span class="hljs-tag"><span class="hljs-tag"></<span class="hljs-title">version</span>></span></span> <span class="hljs-tag"><span class="hljs-tag"><<span class="hljs-title">build</span>/></span></span> <span class="hljs-tag"><span class="hljs-tag"><<span class="hljs-title">dependencies</span>></span></span> <span class="hljs-comment"><span class="hljs-comment"><!--lucene Core Pack</span> --</span> <span class="hljs-tag"><span class="hljs-tag"><<span class="hljs-title">dependency</span>></span></span> <span class="hljs-tag"><span class="hljs-tag"><<span class="hljs-title">groupId</span>></span></span>Org.apache.lucene<span class="hljs-tag"><span class="hljs-tag"></<span class="hljs-title">groupId</span>></span></span> <span class="hljs-tag"><span class="hljs-tag"><<span class="hljs-title">artifactid</span>></span></span>Lucene-core<span class="hljs-tag"><span class="hljs-tag"></<span class="hljs-title">artifactid</span>></span></span> <span class="hljs-tag"><span class="hljs-tag"><<span class="hljs-title">version</span>></span></span>6.1.0<span class="hljs-tag"><span class="hljs-tag"></<span class="hljs-title">version</span>></span></span> <span class="hljs-tag"><span class="hljs-tag"></<span class="hljs-title">Dependency</span>></span></span> <span class="hljs-comment"><span class="hljs-comment"><!--lucene Query Parsing package</span> --</span> <span class="hljs-tag"><span class="hljs-tag"><<span class="hljs-title">dependency</span>></span></span> <span class="hljs-tag"><span class="hljs-tag"><<span class="hljs-title">groupId</span>></span></span>Org.apache.lucene<span class="hljs-tag"><span class="hljs-tag"></<span class="hljs-title">groupId</span>></span></span> <span class="hljs-tag"><span class="hljs-tag"><<span class="hljs-title">artifactid</span>></span></span>Lucene-queryparser<span class="hljs-tag"><span class="hljs-tag"></<span class="hljs-title">artifactid</span>></span></span> <span class="hljs-tag"><span class="hljs-tag"><<span class="hljs-title">version</span>></span></span>6.1.0<span class="hljs-tag"><span class="hljs-tag"></<span class="hljs-title">version</span>></span></span> <span class="hljs-tag"><span class="hljs-tag"></<span class="hljs-title">Dependency</span>></span></span> <span class="hljs-comment"><span class="hljs-comment"><!--lucene Parser Package</span> --</span> <span class="hljs-tag"><span class="hljs-tag"><<span class="hljs-title">dependency</span>></span></span> <span class="hljs-tag"><span class="hljs-tag"><<span class="hljs-title">groupId</span>></span></span>Org.apache.lucene<span class="hljs-tag"><span class="hljs-tag"></<span class="hljs-title">groupId</span>></span></span> <span class="hljs-tag"><span class="hljs-tag"><<span class="hljs-title">artifactid</span>></span></span>Lucene-analyzers-common<span class="hljs-tag"><span class="hljs-tag"></<span class="hljs-title">artifactid</span>></span></span> <span class="hljs-tag"><span class="hljs-tag"><<span class="hljs-title">version</span>></span></span>6.1.0<span class="hljs-tag"><span class="hljs-tag"></<span class="hljs-title">version</span>></span></span> <span class="hljs-tag"><span class="hljs-tag"></<span class="hljs-title">Dependency</span>></span></span> <span class="hljs-tag"><span class="hljs-tag"></<span class="hljs-title">dependencies</span>></span></span><span class="hljs-tag"><span class="hljs-tag"></<span class="hljs-title">Project</span>></span></span></code></pre><p><p>Before writing the program, we have to get some files, I casually find some English documents (chinese after the study), put in the D:\lucene\data\ directory, as Follows:<br><br>The document is full of english, I can not.<br>next, start writing the indexed Java program:</p></p><pre class="prettyprint"><code class="language-java hljs "><span class="hljs-javadoc"><span class="hljs-javadoc">/** * Indexed classes *<span class="hljs-javadoctag"> @author</span> Ni Shengwu *</span> */</span><span class="hljs-keyword"><span class="hljs-keyword"></span> public</span> <span class="hljs-class"><span class="hljs-class"> <span class="hljs-keyword">class</span> <span class="hljs-title">Indexer</span> {</span></span> <span class="hljs-keyword"><span class="hljs-keyword">Private</span></span>IndexWriter writer;<span class="hljs-comment"><span class="hljs-comment">//write Index Instance</span></span> <span class="hljs-comment"><span class="hljs-comment">//construct method, Instantiate IndexWriter</span></span> <span class="hljs-keyword"><span class="hljs-keyword"></span> public</span> <span class="hljs-title"><span class="hljs-title">Indexer</span></span>(String Indexdir)<span class="hljs-keyword"><span class="hljs-keyword">throws</span></span>Exception {Directory dir = fsdirectory.open (paths.get (indexdir)); Analyzer Analyzer =<span class="hljs-keyword"><span class="hljs-keyword">New</span></span>StandardAnalyzer ();<span class="hljs-comment"><span class="hljs-comment">//standard Word breaker, will automatically remove the space ah, is a the word</span></span>Indexwriterconfig config =<span class="hljs-keyword"><span class="hljs-keyword">New</span></span>Indexwriterconfig (analyzer);<span class="hljs-comment"><span class="hljs-comment">//match The standard word breaker to the configuration of the write index</span></span>writer =<span class="hljs-keyword"><span class="hljs-keyword">New</span></span>IndexWriter (dir, config);<span class="hljs-comment"><span class="hljs-comment">//instantiate Write Index object</span></span>}<span class="hljs-comment"><span class="hljs-comment">//close Write Index</span></span> <span class="hljs-keyword"><span class="hljs-keyword"></span> public</span> <span class="hljs-keyword"><span class="hljs-keyword">void</span></span> <span class="hljs-title"><span class="hljs-title">Close</span></span>()<span class="hljs-keyword"><span class="hljs-keyword">throws</span></span>Exception {writer.close (); }<span class="hljs-comment"><span class="hljs-comment">//index All files under the specified directory</span></span> <span class="hljs-keyword"><span class="hljs-keyword"></span> public</span> <span class="hljs-keyword"><span class="hljs-keyword">int</span></span> <span class="hljs-title"><span class="hljs-title">Indexall</span></span>(String Datadir)<span class="hljs-keyword"><span class="hljs-keyword">throws</span></span>Exception {file[] files =<span class="hljs-keyword"><span class="hljs-keyword">New</span></span>File (datadir). listfiles ();<span class="hljs-comment"><span class="hljs-comment">//get All files under this path</span></span> <span class="hljs-keyword"><span class="hljs-keyword"></span> for</span>(File file:files) {indexfile (file);<span class="hljs-comment"><span class="hljs-comment">//call The following Indexfile method to index each file</span></span>}<span class="hljs-keyword"><span class="hljs-keyword">return</span></span>Writer.numdocs ();<span class="hljs-comment"><span class="hljs-comment">//returns the number of files indexed</span></span>}<span class="hljs-comment"><span class="hljs-comment">//index The specified file</span></span> <span class="hljs-keyword"><span class="hljs-keyword">Private</span></span> <span class="hljs-keyword"><span class="hljs-keyword">void</span></span> <span class="hljs-title"><span class="hljs-title">Indexfile</span></span>(file File)<span class="hljs-keyword"><span class="hljs-keyword">throws</span></span>Exception {System.out.println (<span class="hljs-string"><span class="hljs-string">"path to the index file:"</span></span>+ File.getcanonicalpath ()); Document doc = GetDocument (file);<span class="hljs-comment"><span class="hljs-comment">//obtain document for this file</span></span>Writer.adddocument (doc);<span class="hljs-comment"><span class="hljs-comment">//call The following GetDocument method to add doc to the index</span></span>}<span class="hljs-comment"><span class="hljs-comment">//get documents, set each field in the document, similar to a row of records in the database</span></span> <span class="hljs-keyword"><span class="hljs-keyword">Private</span></span>Document<span class="hljs-title"><span class="hljs-title">getdocument</span></span>(file File)<span class="hljs-keyword"><span class="hljs-keyword">throws</span></span>Exception{Document doc =<span class="hljs-keyword"><span class="hljs-keyword">New</span></span>Document ();<span class="hljs-comment"><span class="hljs-comment">//add Field</span></span>Doc.add (<span class="hljs-keyword"><span class="hljs-keyword">New</span></span>TextField (<span class="hljs-string"><span class="hljs-string">"contents"</span></span>,<span class="hljs-keyword"><span class="hljs-keyword">New</span></span>FileReader (file));<span class="hljs-comment"><span class="hljs-comment">//add Content</span></span>Doc.add (<span class="hljs-keyword"><span class="hljs-keyword">New</span></span>TextField (<span class="hljs-string"><span class="hljs-string">"fileName"</span></span>, File.getname (), Field.Store.YES));<span class="hljs-comment"><span class="hljs-comment">//add file name and save this field to the index file</span></span>Doc.add (<span class="hljs-keyword"><span class="hljs-keyword">New</span></span>TextField (<span class="hljs-string"><span class="hljs-string">"fullPath"</span></span>, File.getcanonicalpath (), Field.Store.YES));<span class="hljs-comment"><span class="hljs-comment">//add file path</span></span> <span class="hljs-keyword"><span class="hljs-keyword">return</span></span>Doc }<span class="hljs-keyword"><span class="hljs-keyword"></span> public</span> <span class="hljs-keyword"><span class="hljs-keyword">Static</span></span> <span class="hljs-keyword"><span class="hljs-keyword">void</span></span> <span class="hljs-title"><span class="hljs-title">Main</span></span>(string[] Args) {String Indexdir =<span class="hljs-string"><span class="hljs-string">"d:\\lucene"</span></span>;<span class="hljs-comment"><span class="hljs-comment">//the path to which to save the index</span></span>String DataDir =<span class="hljs-string"><span class="hljs-string">"d:\\lucene\\data"</span></span>;<span class="hljs-comment"><span class="hljs-comment">//directory where file data to be indexed is stored</span></span>Indexer Indexer =<span class="hljs-keyword"><span class="hljs-keyword">NULL</span></span>;<span class="hljs-keyword"><span class="hljs-keyword">int</span></span>Indexednum =<span class="hljs-number"><span class="hljs-number">0</span></span>;<span class="hljs-keyword"><span class="hljs-keyword">Long</span></span>StartTime = System.currenttimemillis ();<span class="hljs-comment"><span class="hljs-comment">//record index Start time</span></span> <span class="hljs-keyword"><span class="hljs-keyword">Try</span></span>{indexer =<span class="hljs-keyword"><span class="hljs-keyword">New</span></span>Indexer (indexdir); Indexednum = Indexer.indexall (datadir); }<span class="hljs-keyword"><span class="hljs-keyword">Catch</span></span>(Exception E) {e.printstacktrace (); }<span class="hljs-keyword"><span class="hljs-keyword">finally</span></span>{<span class="hljs-keyword"><span class="hljs-keyword">Try</span></span>{indexer.close (); }<span class="hljs-keyword"><span class="hljs-keyword">Catch</span></span>(Exception E) {e.printstacktrace (); } }<span class="hljs-keyword"><span class="hljs-keyword">Long</span></span>EndTime = System.currenttimemillis ();<span class="hljs-comment"><span class="hljs-comment">//record Index End time</span></span>System.out.println (<span class="hljs-string"><span class="hljs-string">"index time-consuming"</span></span>+ (endtime-starttime) +<span class="hljs-string"><span class="hljs-string">"milliseconds"</span></span>); System.out.println (<span class="hljs-string"><span class="hljs-string">"total index"</span></span>+ Indexednum +<span class="hljs-string"><span class="hljs-string">"files"</span></span>); }}</code></pre><p><p>I wrote the procedure according to the process of indexing, which has been explained very clearly in the comments, and I will not repeat it here. Then run the main method and look at the results as Follows:<br><br>A total of 7 files are indexed, it takes 649 milliseconds, it's pretty fast, and the path to the index file is right, and then you can see that D:\lucene\ generates some files, which are the generated Indexes.<br><br>Now that we have the index, we can retrieve the characters we want to query, I opened a file, and found an ugly string "generate-maven-artifacts" in it as the object to Retrieve. Look at the Java code that was retrieved before retrieving it:</p></p><pre class="prettyprint"><code class="language-java hljs "><span class="hljs-keyword"><span class="hljs-keyword"></span> public</span> <span class="hljs-class"><span class="hljs-class"> <span class="hljs-keyword">class</span> <span class="hljs-title">Searcher</span> {</span></span> <span class="hljs-keyword"><span class="hljs-keyword"></span> public</span> <span class="hljs-keyword"><span class="hljs-keyword">Static</span></span> <span class="hljs-keyword"><span class="hljs-keyword">void</span></span> <span class="hljs-title"><span class="hljs-title">Search</span></span>(string indexdir, String Q)<span class="hljs-keyword"><span class="hljs-keyword">throws</span></span>Exception {Directory dir = fsdirectory.open (paths.get (indexdir));<span class="hljs-comment"><span class="hljs-comment">//get the path to query, which is where the index is located</span></span>Indexreader reader = Directoryreader.open (dir); Indexsearcher searcher =<span class="hljs-keyword"><span class="hljs-keyword">New</span></span>Indexsearcher (reader); Analyzer Analyzer =<span class="hljs-keyword"><span class="hljs-keyword">New</span></span>StandardAnalyzer ();<span class="hljs-comment"><span class="hljs-comment">//standard Word breaker, will automatically remove the space ah, is a the word</span></span>Queryparser parser =<span class="hljs-keyword"><span class="hljs-keyword">New</span></span>Queryparser (<span class="hljs-string"><span class="hljs-string">"contents"</span></span>, analyzer);<span class="hljs-comment"><span class="hljs-comment">//query Parser</span></span>Query query = Parser.parse (q);<span class="hljs-comment"><span class="hljs-comment">//to Get the query object by parsing the string to query</span></span> <span class="hljs-keyword"><span class="hljs-keyword">Long</span></span>StartTime = System.currenttimemillis ();<span class="hljs-comment"><span class="hljs-comment">//record index Start time</span></span>Topdocs docs = Searcher.search (query,<span class="hljs-number"><span class="hljs-number">Ten</span></span>);<span class="hljs-comment"><span class="hljs-comment">//start query, Query the first 10 data, save the record in Docs</span></span> <span class="hljs-keyword"><span class="hljs-keyword">Long</span></span>EndTime = System.currenttimemillis ();<span class="hljs-comment"><span class="hljs-comment">//record Index End time</span></span>System.out.println (<span class="hljs-string"><span class="hljs-string">"match"</span></span>+ q +<span class="hljs-string"><span class="hljs-string">"total time-consuming"</span></span>+ (endtime-starttime) +<span class="hljs-string"><span class="hljs-string">"milliseconds"</span></span>); System.out.println (<span class="hljs-string"><span class="hljs-string">"query to"</span></span>+ Docs.totalhits +<span class="hljs-string"><span class="hljs-string">"records"</span></span>);<span class="hljs-keyword"><span class="hljs-keyword"></span> for</span>(scoredoc ScoreDoc:docs.scoreDocs) {<span class="hljs-comment"><span class="hljs-comment">//remove each query result</span></span>Document doc = Searcher.doc (scoredoc.doc);<span class="hljs-comment"><span class="hljs-comment">//scoredoc.doc equivalent to docid, according to this docid to obtain the document</span></span>System.out.println (doc.get (<span class="hljs-string"><span class="hljs-string">"fullPath"</span></span>));<span class="hljs-comment"><span class="hljs-comment">//fullpath is a field we defined when we just built the Index.</span></span>} reader.close (); }<span class="hljs-keyword"><span class="hljs-keyword"></span> public</span> <span class="hljs-keyword"><span class="hljs-keyword">Static</span></span> <span class="hljs-keyword"><span class="hljs-keyword">void</span></span> <span class="hljs-title"><span class="hljs-title">Main</span></span>(string[] Args) {String Indexdir =<span class="hljs-string"><span class="hljs-string">"d:\\lucene"</span></span>; String q =<span class="hljs-string"><span class="hljs-string">"generate-maven-artifacts"</span></span>;<span class="hljs-comment"><span class="hljs-comment">//query This string</span></span> <span class="hljs-keyword"><span class="hljs-keyword">Try</span></span>{search (indexdir, q); }<span class="hljs-keyword"><span class="hljs-keyword">Catch</span></span>(Exception E) {e.printstacktrace (); } }}</code></pre><p><p>Run the main method and look at the Results:<br><br>Lucene has correctly helped us to retrieve, and then I put the middle of the "-" removed, it can also help us to retrieve, but I put the previous characters are removed, leaving only "rtifacts" can not be retrieved, which also can be explained in Lucene index is divided by the word, But this problem can be solved, I will write in a follow-up article.</p></p><p><p>Section references from:<br>http://blog.csdn.net/forfuture1978/article/details/4711308<br>Http://www.cnblogs.com/dewin/archive/2009/11/24/1609905.html</p></p><p><p>-willing to share and progress together!<br>--my Blog Home: http://blog.csdn.net/eson_15</p></p> <p><p>"lucene" Apache lucene Full Text search engine architecture Introduction</p></p></span>

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.