"lucene" Apache lucene Full Text search engine architecture Introduction

Last Update:2016-07-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

<blockquote> <blockquote> Lucene is a set of open source libraries for Full-text search and search, supported and provided by the Apache software Foundation. Lucene provides a simple yet powerful application interface that enables Full-text indexing and searching. In the Java development environment Lucene is a mature free open source Tool. For its part, Lucene is the most popular free Java Information Retrieval library in the current and recent Years. --"baidu encyclopedia" </blockquote> </blockquote>This blog post mainly starts from two aspects, first introduces the Full-text search principle in lucene, followed by the program example to show how to use Lucene. About the principle of full-text search I searched the internet for a bit, also read several articles, and finally in writing this article part of the reference to two of them (address I put at the end of the article), thank the original Author.1. Full-text SearchWhat is Full-text search? For example, for example, now in a file to find a string, the most direct idea is to retrieve from the beginning, found on the ok, this small amount of data for the file, very practical, but for the big data volume of the file, it is a bit hehe. Or to find a file containing a string, the same is true, if in a hard disk with dozens of G to find that efficiency can be imagined, is very low. The data in the file is non-structured data, that is, it is not what structure, to solve the above mentioned efficiency problem, first we have to the unstructured data part of the information extracted, reorganize, so that it has a certain structure, and then a certain structure of the data to search, So as to achieve a relatively fast search. This is called Full-text Search. That is, the process of indexing and then searching the Index. So how is the index indexed in lucene? Suppose you now have two documents with the following content: <blockquote> <blockquote> Article 1 of the content is: Tom lives in Guangzhou, I live in Guangzhou too. Article 2 of the content Is: He once lived in Shanghai. </blockquote> </blockquote>The first step is to pass the document to the Sub-phrase (tokenizer), which divides the document into words and removes punctuation and stop words. The term "stop" refers to words that have no special meaning, such as A,the,too in English. After participle, get the word element (Token). As follows: <blockquote> <blockquote> Article 1 results after participle: [Tom] [lives] [Guangzhou] [I] [live] [Guangzhou] Article 2 results after participle: [He] [lives] [shanghai] </blockquote> </blockquote>Then pass the word to the language processing component (linguistic Processor), for english, the language processing component will generally turn the letter into lowercase, reduce the word to root form, such as "lives" to "live" and so on, the word into a root form, such as "drove" to " Drive "and So On. Then get the word (term). As follows: <blockquote> <blockquote> Article 1 results after processing: [tom] [live] [guangzhou] [i] [live] [guangzhou] Article 2 results after processing: [he] [live] [shanghai] </blockquote> </blockquote>finally, The resulting word is passed to the index component (Indexer), and the index component is processed to get the following index Structure: <table> <thead> <tr> <th align="center"> keywords </th> <th align="center"> article number [frequency] </th> <th align="center"> occurrence position /th> </th> </tr> </thead> <tbody> <tr> <td align="center">guangzhou </td> <td align="center">1[2] </td> <td alig n="center">3,6 </td> </tr> <tr> <td align="center">he </td> <td align="center">2[1] </td> <td align="center"> 1 </td> </tr> <tr> <td align="center">i </td> <td align="center">1[1] </td> <td align="center">4 </td> </tr> <tr> <td align="center">live </td> <td align="center">1[2],2[1] </td> <td align="center">2,5,2 </td> </tr> <tr> <td align="center">shanghai </td> <td align="center">2[1] </td> <td align="center">3 </td> </tr> <tr> <td align="center">tom </td> <td align="center">1[1] </td> <td align="center">1 </td> </tr> </tbody> </table>These are the most central parts of the Lucene index Structure. Its keywords are in alphabetical order, so lucene can quickly locate keywords with a two-dollar search Algorithm. When implemented, Lucene saves the above three columns as a dictionary file (term Dictionary), a frequency file (frequencies), and a location file (positions). The dictionary file not only holds each keyword, but also retains a pointer to the frequency file and location file, and the pointer can find the frequency information and location information of the Keyword. The search process is to find the dictionary two yuan, find the word, through the pointer to the frequency file to read out all the article number, and then return the results, and then you can find in a specific article based on where the word appears. So Lucene may be slower at first indexing, but it doesn't have to be indexed every time, so it's Fast. Of course, This is a search for english, the rules for Chinese will be different, i'll look at the relevant information later.2. Sample CodeAccording to the above analysis, the Full-text search has two steps, first index, then Retrieve. So in order to test this process, I wrote two Java classes, one for test indexing and the other for Test retrieval. First set up a maven project, Pom.xml as Follows:<pre class="prettyprint"><code class="language-xml hljs "><project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi=" Http://www.w3.org/2001/XMLSchema-instance " xsi:schemalocation=" http://maven.apache.org/POM/ 4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd "> <modelversion>4.0.0</modelversion> <groupId>Demo.lucene</groupId> <artifactid>Lucene01</artifactid> <version>0.0.1-snapshot</version> <build/> <dependencies> <!--lucene Core Pack -- <dependency> <groupId>Org.apache.lucene</groupId> <artifactid>Lucene-core</artifactid> <version>6.1.0</version> </Dependency> <!--lucene Query Parsing package -- <dependency> <groupId>Org.apache.lucene</groupId> <artifactid>Lucene-queryparser</artifactid> <version>6.1.0</version> </Dependency> <!--lucene Parser Package -- <dependency> <groupId>Org.apache.lucene</groupId> <artifactid>Lucene-analyzers-common</artifactid> <version>6.1.0</version> </Dependency> </dependencies></Project></code></pre>Before writing the program, we have to get some files, I casually find some English documents (chinese after the study), put in the D:\lucene\data\ directory, as Follows: The document is full of english, I can not. next, start writing the indexed Java program:<pre class="prettyprint"><code class="language-java hljs ">/** * Indexed classes * @author Ni Shengwu * */ public class Indexer { PrivateIndexWriter writer;//write Index Instance //construct method, Instantiate IndexWriter public Indexer(String Indexdir)throwsException {Directory dir = fsdirectory.open (paths.get (indexdir)); Analyzer Analyzer =NewStandardAnalyzer ();//standard Word breaker, will automatically remove the space ah, is a the wordIndexwriterconfig config =NewIndexwriterconfig (analyzer);//match The standard word breaker to the configuration of the write indexwriter =NewIndexWriter (dir, config);//instantiate Write Index object}//close Write Index public void Close()throwsException {writer.close (); }//index All files under the specified directory public int Indexall(String Datadir)throwsException {file[] files =NewFile (datadir). listfiles ();//get All files under this path for(File file:files) {indexfile (file);//call The following Indexfile method to index each file}returnWriter.numdocs ();//returns the number of files indexed}//index The specified file Private void Indexfile(file File)throwsException {System.out.println ("path to the index file:"+ File.getcanonicalpath ()); Document doc = GetDocument (file);//obtain document for this fileWriter.adddocument (doc);//call The following GetDocument method to add doc to the index}//get documents, set each field in the document, similar to a row of records in the database PrivateDocumentgetdocument(file File)throwsException{Document doc =NewDocument ();//add FieldDoc.add (NewTextField ("contents",NewFileReader (file));//add ContentDoc.add (NewTextField ("fileName", File.getname (), Field.Store.YES));//add file name and save this field to the index fileDoc.add (NewTextField ("fullPath", File.getcanonicalpath (), Field.Store.YES));//add file path returnDoc } public Static void Main(string[] Args) {String Indexdir ="d:\\lucene";//the path to which to save the indexString DataDir ="d:\\lucene\\data";//directory where file data to be indexed is storedIndexer Indexer =NULL;intIndexednum =0;LongStartTime = System.currenttimemillis ();//record index Start time Try{indexer =NewIndexer (indexdir); Indexednum = Indexer.indexall (datadir); }Catch(Exception E) {e.printstacktrace (); }finally{Try{indexer.close (); }Catch(Exception E) {e.printstacktrace (); } }LongEndTime = System.currenttimemillis ();//record Index End timeSystem.out.println ("index time-consuming"+ (endtime-starttime) +"milliseconds"); System.out.println ("total index"+ Indexednum +"files"); }}</code></pre>I wrote the procedure according to the process of indexing, which has been explained very clearly in the comments, and I will not repeat it here. Then run the main method and look at the results as Follows: A total of 7 files are indexed, it takes 649 milliseconds, it's pretty fast, and the path to the index file is right, and then you can see that D:\lucene\ generates some files, which are the generated Indexes. Now that we have the index, we can retrieve the characters we want to query, I opened a file, and found an ugly string "generate-maven-artifacts" in it as the object to Retrieve. Look at the Java code that was retrieved before retrieving it:<pre class="prettyprint"><code class="language-java hljs "> public class Searcher { public Static void Search(string indexdir, String Q)throwsException {Directory dir = fsdirectory.open (paths.get (indexdir));//get the path to query, which is where the index is locatedIndexreader reader = Directoryreader.open (dir); Indexsearcher searcher =NewIndexsearcher (reader); Analyzer Analyzer =NewStandardAnalyzer ();//standard Word breaker, will automatically remove the space ah, is a the wordQueryparser parser =NewQueryparser ("contents", analyzer);//query ParserQuery query = Parser.parse (q);//to Get the query object by parsing the string to query LongStartTime = System.currenttimemillis ();//record index Start timeTopdocs docs = Searcher.search (query,Ten);//start query, Query the first 10 data, save the record in Docs LongEndTime = System.currenttimemillis ();//record Index End timeSystem.out.println ("match"+ q +"total time-consuming"+ (endtime-starttime) +"milliseconds"); System.out.println ("query to"+ Docs.totalhits +"records"); for(scoredoc ScoreDoc:docs.scoreDocs) {//remove each query resultDocument doc = Searcher.doc (scoredoc.doc);//scoredoc.doc equivalent to docid, according to this docid to obtain the documentSystem.out.println (doc.get ("fullPath"));//fullpath is a field we defined when we just built the Index.} reader.close (); } public Static void Main(string[] Args) {String Indexdir ="d:\\lucene"; String q ="generate-maven-artifacts";//query This string Try{search (indexdir, q); }Catch(Exception E) {e.printstacktrace (); } }}</code></pre>Run the main method and look at the Results: Lucene has correctly helped us to retrieve, and then I put the middle of the "-" removed, it can also help us to retrieve, but I put the previous characters are removed, leaving only "rtifacts" can not be retrieved, which also can be explained in Lucene index is divided by the word, But this problem can be solved, I will write in a follow-up article.Section references from: http://blog.csdn.net/forfuture1978/article/details/4711308 Http://www.cnblogs.com/dewin/archive/2009/11/24/1609905.html-willing to share and progress together! --my Blog Home: http://blog.csdn.net/eson_15 "lucene" Apache lucene Full Text search engine architecture Introduction

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

"lucene" Apache lucene Full Text search engine architecture Introduction

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

"lucene" Apache lucene Full Text search engine architecture Introduction

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support