1 Lucene Introduction
1.1 What is Lucene Powered by 25175.net
Lucene is a full-text search framework, not an application product. Therefore, it is not as useful as www.baidu.com or Google Desktop. It only provides a tool for you to implement these products.
1.2 What Can Lucene do
To answer this question, we must first understand the nature of Lucene. In fact, Lucene has a very simple function. In the end, you give it several strings, and then it provides you with a full-text search service, telling you where the keywords you want to search appear. By knowing this essence, you can use your imagination to do anything that meets this condition. You can index all the news on the site and create a database. You can index several fields in a database table, you don't have to worry about locking the table because of "% like %". You can also write your own search engine ......
1.3 you should choose Lucene
Some test data is provided below. If you think it is acceptable, you can choose.
Test 1: 2.5 million records, about 800 mb of text, about 300 MB of generated index, and the average processing time in threads is ms.
Test 2: 37000 records, two varchar fields in the index database, with an index file of 2.6 MB and an average processing time of 800 ms in 1.5 threads.
2. How Lucene works
The service provided by Lucene consists of two parts: one in and one out. The so-called "inbound" refers to writing or removing the source (essentially a string) You provide from the index. The so-called "outbound" refers to providing full-text search services to users, allows you to locate the source using keywords.
2.1 write process
The source string is first processed by analyzer, including word segmentation, divided into words, and stopword removal (optional ).
Add the required information in the source to the fields of document., index the fields to be indexed, and store the fields to be stored.
Write the index into the memory. The memory can be memory or disk.
2.2 read Process
The user provides search keywords, which are processed by analyzer.
Find the corresponding document for the processed keyword search index.
The user extracts the required field from the found document. as needed.
3. concepts to be aware
Lucene uses some concepts to understand their meanings, which is helpful for the following explanation.
3.1 Analyzer
Analyzer is a analyzer that divides a string into words according to certain rules and removes invalid words, invalid words here refer to words such as "of", "the", "", and "location" in English.ArticleBut does not contain any key information. Removing this helps narrow down the index file, improve efficiency, and increase the hit rate.
Word Segmentation rules are ever-changing, but there is only one purpose: to divide words by semantics. This is easy to implement in English, because English itself is a unit of words, which have been separated by spaces; while Chinese must divide sentences into words in some way. The specific partitioning method is described in detail below. Here you only need to understand the concept of analyzer.
3.2 document.br/> the source provided by the user is a record, which can be a text file, a string, or a record of a database table. After a record is indexed, it is stored in the index file in the form of a document. The user performs a search, which is also returned in the form of a document. Table.
Field 3.3
A document. to contain multiple information fields. For example, an article may contain information fields such as "title", "body", and "last modification time. stored.
Field has two attributes: storage and index. You can use the storage attribute to control whether to store the field. You can use the index attribute to control whether to index the field. This seems a bit nonsense. In fact, the correct combination of these two attributes is very important. The following is an example:
Taking the previous article as an example, we need to perform a full-text search for the title and body, so we need to set the index attribute to true, and we want to extract the article title from the search results directly, therefore, we set the storage attribute of the title domain to true, but because the body domain is too large, we set the storage attribute of the body domain to false in order to reduce the size of the index file, directly read the file when necessary. We just want to extract the last modification time from the search result without searching it, therefore, we set the storage attribute of the last modified time domain to true and the index attribute to false. The above three fields cover three combinations of two attributes, and one completely false is not used. In fact, the field does not allow you to set this, because it is meaningless to store a domain that is neither stored nor indexed.
3.4 term
Term is the smallest unit of search. It represents a word in a document. term consists of two parts: the word it represents and the field it appears.
3.5 tocken
Tocken is a term appearance. It contains the Trem text, the corresponding start and end offset, and a type string. The same words can appear multiple times in a sentence. They are represented by the same term, but different tocken is used. Each tocken marks the place where the word appears.
3.6 segment
When an index is added, not every document. add to the same index file immediately. They are first written to different small files and then merged into a large index file. Each small file is a segment.
4 Lucene Structure
Lucene includes two parts: core and sandbox. Among them, core is the stable core part of Lucene. Sandbox includes some additional functions, such as highlighter and various analyzers.
Lucene core has seven packages: analysis, document. index, queryparser, search, store, and util.
4.1 Analysis
Analysis includes some built-in analyzer, such as whitespaceanalyzer, Which is segmented by blank characters. It adds stopanalyzer filtered by stopwrod, the most commonly used standardanalyzer.
4.2 document.br/> document. contains the data structure of the document. For example, document. defines the data structure of the stored document, and the field class defines document. A domain.
4.3 Index
The index contains the index read/write class, such as the index file segment write, merge, and optimized indexwriter class, And the indexreader class that reads and deletes the index, it should be noted that the name of indexreader should not be misled as an index file reading class. In fact, deleting an index is also done by it. indexwriter only cares about How to Write indexes into segments, and merge and optimize them. indexreader focuses on the organization of each document in the index file.
4.4 queryparser
Queryparser contains the parsing query statement class. Lucene's query statement is a bit similar to SQL statement. It has various Reserved Words and can be used to form various queries according to certain syntax. Lucene has many query classes that inherit from query and execute various special queries. queryparser is used to parse query statements and call various query classes in order to find the results.
4.5 search
Search contains various types of search results from the index, such as the query classes just mentioned, including termquery and booleanquery.
4.6 store
Store contains an index storage class. For example, directory defines the storage structure of the index file, fsdirectory is the index stored in the file, ramdirectory is the index stored in the memory, mmapdirectory is the index using memory ing.
4.7 util
Util contains some common tools, such as time and String Conversion tools.
5. How to create an index
5.1 The easiest way to complete the index Code Fragment
Indexwriter writer = new indexwriter ("/data/index/", new standardanalyzer (), true );
Document.doc = new document .);
Doc. Add (new field ("title", "Lucene Introduction", field. Store. Yes, field. Index. tokenized ));
Doc. Add (new field ("content", "Lucene works well", field. Store. Yes, field. Index. tokenized ));
Writer.adddocument.doc );
Writer. Optimize ();
Writer. Close ();
Next we will analyze this code.
First, we create a writer and specify the directory where the indexes are stored as "/data/Index". The analyzer used is standardanalyzer, the third parameter indicates that if an index file already exists in the index directory, We will overwrite it.
Then we create a new document.
We add a field named "title" to document. And the content is "Lucene Introduction", which is stored and indexed.
Add another field named "content" with the content "Lucene works well", which is also stored and indexed.
Then we add this document to the index. If there are multiple documents, we can repeat the above operation to create document. Add.
After adding all documents, we optimize the index. The optimization mainly involves merging multiple segments into one, which is conducive to improving the index speed.
It is important to disable writer later.
Yes, creating an index is that simple!
Of course, you may modify the above code to get a more personalized service.
5.2 write the index directly in the memory
Create a ramdirectory and pass it to writer. The Code is as follows:
Directory dir = new ramdirectory ();
Indexwriter writer = new indexwriter (Dir, new standardanalyzer (), true );
Document.doc = new document .);
Doc. Add (new field ("title", "Lucene Introduction", field. Store. Yes, field. Index. tokenized ));
Doc. Add (new field ("content", "Lucene works well", field. Store. Yes, field. Index. tokenized ));
Writer.adddocument.doc );
Writer. Optimize ();
Writer. Close ();
5.3 index text files
If you want to index plain text files without reading them into strings to create fields, you can use the following code to create fields:
Field field = new field ("content", new filereader (File ));
The file here is the text file. This constructor actually reads the file content and indexes it, but does not store it.
6. How to maintain Indexes
Indexes are maintained by the indexreader class.
6.1 how to delete an index
Lucene provides two methods to delete document from an index. One is
Void deletedocument.int docnum)
This method is based on document. the number in the index to delete, each document. there will be a unique number after the index is entered, so deleting by number is a precise deletion, but this number is the internal structure of the index. Generally, we do not know the number of a file, so it is of little use. The other is
Void deletedocument. (Term term)
This method is actually to first perform a search operation based on the parameter term, and then batch Delete the search results. We can use this method to provide a strict query condition to delete the specified document.
The following is an example:
Directory dir = fsdirectory. getdirectory (path, false );
Indexreader reader = indexreader. Open (DIR );
Term term = new term (field, key );
Reader. deletedocument. (TERM );
Reader. Close ();
6.2 how to update Indexes
Lucene does not provide a dedicated Index Update method. We need to first divide the corresponding document. and then add the new document. Into the index. For example:
Directory dir = fsdirectory. getdirectory (path, false );
Indexreader reader = indexreader. Open (DIR );
Term term = new term ("title", "Lucene Introduction ");
Reader. deletedocument. (TERM );
Reader. Close ();
Indexwriter writer = new indexwriter (Dir, new standardanalyzer (), true );
Document.doc = new document .);
Doc. Add (new field ("title", "Lucene Introduction", field. Store. Yes, field. Index. tokenized ));
Doc. Add (new field ("content", "Lucene is funny", field. Store. Yes, field. Index. tokenized ));
Writer.adddocument.doc );
Writer. Optimize ();
Writer. Close ();
7. How to search
Lucene's search function is quite powerful. It provides many auxiliary query classes. Each class inherits from the Query Class and completes a special query, you can use them in any combination like building blocks to complete some complex operations. Lucene also provides the sort class to sort the results and the filter class to limit the query conditions. You may unconsciously compare it with the SQL statement: "Can Lucene execute and, Or, order by, where, like '% XX % ?" The answer is: "Of course no problem !"
7.1 various Query
Next, let's take a look at what query operations Lucene allows us to perform:
7.1.1 termquery
First, we will introduce the most basic query. If you want to execute a query like this: "document. rdquo; containing 'lucene 'in the content field, you can use termquery:
Term T = new term ("content", "Lucene ";
Query query = new termquery (t );
7.1.2 booleanquery
If you want to query: "document. rdquo; that contains Java or Perl in the content domain, you can create two termqueries and connect them with booleanquery:
Termquery termquery1 = new termquery (new term ("content", "Java ");
Termquery 2 = new termquery (new term ("content", "Perl ");
Booleanquery = new booleanquery ();
Booleanquery. Add (termquery 1, booleanclause. occur. shold );
Booleanquery. Add (termquery 2, booleanclause. occur. shold );
7.1.3 wildcardquery
If you want to query a word with wildcards, you can use wildcardquery. The wildcards include '? 'Match any character and '*' match zero or multiple arbitrary characters. For example, if you search for 'use * ', you may find 'useful' or 'useless ':
Query query = new wildcardquery (new term ("content", "use *");
7.1.4 phrasequery
You may be interested in Sino-Japanese relations. If you want to find articles with the distance between "medium" and "day" (within five words), you will not consider any articles that exceed this distance, you can:
Phrasequery query = new phrasequery ();
Query. setslop (5 );
Query. Add (new term ("content", "medium "));
Query. Add (new term ("content", "day "));
Then it may find "Sino-Japanese cooperation ......" "China and Japan ......", However, "a top Chinese leader says Japan is not flat" is not found ".
7.1.5 prefixquery
If you want to search for words starting with '中', you can use prefixquery:
Prefixquery query = new prefixquery (new term ("content", "medium ");
7.1.6 fuzzyquery
Fuzzyquery is used to search for similar terms, and levenshtein is used.Algorithm. Suppose you want to search for words similar to 'wuzza', you can:
Query query = new fuzzyquery (new term ("content", "wuzza ");
You may get 'fuzzy 'and 'wuyun '.
7.1.7 rangequery
Another commonly used query is rangequery. You may want to search for documents in the time range from 20060101 to 20060130. You can use rangequery:
Rangequery query = new rangequery (new term ("time", "20060101"), new term ("time", "20060130"), true );
The final true value indicates that the closed interval is used.
7.2 queryparser
after reading so many queries, you may ask: "I won't have to combine various queries myself. It's too much trouble !" Of course not. Lucene provides a query statement similar to an SQL statement. Let's call it a Lucene statement. through it, you can get all kinds of queries in one sentence, lucene Automatically splits them into small pieces and submits them to the corresponding query for execution. Next we will demonstrate each type of query:
termquery can be used in the "field: Key" mode, for example, "content: Lucene ".
In booleanquery, 'and' Use '+', 'or' Use '', for example," content: Java contenterl ".
wildcardquery still uses '? 'And' * ', for example, "content: Use *".
phrasequery '~ ', Such as "content:" China and Japan "~ 5 ".
prefixquery uses '*', for example, "medium *".
fuzzyquery '~ ', Such as "content: wuzza ~".
rangequery uses '[]' or '{}'. The former indicates a closed interval, and the latter indicates an open interval. For example, "Time: [20060101 to 20060130]". Be sure to be case sensitive.
you can combine query strings to perform complex operations. For example, "the title or body includes Lucene, and the time range is between 20060101 and 20060130." can be expressed: "+ (Title: Lucene content: Lucene) + time: [20060101 to 20060130]". The Code is as follows:
Directory dir = fsdirectory. getdirectory (path, false );
Indexsearcher is = new indexsearcher (DIR );
Queryparser parser = new queryparser ("content", new standardanalyzer ());
Query query = parser. parse ("+ (Title: Lucene content: Lucene) + time: [20060101 to 20060130]";
Hits hits = is. Search (query );
For (INT I = 0; I<Hits. Length (); I ++)
{
Document.doc= Hits.doc (I );
System. Out. println (Doc. Get ("title ");
}
Is. Close ();
First, create an indexsearcher on the specified file directory.
Create a queryparser that uses standardanalyzer as the analyzer. The default search domain is content.
Then we use queryparser to parse the query string and generate a query.
Then use this query to find the result, and the result is returned in the form of hits.
This hits object contains a list, which is displayed one by one.
7.3 filter
the function of the filter is to query only a subset of the index. It works a bit like where in an SQL statement, but is different, it is not a part of a regular query. It only processes the data source and then submits it to the query statement. Note that it performs preprocessing instead of filtering the query results. Therefore, the cost of using filter is very high. It may increase the time consumption of a single query by one hundred times.
the most common filters are rangefilter and queryfilter. Rangefilter is used to search for indexes within the specified range. queryfilter is used to search for indexes in the last query.
filter is easy to use. You only need to create a filter instance and send it to searcher.