Lucence Full-Text Search framework

Source: Internet
Author: User
Tags file size

1 Lucene Introduction 1.1 What is Lucene Lucene is a full-text search framework, not an application product. So it doesn't work like www.baidu.com or Google Desktop, it just provides a tool to enable you to implement these products.

1.2 What Lucene can do to answer this question, first understand the nature of Lucene. In fact, Lucene is a very simple function, after all, you give it a number of strings, and then it provides you with a full-text search service, tell you where the keywords you want to search appear. Knowing the nature, you can imagine doing anything that fits this condition. You can index the news in the station, and make a database; You can index several fields of a table, so you don't have to worry about locking the table because of "%like%"; you can also write your own search engine ...

1.3 You should not choose Lucene . Give some test data below, if you feel acceptable, then you can choose.
Test one: 2.5 million records, 300M text, generate index around 380M, 800 threads under average processing time 300ms.
Test two: 37000 records, index database of two varchar fields, index file 2.6m,800 thread under average processing time 1.5ms.


2 How Lucene works the service provided by Lucene actually consists of two parts: one in one out. The so-called entry is written, the source you provide (essentially a string) is written to the index or deleted from the index, so-called read out, that is, to provide users with full-text search services, so that users can locate the source by keyword.

2.1 Write Process source string first through the analyzer processing, including: participle, divided into words; remove stopword (optional).
Add the required information from the source to each field in the document and index the field that needs to be indexed to store the fields that need to be stored.
Writes an index to memory, which can be either memory or disk.

2.2 read out process users provide search keywords, through the analyzer processing.
Find the corresponding document for the Processed keyword search index.
The user extracts the required field from the document that is found as needed.


3 Some concepts that need to be known Lucene uses some concepts to understand what they mean and facilitates the following explanations.

3.1 Analyzer Analyzer is a parser, it is the role of a string according to a certain rule into a word, and remove the invalid words, here said invalid word refers to the English "of", "the", "in Chinese", "ground" and other words, these words appear in the article, However, it does not contain any key information, which is helpful for shrinking index files, improving efficiency, and increasing hit ratios.
The rules of participle are changeable, but the purpose is only one: chapeau division. This is easier to achieve in English, because English itself is a word unit, has been separated by a space, while the Chinese must be in some way to divide into a sentence into a word. The specific partitioning method is described in detail below, where you only need to understand the parser concept.

3.2 Document users provide a source of records, which can be a text file, a string, or a record of a database table, and so on. Once a record has been indexed, it is stored in the index file as a document. The user searches and is returned as a list of the document.

3.3 Field A document can contain multiple fields of information, such as an article that can contain information fields such as title, body, last modified, and so on, which are stored in document by field.
Field has two properties to choose from: Storage and indexing. You can control whether the field is stored by storing properties, and you can control whether the field is indexed by indexed properties. This may seem a bit of crap, in fact it's important to have the right combination of these two properties, as illustrated below:
As an example of the previous article, we need to search the title and body text, so we want to set the Index property to True, and we want to be able to extract the title of the article directly from the search results, so we set the Title field storage property to true, but because the body field is too large, we have to reduce the index file size, Set the storage property of the body field to false, and then read the file directly when needed; we just want to be able to extract the last modified time from the search results and do not need to search for it, so we set the storage property of the last modified time domain to true and the indexed property to False. The above three fields cover three combinations of two properties, and there is no use of a fake one, in fact field does not allow you to set it, because fields that are neither stored nor indexed are meaningless.

3.4 Term is the smallest unit of search, which represents a word in a document that consists of two parts: the words it represents and the field in which the word appears.

3.5 Tocken Tocken is a term occurrence that contains the term text and corresponding start and end offsets, as well as a type string. Words can appear multiple times the same word, they are used in the same term, but with different tocken, each tocken mark the place where the word appears.

3.6 Segment When adding an index, not every document is added to the same index file immediately, they are written to a different small file and then merged into a large index file, where each small file is a segment.


4 Lucene's structure Lucene consists of core and sandbox, where core is the central part of Lucene's stability, and the sandbox contains additional features such as highlighter, various analyzers.

Lucene Core has seven packages: Analysis,document,index,queryparser,search,store,util.



Lucence System Organization 4.1 Analysis analysis contains some built-in parsers, such as whitespaceanalyzer that are word-breaking by whitespace characters, Added the Stopwrod filter stopanalyzer, the most commonly used standardanalyzer. The
4.2 document document contains the data structure of the document, for example, the document class defines the data structure that stores the documents, and the field class defines a domain for the file.
4.3 index Index contains the read-write class for the index, such as write, merge, optimize IndexWriter class, and Indexreader class for read and delete operations on the index file segment. Note here is not to be misled by the name Indexreader, think it is the index file read class, actually delete the index is also done by it, indexwriter only care about how to write the index segment, and combine them to optimize Indexreader is concerned with the organization of each document in the index file. The
4.4 queryparser Queryparser contains classes that parse query statements, and Lucene's query statements and SQL statements are somewhat similar, with a variety of reserved words that can be made into various queries according to certain syntax. Lucene has many kinds of query classes, all of which inherit from query and execute various special queries, the function of Queryparser is to parse the query statements, and call various query classes in order to find out the results. The
4.5 search Search contains classes that search for results from the index, such as the various query classes just mentioned, including Termquery, Booleanquery, and so on in this package. The
4.6 store store contains indexed storage classes, such as directory-defined storage structures for index files, and fsdirectory as indexes stored in files. Ramdirectory is the index stored in memory, Mmapdirectory is the index that uses memory mapping. The
4.7 util util contains some common tool classes, such as conversion tools between time and string.


5 How to build an index 5.1 simplest piece of code to complete the index

IndexWriter writer = new IndexWriter ("/data/index/", New StandardAnalyzer (), true);
Document doc = new document ();
Doc.add (New Field ("title", "Lucene Introduction", Field.Store.YES, Field.Index.TOKENIZED));
Doc.add (New Field ("Content", "Lucene works well", Field.Store.YES, Field.Index.TOKENIZED));
Writer.adddocument (DOC);
Writer.optimize ();
Writer.close ();

Let's examine this code.
First we create a writer and specify the directory to hold the index as "/data/index", using the parser as StandardAnalyzer,

The third parameter shows that if there are already index files in the index directory, we will overwrite them.
Then we create a new document.
We add a field to document named "title" with the content "Lucene Introduction", which is stored and indexed.
Add a field named "Content" with the content "Lucene works well", which is also stored and indexed.
Then we add this document to the index, and if there are multiple documents, you can repeat the above actions, create the documents and add them.
After adding all the document, we optimize the index, and the optimization is mainly to merge multiple segment into one, which helps to improve the index speed.
It is important that the writer is subsequently closed.

Yes, creating an index is as simple as that.
Of course you may modify the above code to get a more personalized service.

5.2 Write the index directly in memory you need to first create a ramdirectory and pass it to writer, with the following code:

Directory dir = new Ramdirectory ();
IndexWriter writer = new IndexWriter (dir, New StandardAnalyzer (), true);
Document doc = new document ();
Doc.add (New Field ("title", "Lucene Introduction", Field.Store.YES, Field.Index.TOKENIZED));
Doc.add (New Field ("Content", "Lucene works well", Field.Store.YES, Field.Index.TOKENIZED));
Writer.adddocument (DOC);
Writer.optimize ();
Writer.close ();

5.3 Indexed text files if you want to index plain text files instead of reading them into a string to create a field, you can create a field with the following code:

Field field = new Field ("Content", new FileReader (file));

The file here is the text. The constructor actually reads the contents of the file and indexes it, but does not store it.


6 How Maintenance operations for index indexes are maintained are provided by the Indexreader class.

6.1 How to delete an index Lucene provides two ways to remove the document from the index, one of which is void deletedocument (int docnum), which is deleted based on the document's number in the index. Each document added to the index will have a unique number, so according to the number is deleted is an exact deletion, but this number is the internal structure of the index, generally we do not know the number of a file in the end is a few, so it is not very useful. The other is void deletedocuments (term) This method is actually first based on the parameters of the term to perform a search operation, and then the search results are deleted in bulk. We can use this method to provide a strict query condition to delete the specified document.

An example is given below:

Directory dir = Fsdirectory.getdirectory (PATH, false);
Indexreader reader = Indexreader.open (dir);
Term term = new term (field, key);
Reader.deletedocuments (term);
Reader.close ();

6.2 How to update the index Lucene does not provide a dedicated index update method, we need to delete the corresponding document before adding the new document to the index. For example:

Directory dir = Fsdirectory.getdirectory (PATH, false);
Indexreader reader = Indexreader.open (dir);
Term term = new term ("title", "Lucene introduction");
Reader.deletedocuments (term);
Reader.close ();

IndexWriter writer = new IndexWriter (dir, New StandardAnalyzer (), true);
Document doc = new document ();
Doc.add (New Field ("title", "Lucene Introduction", Field.Store.YES, Field.Index.TOKENIZED));
Doc.add (New Field ("Content", "Lucene is funny", Field.Store.YES, Field.Index.TOKENIZED));
Writer.adddocument (DOC);
Writer.optimize ();
Writer.close ();


7 How to search Lucene search is quite powerful, it provides a lot of auxiliary query classes, each class inherits from the query class, each to complete a special kind of query, you can use the same as building blocks in any combination, to complete a number of complex operations; Lucene also provides a sort class for sorting the results and provides the filter class to limit the query criteria. You might unconsciously compare it to an SQL statement: "Lucene can perform and, or, order by, where, like '%xx% ' operations." "The answer is:" Of course, no problem. ”

7.1 Various query below Let's look at what the Lucene allows us to do:

7.1.1 Termquery first describes the most basic query, if you want to execute a query like "Lucene" in the content domain, then you can use the Termquery:

Term T = new term ("content", "Lucene");
Query query = new Termquery (t);

7.1.2 Booleanquery If you want to query: "Include Java or Perl document in the content domain", then you can create two termquery and connect them with Booleanquery:

Termquery termQuery1 = new Termquery (New term ("content", "Java");
Termquery termquery 2 = new Termquery (New term ("content", "Perl");
Booleanquery booleanquery = new Booleanquery ();
Booleanquery.add (termquery 1, BooleanClause.Occur.SHOULD);
Booleanquery.add (termquery 2, BooleanClause.Occur.SHOULD);

7.1.3 wildcardquery If you want to make a wildcard query for a word, you can use the Wildcardquery, which includes the '? ' character. Match an arbitrary character and ' * ' to match 0 or more arbitrary characters, for example you search ' use* ', you may find ' useful ' or ' useless ':

Query query = new Wildcardquery ("Content", "use*");

7.1.4 phrasequery You may be interested in Sino-Japanese relations, want to find ' middle ' and ' Day ' close (5 words within the distance) of the article, beyond this distance is not considered, you can:

Phrasequery query = new Phrasequery ();
Query.setslop (5);
Query.add (New term ("content", "medium"));
Query.add (New term ("content", "Day"));

Then it may search for "Sino-Japanese cooperation ...", "China and Japan ...", but not "a senior Chinese leader said that Japan owes a flat".

7.1.5 prefixquery If you want to search for words that start with ' medium ', you can use Prefixquery:

Prefixquery query = new Prefixquery (New term ("content", "medium");

7.1.6 fuzzyquery Fuzzyquery is used to search for similar terms using the Levenshtein algorithm. Suppose you want to search for words similar to ' Wuzza ', you can:

Query query = new Fuzzyquery ("Content", "Wuzza");

You may get ' fuzzy ' and ' Wuzzy '.

7.1.7 rangequery Another commonly used query is rangequery, you may want to search for the document from 20060101 to 20060130 in the time domain, you can use the Rangequery:

Rangequery query = new Rangequery ("Time", "20060101"), New term ("Time", "20060130"), true);

The last true indicates the use of a closed interval.

7.2 Queryparser Read so many query, you may ask: "Will not let me mix the various query bar, too troublesome." "Of course not, Lucene provides a query statement similar to the SQL statement, we call it Lucene statement, through it, you can put a variety of queries, Lucene will automatically check them into small pieces to the corresponding query execution. Here's a demonstration of each query:
Termquery can be used in "Field:key" mode, such as "Content:lucene".
Booleanquery ' with ' with ' + ', ' or ' with ', for example ' Content:java Contenterl '.
Wildcardquery still use '? ' and ' * ', such as "content:use*".
Phrasequery with ' ~ ', such as "content:" Chinese and Japanese "-".
Prefixquery with ' * ', for example "medium *".
Fuzzyquery with ' ~ ', for example "Content:wuzza ~".
Rangequery uses ' [] ' or ' {} ', which represents a closed interval, which indicates an opening interval, such as "time:[20060101 to 20060130]", noting that the to is case-sensitive.
You can combine the query string arbitrarily to complete complex operations, such as "the title or body includes Lucene, and the time between 20060101 and 20060130 articles" can be expressed as: "+ (Title:lucene content:lucene) + Time:[20060101 to 20060130] ".

The code is as follows:

Directory dir = Fsdirectory.getdirectory (PATH, false);
Indexsearcher is = new Indexsearcher (dir);
Queryparser parser = new Queryparser ("Content", new StandardAnalyzer ());
Query query = parser.parse ("+ (Title:lucene content:lucene) +time:[20060101 to 20060130]";
Hits Hits = is.search (query);
for (int i = 0; i < hits.length (); i++)
{
Document doc = Hits.doc (i);
System.out.println (Doc.get ("title");
}
Is.close ();

First we create a indexsearcher on the specified file directory.
Then create a queryparser that uses StandardAnalyzer as the parser, and the domain it searches by default is content.
We then use Queryparser to parse the string and generate a query.
The query is then used to find the results, and the results are returned in hits form.
This Hits object contains a list, and we show it in turn.

The role of the 7.3 filter filter is to limit the query to only a subset of the index, which is a bit like where in the SQL statement, but there is a difference, it is not part of the formal query, but the data source preprocessing, and then handed to the query statement. Note that it performs preprocessing rather than filtering the query results, so the cost of using filter is significant, and it can make a query take up to 100 times times more time.
The most commonly used filter is rangefilter and QueryFilter. Rangefilter is set to search only the index within the specified range; queryfilter is searched for in the results of the last query.
The use of filter is very simple, you just need to create a filter instance, and then pass it to searcher. To continue the above example, query "articles between 20060101 and 20060130" In addition to writing restrictions in the query string, you can also write in Rangefilter:

Directory dir = Fsdirectory.getdirectory (PATH, false);
Indexsearcher is = new Indexsearcher (dir);
Queryparser parser = new Queryparser ("Content", new StandardAnalyzer ());
Query query = Parser.parse ("Title:lucene content:lucene";
Rangefilter filter = new Rangefilter ("Time", "20060101", "20060230", true, true);
Hits Hits = is.search (query, filter);
for (int i = 0; i < hits.length (); i++)
{
Document doc = Hits.doc (i);
System.out.println (Doc.get ("title");
}
Is.close ();

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.