10 Summary of Lucene usage

Source: Internet
Author: User

1 About Lucene
1.1 What is Lucene
Lucene is a full-text search framework, not an app Product. So it doesn't work like www.baidu.com or Google desktop, It just provides a tool to enable you to implement these Products.

1.2 What Lucene can do
to answer this question, first understand the nature of LUCENE. In fact, Lucene is a very simple function, after all, you give it a number of strings, and then it provides you with a full-text search service, tell you where the keywords you want to search appear. Knowing the nature, you can imagine doing anything that fits this Condition. You can index the news in the station, and make a database; you can index several fields of a table, so you don't have to worry about locking the table because of "%like%"; you can also write your own search engine ...

1.3 You should not choose Lucene
here are some test data, if you feel acceptable, then you can Choose.
Test one: 2.5 million records, 300M text, Generate index around 380M, 800 threads under average processing time 300ms.
test two: 37000 records, Index database of two varchar fields, index file 2.6m,800 thread under average processing time 1.5ms.

2 ways to work with Lucene download
the services provided by Lucene actually consist of two parts: one in one Out. The so-called entry is written, the source you provide (essentially a String) is written to the index or deleted from the index, so-called read out, that is, to provide users with Full-text search services, so that users can locate the source by Keyword.

2.1 Write Process
the source string is first processed by the analyzer, including: participle, divided into words; remove Stopword (optional).
Add the required information from the source to each field in the document and index the field that needs to be indexed to store the fields that need to be stored.
writes an index to memory, which can be either memory or Disk.

2.2 read-out process Download
users to provide search keywords, through the analyzer Processing.
find the corresponding document for the processed keyword search Index.
the user extracts the required field from the document that is found as Needed.

3 Some concepts to be aware of
Lucene uses some concepts to understand what they mean and is helpful for the following explanations.

3.1 Analyzer download
analyzer is a parser, it is the role of a string according to a certain rule into a word, and remove the invalid words, here said invalid word refers to the English "of", "the", "in chinese", "ground" and other words, these words appear in the article, however, It does not contain any key information, which is helpful for shrinking index files, improving efficiency, and increasing hit ratios.
the rules of participle are changeable, but the purpose is only one: chapeau Division. This is easier to achieve in english, because English itself is a word unit, has been separated by a space, while the Chinese must be in some way to divide into a sentence into a Word. The specific partitioning method is described in detail below, where you only need to understand the parser Concept.

3.2 Document
user-supplied sources are records that can be a text file, a string, or a record of a database table, and so On. Once a record has been indexed, it is stored in the index file as a Document. The user searches and is returned as a list of the Document.

3.3 Field
A document can contain multiple fields of information, such as an article that can contain information fields such as title, body, and last modified, which are stored in document by FIELD.
field has two properties to choose from: Storage and Indexing. You can control whether the field is stored by storing properties, and you can control whether the field is indexed by indexed Properties. This may seem a bit of crap, in fact it's important to have the right combination of these two properties, as illustrated Below:
As an example of the previous article, we need to search the title and body text, so we want to set the Index property to true, and we want to be able to extract the title of the article directly from the search results, so we set the store property of the title field to true, but because the body field is too large, In order to reduce the size of the index file, the storage property of the body field is set to false, and then read the file directly when needed; we just want to be able to extract the last modified time from the search results and do not need to search for it, so we set the storage property of the last modified time domain to true and the indexed property to False. The above three fields cover three combinations of two properties, and there is no use of a fake one, in fact field does not allow you to set it, because fields that are neither stored nor indexed are MEANINGLESS.

3.4 term download
term is the smallest unit of search, which represents a word in a document that consists of two parts: the word it represents and the field in which the word appears.

3.5 Tocken
Tocken is the occurrence of a term that contains the Trem text and corresponding start and end offsets, as well as a type string. Words can appear multiple times the same word, they are used in the same term, but with different tocken, each tocken mark the place where the word appears.

3.6 Segment
when you add an index, not every document is added to the same index file at once, they are written to a different small file and then merged into a large index file, where each small file is a Segment.

4 structure of Lucene
Lucene includes core and sandbox, where core is the central part of Lucene's stability, and the sandbox contains additional features such as highlighter, various Analyzers.
Lucene Core has seven packages: Analysis,document,index,queryparser,search,store,util.
4.1 analysis
The analysis contains some built-in analyzers, such as the Whitespaceanalyzer of a blank character participle, with the addition of Stopwrod filtered stopanalyzer, the most commonly used Standardanalyzer.
4.2 Document
document contains the data structures of the documents, for example, the document class defines the data structure in which the documents are stored, and the field class defines a domain for the File.
4.3 index
index contains read-write classes for indexes, such as write, merge, optimize the IndexWriter class for the segment of the index file, and Indexreader class for reading and deleting the index, so be careful not to be misled by the name Indexreader. , that it is the read class of the index file, and actually delete the index is also done by it, IndexWriter only care about how to write the index segment, and combine them to optimize; Indexreader is concerned with the organization of each document in the index File. Download
4.4 Queryparser
Queryparser contains a class to parse query statements, lucene query Statements and SQL statements are a bit similar, there are various reserved words, according to a certain syntax can be composed of various queries. Lucene has many kinds of query classes, all of which inherit from query and execute various special queries, the function of Queryparser is to parse the query statements, and call various query classes in order to find out the Results.
4.5 Search
Search contains the various classes that searched for results from the index, such as the various query classes just mentioned, including termquery, booleanquery, and so on in this package.
4.6 Store
store contains indexed storage classes, such as directory defines the storage structure of the index file, fsdirectory is the index stored in the file, ramdirectory the index stored in memory, Mmapdirectory is an index that uses memory Mapping.
4.7 util
util contains some common tool classes, such as conversion tools between time and String.

5 How to build an index
5.1 Simplest code snippet to complete the index download

indexwriter writer = new IndexWriter ("/data/index/", new StandardAnalyzer (), true);
Document doc = new Document ();
doc.add (new Field ("title", "lucene introduction", Field.Store.YES, Field.Index.TOKENIZED));
doc.add (new Field ("content", "lucene works well", Field.Store.YES, Field.Index.TOKENIZED));
writer.adddocument (doc);
writer.optimize ();
writer.close ();

let's examine this code. Download
First we create a writer and specify that the directory for the index is "/data/index", the parser used is standardanalyzer, and the third parameter shows that if there are already indexed files in the index directory, we will overwrite Them.
then we create a new Document.
we add a field to document named "title" with the content "lucene introduction", which is stored and indexed.
add a field named "content" with the content "lucene works well", which is also stored and indexed.
then we add this document to the index, and if there are multiple documents, you can repeat the above actions, create the documents and add Them.
after adding all the document, we optimize the index, and the optimization is mainly to merge multiple segment into one, which helps to improve the index Speed.
It is important that the writer is subsequently Closed.

yes, It's so easy to create an index!
of Course you may modify the above code to get a more personalized Service.


10 Summary of Lucene usage

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.