Lucene Core Concept __lucene

Source: Internet
Author: User

carding of knowledge points: distinguishing between lucene retrieval and database Retrieval performance: Database: A full table scan of data in a database: low performance; Lucene: First index the data and then find it based on the index that was established. (more than creating an index of such a process, we are created once, many times); correlation sort: Database: ORDER by ID sorted according to order; Lucene: For each query, the results are scored a little, the higher the score, the more ranked, We can modify the score by setting the weight value, matching accuracy: database: Through the query statement, like: Disadvantage: Full table scanning performance low Lucene: First take the results out, through the segmentation of the word breaker to establish the index. Lookup based on an established index;

   The workflow of the Full-text retrieval program If the information retrieval system goes to the Internet to find an answer after the user sends a retrieval request, the result cannot be returned in a limited amount of time. So the collection of resources that needs to be retrieved is placed locally and uses a particular structure store, called an index, which is called an index library. Because the structure of the index library is designed specifically for quick queries, queries are fast. Each of our searches is done in a local index library, as shown in the following illustration: as you can see from the picture, we should not only search, but also ensure consistency between the data set and the index library. So for the development of Full-text search function, there are two aspects to be done: Index Library Management (maintaining the data in the index library), searching in the index library. Lucene is the tool that operates the index library. use Lucene APIs to manipulate index libraries The Index Library is a directory in which there are binary files, as in a database, where all the data is in the file system. Instead of manipulating these binaries directly, we use the API provided by Lucene to do the same, just as the operation database should use SQL statements. The operations of the on the index library can be divided into two types: management and query. The Management Index Library uses IndexWriter to query the use of Indexsearcher from the index library; Lucene's data structure is document and field; Document represents a piece of data; field represents a property in the data; There are multiple Field,field values in a document that are string, because Lucene only processes text. Just turn the objects in our program into document, and we can give it to Lucene to manage, and the list of data in the search results is also a collection of document. Index Library Structure--inverted sort index

we need to preprocess the document and build a data structure that is easy to retrieve to improve the speed of information retrieval, which is the index. -One of the most widely used indexing methods is the inverted sort index. The inverted sort index is like looking up a dictionary. To first look up the table of contents, get the data corresponding to the page number, in the direct turn to the specified page number. Instead of looking for a word in an article, look for the article in the catalogue. This requires a glossary (table of contents) to be generated in the index library, and each record in the glossary is similar to the structure of the "numbered list" of the document where the word--> is located, recording each occurrence of the word, and where the word appears (which document). Search the vocabulary first, get the document number, and then directly remove the corresponding document. The operation of converting data into a specified format into an indexed library is called indexing. when an index is established, the glossary is updated after the data is saved to the index library. When you search, start by retrieving the glossary, and then find the corresponding document. If the query contains only one keyword, find the word in the glossary and take out his corresponding document. If the query contains multiple keywords, you need to merge the records retrieved by each word and then remove the corresponding document records. index file retrieval and maintenance, update is deleted first and then created There are three actions to maintain inverted indexes: Add, delete, and update documents. But the update operation requires a higher price. Because the document changes (even small changes), it may cause a lot of keywords in the document has changed position, which requires frequent reading and modification of records, the price is quite high. Therefore, instead of a true update operation, the update operation is replaced by the "delete first, then create" method. the execution of an index creates an index by saving the document to the index library and updating the glossary. The following figure:

What we do: Turn the data object into the corresponding document, where the attributes are converted to field. What we do: Invoke the Adddocument (DOC) of the tool IndexWriter to add the document to the index library. Lucene does: Save the document in the index library and automatically specify an internal number to uniquely identify the data. The internal number is similar to the address of this data, the number may change when the data inside the index library is adjusted, and the number referenced in the glossary is changed accordingly to ensure correct. But if we refer to this number outside, two times before and after it, it may not be the same document. So the internal number is best used only internally. What Lucene does: Updates the glossary. Find the words in the text and put them in the glossary to establish a corresponding relationship with the document. Which words do you want to put in the glossary, which words are included in the text? This uses a tool called analyzer (Word breaker). His role is to take the words in a paragraph of text in accordance with the rules to remove all the words contained. corresponding to the Analyzer class, this is an abstract class, the specific rules of the word segmentation is implemented by subclasses, so for different languages (rules), to use a different word breaker; When you convert an object's properties to field, the related code is: Doc.add ( New Field ("title", Article.gettitle (), Store.yes, index.analyzed)). The third and fourth parameters mean:

Enum type

Enumeration constants

Description

Store

NO

The value of the property is not stored

YES

Store the value of a property

Index

NO

Do not establish an index

Analyzed

Indexing after participle

Not_analyzed

Without participle, the whole content is indexed as a word

Description: Store is the original content that affects whether a specified attribute is in the result of a search. Index is the effect of whether or not you can query (No) from this attribute, or look up certain words (analyzed) in the query, or query the entire content as a word (not_analyzed). the execution process of searching from the index library when searching, first look in the glossary and get a list of the document numbers that match the criteria. The data is actually retrieved according to the document number. The following figure: Convert the string you want to query into the query object. This is like using a HQL query in hibernate to call Session.createquery (HQL) to a query object that is converted to hibernate. To convert a query string into query is to use Queryparser, or use Multifieldqueryparser. The query string is also preceded by an analyzer (word breaker). The analyzer to be used when searching for a search is consistent with the analzyer used to establish the index, otherwise the correct results may not be found. Call Indexsearcher.search (), make a query, and get the results. This method returns a value of Topdocs and is an object that contains multiple information about the result. Where there are totalhits voting records, the Scoredoc array. Scoredoc is an object that represents the correlation score of a result with the information of a document number. Remove the list of data that you want to use. Call Indexsearcher.doc (Scoredoc.doc) to remove the document data that corresponds to the specified number. Use when paging: only one page at a time.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.