Java search engine: Lucene study Note 3

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

ArticleDirectory

Paging Processing
Lucene Scoring Algorithm

Lucene search API classes mainly include four indexsearcher, query (including subclass), queryparser, and hits
Indexsearcher is the search entry. Its search method provides the search function.

Query has many sub-classes. Different sub-classes represent different query conditions.

Queryparser is a very common help class, the function is to convert user input text into a built-in query object (most web search engines provide a query input box for users to enter query conditions ). Queryparser provides many built-in syntaxes for queries that can input various advanced conditions. For example, "hello and World" will be parsed into a booleanquery with an and relationship, which contains two termqueries (hell and world ). Although these syntaxes are powerful, they are all designed for English. For Chinese search, we do not need to know much about query types. Generally, a few simple syntaxes are enough. The usage of queryparser is as follows:

Queryparser. parse (string query, string field, analyzer) throws parseexception

Where: query is the content entered by the user, field is the default field to search (other fields need to be explicitly specified), and analyzer is used to analyze and process the content entered by the user (Word Segmentation ), in general, the anaylyzer here uses the same analyzer when the index is used.

In addition, we can also construct a queryparser: New queryparser (string field, analyzer A) (meaning the same as above) by ourselves. The advantage of doing so is that we can define and adjust some parameters by ourselves.

Processing of search results: hits object

The hits object is a set of search results. The following methods are used:

Length (), this method records how many results are returned (lazy loading)
DOC (n) returns the nth record
ID (in) returns the Document ID of the nth record
Score (n) Correlation of nth record (points)

Because the search results are generally large, in terms of performance, the hits object does not actually retrieve all the results, by default, the first 100 records are retained (for general search engines, 100 records are sufficient ).

Paging Processing

There are still too many 100 records. We usually display 20 records on each page and display them on several pages. There are two methods for paging.

The indexreader and hit objects are retained in the session, and the content is extracted during page turning.
If the session is not used, it is processed as a re-query every time.

Lucene recommends that you use the second method, that is, re-query each time. The advantage of this method is that it is simple and convenient, and you do not need to consider the session issue, lucene's query efficiency can also ensure that the query time is not long, unless there is a real performance problem, otherwise you do not need to consider the first method.

Cache: ramdirectory usage

The ramdirectory object is very useful. Through it, we can completely read a common index into the memory. The usage is as follows:
Ramdirectory ramdir = new ramdirectory (DIR );
This ramdir is naturally much more efficient than the real file system.

Lucene scoring Algorithm

Records queried by lucence are sorted by relevance by default. The relevance is score. The Scoring Algorithm is complex and does not seem to be helpful to the people we use. (First, let's talk about the term: in my understanding, term is an independent query term. After a user inputs a query based on various word segmentation, Case sensitivity (normalization), and stopwords elimination, the term is already the basic unit ), pay attention to several key parameters.

Frequency of term in articles
Frequency of articles containing the same term
Boosting parameter in Field
Term length
Number of terms in the document

In general, we cannot adjust these parameters. If you want to learn more, indexsearcher also provides an explain method. By passing in a query and Document ID, you can get an explain object, it is a simple encapsulation of internal algorithm information. You can see the detailed description in tostring ().

Query creation: Various Query

The most common termquery
Termquery is the most common and can be constructed using term T = new term ("contents", "cap"); New termquery (t)
Termquery regards the query condition as a key and must exactly match the query content. For example, you can use termquery for field. Keyword type.

Rangequery
Rangequery indicates a range of search conditions. rangequery query = new rangequery (begin, end, sorted DED );
The last Boolean value indicates whether the boundary condition itself exists. It is expressed as "[begin to end]" or "{begin to end }"

Prefixquery
As the name implies, it indicates a query starting with XX, and the character is "something *"

Booleanquery
This is a combined query. You can add various queries and indicate their logical relationships.

Public void add (query, Boolean required, Boolean prohibited)

Method. The last two Boolean variables indicate and or not as "and or not" or "+-". Multiple queries can be added to a booleanquery, if the value of setmaxclausecount (INT) is exceeded (1024 by default), The toomanyclses error is thrown.

Phrasequery
Indicates the query of non-strict statements, such as "Red pig" to match "red fat pig" and "red fat Big Pig". phrasequery provides a setslop () parameter, during the query, Lucene tries to adjust the distance and position of the word. This parameter indicates that the number of adjustments is acceptable. If the actual content can be adjusted to full match in so many steps, then it is regarded as a match. by default, the slop value is 0, so non-strict matching is not supported by default, by setting slop parameters (for example, if "Red pig" matches "red fat pig", one slop is required to move the pig one bit later), Lucene can perform fuzzy search. it is worth noting that phrasequery does not guarantee the order of words before and after. In the above example, "pig red" requires two slop, that is, if the slop is greater than or equal to 2, therefore, "pig red" will also be considered a match.

Wildcardquery
Use? And * to indicate one or more letters such as wil * can match wild, wila, wilxaaaa ..., it is worth noting that in wildcard, as long as it is a matching record, their relevance is the same. For example, wilxaaaa and wild have the same relevance to Wil.

Fuzzyquery
This query is useless to Chinese characters. It can be used to fuzzy match English words (phrases are used in front of them). For example, fuzzy and wu.pdf can be considered as similar. For English Tense changes and plural forms, this fuzzyquery is useful, and the relevance of the matching results is different. the character is expressed as "Fuzzy ~ "

Queryparser usage

For search engines, in many cases, users only need to enter all the query conditions (such as Google) in one input box. At this time, queryparser will be used, his role is to convert various user input into query or query groups. He represents the query characters mentioned above (query. tostring) to the actual query object, such as "wuzzy ~ "It will be converted to fuzzyquery, but queryparser uses analyzer, so the query after queryparser parse and tostring may not be the same as the original. The additional query syntax is:

GROUP: groupping
For example, "(a and B) or c" is a bracket group, which is easy to understand.

Fieldselectiong
The query condition of queryparser is for the default field, which is encoded during queryparser parsing. If you need to select another field in the query condition, you can use the following syntax: fieldname: fielda. for multiple groups, you can use fieldname :( fielda fieldb fieldc.

* No.
Queryparse does not allow the * sign to appear in the start part by default. This is mainly used to prevent users from mistakenly inputting * headers from causing serious performance problems (all records will be read)

Boosting
Through Hello ^ 2.0, you can perform boosting on the term "hello". (I don't think any user will do this. bt)

Queryparser is a ready help class that can work immediately, but it still provides many parametersProgramFirst, we need to construct a new queryparser, and then customize the parameters.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Java search engine: Lucene study Note 3

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Java search engine: Lucene study Note 3

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support