Lucene syntax (detailed Lucene query syntax)

Source: Internet
Author: User

Lucene provides a variety of APIs to combine and customize the queryserver you need. You can also use the powerful query Syntax Parsing provided by query parser to construct the queryserver you want. This article introduces Lucene's query syntax in detail. Parses a query string into a Lucene queryer using the Java syntax analyzer. Before using query parser, consider the following:

If you want to splice the query syntax string in the program and then use the query parser conversion, we strongly recommend that you use the corresponding API to construct your own queryer. That is to say, the query parser is designed for manual input of advanced queries, rather than for program concatenation of syntax strings. It is better to add fields without word segmentation to the queryer through corresponding APIs, rather than query parser. The analyser analyzer used by the query parser to convert the text manually entered by the user into the corresponding term. If the value of a field is generated by a program (such as a date field or a keyword field), it should be consistent during the query and use the program to generate a corresponding format for query.

In the query target, if all the fields are generated by the Program (for example, the completed date fields), it is best to use the query parser to make the query in a consistent format. For other queries, such as date range queries and keyword queries, it is best to call the corresponding API to build the queryer. If only a limited enumerated value exists in the target field, it is best to provide the user with a drop-down list and add it to the queryer using termquery, instead, it Concatenates the query string and uses the query Parser for parsing.

Terms
A query is divided into several terms and operators. There are two types of terms: one is a single term, and the other is a phrase. A single term is the smallest unit after analyzer word segmentation. It is a simple word, such as "test" and "hello ". A phrase is a group of words enclosed by double quotation marks, for example, "Hello Dolly". Multiple terms can be combined into a more complex queryer through Boolean operations.
Note: In general, the analyzer used to create an index should be consistent with the analyzer used for query (of course, there are also special cases, such as single-word index and Word Segmentation Combined Query ), therefore, it is important to select a analyzer that does not interfere with the query term.

Fields
Lucene supports multi-field data. You can specify a field for query or use the default field. You can use the field name + ":" + query word to specify the field name for search. For example, let's assume that Lucene's index contains two fields: the title field and the text field. The text field is the default field, if you want to find a document with the title "The Right Way" and the text containing "go", you can enter:
Title: "The Right Way" and text: Go
Or:
Title: "The Right Way" and go
If the field is a default field, you do not need to explicitly specify it in the query syntax. Note that using the default field may result in the following results:
Title: do it right
The above query will find the document with the title containing "do" and the text field containing "it" and "right", because text is the default field, therefore, if you want to find the full content of the title, it is caused by quotation marks.

Term Modifiers
Lucene supports fuzzy search by using wildcards in terms.

Wildcard searches
Lucene supports single-character or multiple-character wildcard queries, matching a single character using the symbol "?", Use the symbol "*" to match multiple characters.
"?" Wildcard characters are used to search for all documents that meet the condition after one character is replaced. For example, to search for "test" and "text", you can use:
Te? T
The "*" wildcard matches the condition after 0 or more characters are replaced. For example, to query test, tests, or tester, you can use the following string to search:
Test *
Of course, you can also place "*" in the middle of the character
Te * t
Note: you cannot set "*" and "?" Put it in the first character for query. (Lucene does not support this function because of performance considerations)

Fuzzy searches
Lucene supports fuzzy search based on the editing distance algorithm. You can use the Tilde "~" Put it behind the query word. For example, to search for a word with a similar spelling as "Roam", you can use:
Roam ~
This query will look for words like "foam" and "roams. It can also be called similarity query.

Proximity searches
Lucene supports specifying distance query. You can use the Tilde "~" Add a number after the query term. For example, if you search for "Apache" and "Jakarta" within 10 characters, you can use the following syntax:
"Jakarta Apache "~ 10
With the support of this syntax, we can perform single-word indexing, Word Segmentation query, and after word segmentation, the spacing of each word must be 0. This can ensure a 100% recall rate, but the index will be bloated, and the query speed will also be reduced to some extent. In general, the performance will be significantly reduced when million articles are written to million data.

Range searches
Range Query allows you to specify the maximum and minimum values of a field and query all documents between them. A range query can contain or does not contain the maximum and minimum values, which are sorted alphabetically.
Mod_date: [20020101 to 20030101]
This will search for all documents whose mode_date field is greater than or equal to 20020101 and smaller than or equal to 20030101. Note: The range query is not specific to the date field, you can also query the range of non-date fields.
Title: {aida to Carmen}
This will look for all documents with titles between AIDA and Carmen but not including AIDA and Carmen. Square brackets are used for queries that contain the maximum and minimum values. curly brackets are used for troubleshooting.

Boosting a term
Lucene supports setting different weights for different query words. Set the weight to use the "^" symbol, place "^" at the end of the query word, and follow the weight value. The larger the weight factor, the more important the word. Setting weights allows you to set different weights for different query words to affect the relevance of the document. If you are searching:
Jakarta Apache
If you think "Jakarta" is more important in queries, you can use the following syntax:
Jakarta ^ 4 Apache
This will make documents containing Jakarta more relevant. You can also set the weight for the phrase as follows:
"Jakarta Apache" ^ 4 "Jakarta Lucene"
By default, the weight factor is 1. Of course, the weight factor can be smaller than 1.

Boolean operators
The boolean operator can combine multiple terms into a complex logical query. Lucene supports and,
+, Or, not,-is used as the operation symbol. Note that all symbols must be in uppercase.

Or
Or operator default join operator. This means that when no operator is explicitly specified for multiple terms, or is used. As long as one of the terms contains, the document can be queried, this is similar to the meaning of the logical symbol |. Suppose we query a document that contains "Jakarta Apache" or "Jakarta", we can use the following syntax:
"Jakarta Apache" Jakarta
Or
"Jakarta Apache" or Jakarta

And
The and operator specifies that all the terms must appear to meet the query conditions, which is similar to the logical symbol & meaning. If you want to search for a document that contains "Jakarta Apache" and "Jakarta Lucene", you can use the following syntax:
"Jakarta Apache" and "Jakarta Lucene"

+
+ The operator specifies that the following term must appear in the document, that is, the must attribute in the query word. For example, to query a document that must contain "Jakarta" or "Lucene", you can use the following syntax:
+ Jakarta Apache

Not
The not operator specifies that the query document must not contain a term after not, which is in the logical symbol! Similar. To search for a document that must contain "Jakarta Apache" and cannot contain "Jakarta Lucene", we can use the following query;
"Jakarta Apache" not "Jakarta Lucene"
Note: The not operator cannot be used in a separate term. For example, the following query returns no results:
Not "Jakarta Apache"

-
-The operator excluded the document containing the subsequent term, which is somewhat similar to not. Suppose we want to search for "Jakarta Apache" but not "Jakarta Lucene", we use the following syntax:
"Jakarta Apache"-"Jakarta Lucene"

Grouping
Lucene supports grouping Query expressions using parentheses, which is useful in controlling Boolean queries. For example, if a search must contain "website" and "Jakarta" and "Apache", we can use the following syntax:
(Jakarta or Apache) and website
This syntax is of great significance to eliminate ambiguity and ensure the correctness of the query expression.

Field grouping
Lucene supports grouping fields with parentheses. When we want to query the headers containing "return" and "Pink ranther", we can use the following syntax:
Title :( + return + "Pink Panther ")

Escaping special characters
Lucene supports special characters in the escape query. The following lists the special characters of Lucene:
+-& |! () {} [] ^ "~ *? :\
To escape special characters, we can use the symbol "\" before the character. For example, to search (1 + 1): 2, we can use the following syntax:
\ (1 \ + 1 \) \: 2


From: http://hi.baidu.com/expertsearch/blog/item/8d4f7d355a2e413c5ab5f547.html

This translation is original. Please indicate the source when reprinting. For the original English version, see:
Http://lucene.apache.org/java/1_4_3/queryparsersyntax.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.