Elasticsearch's AutoComplete

Source: Internet
Author: User

For search engines, it is an important feature to have an auto-cue in the process of typing in search terms, and ES provides support.

Does this feature look a bit like prefixquery? However, Prefixquery will qualify for Doc, and the auto-cue is to return the term that meets the criteria. So the two cannot be confused. Therefore the suggestion module appeared. We focus on complete.

1: Why do I need a separate suggest?

Speed first: For the sake of the teleprompter, a complete query needs to retrieve too many terms, which is not fast enough. The automatic teleprompter must be super fast and can't wait, so ES uses a memory data structure called FST to store the term in order to achieve fast enough retrieval speed.

Real-time: ES is known for its real-time nature. If you build FST in memory, it's pretty expensive to load all the data, for example. And once the data has changed, it will be necessary to rebuild the FST, which is obviously unreasonable. In order to achieve real-time, ES put FST's construction process from the query stage to a segment-generated index phase, a new segment generation will accompany this one FST file. And this FST file loaded into memory is very fast. There will be multiple FST files, so the final teleprompter result is a summary of multiple FST files.

Readability: The user's input may be many, such as a hint word is "courtyard by Marriot, Munich City", when the user input "Countyard Munich" or "Munich hotel", All want to be able to give the front of the specific information corresponding to the cue words. We give the hint should be as clear as possible, there is no doubt, so when the user entered any of the above phrase, we should give a strong readability hint "courtyard by Marriot, Munich City".

Customization of the teleprompter order: for the return result of a query, factors such as TF/IDF are considered. For the teleprompter result, the user may need to be fully customizable. For example, want to discount the hotel ranked in the front and so on.

For these reasons, ES provides a separate suggest to complete the teleprompter.

Some settings for 2:completion suggest

Multiple input settings: limited by the structure of the FST, which is matched from left to right. Therefore, in order to give more flexibility to the inscription, you can set a plurality of inputs, which correspond to the same logical meaning, regardless of the input input should give the correct inscription. Set the input attribute for the inscription field.

The same inscription: The output attribute actually specifies what the user's inscription is, rather than the term that is stored in the FST. For example, input as "Mercure Hotel Munich", according to the normal inscription, if the text is set to M, you will be prompted Mercure Hotel Munich, and if the output is set to the hotel Mercure, The inscription given is the output content.

Weight issue: Set weight. Different doc wants users to have different weights, which are used for suggest, high weights in the order of the front. If the weight,es is not set, it will be determined by the TF during the query.

User-definable information: payload can set a JSON-structured string to carry user-defined information. For example, can be set to DocId, so when the inscription is given, but also the docid returned to the user, the user can directly browse the doc information.

Synonyms: As with queries, synonyms also apply, setting the corresponding filter.

Ignore stop words: Note that the default Index_analyzer in the mapping of completion suggest is simple analyzer rather than standard analyzer. Why is it? Or the question of stopping words. For the "Charles Hotel", when the user enters "Charles", the correct inscription should be given, but this is not the case, because the Stop token filter removes the stop word, but leaves a blank character in the FST, So the inscription begins with a blank character, so there is a problem. In order to remove this whitespace character, you can set preserve_position_increments and preserve_separators to false.

Whether or not to ignore the word stop is also a tangled matter, such as "Simon the Sorcerer", when the user enters "Simon T", if the stop word is ignored, it is not a word, because T itself is not a stop words. So in order to get a better cue result, we have to make full use of multiple inputs to indicate all the input situations and get the correct teleprompter result (not complicated ...). )。 Well.

Plans for the future:

ES will deal with this block, and it will also support fuzzy.





Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Elasticsearch's AutoComplete

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.