Elasticsearch Mapping and analysis

Source: Internet
Author: User
Tags elasticsearch mapping

Analytics and Analyzers

Analysis is such a process:

    • First, characterize a text block as a separate word for inverted index (term)
    • These words are then standardized in standard form, improving their "searchable" or "recall"

This work is done by the Analyser (analyzer) . An Analyzer (Analyzer) is just a wrapper for putting three functions into a package:

Character Filter

The string is first passed through a character filter (character filter), and their work is done before the characterization (the word is called hyphenation is more appropriate) to process the string. Character filters can remove HTML tags, or convert "&" to "and" .

Word breaker

Next, the word breaker (tokenizer) is characterized by a separate word (word breaker). A simple word breaker (tokenizer) can separate words according to spaces or commas (translator Note: This does not apply in Chinese).

Characterization Filtration

Finally, each word is filtered by all representations (token filters), which can modify the words (for example, to "Quick" lowercase), remove the words (such as disabling words "a" , and "and"``"the" so on), or add words (such as synonyms "jump" and "leap" )

Elasticsearch offers many out-of-the-box character filters, Word breakers, and characterization filters. These can be combined to create a custom parser to respond to different requirements. We will discuss this in more detail in the section "Custom Analyzers".

Built-in analyzers

However, Elasticsearch also comes with some pre-installed analyzers that you can use directly. Below we list the most important analyzers to demonstrate the difference in performance after the string participle:

"Set the shape to semi-transparent by calling set_trans(5)"
Standard analyzers

The standard parser is the parser that Elasticsearch uses by default. For text analysis, it is the best choice for any language (the translator notes: There is no special need, for any one country's language, this analyzer is enough). It slices the text based on the word boundary (word boundaries) defined by the Unicode Consortium, and then removes most of the punctuation. Finally, all words are converted to lowercase. The resulting result is:

set, the, shape, to, semi, transparent, by, calling, set_trans, 5
Simple analyzer

A simple parser splits text that is not a single letter and then converts each word to lowercase. The resulting result is:

set, the, shape, to, semi, transparent, by, calling, set, trans
Space Analyzer

The Space Analyzer splits text by space. It does not convert lowercase. The resulting result is:

Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
Language analyzers

A language-specific parser is available in many languages. They can take into account the characteristics of a particular language. For example, the english parser comes with a set of English-language-disabled thesaurus-like and or the these semantics-independent generic words. After these words have been removed, because of the existence of grammatical rules, the main meaning of English words can still be understood (translator Note: stem English words this sentence does not know how to translate, check the dictionary, I understand the approximate meaning should be compared to a plant of English sentences, remove useless foliage, trunk still exist, stop using words like foliage, Presence or absence does not affect the understanding of this sentence. )。

englishThe parser will produce the following results:

set, shape, semi, transpar, call, set_tran, 5

Attention "transparent" , "calling" and "set_trans" how it turns into stemming.

When the parser is used

When we index a document, the entire text segment is parsed into separate words to create the inverted index. However, when we search in full-text fields, we want the query string to be processed by the same parsing process to ensure that the words are present in the index.

Full-Text queries we will discuss later to understand how each field is defined so that they can do the right thing:

    • When you query the full text field, the query uses the same parser to parse the query string to produce the correct word list.
    • When you query the exact value field, the query will not parse the query string, but you can specify it yourself.

Now you can see why the beginning of mapping and analysis produces that kind of result:

    • dateThe field contains an exact value: a single word "2014-09-15" .
    • _allThe field is a full-text segment, so the analysis process converts the date to three words: "2014" , "09" and "15" .

When we _all query in 2014 a field, it's a match to 12 tweets, because these tweets all contain words 2014 :

GET /_search?q=2014              # 12 results

When we query in a _all field 2014-09-15 , we first parse the query string, produce a query that matches either Word, 2014 or, 09 15 it still matches 12 tweets, because they all contain words 2014 .

GET /_search?q=2014-09-15        # 12 results !

When we query in the date field 2014-09-15 , it queries an exact date and then finds only one tweet:

GET /_search?q=date:2014-09-15   # 1  result

When we query in the date field 2014 , no document is found because no document contains that exact date:

GET /_search?q=date:2014         # 0  results !
Test analyzer

Especially when you're a elasticsearch novice, it's difficult to understand how to segment and store into an index. To get a better understanding of how this is done, you can use the analyze API to see how the text is parsed. Specify the parser to use in the query string parameter, the parsed text as the request body:

GET /_analyze?analyzer=standardText to analyze

Each node in the result represents a word:

{"Tokens": [      {"token":"Text","Start_offset":0,"End_offset":4,"Type":"<ALPHANUM>","Position":1},      {"token":"to","Start_offset":5,"End_offset":7,"Type":"<ALPHANUM>","Position":2},      {"token":"Analyze","Start_offset":8,"End_offset": the,"Type":"<ALPHANUM>","Position":3}   ]}

tokenis a word that is actually stored in the index. The position specified word is the first occurrence in the original text. start_offsetand the end_offset position that the expression occupies in the original text.

analyzeThe API is a very useful tool for understanding the inner details of the Elasticsearch index, and we will continue to discuss it as the content progresses.

Specify the parser

When Elasticsearch detects a new string field in your document, it automatically sets it as a full-text field and parses it with the string standard parser.

You can't always want to do this. Perhaps you want to use a language analyzer that is better suited to this data. Or, you just want to treat the string field as a normal field--no analysis, just the exact value, just like the user ID of the string type or the internal state field or label.

To achieve this effect, we must manually set these fields through mapping (mapping) .

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Elasticsearch Mapping and analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.