Analytics and Analyzers
Analysis is such a process:
- First, characterize a text block as a separate word for inverted index (term)
- These words are then standardized in standard form, improving their "searchable" or "recall"
This work is done by the Analyser (analyzer) . An Analyzer (Analyzer) is just a wrapper for putting three functions into a package:
Character Filter
The string is first passed through a character filter (character filter), and their work is done before the characterization (the word is called hyphenation is more appropriate) to process the string. Character filters can remove HTML tags, or convert "&"
to "and"
.
Word breaker
Next, the word breaker (tokenizer) is characterized by a separate word (word breaker). A simple word breaker (tokenizer) can separate words according to spaces or commas (translator Note: This does not apply in Chinese).
Characterization Filtration
Finally, each word is filtered by all representations (token filters), which can modify the words (for example, to "Quick"
lowercase), remove the words (such as disabling words "a"
, and "and"``"the"
so on), or add words (such as synonyms "jump"
and "leap"
)
Elasticsearch offers many out-of-the-box character filters, Word breakers, and characterization filters. These can be combined to create a custom parser to respond to different requirements. We will discuss this in more detail in the section "Custom Analyzers".
Built-in analyzers
However, Elasticsearch also comes with some pre-installed analyzers that you can use directly. Below we list the most important analyzers to demonstrate the difference in performance after the string participle:
"Set the shape to semi-transparent by calling set_trans(5)"
Standard analyzers
The standard parser is the parser that Elasticsearch uses by default. For text analysis, it is the best choice for any language (the translator notes: There is no special need, for any one country's language, this analyzer is enough). It slices the text based on the word boundary (word boundaries) defined by the Unicode Consortium, and then removes most of the punctuation. Finally, all words are converted to lowercase. The resulting result is:
set, the, shape, to, semi, transparent, by, calling, set_trans, 5
Simple analyzer
A simple parser splits text that is not a single letter and then converts each word to lowercase. The resulting result is:
set, the, shape, to, semi, transparent, by, calling, set, trans
Space Analyzer
The Space Analyzer splits text by space. It does not convert lowercase. The resulting result is:
Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
Language analyzers
A language-specific parser is available in many languages. They can take into account the characteristics of a particular language. For example, the english
parser comes with a set of English-language-disabled thesaurus-like and
or the
these semantics-independent generic words. After these words have been removed, because of the existence of grammatical rules, the main meaning of English words can still be understood (translator Note: stem English words
this sentence does not know how to translate, check the dictionary, I understand the approximate meaning should be compared to a plant of English sentences, remove useless foliage, trunk still exist, stop using words like foliage, Presence or absence does not affect the understanding of this sentence. )。
english
The parser will produce the following results:
set, shape, semi, transpar, call, set_tran, 5
Attention "transparent"
, "calling"
and "set_trans"
how it turns into stemming.
When the parser is used
When we index a document, the entire text segment is parsed into separate words to create the inverted index. However, when we search in full-text fields, we want the query string to be processed by the same parsing process to ensure that the words are present in the index.
Full-Text queries we will discuss later to understand how each field is defined so that they can do the right thing:
- When you query the full text field, the query uses the same parser to parse the query string to produce the correct word list.
- When you query the exact value field, the query will not parse the query string, but you can specify it yourself.
Now you can see why the beginning of mapping and analysis produces that kind of result:
date
The field contains an exact value: a single word "2014-09-15"
.
_all
The field is a full-text segment, so the analysis process converts the date to three words: "2014"
, "09"
and "15"
.
When we _all
query in 2014
a field, it's a match to 12 tweets, because these tweets all contain words 2014
:
GET /_search?q=2014 # 12 results
When we query in a _all
field 2014-09-15
, we first parse the query string, produce a query that matches either Word, 2014
or, 09
15
it still matches 12 tweets, because they all contain words 2014
.
GET /_search?q=2014-09-15 # 12 results !
When we query in the date
field 2014-09-15
, it queries an exact date and then finds only one tweet:
GET /_search?q=date:2014-09-15 # 1 result
When we query in the date
field 2014
, no document is found because no document contains that exact date:
GET /_search?q=date:2014 # 0 results !
Test analyzer
Especially when you're a elasticsearch novice, it's difficult to understand how to segment and store into an index. To get a better understanding of how this is done, you can use the analyze
API to see how the text is parsed. Specify the parser to use in the query string parameter, the parsed text as the request body:
GET /_analyze?analyzer=standardText to analyze
Each node in the result represents a word:
{"Tokens": [ {"token":"Text","Start_offset":0,"End_offset":4,"Type":"<ALPHANUM>","Position":1}, {"token":"to","Start_offset":5,"End_offset":7,"Type":"<ALPHANUM>","Position":2}, {"token":"Analyze","Start_offset":8,"End_offset": the,"Type":"<ALPHANUM>","Position":3} ]}
token
is a word that is actually stored in the index. The position
specified word is the first occurrence in the original text. start_offset
and the end_offset
position that the expression occupies in the original text.
analyze
The API is a very useful tool for understanding the inner details of the Elasticsearch index, and we will continue to discuss it as the content progresses.
Specify the parser
When Elasticsearch detects a new string field in your document, it automatically sets it as a full-text field and parses it with the string
standard
parser.
You can't always want to do this. Perhaps you want to use a language analyzer that is better suited to this data. Or, you just want to treat the string field as a normal field--no analysis, just the exact value, just like the user ID of the string type or the internal state field or label.
To achieve this effect, we must manually set these fields through mapping (mapping) .
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Elasticsearch Mapping and analysis