Forwarded from: http://blog.csdn.net/hzrandd/article/details/47128895
Analytics and Analyzers
Analysis is such a process:
- First, characterize a text block as a separate word for inverted index (term)
- These words are then standardized in standard form, improving their "searchable" or "recall"
This work is done by the Analyser (analyzer). An analyzer (analyzer) is just a wrapper for putting three functions into a package:
Character Filter
The string is first passed through a character filter (character filter), and their work is done before the characterization (the word is called hyphenation is more appropriate) to process the string. Character filters can remove HTML tags, or convert "&"
to "and"
.
Word breaker
Next, the word breaker (Tokenizer) is characterized by a separate word (word breaker). A simple word breaker (tokenizer) can separate words according to spaces or commas (translator Note: This does not apply in Chinese).
Characterization Filtration
Finally, each word is filtered by all representations (token filters), which can modify the words (for example, to "Quick"
lowercase), remove the words (such as disabling words "a"
, and "and"``"the"
so on), or add words (such as synonyms "jump"
and "leap"
)
Elasticsearch offers many out-of-the-box character filters, Word breakers, and characterization filters. These can be combined to create a custom parser to respond to different requirements. We will discuss this in more detail in the section "Custom Analyzers".
Built-in analyzers
However, Elasticsearch also comes with some pre-installed analyzers that you can use directly. Below we list the most important analyzers to demonstrate the difference in performance after the string participle:
"Set the shape to semi-transparent by calling set_trans(5)"
Standard analyzers
The standard parser is the parser that Elasticsearch uses by default. For text analysis, it is the best choice for any language (the translator notes: There is no special need, for any one country's language, this analyzer is enough). It slices the text based on the word boundary (word boundaries) defined by the Unicode Consortium, and then removes most of the punctuation. Finally, all words are converted to lowercase. The resulting result is:
set, the, shape, to, semi, transparent, by, calling, set_trans, 5
Simple analyzer
A simple parser splits text that is not a single letter and then converts each word to lowercase. The resulting result is:
set, the, shape, to, semi, transparent, by, calling, set, trans
Space Analyzer
The Space Analyzer splits text by space. It does not convert lowercase. The resulting result is:
Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
Language analyzers
A language-specific parser is available in many languages. They can take into account the characteristics of a particular language. For example, the english
parser comes with a set of English-language-disabled thesaurus-like and
or the
these semantics-independent generic words. After these words have been removed, because of the existence of grammatical rules, the main meaning of English words can still be understood (translator Note: stem English words
this sentence does not know how to translate, check the dictionary, I understand the approximate meaning should be compared to a plant of English sentences, remove useless foliage, trunk still exist, stop using words like foliage, Presence or absence does not affect the understanding of this sentence. )。
english
The parser will produce the following results:
set, shape, semi, transpar, call, set_tran, 5
Attention "transparent"
, "calling"
and "set_trans"
how it turns into stemming.
When the parser is used
When we index a document, the entire text segment is parsed into separate words to create the inverted index. However, when we search in full-text fields, we want the query string to be processed by the same parsing process to ensure that the words are present in the index.
Full-Text queries we will discuss later to understand how each field is defined so that they can do the right thing:
- When you query the full text field, the query uses the same parser to parse the query string to produce the correct word list.
- When you query the exact value field, the query will not parse the query string, but you can specify it yourself.
Now you can see why the beginning of mapping and analysis produces that kind of result:
date
The field contains an exact value: a single word "2014-09-15"
.
_all
The field is a full-text segment, so the analysis process converts the date to three words: "2014"
, "09"
and "15"
.
When we _all
query in 2014
a field, it's a match to 12 tweets, because these tweets all contain words 2014
:
GET /_search?q=2014 # 12 results
When we query in a _all
field 2014-09-15
, we first parse the query string, produce a query that matches either word, 2014
09
or, 15
it still matches 12 tweets, because they all contain words 2014
.
GET /_search?q=2014-09-15 # 12 results !
When we query in the date
field 2014-09-15
, it queries an exact date and then finds only one tweet:
GET /_search?q=date:2014-09-15 # 1 result
When we query in the date
field 2014
, no document is found because no document contains that exact date:
GET /_search?q=date:2014 # 0 results !
Test analyzer
Especially when you're a elasticsearch novice, it's difficult to understand how to segment and store into an index. To get a better understanding of how this is done, you can use the analyze
API to see how the text is parsed. Specify the parser to use in the query string parameter, the parsed text as the request body:
GET /_analyze?analyzer=standardText to analyze
Each node in the result represents a word:
{"Tokens": [{"Token":"Text","Start_offset":0,"End_offset":4,"Type": "position": 1} , { "token": "to", "Start_offset": Span class= "Hljs-number" >5, "End_offset": 7, " type ": " <ALPHANUM> "" position ": 2}, { "token": "analyze", " Start_offset ": 8, " End_offset ": 15, " type ": " <ALPHANUM> ", " position ": 3}]}
token
is a word that is actually stored in the index. The position
specified word is the first occurrence in the original text. start_offset
and the end_offset
position that the expression occupies in the original text.
analyze
The API is a very useful tool for understanding the inner details of the Elasticsearch index, and we will continue to discuss it as the content progresses.
Specify the parser
When Elasticsearch detects a new string field in your document, it automatically sets it as a full-text field and parses it with the string
standard
parser.
You can't always want to do this. Perhaps you want to use a language analyzer that is better suited to this data. Or, you just want to treat the string field as a normal field--no analysis, just the exact value, just like the user ID of the string type or the internal state field or label.
To achieve this effect, we must manually set these fields through mapping (mapping).
Mapping
We know that each document in the index has a type. Each type has its own mapping (mapping) or schema definition. A map defines the field type, the data type of each field, and how the field is Elasticsearch processed. Mappings are also used to set metadata that is associated to a type.
Core Simple field type
Elasticsearch supports the following simple field types:
type |
the data type represented |
String |
string |
Whole number |
byte , short , integer ,long |
Floating point |
float ,double |
Boolean |
boolean |
Date |
date |
When you index a document that contains a new field--a field that has not been previously--elasticsearch will use dynamic mapping to guess the field type, which comes from the basic data type of JSON, using the following rules:
JSON Type |
Field Type |
Boolean: true orfalse |
"boolean" |
Whole Number:123 |
"long" |
Floating point:123.45 |
"double" |
String, valid date:"2014-09-15" |
"date" |
String:"foo bar" |
"string" |
Note
This means that if you index a quoted number-- "123"
it will be mapped to a "string"
type, not a "long"
type. However, if the field is already mapped to a "long"
type, Elasticsearch will attempt to convert the string to long and throw an exception if the conversion fails.
View Mappings
We can use _mapping
suffixes to see the mappings in Elasticsearch. At the beginning of this chapter we have found gb
tweet
the mappings in the index type:
GET /gb/_mapping/tweet
This shows the mapping of our fields, called attributes (properties), which are generated dynamically by Elasticsearch when the index is created:
{ "GB": { "mappings": { "tweet": { "Properties": {" date ": {" type ": " date ", "format": "Dateoptionaltime"}, " type ": " string "}, " tweet ": { "type": "string"}, "user_id": { Span class= "hljs-string" > "type": "Long"}}}}}
Little Tips
Incorrect mapping, such as age
mapping a field to a string
type rather than a integer
type, can result in confusing query results.
To check the mapping type, instead of assuming it is correct!
custom Field Mappings
The most important field parameter in the map is type
. In addition to the string
types of fields, you may rarely need to map other type
:
{ "number_of_clicks": { "type": "integer" }}
string
Types of fields, the default, considering the inclusion of full text, their values are parsed by the parser before the index, and query statements are parsed before the full-text search for this field.
For a string
field, the two most important mapping parameters are the index
and analyer
.
index
index
The parameter controls how the string is indexed. It contains one of the following three values:
value |
explain |
analyzed |
This string is parsed first and then indexed. In other words, this field is indexed in full-text form. |
not_analyzed |
Index This field so that it can be searched, but the index content is the same as the specified value. This field is not parsed. |
no |
Do not index this field. This field cannot be searched for. |
string
The Type field default value is analyzed
. If we want to map the field to the exact value, we need to set it to not_analyzed
:
{ "tag": { "type": "string", "index": "not_analyzed" }}
Other simple types-- long
, double
, and date
so on--also accept index
parameters, but the corresponding values can only be no
and not_analyzed
their values cannot be parsed.
Analysis
For analyzed
string fields of type, use analyzer
parameters to specify which parser will be used when searching and indexing. By default, Elasticsearch uses the standard
parser, but you can change it by specifying a built-in parser, such as whitespace
, simple
or english
.
{ "tweet": { "type": "string", "analyzer": "english" }}
In the "Custom Analyzer" section we will show you how to define and use a custom parser.
Update mappings
You can specify the type of the mapping the first time you create an index. In addition, you can add mappings to new types later (or update mappings for existing types).
Important
You can add a field to an existing map, but you can't modify it. If a field already exists in the map, this may mean that the data for that field has already been indexed. If you change the field mapping, the data that has been indexed will be wrong and cannot be searched correctly.
We can update a map to add a new field, but we can't change the type of the existing field analyzed
to not_analyzed
.
To demonstrate the two specified mapping methods, let's first delete the index gb
:
DELETE /gb
A new index is then created, and tweet
the parser for the specified field is english
:
PUT /gb <1>{ "mappings": { "tweet" : { "properties" : { "tweet" : { "type" : "string", "analyzer": "english" }, "date" : { "type" : "date" }, "name" : { "type" : "string" }, "user_id" : { "type" : "long" } } } }}
<1>
This creates the contained mappings
index, which the map specifies in the request body.
Later, we decided to tweet
add a new not_analyzed
type of text field in the map, called tag
, using the _mapping
suffix:
PUT /gb/_mapping/tweet{ "properties" : { "tag" : { "type" : "string", "index": "not_analyzed" } }}
Notice that we no longer need to list all the fields that already exist because we can't modify them. Our new field has been merged into the existing map.
Test mappings
You can use analyze
the API to test the mapping of string fields by name. Compare the output of these two requests:
GET /gb/_analyze?field=tweetBlack-cats <1>GET /gb/_analyze?field=tagBlack-cats <1>
<1>
The text we want to analyze is placed in the request body.
tweet
The field produces two words, "black"
and "cat"
the tag
field produces a single word "Black-cats"
. In other words, our mapping works fine.
Mapping and analysis of Elasticsearch