Analysis and Analyzer
Analysis is the process of characterizing a block of text as a separate word for inverted indexing (term) and then standardizing the words as standard forms, improving their "searchable" or "recall"
This work is done by analyzer. A parser is just a wrapper for putting three functions into one package: Character filters
First, the string passes through the character filter (character filter), whose work is to process the string before it is represented (the word is more appropriate for the term "word breaker"). Character filters can remove HTML tags, or convert "&" to "and". Word breaker
Next, the word breaker (Tokenizer) is characterized as a separate term. A simple word breaker (tokenizer) can separate words according to spaces or commas (this is not applicable in Chinese). Characterization Filtration
Finally, each word is filtered through all representations (token filters), it can modify words (for example, "Quick" to lowercase), remove words (such as "a", "and" "" "," etc.), or add words (such as synonyms like "Jump" and "leap")
Elasticsearch provides many out-of-the-box character filters, Word breakers and characterization filters. These can be combined to create a custom parser to meet different requirements. We will discuss them in detail in the section "Custom Analyzer". Built-in analyzer
However, Elasticsearch also comes with a number of pre-installed parsers that you can use directly. Here's a list of the most important parsers to demonstrate the difference in performance after this string participle:
"Set the shape to semi-transparent by calling Set_trans (5)"
Standard Analyzer
The standard parser is the parser that Elasticsearch uses by default. For text analysis, it is the best choice for any language (translator Note: There is no special need, for the language of any country, this parser is enough). It splits the text according to the defined word boundary (word boundaries) of the Unicode Consortium, and then removes most of the punctuation. Finally, convert all the words to lowercase. The resulting results are:
Set, the, shape, to, semi, transparent, by, calling, Set_trans, 5
Simple analyzer
The simple parser splits the text that is not a single letter and then converts each word to lowercase. The resulting results are:
Set, the, shape, to, semi, transparent, by, calling, set, trans
Space Analyzer
The Space Analyzer splits the text according to the space. It does not convert lowercase. The resulting results are:
Set, the, shape, to, semi-transparent, by, calling, Set_trans (5)
Language Analyzer
Specific language parsers are available in many languages. They can take into account the characteristics of a particular language. For example, the Chinese parser brings up a set of English stop words-a generic word that is not semantically related, like and or the. After these words have been removed, because of the existence of grammatical rules, the main meaning of English words can still be understood (translator note: Stem Chinese words this sentence does not know how to translate, check the dictionary, I understand the general meaning should be the English sentence to a plant, remove useless foliage, trunk still exist, Stop word is like foliage, existence or not affect the understanding of this sentence. )。
The 中文版 Analyzer will produce the following results:
Set, shape, semi, transpar, call, Set_tran, 5
Notice how the "transparent", "calling" and "Set_trans" are converted to stem. When the parser is used
When we index a document, full text segments are parsed into separate words to create inverted indexes. However, when we search in full text, we want the query string to be processed by the same parsing process to make sure that the words are in the index.
Full-Text queries we'll discuss later to understand how each field is defined so they can do the right thing: when you query the full text field, the query uses the same parser to parse the query string to produce the correct word list. When you query an exact value (exact value) field, the query will not parse the query string, but you can specify it yourself.
Now you can see why the beginning of mapping and analysis produces that result: the Date field contains an exact value: a single word "2014-09-15". The _all field is a full text segment, so the parsing process converts the date to three words: "2014", "09", and "15".
When we query 2014 in the _all field, it matches to 12 tweets, because these tweets all contain the word 2014:
Get/_search?q=2014 # Results
When we query 2014-09-15 in the _all field, we first parse the query string to produce a query that matches either Word 2014, 09, or 15, and it still matches 12 tweets because they all contain the word 2014.
Get/_search?q=2014-09-15 # Results!
When we query 2014-09-15 in the Date field, it queries for an exact date and then finds only one tweet:
Get/_search?q=date:2014-09-15 # 1 result
When we queried 2014 in the Date field, the document was not found because no document contained that exact date:
Get/_search?q=date:2014 # 0 results!
Test analyzer
Especially when you are a novice, it is difficult to understand how to elasticsearch and how to store it in the index. To better understand how to do this, you can use the Analyze API to see how the text is parsed. Specify the parser to use in the query string parameter, the parsed text as the requesting body:
Get/_analyze?analyzer=standard
Text to analyze
Each node in the result represents a word:
{"
tokens": [
{
"token": "text",
"Start_offset": 0,
"end_offset": 4, "
type": "<ALPHANUM>",
"position": 1
},
{
"token": "to",
"Start_offset": 5,
"End_offset": 7,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": " analyze",
"Start_offset": 8,
"End_offset": "
Type": "< Alphanum> ",
" position ": 3
}
]
}
Token is a word that is actually stored in the index. The position specifies the number of occurrences of the word in the original text. Start_offset and End_offset indicate where the word occupies in the original text.
The Analyze API is a useful tool for understanding the inner details of the Elasticsearch index, and we will continue to discuss it as the content progresses. Specify analyzer
When Elasticsearch detects a new string field in your document, it automatically sets it as a full-text string field and is parsed by the standard parser.
You can't always want to do this. Maybe you want to use a language analyzer that's better suited to this data. Or, you just want to treat the string field as a normal field--without any analysis, just store the exact value, like a string-type user ID or an internal state field or label.
To achieve this effect, we must set these fields manually by mapping (mapping).
Mapping
We know that each document in the index has one type (type). Each type has its own mapping (mapping) or schema definitions (schema definition). A mapping defines the field type, the data type of each field, and how the field is Elasticsearch processed. The mapping is also used to set the metadata associated with the type. Core Simple field type
Elasticsearch supports the following simple field types: data type represented by type string string Whole number Byte, short, integer, long floating point float, double Boo Lean Boolean Date Date
When you index a document that contains a new field-a field that was not previously--elasticsearch will use dynamic mapping to guess the field type, which comes from the base data type of JSON, using the following rule: JSON type Field type Boolean:true or False "Boolean" Whole number:123 "long" floating point:123.45 "double" string, valid date: "2014-09-15" "Date" string: " Foo Bar "string" note
This means that if you index a quoted number--"123", it will be mapped to a "string" type, not a "long" type. However, if the field has been mapped to a "long" type, Elasticsearch will attempt to convert the string to long and throw an exception if the conversion fails. View Mappings
We can use the _mapping suffix to see the mappings in Elasticsearch. At the beginning of this chapter we have found mappings in the index GB type tweets:
Get/gb/_mapping/tweet
This shows us the mapping of the fields, called attributes (properties), which are dynamically generated by Elasticsearch when the index is created:
{"GB": {"mappings": {"tweet": {"Properties": {"
Date"
: {"]:" Dateoptionaltime "
},
" name ": {
" type ":" String "
},
" tweet ": {
" type ":" String "
},
' user_id ': {
' type ': ' Long '
}
}}}
}
Small Tips
Incorrect mappings, such as mapping an age field to a string type rather than an integer type, can result in confusing query results.
To check the mapping type, rather than assuming it is correct. custom Field Mappings
The most important field parameter in a map is type. In addition to string-type fields, you may rarely need to map other types:
{
' number_of_clicks ': {'
type ': ' Integer '
}
}
String type fields, which, by default, take into account the full text, their values are parsed before indexing, and the query statements are parsed before the Full-text search for this field.
For string fields, the two most important mapping parameters are index and Analyer. Index
The index parameter controls how the string is indexed. It contains one of the following three values: value interpretation analyzed first parse the string and then index. In other words, this field is indexed in full-text format. not_analyzed Index This field so that it can be searched, but the index content is the same as the specified value. This field is not parsed. No does not index this field. This field cannot be searched for.
The String Type field default value is analyzed. If we want to map the field to an exact value, we need to set it to not_analyzed:
{
"tag": {
"type": "string",
"index": "not_analyzed"
}
}
Other simple types--long, double, date, and so on-also accept the index parameter, but the corresponding value can only be no and not_analyzed, and their values cannot be parsed. Analysis
For string fields of type analyzed, use the analyzer parameter to specify which parser will be used when searching and indexing. By default, Elasticsearch uses the standard parser, but you can change it by specifying a built-in parser, such as whitespace, simple, or 中文版.
{"
tweet": {
"type": "string",
"Analyzer": "中文版"
}
}
In the "Custom Analyzer" section we will show you how to define and use a custom parser. Update mappings
You can specify the type of mapping the first time you create an index. In addition, you can add mappings for new types later (or update mappings for existing types). Important
You can add a field to an existing map, but you can't modify it. If a field already exists in the map, this may mean that the data for that field has already been indexed. If you change the field mappings, the data that has been indexed will be wrong and cannot be searched correctly.
We can update a map to add a new field, but we can't change the type of the existing field from analyzed to not_analyzed.
To illustrate the two specified mapping methods, let's first delete the index GB:
Delete/gb
Then create a new index that specifies that the parser for the tweet field is 中文版:
PUT/GB <1>
{
"mappings": {"tweet": {"
Properties": {"
tweet": {"
type": "string" ,
"Analyzer": "中文版"
},
"date": {
"type": "date"
},
"name": {
"type": ' string '
},
' user_id ': {
' type ': ' long '
}}}}
<1> This will create an index containing mappings, which is specified in the request body.
Later, we decided to add a new not_analyzed type of text field, called tag, using the _mapping suffix in the tweet map:
Put/gb/_mapping/tweet
{"
properties": {"
tag": {
"type": "string",
"index": " Not_analyzed "}}}
Notice that we no longer need to list all the existing fields because we cannot modify them. Our new field has been merged into the existing mapping. Test mappings
You can use the Analyze API to test the mapping of string fields by name. Compare the output of these two requests:
Get/gb/_analyze?field=tweet
black-cats <1>
get/gb/_analyze?field=tag
black-cats <1>
<1> the text we want to analyze is placed in the request body.
The tweet field produces two words, "black" and "cat", and the tag field produces a single word "black-cats". In other words, our mapping works fine.