Mapping and analysis of Elasticsearch

Source: Internet
Author: User

Forwarded from: http://blog.csdn.net/hzrandd/article/details/47128895

Analytics and Analyzers

Analysis is such a process:

    • First, characterize a text block as a separate word for inverted index (term)
    • These words are then standardized in standard form, improving their "searchable" or "recall"

This work is done by the Analyser (analyzer). An analyzer (analyzer) is just a wrapper for putting three functions into a package:

Character Filter

The string is first passed through a character filter (character filter), and their work is done before the characterization (the word is called hyphenation is more appropriate) to process the string. Character filters can remove HTML tags, or convert "&" to "and" .

Word breaker

Next, the word breaker (Tokenizer) is characterized by a separate word (word breaker). A simple word breaker (tokenizer) can separate words according to spaces or commas (translator Note: This does not apply in Chinese).

Characterization Filtration

Finally, each word is filtered by all representations (token filters), which can modify the words (for example, to "Quick" lowercase), remove the words (such as disabling words "a" , and "and"``"the" so on), or add words (such as synonyms "jump" and "leap" )

Elasticsearch offers many out-of-the-box character filters, Word breakers, and characterization filters. These can be combined to create a custom parser to respond to different requirements. We will discuss this in more detail in the section "Custom Analyzers".

Built-in analyzers

However, Elasticsearch also comes with some pre-installed analyzers that you can use directly. Below we list the most important analyzers to demonstrate the difference in performance after the string participle:

"Set the shape to semi-transparent by calling set_trans(5)"
Standard analyzers

The standard parser is the parser that Elasticsearch uses by default. For text analysis, it is the best choice for any language (the translator notes: There is no special need, for any one country's language, this analyzer is enough). It slices the text based on the word boundary (word boundaries) defined by the Unicode Consortium, and then removes most of the punctuation. Finally, all words are converted to lowercase. The resulting result is:

set, the, shape, to, semi, transparent, by, calling, set_trans, 5
Simple analyzer

A simple parser splits text that is not a single letter and then converts each word to lowercase. The resulting result is:

set, the, shape, to, semi, transparent, by, calling, set, trans
Space Analyzer

The Space Analyzer splits text by space. It does not convert lowercase. The resulting result is:

Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
Language analyzers

A language-specific parser is available in many languages. They can take into account the characteristics of a particular language. For example, the english parser comes with a set of English-language-disabled thesaurus-like and or the these semantics-independent generic words. After these words have been removed, because of the existence of grammatical rules, the main meaning of English words can still be understood (translator Note: stem English words this sentence does not know how to translate, check the dictionary, I understand the approximate meaning should be compared to a plant of English sentences, remove useless foliage, trunk still exist, stop using words like foliage, Presence or absence does not affect the understanding of this sentence. )。

englishThe parser will produce the following results:

set, shape, semi, transpar, call, set_tran, 5

Attention "transparent" , "calling" and "set_trans" how it turns into stemming.

When the parser is used

When we index a document, the entire text segment is parsed into separate words to create the inverted index. However, when we search in full-text fields, we want the query string to be processed by the same parsing process to ensure that the words are present in the index.

Full-Text queries we will discuss later to understand how each field is defined so that they can do the right thing:

    • When you query the full text field, the query uses the same parser to parse the query string to produce the correct word list.
    • When you query the exact value field, the query will not parse the query string, but you can specify it yourself.

Now you can see why the beginning of mapping and analysis produces that kind of result:

    • dateThe field contains an exact value: a single word "2014-09-15" .
    • _allThe field is a full-text segment, so the analysis process converts the date to three words: "2014" , "09" and "15" .

When we _all query in 2014 a field, it's a match to 12 tweets, because these tweets all contain words 2014 :

GET /_search?q=2014              # 12 results

When we query in a _all field 2014-09-15 , we first parse the query string, produce a query that matches either word, 2014 09 or, 15 it still matches 12 tweets, because they all contain words 2014 .

GET /_search?q=2014-09-15        # 12 results !

When we query in the date field 2014-09-15 , it queries an exact date and then finds only one tweet:

GET /_search?q=date:2014-09-15   # 1  result

When we query in the date field 2014 , no document is found because no document contains that exact date:

GET /_search?q=date:2014         # 0  results !
Test analyzer

Especially when you're a elasticsearch novice, it's difficult to understand how to segment and store into an index. To get a better understanding of how this is done, you can use the analyze API to see how the text is parsed. Specify the parser to use in the query string parameter, the parsed text as the request body:

GET /_analyze?analyzer=standardText to analyze

Each node in the result represents a word:

{"Tokens": [{"Token":"Text","Start_offset":0,"End_offset":4,"Type": "position": 1} , { "token":  "to",  "Start_offset": Span class= "Hljs-number" >5,  "End_offset": 7, " type ": " <ALPHANUM> "" position ": 2}, { "token":  "analyze", " Start_offset ": 8, " End_offset ": 15, " type ": " <ALPHANUM> ", " position ": 3}]}         

tokenis a word that is actually stored in the index. The position specified word is the first occurrence in the original text. start_offsetand the end_offset position that the expression occupies in the original text.

analyzeThe API is a very useful tool for understanding the inner details of the Elasticsearch index, and we will continue to discuss it as the content progresses.

Specify the parser

When Elasticsearch detects a new string field in your document, it automatically sets it as a full-text field and parses it with the string standard parser.

You can't always want to do this. Perhaps you want to use a language analyzer that is better suited to this data. Or, you just want to treat the string field as a normal field--no analysis, just the exact value, just like the user ID of the string type or the internal state field or label.

To achieve this effect, we must manually set these fields through mapping (mapping).

Mapping

We know that each document in the index has a type. Each type has its own mapping (mapping) or schema definition. A map defines the field type, the data type of each field, and how the field is Elasticsearch processed. Mappings are also used to set metadata that is associated to a type.

Core Simple field type

Elasticsearch supports the following simple field types:

type the data type represented
String string
Whole number byteshortinteger,long
Floating point float,double
Boolean boolean
Date date

When you index a document that contains a new field--a field that has not been previously--elasticsearch will use dynamic mapping to guess the field type, which comes from the basic data type of JSON, using the following rules:

JSON Type Field Type
Boolean: true orfalse "boolean"
Whole Number:123 "long"
Floating point:123.45 "double"
String, valid date:"2014-09-15" "date"
String:"foo bar" "string"
Note

This means that if you index a quoted number-- "123" it will be mapped to a "string" type, not a "long" type. However, if the field is already mapped to a "long" type, Elasticsearch will attempt to convert the string to long and throw an exception if the conversion fails.

View Mappings

We can use _mapping suffixes to see the mappings in Elasticsearch. At the beginning of this chapter we have found gb tweet the mappings in the index type:

GET /gb/_mapping/tweet

This shows the mapping of our fields, called attributes (properties), which are generated dynamically by Elasticsearch when the index is created:

 { "GB": { "mappings": {  "tweet": { "Properties": {" date ": {" type ": " date ",  "format":  "Dateoptionaltime"}, " type ": " string "}, " tweet ": { "type":  "string"},  "user_id": { Span class= "hljs-string" > "type":  "Long"}}}}}      
Little Tips

Incorrect mapping, such as age mapping a field to a string type rather than a integer type, can result in confusing query results.

To check the mapping type, instead of assuming it is correct!

custom Field Mappings

The most important field parameter in the map is type . In addition to the string types of fields, you may rarely need to map other type :

{    "number_of_clicks": {        "type": "integer"    }}

stringTypes of fields, the default, considering the inclusion of full text, their values are parsed by the parser before the index, and query statements are parsed before the full-text search for this field.

For a string field, the two most important mapping parameters are the index and analyer .

index

indexThe parameter controls how the string is indexed. It contains one of the following three values:

value explain
analyzed This string is parsed first and then indexed. In other words, this field is indexed in full-text form.
not_analyzed Index This field so that it can be searched, but the index content is the same as the specified value. This field is not parsed.
no Do not index this field. This field cannot be searched for.

stringThe Type field default value is analyzed . If we want to map the field to the exact value, we need to set it to not_analyzed :

{    "tag": {        "type":     "string",        "index":    "not_analyzed" }}

Other simple types-- long , double , and date so on--also accept index parameters, but the corresponding values can only be no and not_analyzed their values cannot be parsed.

Analysis

For analyzed string fields of type, use analyzer parameters to specify which parser will be used when searching and indexing. By default, Elasticsearch uses the standard parser, but you can change it by specifying a built-in parser, such as whitespace , simple or english .

{    "tweet": {        "type":     "string",        "analyzer": "english" }}

In the "Custom Analyzer" section we will show you how to define and use a custom parser.

Update mappings

You can specify the type of the mapping the first time you create an index. In addition, you can add mappings to new types later (or update mappings for existing types).

Important

You can add a field to an existing map, but you can't modify it. If a field already exists in the map, this may mean that the data for that field has already been indexed. If you change the field mapping, the data that has been indexed will be wrong and cannot be searched correctly.

We can update a map to add a new field, but we can't change the type of the existing field analyzed to not_analyzed .

To demonstrate the two specified mapping methods, let's first delete the index gb :

DELETE /gb

A new index is then created, and tweet the parser for the specified field is english :

PUT /gb <1>{  "mappings": {    "tweet" : {      "properties" : {        "tweet" : { "type" : "string", "analyzer": "english" }, "date" : { "type" : "date" }, "name" : { "type" : "string" }, "user_id" : { "type" : "long" } } } }}

<1>This creates the contained mappings index, which the map specifies in the request body.

Later, we decided to tweet add a new not_analyzed type of text field in the map, called tag , using the _mapping suffix:

PUT /gb/_mapping/tweet{  "properties" : {    "tag" : {      "type" :    "string",      "index": "not_analyzed" } }}

Notice that we no longer need to list all the fields that already exist because we can't modify them. Our new field has been merged into the existing map.

Test mappings

You can use analyze the API to test the mapping of string fields by name. Compare the output of these two requests:

GET /gb/_analyze?field=tweetBlack-cats <1>GET /gb/_analyze?field=tagBlack-cats <1>

<1>The text we want to analyze is placed in the request body.

tweetThe field produces two words, "black" and "cat" the tag field produces a single word "Black-cats" . In other words, our mapping works fine.

Mapping and analysis of Elasticsearch

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.