Elasticsearch's doc Values and Fielddata

Source: Internet
Author: User
Tags greatest common divisor numeric memory usage
docvalues What is Docvalues

The simple explanation is that docvalues is a sort of data storage structure (DOCID, termvalues).
The advantage of an inverted index is to find a document that contains an item, that is, to find the corresponding docid by term.

The inverted row of the term

Term
doc_1 doc_2 Doc_3
Brown X X
Dog X

Term2 the inverted row

Term2 doc_1 doc_2 Doc_3
Brown2 X X
Dog2 X

So you can quickly locate documents that contain Brown as Doc1 and DOC2.

However, it is not efficient to reverse the operation from another direction, that is, what the value of the specified field (TERM2) of the document is found according to DocId. However, such access patterns are required for aggregation, sorting, and detail queries.
Declares that the traversal index is not available. This is slow and difficult to scale: as the number of terms and documents increases, the execution time increases. In order to solve the above problem, we use DOC values to solve this problem by transpose the relationship between the two.
Docvalues of the term

Term
Doc
Doc_1 Brown
doc_2 Brown
Doc_3 Dog

Term2 's Docvalues

Doc Term2
Doc_1 Brown2
doc_2 Brown2
Doc_3 Dog2

To illustrate:

Select Term2,count (1) from table where term= ' Brown ' GROUP by Term2

To complete an aggregate query similar to the above SQL:
1. Locate the data range. Retrieved the DocId (doc1,doc2) of term= ' Brown, using an inverted index
2. Perform aggregation calculations. The fields that are positioned to term2 according to DOC1,DOC2 are all brown2, and the aggregated summation is calculated. The count (1) of browm2 = 2.

Search and aggregation are closely related to each other. The search uses an inverted index to find documents, and the aggregation operation collects and aggregates the data in doc values. How the Docvalues works

Docvalues's official documentation describes the characteristics of fast, efficient and memory-friendly.
1. Docvalues is generated at the same time as the inverted index at index time, and is immutable. As with the inverted row, it is saved in the Lucene file (serialized to disk).
Lucene file operations rely on the operating system's cache for management, rather than hosting data in the JVM stack. This feature determines the need to allocate enough memory to the OS when using ES, to ensure file processing performance, and to set detailed configuration recommendations that refer to the production environment Elasticsearch

2. Docvalues using column-type compression
Modern CPUs are processed much faster than disks, which means reducing the amount of data that must be read from disk is always beneficial, although additional CPU operations are required to decompress.

Docvalues uses a lot of compression techniques. It will sequentially detect the following compression modes:
-If all values are different (or missing), set a tag and record the values
-If these values are less than 256, a simple encoding table will be used
-If these values are greater than 256, detect if there is a greatest common divisor
-If no greatest common divisor is present, the uniform computed offset is encoded starting from the smallest value
These compression modes are not traditional general-purpose compression methods, such as DEFLATE or LZ4. Because the structure of the columnstore is strict and well-defined, we can achieve a higher compression effect than a general-purpose compression algorithm (such as LZ4) by using specialized patterns.

, character types are encoded similarly by the use of sequential tables (ordinal table). A character type is a de-weight that is stored in the sequential table by assigning an ID and then using the same ID as the document value of the numeric type. That is, the character type has the same compression characteristics as the numeric type.
The sequential table itself also has many compression techniques, such as fixed length, variable length, or prefix character encoding, and so on.

3, Docvalues support disable
This value defaults to the startup state, and can be disabled if it is not necessary to set doc_values:false. docvalues not supported.

According to the above, DOC values do not support the analyzed string field, imagine if a field is analyzed, such as the first, then in the analysis phase will be docvalues stored as two docvalue (the first), When calculating, you will get

Term
Count
The 1
First 1

Instead of

Term
Count
The first 1

How do you want to achieve the results we want? Fielddata Fielddata

Doc values does not generate parsed strings, however, these fields can still use aggregations because the FIELDDATA data structure is used. Unlike doc values, Fielddata builds and manages 100% in memory and resides in the JVM memory heap. Fielddata is the default setting for all fields. Note Memory usage

Some features:
1. The Fielddata is delayed loading. If you have never aggregated an parsing string, you will not load Fielddata into memory, which is built at query time.
2. fielddata is field-loaded, and only active use of fields increases the fielddata burden.
3. Fielddata loads all the documents in the index (for that particular field), regardless of whether the query is hit or not. The logic is this: if the query accesses documents X, Y, and Z, it is likely that other documents will be accessed in the next query.
4. If there is not enough space, use the longest unused (LRU) algorithm to remove the fielddata.
Therefore, Fielddata should be used rationally in the JVM, otherwise it will affect ES performance.
We can use Indices.fielddata.cache.size to limit fielddata memory usage, either as a specific size (such as 2G) or as a percentage of memory consumption (such as 20%).
You can also use the following command to monitor.

GET/_stats/fielddata

Finally, what happens if a one-time load field directly exceeds the memory value. Hang off. So es in order to prevent this situation, the use of circuit breaker (fusing mechanism).
It estimates the memory required for a query through internal checks (type of field, cardinality, size, and so on). It then checks whether the fielddata required to load will cause the total amount of fielddata to exceed the heap's configuration scale. If the estimated query size exceeds the limit, the fuse is triggered and the query is aborted and an exception is returned.

Indices.breaker.fielddata.limit fielddata level limit, default to 60% of heap
Indices.breaker.request.limit requests level request limit, default to 40% of heap
Indices.breaker.total.limit guarantees the combination of the above constraints, the default heap of 70% fielddata filtering

You can save memory by setting up only part of the Fielddata to load.

"Frequency": {
"Min": 0.01,
"Min_segment_size": 500
}
Load only those items that appear at least 1% of the time in this document.
Omit any segments with a document number less than 500.
Detailed reference website. fielddata Pre-load

Loading Fielddata By default is lazy loading. When Elasticsearch queries a field for the first time, it will fully load the inverted index in all segment in this field into memory, so that later queries can get better performance.
For small index segments, the time required for this process can be ignored. However, if the index is a large number of gigabytes, this process can take several seconds. It is difficult for users who are accustomed to sub-second response to accept a pause for a few seconds.
There are three ways to resolve this delay spike: Preload Fielddata, set up for pre-loading. Pre-load global ordinals. A load-optimized way to reduce memory consumption, similar to a global dictionary (storing a string field and its corresponding globally unique int value), so that only the int value is loaded, and then the corresponding string field in the dictionary is found. Cache warm-up. has been deprecated.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.