Elasticsearch's doc Values and Fielddata

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

docvalues What is Docvalues

The simple explanation is that docvalues is a sort of data storage structure (DOCID, termvalues).
The advantage of an inverted index is to find a document that contains an item, that is, to find the corresponding docid by term.

The inverted row of the term

Term

	doc_1	doc_2	Doc_3
Brown	X	X
Dog			X

Term2 the inverted row

Term2	doc_1	doc_2	Doc_3
Brown2	X	X
Dog2			X

So you can quickly locate documents that contain Brown as Doc1 and DOC2.

However, it is not efficient to reverse the operation from another direction, that is, what the value of the specified field (TERM2) of the document is found according to DocId. However, such access patterns are required for aggregation, sorting, and detail queries.
Declares that the traversal index is not available. This is slow and difficult to scale: as the number of terms and documents increases, the execution time increases. In order to solve the above problem, we use DOC values to solve this problem by transpose the relationship between the two.
Docvalues of the term

Term

Doc
Doc_1	Brown
doc_2	Brown
Doc_3	Dog

Term2 's Docvalues

Doc	Term2
Doc_1	Brown2
doc_2	Brown2
Doc_3	Dog2

To illustrate:

Select Term2,count (1) from table where term= ' Brown ' GROUP by Term2

To complete an aggregate query similar to the above SQL:
1. Locate the data range. Retrieved the DocId (doc1,doc2) of term= ' Brown, using an inverted index
2. Perform aggregation calculations. The fields that are positioned to term2 according to DOC1,DOC2 are all brown2, and the aggregated summation is calculated. The count (1) of browm2 = 2.

Search and aggregation are closely related to each other. The search uses an inverted index to find documents, and the aggregation operation collects and aggregates the data in doc values. How the Docvalues works

Docvalues's official documentation describes the characteristics of fast, efficient and memory-friendly.
1. Docvalues is generated at the same time as the inverted index at index time, and is immutable. As with the inverted row, it is saved in the Lucene file (serialized to disk).
Lucene file operations rely on the operating system's cache for management, rather than hosting data in the JVM stack. This feature determines the need to allocate enough memory to the OS when using ES, to ensure file processing performance, and to set detailed configuration recommendations that refer to the production environment Elasticsearch

2. Docvalues using column-type compression
Modern CPUs are processed much faster than disks, which means reducing the amount of data that must be read from disk is always beneficial, although additional CPU operations are required to decompress.

Docvalues uses a lot of compression techniques. It will sequentially detect the following compression modes:
-If all values are different (or missing), set a tag and record the values
-If these values are less than 256, a simple encoding table will be used
-If these values are greater than 256, detect if there is a greatest common divisor
-If no greatest common divisor is present, the uniform computed offset is encoded starting from the smallest value
These compression modes are not traditional general-purpose compression methods, such as DEFLATE or LZ4. Because the structure of the columnstore is strict and well-defined, we can achieve a higher compression effect than a general-purpose compression algorithm (such as LZ4) by using specialized patterns.

, character types are encoded similarly by the use of sequential tables (ordinal table). A character type is a de-weight that is stored in the sequential table by assigning an ID and then using the same ID as the document value of the numeric type. That is, the character type has the same compression characteristics as the numeric type.
The sequential table itself also has many compression techniques, such as fixed length, variable length, or prefix character encoding, and so on.

3, Docvalues support disable
This value defaults to the startup state, and can be disabled if it is not necessary to set doc_values:false. docvalues not supported.

According to the above, DOC values do not support the analyzed string field, imagine if a field is analyzed, such as the first, then in the analysis phase will be docvalues stored as two docvalue (the first), When calculating, you will get

Term

	Count
The	1
First	1

Instead of

Term

	Count
The first	1

How do you want to achieve the results we want? Fielddata Fielddata

Doc values does not generate parsed strings, however, these fields can still use aggregations because the FIELDDATA data structure is used. Unlike doc values, Fielddata builds and manages 100% in memory and resides in the JVM memory heap. Fielddata is the default setting for all fields. Note Memory usage

Some features:
1. The Fielddata is delayed loading. If you have never aggregated an parsing string, you will not load Fielddata into memory, which is built at query time.
2. fielddata is field-loaded, and only active use of fields increases the fielddata burden.
3. Fielddata loads all the documents in the index (for that particular field), regardless of whether the query is hit or not. The logic is this: if the query accesses documents X, Y, and Z, it is likely that other documents will be accessed in the next query.
4. If there is not enough space, use the longest unused (LRU) algorithm to remove the fielddata.
Therefore, Fielddata should be used rationally in the JVM, otherwise it will affect ES performance.
We can use Indices.fielddata.cache.size to limit fielddata memory usage, either as a specific size (such as 2G) or as a percentage of memory consumption (such as 20%).
You can also use the following command to monitor.

GET/_stats/fielddata

Finally, what happens if a one-time load field directly exceeds the memory value. Hang off. So es in order to prevent this situation, the use of circuit breaker (fusing mechanism).
It estimates the memory required for a query through internal checks (type of field, cardinality, size, and so on). It then checks whether the fielddata required to load will cause the total amount of fielddata to exceed the heap's configuration scale. If the estimated query size exceeds the limit, the fuse is triggered and the query is aborted and an exception is returned.

Indices.breaker.fielddata.limit fielddata level limit, default to 60% of heap
Indices.breaker.request.limit requests level request limit, default to 40% of heap
Indices.breaker.total.limit guarantees the combination of the above constraints, the default heap of 70% fielddata filtering

You can save memory by setting up only part of the Fielddata to load.

"Frequency": {
"Min": 0.01,
"Min_segment_size": 500
}
Load only those items that appear at least 1% of the time in this document.
Omit any segments with a document number less than 500.
Detailed reference website. fielddata Pre-load

Loading Fielddata By default is lazy loading. When Elasticsearch queries a field for the first time, it will fully load the inverted index in all segment in this field into memory, so that later queries can get better performance.
For small index segments, the time required for this process can be ignored. However, if the index is a large number of gigabytes, this process can take several seconds. It is difficult for users who are accustomed to sub-second response to accept a pause for a few seconds.
There are three ways to resolve this delay spike: Preload Fielddata, set up for pre-loading. Pre-load global ordinals. A load-optimized way to reduce memory consumption, similar to a global dictionary (storing a string field and its corresponding globally unique int value), so that only the int value is loaded, and then the corresponding string field in the dictionary is found. Cache warm-up. has been deprecated.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More