elasticsearch2.x DOC Values

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Document Address:

Https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html

Https://www.elastic.co/guide/en/elasticsearch/guide/2.x/docvalues-intro.html

Https://www.elastic.co/guide/en/elasticsearch/guide/2.x/docvalues.html

Https://www.elastic.co/guide/en/elasticsearch/guide/2.x/_deep_dive_on_doc_values.html#_deep_dive_on_doc_values

doc_values Introduction doc Values is an important topic that we repeatedly repeat, are you aware of something? When searching, we need a mapping of "word" to "document" list, we need a mapping of "document" to "word" list, in other words, we need a "positive row index" on the basis of inverted index, the "Positive row index" structure is often referred to in other systems (such as relational databases). Columnstore ". Essentially, it stores all of the value on a column in a data field, a structure that behaves efficiently on some operations, such as sorting.
This "Columnstore" in Es is our familiar "Doc values", which is enabled by default, and Doc values is created at Index-time (Index period): When a field is indexed, ES adds "word" to the inverted index, These words are also added to Doc values (stored on the hard disk) for "Columnstore". Doc values are typically applied in the following ways: Based on a field aggregation, you perform some filter on a field (such as: geolocation filter) that references one or more fields in the script (scripts) because the doc The values are serialized to the hard disk during the indexing period, and we can use the operating system to quickly access them, about how Doc values is managed on disk, as described later. Most of the fields in doc_values are indexed by default, which allows them to be searched, with inverted indexes allowing a query to be based on a single word list, or a quick access to a listing of documents containing a term. Sorting, aggregating, and accessing some field values in a script requires a different way of accessing it because the inverted index does not support this access, so we need a structure that can query the document-to-word mappings. Doc values creates a disk-based data structure at the index period, which makes this access possible. Doc values supports most field types, except for the "analyzed" type of string field. All fields support Doc values by default, and if you're sure you don't need to sort or aggregate on a field or access it in a script, you can disable it out:

PUT my_index
{"
  mappings": {"My_type": {"Properties": {
        "Status_code": {" 
          type":       " String ",
          " index ":      " not_analyzed "
        },
        " session_id ": { 
          " type ":       " string ",
          " index ":      "not_analyzed",
          "Doc_values": false
        }
  }
}}

Status_code field is turned on by default Doc_values session_id field is disabled doc_values, although disabled but can still be queried tip:doc_values can set different values on the same name field in the same index. It can also use the put mapping API to disable it based on an existing field.
Look at the inverted index structure as follows

Term      doc_1   doc_2   doc_3
------------------------------------
Brown   |   X   |   X   |
Dog     |   X   |       |   X
Dogs    |       |   X   |   X
Fox     |   X   |       |   X
Foxes   |       |   X   |
In      |       |   X   |
Jumped  |   X   |       |   X-
Lazy    |   X   |   X   |
Leap    |       |   X   |
Over    |   X   |   X   |   X
Quick   |   X   |   X   |   X
Summer  |       |   X   |
The     |   X   |       |   X
------------------------------------

If we want to edit a complete list of words for each document that contains "Brown", we may use the following query

Get/my_index/_search
{"
  query": {"
    match": {
      "body": "Brown"
    }
  },
  "Aggs": {
    "Popular_terms": {
      "terms": {
        "field": "Body"
      }
  }}

Look at the query section above. The inverted index was sequenced by the entry, so we first found the list of entries with "Brown" and then scanned all the documents that contained "Brown" across the columns, where we were fortunate to find "doc_1" and "doc_2". Then in the aggregation section, we need to find all the words in doc_1 and doc_2, it's very expensive to do this in the inverted index: it means we have to iterate over each word in the index to see if they are contained in doc_1 and doc_2, and the process is very slow, And it's also very stupid: because as the number of document words increases, the execution time of our aggregation increases. Let's take a look at the following structure:

Doc      Terms
-----------------------------------------------------------------
doc_1 | Brown, Dog, Fox, Jumped, lazy, over, quick, the
doc_2 | Brown, dogs, foxes, in, lazy, leap, over, quick, summer
doc_3 | dog, Dogs, Fox, jumped, over, quick, the
-----------------------------------------------------------------

With this structure we can easily get the terms that Doc_1 and doc_2 contain, and we just need to combine the two sets with the above structure.
Therefore, the query and aggregation is very complex, the query document uses the inverted index, the aggregation document uses the positive row index (doc_values) Note:doc values is not only used in the aggregation, but also used in the sorting, scripting, child parent document relationship (not described here). Drill down to Doc ValuesThe doc values mentioned above give us a few impressions: fast access, efficient, hard disk based. Now let's take a look at how Doc values works. The doc values are generated along with the inverted index in the index period, which means doc values are generated based on each index segment and are immutable (immutable),As with inverted indexes, doc values is also serialized to disk, which makes it highly efficient and extensible. By serializing a data structure to disk, we can rely on the operating system's File System CacheInstead of the JVM's heap memory, when our "working set" is less than the OS available memory, the operating system naturally loads these doc values into memory. The performance of DOC values is the same as in the JVM heap memory. However, when the working set is larger than the operating system available memory, the operating system loads DOC values on demand, which can be significantly slower than when the full amount of DOC values is loaded. But this kind of operation makes our server memory utilization far exceed the server maximum memory limit. Imagine if the full load into doc values into memory is bound to cause ES outofmemery. Note:Since Doc values is not managed by the JVM heap memory, we can set ES to a smaller memory, leave more memory to the operating system to swap out (Doc values), and this will enable the JVM's GC to work on smaller heap memory and execute the GC faster and more efficiently.
In general, we configure the JVM's heap memory base and operating system memory in half (50%), due to the introduction of DOC values so we can consider to set the JVM heap memory smaller, for example, we can set the JVM heap memory on a 64G server to 4- 16GB is more efficient than setting heap memory to 32G. column-store Compression (column-storage compression)In essence, doc values is a serialized structure for "Columnstore," and we've discussed the advantages of Columnstore in some query operations, not only that they are also better at data compression, especially numbers, which is important for disk storage and fast access. To understand how it compresses data, let's look at the following simple DOC values structure

Doc      Terms
-----------------------------------------------------------------
doc_1 |
doc_2 |
Doc_3 | doc_4 |
doc_5 |
doc_6 | 1900 doc_7
| 4200
--------------------------- --------------------------------------

As in the form of each line of data above, we can get contiguous numbers of blocks, such as: [100,1000,1500,1200,300,1900,4200]. Because we know that they are all numeric values can be aligned together through a consistent offset. With deep-seated, there are several compression methods that can be applied to these numbers. You may know that the above numbers are multiples of 100, and if all the numbers on the index segment share a "greatest common divisor", then you can use this greatest common divisor to compress the data. As the above figure we can divide by 100, the data obtained is [1,10,15,12,3,19,42]. This makes the numbers smaller and the number of bits that are used to store them smaller. Doc values uses several methods to compress numbers. If all the numeric values are equal (or missing), a token is set to represent the value if the number of all numeric values is less than 256, a simple encoding table will be used to compress if there are more than 256, and if there is a greatest common divisor, the presence of greatest common divisor compression if there is no greatest common divisor, The offset is stored to compress the number. As you can see, you might think, "doing this is really good for compressing numeric fields, and then for string types." ", in fact, the string compression is the same as the digital compression using the same method through a sequential table to compress, the string is removed, the order is given an ID, these IDs are numbers, so you can use the above scheme to compress. Compressed storage is also used for the order tables themselves.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More