Understanding of Elasticsearch 5.0 disk space saving strategy

Source: Internet
Author: User
Tags json

This article is at that time QQ group members discussed how disk space optimization, I searched a similar article, combined with official documents to do some summary

Reference article 1

Reference Article 2

If you have questions, you can contact me to participate in the discussion, or go to the original view.

Note: disk space Savings is a matter of gain and loss. To save disk space, certain features will be affected, and if the affected features you don't need, you can adopt the corresponding disk-saving strategy.

TIP: disk space savings need to be cautious, so be sure to look at the impact of the strategy. I. Factors of influence

Replication

The replica is generated for high availability, as a backup of the data, after some nodes are hung up, the protection data is not lost. In principle, there is no difference between the copy data and the data itself. So the number of replicas will multiply the size of the index exponentially.

To modify the size of the copy command:

Curl-xput ' localhost:9200/my_index/_settings '-d '
{
"index": {
    "Number_of_replicas": 0
    }
} '        

The default number of replicas is 1, and earlier version 2.3 supports the addition of index.number_of_replicas:0 in elasticsearch.yml to modify this form. Configuration in Elasticsearch.yml to modify the index settings level is not supported in version 5.0

Impact: The replica is set to 0, although it will save half of the disk space, ES cluster is no longer highly available, the node hangs, the data is lost.

_source

Elasticsearch retains a copy of the original data JSON for each incoming document. This _source field is useful for us to re-structure the original data and highlight the results of the search. But it also takes up disk space. Can be disable off to conserve disk space

Note that this copy of the raw data JSON is not the same as the replication above, and is the contents of the following {}

PUT my_index/user/1 
{
 "first_name":    "John",
 "last_name":     "Smith",
 "Date_of_birth": " 1970-10-24 "
}

Impact: You can view the _source field I wrote separately understanding and impact

A single field can also choose whether or not to store and also affect disk space. I think the single field store and _source are individual and overall relationships, so I write in a paragraph. I have doubts here.

Single Field Store

_all

The _all field maintains this large array of strings, containing all of the term. It is convenient for us to search for term value without knowing the field. Large string arrays also consume disk space and can be disable off.

Impact: You can view the _all field I wrote separately understanding and impact

Doc_value
Doc values is a mechanism used by elasticsearch to reduce heap memory usage. Can save heap when sorting and aggregating, but it consumes disk space itself

Impact: You can view the Doc_value I wrote individually and influence

participle

impact: Participle of a string will also have an impact on disk space, in general, the Word will save space

Website Word breaker Introduction second, the degree of influence

The data in this section is from reference article 2

The test data is a 67644119 byte log file

71.212.224.97--[28/may/2014:16:27:35-0500] "get/images/web/2009/banner.png
http/1.1 "52315" http://www.semicomplete.com/projects/xdotool/"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) applewebkit/537.36
(khtml, like Gecko) chrome/32.0.1700.107 safari/537.36 "

Test data size is 67644119

number
Teststring Field _all Doc Value Index Size Expansion ratio (index size/raw size)
0 x x x 67644119 1
1 analyzed and not_analyzed enabled enabled 94633818 1.39 9
2 analyzed and not_analyzed disabled enabled 75648416 1.118
3 not_analyzed disabled enabled< /td> 63079805 0.933
4 analyzed and not_analyzed Enab LED disabled 80608354 1.192
5 analyzed and Not_ Analyzed disabled disabled 61680474 0.912
3 not_analyzed disabled disabled 48432487 0.716

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.