This article is at that time QQ group members discussed how disk space optimization, I searched a similar article, combined with official documents to do some summary
Reference article 1
Reference Article 2
If you have questions, you can contact me to participate in the discussion, or go to the original view.
Note: disk space Savings is a matter of gain and loss. To save disk space, certain features will be affected, and if the affected features you don't need, you can adopt the corresponding disk-saving strategy.
TIP: disk space savings need to be cautious, so be sure to look at the impact of the strategy. I. Factors of influence
Replication
The replica is generated for high availability, as a backup of the data, after some nodes are hung up, the protection data is not lost. In principle, there is no difference between the copy data and the data itself. So the number of replicas will multiply the size of the index exponentially.
To modify the size of the copy command:
Curl-xput ' localhost:9200/my_index/_settings '-d '
{
"index": {
"Number_of_replicas": 0
}
} '
The default number of replicas is 1, and earlier version 2.3 supports the addition of index.number_of_replicas:0 in elasticsearch.yml to modify this form. Configuration in Elasticsearch.yml to modify the index settings level is not supported in version 5.0
Impact: The replica is set to 0, although it will save half of the disk space, ES cluster is no longer highly available, the node hangs, the data is lost.
_source
Elasticsearch retains a copy of the original data JSON for each incoming document. This _source field is useful for us to re-structure the original data and highlight the results of the search. But it also takes up disk space. Can be disable off to conserve disk space
Note that this copy of the raw data JSON is not the same as the replication above, and is the contents of the following {}
PUT my_index/user/1
{
"first_name": "John",
"last_name": "Smith",
"Date_of_birth": " 1970-10-24 "
}
Impact: You can view the _source field I wrote separately understanding and impact
A single field can also choose whether or not to store and also affect disk space. I think the single field store and _source are individual and overall relationships, so I write in a paragraph. I have doubts here.
Single Field Store
_all
The _all field maintains this large array of strings, containing all of the term. It is convenient for us to search for term value without knowing the field. Large string arrays also consume disk space and can be disable off.
Impact: You can view the _all field I wrote separately understanding and impact
Doc_value
Doc values is a mechanism used by elasticsearch to reduce heap memory usage. Can save heap when sorting and aggregating, but it consumes disk space itself
Impact: You can view the Doc_value I wrote individually and influence
participle
impact: Participle of a string will also have an impact on disk space, in general, the Word will save space
Website Word breaker Introduction second, the degree of influence
The data in this section is from reference article 2
The test data is a 67644119 byte log file
71.212.224.97--[28/may/2014:16:27:35-0500] "get/images/web/2009/banner.png
http/1.1 "52315" http://www.semicomplete.com/projects/xdotool/"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) applewebkit/537.36
(khtml, like Gecko) chrome/32.0.1700.107 safari/537.36 "
Test data size is 67644119
Test | number
string Field |
_all |
Doc Value |
Index Size |
Expansion ratio (index size/raw size) |
0 |
x |
x |
x |
67644119 |
1 |
1 |
analyzed and not_analyzed |
enabled |
enabled |
94633818 |
1.39 9 |
2 |
analyzed and not_analyzed |
disabled |
enabled |
75648416 |
1.118 |
3 |
not_analyzed |
disabled |
enabled< /td> |
63079805 |
0.933 |
4 |
analyzed and not_analyzed |
Enab LED |
disabled |
80608354 |
1.192 |
5 |
analyzed and Not_ Analyzed |
disabled |
disabled |
61680474 |
0.912 |
3 |
not_analyzed |
disabled |
disabled |
48432487 |
0.716 |