How to
Elasticsearch provides a very simple out-of-the-box experience by default. Users can directly use full-text search, result highlighting, aggregation, and indexing functions without modifying any configuration.
But to use high-performance elasticsearch in a project, there are several ways to optimize it.
This article is to guide how to optimize.
General recommendations do not return too many search result sets at a time
Elasticsearch is designed as a search engine and is very good at returning matching query results. However, it is not appropriate to return the entire document as a query result, like a database. If this is not the case, it is best to use the scroll interface.
Avoid index thinning
Elasticsearch is indexed and stored based on Lucene, and the best way to work is with dense data, that is, all document has the same field. In particular, norms is enabled (usually the text field is turned on by default) or doc_values (usually Numerics, date, IP, or keyword is turned on by default).
The reason is that Lucene internally identifies the document by doc_id from 0 to the total number of indexed document. DOC_ID is used for communication between Lucene APIs: for example, using the keyword match query to get the doc_id, then these doc_id are used to retrieve the value of norms to calculate score (match score). This norms lookup method implementation is currently done by reserving one byte for each document. DOC_ID can read the norms value of the document directly. The advantage of this approach is that it helps Lucene to quickly access each document, with the drawback that each document requires an extra byte of storage.
In fact, this means that if an index has m document, the norms of each field requires M-byte storage, even if some fields are contained in a small part of document. Although complex Type field storage uses doc_values, it also consumes storage. As we all know, the Fielddata type is replaced with doc_values after pre-2.0, unless Fielddata is already written on the hard disk, in memory, there is this problem.
It is important to note that sparse storage can have a significant impact on index and search speed, while additional storage bytes are not fields and take time to index and search.
Of course, index is allowed to exist in a few sparse cases, but if the sparse number teaches large, it will affect the efficiency of the whole index.
This chapter focuses on the norms and doc_values of the two most influential features. The sparse condition affects the extent of the inverted index (used for index text/keyword fields) and the coordinate point type fields (for indexes geo_point and numeric).
Here are a few recommendations to avoid sparse:
Avoid putting unrelated data on the same index
Do not put a completely different data structure document in the same index. It is best to put these document in a different index, you can consider creating some smaller index, with less shard to store.
Note that this recommendation does not apply to the case where the document with the Parent/child relationship is placed at the same index.
Standardize document structure
If it is necessary to put different types of document in the same index, there is a chance to reduce the sparse situation. For example, all document within index adds a timestamp field, usually called "timestamp" or "creation_date", which will help to rename all the document to the same field.
Prevent different types from being placed on the same index
Placing multiple type on a single index looks like an easy way to do this. However, Elasticsearch is not stored based on type, and different types can affect efficiency in a single index. If the type does not have a very similar mapping, it is advisable to consider putting it on a separate index.
Disable norms and doc_values for different fields
If the above recommendations apply, you also need to check that the fields are enabled for norms and doc_values. A field that is usually used only to filter without scoring (match score) can be disabled directly norms. Fields that are not used for sorting or aggregation can disable doc_values. Note that if you make these changes to an existing index, you need to do a reindex action on index.
Tuning index speed using bulk requests
Bulk bulk requests perform better than single document index requests. To verify the size of the optimal batch request, you can do a benchmark and run a single shard with a single node. Try indexing 100 document first, then 200, then 400, and so on. Each time the benchmark is run, the number of document is doubled accordingly. The highest index speed is obtained, which is the best number of bulk bulk requests. Of course, volume requests are not the more document the better. If concurrent requests occur concurrently, too large bulk requests can make the cluster memory pressure larger, so it is recommended that you avoid exceeding dozens of m per request, which results in better performance.
Use multi-process/multithreading to send data to Elasticsearch
A single-threaded send bulk request does not seem to be able to play a clustered index capability. To better utilize the resources of a cluster, you should use multithreading or multiple processes to send data, and this will help reduce the cost per Fsync.
Be sure to note that the system returns the Too_many_requests (429) code. (Usually the Java client return is esrejectedexecutionexception), which means that elasticsearch cannot keep up with the current index speed. When this happens, you can pause the index for a second and try again. Try replacing bulk with a random value or ideal value.
For bulk requests of the same size, the optimal number of threads can be obtained by testing. You can incrementally increase the number of threads until the machine IO or CPU saturation in the cluster.
Increase the interval between refresh_interval refreshes
The default value for Index.refresh_interval is 1s, which forces the Elasticsearch cluster to create a new segment per second (an index file that can be understood as Lucene). Increasing this value, such as 30s, allows larger segment to be written, minus post-segment combined pressure.
The number of refresh and replicas can be disabled when indexing is initialized
If you need to load a large amount of data into the index inside, you can first disable refresh, the Index.refresh_interval set to 1, the Index.number_of_replicas set to 0. Temporarily turn off multiple shard replicas (that is, lose data if the current index is corrupted), but doing so can greatly speed up indexing. When the initialization index is complete, you can set Index.refresh_interval and Index.number_of_replicas back to their original values.
Disable swapping
Shut down the operating system's virtual memory swap area. Sysctl inside Add vm.swappiness = 1
Ensure that there is free memory for the file system cache
The file system cache is intended to buffer the IO operation of the disk. At a minimum, make sure that half of the machine's memory is reserved to the operating system, not the Java VM that consumes all memory.
Use faster hardware
Of course, this needless to say that the SSD is the best. If you have multiple SSD drives, you can configure the RAID 0 array to achieve better IO performance. But any SSD damage could break index. Usually the right tradeoff is to optimize single Shard storage performance, and then add replicas on different nodes. Use both the snapshot snapshot and the Restore feature to back up index.
Index buffer size
If the node is doing a very large index action, you need to make sure that the indices.memory.index_buffer_size is large enough to be set to a maximum of 512M buffer. In addition to increasing this value, performance will not usually improve.
Active shard of Elasticsearch need to use Java's heap memory percentage or absolute value as a shared buffer. Very active Shard will naturally be used more frequently.
This default value is typically 10%, for example, if the JVM is set to 10GB of memory as heap, then there will be a 1GB index buffer available to a large number of index Shard.
Elasticsearch Tuning (Official document how to)