Cloud computing platform (Search)-elasticsearch-INDEX OPTIMIZATION

Source: Internet
Author: User

Elasticsearch index optimization mainly solves the problem from two aspects: the index data process and the retrieval process.

I have mentioned how to create indexes and import data in the previous articles, but you may encounter slow indexing data. In fact, you can perform targeted optimization by understanding the indexing principles. The elasticsearch index process expands the distributed data compared to the Lucene index process. elasticsearch uses tranlog to balance data between nodes. Therefore, we can perform the first optimization from the settings of the index:

"Index. translog. flush_threshold_ops": "100000"

"Index. refresh_interval": "-1 ",

The first two parameters are the number of tranlog data records to be balanced. The default value is 5000, which is a waste of time and resources. Therefore, we can set the value to-1 to be larger, and then manually perform tranlog balancing. The second parameter is the refresh frequency. The default value is 120s, which means that the index is refreshed regularly during the lifecycle. Once data comes in, refresh is like a commit in Lucene. We know that when the data is adddoucment, you cannot retrieve the row data only after the index is commit, so you can disable it. manually refresh one after the initial index is complete, and then set the index in the index setting. the refresh_interval parameter can be modified as needed to improve the efficiency of the index process.

In addition, if a copy exists during the es index process, the data will be synchronized to the copy immediately. I personally recommend that you set the number of replicas to 0 during the indexing process. After the index is complete, you can change the number of replicas as needed to improve the indexing efficiency.

"Number_of_replicas": 0

After talking about the optimization of the index process, let's talk about the slow retrieval speed. In fact, the retrieval speed is closely related to the index quality. The quality of indexes is related to many factors.

I. Number of Parts

The number of shards, which is an indicator highly related to the retrieval speed. If the number of shards is too small or too large, the retrieval will be slow. If the number of parts is too large, opening more files during retrieval may also lead to communication between multiple servers. If the number of parts is too small, the retrieval speed is slow because the index of a single part is too large.

Before determining the number of shards, You need to test a single service single index single shard. For example, when I created an index on a machine in the IBM-3650, the index had only one partition, and the retrieval speed was tested separately in the case of different data volumes. Finally, the content of a single part is 20 GB.

So the number of index shards = total data volume/number of individual shards

Currently, we have more than 0.4 billion million data records and the index size is about TB. Because it is document data, the single data is earlier than 8 KB. The retrieval speed is now less than 100 ms. In special cases, if the concurrency test is less than 200,400,800,100 ms, the worst case is less than Ms.

Ii. Number of copies

The number of replicas has a great relationship with the index stability. If elasticsearch fails abnormally, it will often lead to the loss of parts. To ensure the integrity of the data, you can use copies to solve this problem. We recommend that you adjust the number of replicas immediately after optimize is executed after the index is created.

The more copies are deleted by mistake, the faster the retrieval is. I have done this before, as the number of replicas increases, the retrieval speed decreases slightly. Therefore, when you set the number of replicas, you need to find a balance value. In addition, after the replica is set, there may be two identical searches with different values. This may be because the tranlog is not balanced or the multipart routing problem occurs? Preference = _ primary allows the search to be performed on the primary shard.

Iii. Word Segmentation

In fact, word segmentation can have a big but small impact on the index. The more words you think, the better the word segmentation effect and the better the index quality. Word Segmentation involves many algorithms, most of which are Word Segmentation Based on word lists. That is to say, the size of the Word Table determines the index size. Therefore, word segmentation is directly linked to the index inflation rate. There should not be many word lists, but only those with strong document-related characteristics. For example, if the data of a thesis is indexed, the word segmentation Word Table is similar to the features of the thesis, and the number of word lists is smaller, the index size can be greatly reduced to ensure full query accuracy. If the index size is reduced, the retrieval speed is improved.

Iv. index segments

The index segment is the segments concept in Lucene. We know that refresh and tranlog will be used in the es index process, which means that we have no segments number in the index process. The segments number is directly related to the search. The more segments number, the slower the search, when segments numbers is possible, it can be guaranteed to be 1, which means nearly half of the search speed can be mentioned.

$ Curl-xpost 'HTTP: // localhost: 9200/Twitter/_ optimize? Max_num_segments = 1'

5. delete documents

Deleting a document in Lucene does not immediately remove the data from the hard disk, but a file is generated in the Lucene index. del file, and this part of data will also be involved in the retrieval process. Lucene will determine whether to delete the data during the retrieval process. If it is deleted, it will be filtered out. This will also reduce the retrieval efficiency. Therefore, you can clear and delete documents.

$ Curl-xpost 'HTTP: // localhost: 9200/Twitter/_ optimize? Only_expunge_deletes = true'

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.