Elasticstack series of XV & query cache causes the performance problem thinking

Source: Internet
Author: User

Problem description

An online cluster, the Query DSL that executes is the same, but the parameters are different. Statistics show that 98% ~ 99% of the query corresponding speed is very fast, only 4 to 6ms, but there are about 1% of the query response time between 100ms ~ 200ms. Cluster hardware configuration is high, the use of SSD hard disk, the system available memory is much higher than the total memory used in the index twice times, and the line has been running for some time, the data does not exist whether it has been warmed up.

Diagnostic process

First, all critical data from the cluster is discharged through the monitoring system, and no performance bottlenecks are found that could cause the query to be time-consuming. Therefore, the initial suspicion is that the query itself is relatively slow reasons. A time-consuming 150ms query from the log system (where only the key content is pasted and the non-critical parts are removed):

Post/xxxindex/xxxdb/_search?routing=mxxxxxxx{"from": 0, "size": +, "query": {"bool": {"Filter": [                    {"bool": {"must": [{"bool": {"must": [                            {"bool": {"should": [{ "Match_phrase": {"Ord_orders_uid": {"Query": "Mxxxxxxx                            "," slop ": 0," Boost ": 1} }}], "Disable_coord": Fals                    E, "adjust_pure_negative": true, "Boost": 1}                          }, {"range": {"Ord_orders_orderdate": { "From": "1405032032", "to": "1504014193", "Include_lower": true,                    "Include_upper": true, "Boost": 1}} }, {"term": {"Ord_orders_ispackageorder                      ": {" value ": 0," Boost ": 1}                          }}, {"bool": {"Must_not": [                              {"exists": {"field": "Ord_hideorder_orderid",                        "Boost": 1}],                        "Disable_coord": false, "adjust_pure_negative": true,             "Boost": 1         }}], "Disable_coord": false, "adjust_pure _negative ": True," Boost ": 1}]," Disable_coord ": FA LSE, "Adjust_pure_negative": true, "Boost": 1}], "Disable_coord": Fals  E, "adjust_pure_negative": true, "Boost": 1}}

After getting the query, I manually executed a bit, 0hits total time consuming 1ms, should be hit the query cache so fast.

Then using the clear API to clear the query cache, and then execute several times, the following findings are summarized as follows:

1. The first two queries took around 36ms, which is because there is no cache required to access the inverted index, which is time-consuming to anticipate. Two times is also time consuming because the index has a copy, and two queries fall on the primary and secondary shards respectively.

2. The next two queries take around 150ms, and here is confused for thought?

3. No matter how the query, the time spent is all in 1~5ms, this is because the cache has been hit again.

At this point, it is generally understood that the high time recorded in the log is in step 2. So what is the operation that will take so long? Based on previous experience, I judged primarily to generate a cache for the range filter, which is the bitmap that generates the document list, and then it is stored in the query cache.

I am using the ElasticSearch5.5.1 version, and at the beginning of the ElasticSearch5.1.1 version, I removed the cache for term filter because the term filter is fast enough and the term filter is often not The attempt will instead waste memory space in vain. So I'm going to focus on the only range filter in the query.

Executed alone this range filter,match to the document is tens of millions of levels, why this range filter hit so many documents, by knowing that the user is mainly query from the current time to the past 1 years of data, similar to do a [now-1y To now] such filtering. As a preliminary conclusion, because the range filter matches too many document trees, it can be a bit more time to build bitmap for this filter in the query cache, which should be the extra 100 milliseconds it brings.

But there is one more question to be explained, why is this high time-consuming query ratio so high?

To think about it again is to be able to understand:

Because the search concurrency of this cluster is still very high, the appearance of the 400/s, plus the time field precision is seconds, so, at the beginning of a second, the first two queries because there is no cache, time-consuming may be around 36ms, then there will be 2 queries because the need to buffer range filter, Time will increase to 200ms, then the remaining queries in 1s will hit the cache, all a few MS, until the next second to start, the cycle. Because every second generates 2 queries that need to build the cache, which is time-consuming, compared to the hundreds of-word query per second, the percentage is a bit higher.

Problem fix

For a large number of range queries that contain [now-xxx to now], there is actually an acceleration tip for the official: Search rounded dates that is, round the upper and lower bounds of the query time to a whole minute or an entire hour, let range filte R can be replaced for a longer period of time to avoid this too frequently rebuilding the cache.

{"   range": {"       my_date": {       "GTE": "now-1y/h",        "LTE": "now-1y/h"    }}}

In the original query, the range filter is written in the form above, and manual test validation is possible. The range filter is extended to 1 hours, so the problem is solved by rebuilding the cache 2 times per hour for the range filter.

Summarize

1. The more cache is not built, the better, because the cache generation and destruction of the additional overhead, especially the result set very large filter, the cost of caching relative to the query itself may be very high.

2. ElasticSearch5.1.1 started canceling the terms filter cache because terms filter performs very quickly, and in most cases, the cancellation of the cache can improve performance.

3. When using a range filter such as [Now-xxx to now], you can use the round date technique to increase the expiration time of the cache and mitigate the performance issues associated with frequently rebuilding the cache.

Elasticstack series of XV & query cache causes the performance problem thinking

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.