"The number of Redis client connections has been down" problem solving

Last Update:2016-09-14 Source: Internet

Author: User

Tags cassandra redis server

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

[Problem on line] "The number of Redis client connections has been down" solution

Some time ago, the new Redis cache Service was launched, ready to replace the Memcached.

Why would you want to replace Memcached?

The reason is that the business data is compressed list-type data , and the cache holds the latest 3,000 data . For the new data append operation , it needs to be disassembled into [get + unzip + append + Zip + set] 5 steps. If the list length is at O (1k) level, it takes at least 50ms+. In a concurrency environment, there is a " Data update overlay problem " because the append operation is not atomic . (This problem is really on the line)

For the " append operation is not an atomic operation " issue, we started to investigate which distributed cache solutions that could solve the problem while satisfying the business data type .

Currently, some of the Key-value distributed cache systems commonly used in the industry are as follows:

Redis
Memcached
Cassandra
Tokyo Tyrant (Tokyo Cabinet)

Reference from:

Technical architecture Recommendations for 2010 –tim Yang
From distributed caches to in-memory data grids
Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Couchbase vs Orientdb vs Aerospike vs Hypertable vs ELASTICSE Arch vs Accumulo vs Voltdb vs Scalaris comparison

Through comparison and screening analysis, we finally chose Redis. There are several reasons for this:

Redis is a key-value cache and Storage (store ) system (now we only use it for caching , not for DB, data stored in Cassandra)
Support for rich data Structures , list is dedicated to the storage of lists , the default by the operation of time sorting. Sorted Set can sort elements by fractions , which is a generalized concept that can be time or score . Secondly, its rich data structure provides a great convenience for future expansion.
All the operations provided are atomic, providing a natural escort for concurrency.
Ultra-fast performance , see its official performance test "How fast is Redis?"
Having a more mature Java client-Jedis, like Sina Weibo, is all about using it as a client. (The official recommended clients)

Wordy some other things, now to the right.

As the Redis service goes online, it pays close attention to some of the key monitoring metrics for Redis (clients: client Connections , memory,stats: Number of commands executed per second by the server, commandstats: Execution statistics for some key commands,Redis.error.log: Exception Log ). (Refer to "Redis Monitoring scheme")

At around 5 o'clock in the afternoon, it was observed that the number of "client connections " had been growing, exceeding 2000 (see), even reducing it by one or two. But the application of the QPS is around 10, and the online application server no more than 10 units . Supposedly, the server will certainly not have such a high number of connections, certainly where the use of the problem.

Now it is only through inverse thinking that the problem is always inferred :

The "Number of client connections" monitored by the Redis server indicates that all clients should have so many, so first confirm the number of connections to each application.
Via "sudo netstat-antp | grep 6379 | Wc-l "confirms that there are more than 1000 connections for a Redis application, while the other is around 400 and the others are 60 or so . (60 up and Down is normal)
The first question: Why do different machines deploy the same application and behave differently?
The second problem: the number of connections more than 1000, the request volume (140) is lower than other machines (200+) (because it is configured in Nginx low weight), then its number of connections is so high? What the hell is going on?
For the second question, we know what happened through the Redis Exception log (Redis.error.log) for each application. The highest application has an unusually large number of exceptions, a total of 130 + exceptions, and there is a " connection leak when the cluster link is closed " issue; a similar situation exists in another high application, while other normal applications do not exceed 2 exceptions, and there is no "connection leak" issue. In this way, the "second question" is a clear one. (" connection leak " problem how to fix See "[FAQ] Jedis use the process of those who stepped on the pits")
At this point, the feeling problem seems to have been solved, but actually did not. By observing for a few days, it was horrible to find that the highest time, its number of connections even exceeded the 3000+. (then leader and I said, do not restart the application)
Even if the application's QPS is 20/s, and there is a "connection leak" issue, the number of connections will not exceed 1000 +. But now the number of connections altogether reached the 3000+, it does not work, only one may be incorrect use of Jedis.
This is the time to continue pushing back, and the number of Redis connections reflects the number of pool objects in the Jedis object pool . 2 Redis servers are deployed online as a cluster, indicating that this application holds (3000/2=1500) pooled objects. (because Jedis is implemented based on the genericobjectpool of Apache Commons Pool)
The third problem: depending on the application's QPS, there will be no more than 20 active pool objects per second, and the remaining 1480 are "free pool objects". Why are so many "free pool objects" not released?
Think about it now: those configuration properties of Jedis are related to the object pool management of "Idle pool objects", how do you manage "free pool objects" behindgenericobjectpool ?

As a result of the use of Jedis, the Apache Commons Pool was touched at the bottom. For the last two doubts, the following configurations of Jedis are related to the object pool management of "Idle Pool Objects":

redis.max.idle.num=32768
Redis. min.idle.num=
Redis. pool.behaviour=FIFO
Redis. time.between.eviction.runs.seconds=1
Redis. Num.tests.per.eviction.run=Ten
Redis. min.evictable.idle.time.minutes=5
Redis. max.evictable.idle.time.minutes=1440

The reason for saying "the number of Jedis connections per application is around 60 is normal" is that there are 2 Redis servers deployed online, and the "minimum number of Free pool objects" configured for Jedis (Redis). Min.idle.num=+).

Genericobjectpool is the " evicted thread evictor" management " Free Pool object ", see the "Apache Commons pool of idle objects of the eviction detection mechanism" article. The bottom 5 configurations are related to the evictor thread, which indicates that the idle queue behavior of the object pool is a FIFO"First in, one out" queue mode, and each second (1) detects a free pool object. Free pool objects are eligible for eviction detection only after more than 5 minutes of idle time, and will be forcibly evicted if the idle time exceeds one day (1440).

The " evicted thread Evictor" iterates over the " Pool object Idle queue " indefinitely, iterating over the detection. There are two ways to behave in the idlequeue: LIFO "LIFO" stack mode,FIFO"First in, out" queue mode, and LIFOby default. Here are two pictures to show how these two methods work in practice:

First,LIFO"LIFO" stack mode

Second,FIFO"first-out" queue mode

From the above two pictures can be seen,LIFO"LIFO" stack method effectively utilize the free queue of hot pool object resources, as traffic drops will make some pool objects for a long time unused and idle, and eventually they will be eliminated from the expulsion; FIFO the "FIFO" queue means that all pool objects in the free queue can be used for a period of time, and it seems as if the request for resources is scattered, but it is not conducive to the release of resources. This is one of the root causes of the "Client connection count has been down" .

Redis. pool.behaviour=FIFO
Redis. time.between.eviction.runs.seconds=1
Redis. Num.tests.per.eviction.run=Ten
Redis. min.evictable.idle.time.minutes=5

According to the above configuration, we can calculate how many free pool objects in 5 minutes have been used in a loop.
According to the application of the QPS 10/s calculation, in 5 minutes there is probably 10*5*60=3000 a free pool object was used, just as the above "connection number altogether reached 3000+" in line with, so it makes sense. At this point, the whole problem finally came to the bottom. (from the monitoring chart can also be seen in the 21st night around 6 o'clock modified configuration Restart service, the number of connections is more stable)

Here's an explanation of why the idle queue behavior is used in FIFO"First out" queues ?

Because we developed on the basis of Jedis "Fault node automatic removal, restore normal node auto-add" function, originally wanted to use FIFO"First out" queue mode in the node failure, the object pool can quickly update the entire cluster information, did not think self-defeating.

The repaired Jedis configuration is as follows:

redis.max.idle.num=32768
Redis.min.idle.num=30
Redis. pool.behaviour=LIFO
Redis.time.between.eviction.runs.seconds=1
redis.num.tests.per.eviction.run=10
Redis.min.evictable.idle.time.minutes=5
Redis. max.evictable.idle.time.minutes=

To sum up, this problem occurs for two reasons:

Idle queue behavior for object pooling is not used correctly (LIFO"LIFO" stack mode)
" exception caused connection leak when cluster link is turned off " issue

Http://www.myexception.cn/internet/1849994.html

"The number of Redis client connections has been down" problem solving

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More