現象 :早上背景訂閱線程無故退出,導致統計和監控失效長達5個小時左右
日誌:
2015-04-13 05:00:00.256 ERROR [Message SubScribe Monitor][SubScribeManager.java:127] - 訂閱線程無故退出com.lingyu.common.core.ServiceException: redis.clients.jedis.exceptions.JedisConnectionException: Unexpected end of stream. at com.lingyu.common.db.Redis.subscribe(Redis.java:1439) ~[Redis.class:?] at com.lingyu.common.db.SubScribeManager.run(SubScribeManager.java:125) ~[SubScribeManager.class:?] at java.lang.Thread.run(Thread.java:745) [?:1.7.0_65]Caused by: redis.clients.jedis.exceptions.JedisConnectionException: Unexpected end of stream. at redis.clients.util.RedisInputStream.ensureFill(RedisInputStream.java:198) ~[jedis-2.6.2.jar:?] at redis.clients.util.RedisInputStream.read(RedisInputStream.java:180) ~[jedis-2.6.2.jar:?] at redis.clients.jedis.Protocol.processBulkReply(Protocol.java:158) ~[jedis-2.6.2.jar:?] at redis.clients.jedis.Protocol.process(Protocol.java:132) ~[jedis-2.6.2.jar:?] at redis.clients.jedis.Protocol.processMultiBulkReply(Protocol.java:183) ~[jedis-2.6.2.jar:?] at redis.clients.jedis.Protocol.process(Protocol.java:134) ~[jedis-2.6.2.jar:?] at redis.clients.jedis.Protocol.read(Protocol.java:192) ~[jedis-2.6.2.jar:?] at redis.clients.jedis.Connection.readProtocolWithCheckingBroken(Connection.java:282) ~[jedis-2.6.2.jar:?] at redis.clients.jedis.Connection.getRawObjectMultiBulkReply(Connection.java:227) ~[jedis-2.6.2.jar:?] at redis.clients.jedis.JedisPubSub.process(JedisPubSub.java:108) ~[jedis-2.6.2.jar:?] at redis.clients.jedis.JedisPubSub.proceed(JedisPubSub.java:102) ~[jedis-2.6.2.jar:?] at redis.clients.jedis.Jedis.subscribe(Jedis.java:2496) ~[jedis-2.6.2.jar:?] at com.lingyu.common.db.Redis.subscribe(Redis.java:1435) ~[Redis.class:?] ... 2 more
被try{}catch(Exception e){} 的居然還會退出,很疑惑~~
查到了這篇文章:
https://github.com/xetorthio/jedis/issues/932
client-output-buffer-limit was the cause. redis-server closed the connections, leading to the exceptions.
client-output-buffer-limit
用戶端buffer控制。在用戶端與server進行的互動中,每個串連都會與一個buffer關聯,此buffer用來隊列化等待被client接受的響應資訊。如果client不能及時的消費響應資訊,那麼buffer將會被不斷積壓而給server帶來記憶體壓力.如果buffer中積壓的資料達到閥值,將會導致串連被關閉,buffer被移除。
buffer控制類型包括:normal -> 普通串連;slave ->與slave之間的串連;pubsub ->pub/sub類型串連,此類型的串連,往往會產生此種問題;因為pub端會密集的發布訊息,但是sub端可能消費不足.
指令格式:client-output-buffer-limit <class> <hard> <soft> <seconds>",其中hard表示buffer最大值,一旦達到閥值將立即關閉串連;
soft表示"容忍值",它和seconds配合,如果buffer值超過soft且期間達到了seconds,也將立即關閉串連,如果超過了soft但是在seconds之後,buffer資料小於了soft,串連將會被保留.
其中hard和soft都設定為0,則表示禁用buffer控制.通常hard值大於soft.
生產線上調整參數(記憶體和配置同步修改):
127.0.0.1:6380> CONFIG GET client-output-buffer-limit
client-output-buffer-limit
normal 0 0 0 slave 268435456 67108864 60 pubsub 33554432 8388608 60
127.0.0.1:6380> config set client-output-buffer-limit 'normal 0 0 0 slave 268435456 67108864 60 pubsub 0 0 0'
redis.conf
client-output-buffer-limit pubsub 0 0 0
===========================================================
昨天又遇到俄羅斯地區 後台活動上傳不成功,我們的後台活動是通過redos publish出去的,但發現publish失敗,試著手工publish 小位元組的是沒問題,大位元組的就失敗,info 也沒法查看。
後台確認原因是,內網區域網路傳輸對包的大小進行了限制,超過一定大小的包,都無法進行傳輸,並導致串連失效。開啟限制,問題解決。
=============================================================
6470672人
平均下來每個人,887位元組 記憶體佔用,608 AOF 磁碟佔用 344 RDB磁碟佔用
檔案cache需要額外佔用一半的記憶體佔用,所以大概就有10幾G的佔用
=============================================================
慢查詢擷取
slowlog get
127.0.0.1:6379> slowlog get 1) 1) (integer) 27 2) (integer) 1417531320 3) (integer) 24623 4) 1) "info"
其中,各項指標表示:
A unique progressive identifier for every slow log entry. slowlog的流水號
The unix timestamp at which the logged command was processed. unix時間戳記
The amount of time needed for its execution, in microseconds 平均耗時(注意,microseconds翻譯成微秒,而不是毫秒).
The array composing the arguments of the command.
slowlog len 擷取總共的slow log數量
slowlog get number 根據數量擷取slowlog
=======================================
commandstats 部分記錄了各種不同類型的命令的執行統計資訊,比如命令執行的次數、命令耗費的 CPU 時間(單位毫秒)、執行每個命令耗費的平均 CPU 時間(單位毫秒)等等。對於每種類型的命令,這個部分都會添加一行以下格式的資訊:
cmdstat_XXX:calls=XXX,usec=XXX,usecpercall=XXX
10.104.5.98:6379>info commandstats# Commandstatscmdstat_get:calls=180608685,usec=470928529,usec_per_call=2.61cmdstat_set:calls=147550519,usec=562225572,usec_per_call=3.81cmdstat_del:calls=177224,usec=1643815,usec_per_call=9.28cmdstat_exists:calls=14130110,usec=31402378,usec_per_call=2.22cmdstat_incr:calls=1017,usec=3261,usec_per_call=3.21cmdstat_mget:calls=666034,usec=18069595,usec_per_call=27.13cmdstat_lpush:calls=103077132,usec=181583996,usec_per_call=1.76cmdstat_lrange:calls=38777511,usec=138617427,usec_per_call=3.57cmdstat_ltrim:calls=2056,usec=7622,usec_per_call=3.71cmdstat_lrem:calls=103075076,usec=579401111,usec_per_call=5.62cmdstat_zadd:calls=15900133,usec=56515414,usec_per_call=3.55cmdstat_zincrby:calls=11747959,usec=196212310,usec_per_call=16.70cmdstat_zrem:calls=257783,usec=1053833,usec_per_call=4.09cmdstat_zrange:calls=7141527,usec=41950470,usec_per_call=5.87cmdstat_zrevrangebyscore:calls=10,usec=51489,usec_per_call=5148.90cmdstat_zcount:calls=16104028,usec=112221789,usec_per_call=6.97cmdstat_zrevrange:calls=27497771,usec=582807534,usec_per_call=21.19cmdstat_zscore:calls=8663683,usec=44001575,usec_per_call=5.08cmdstat_zrank:calls=3,usec=43,usec_per_call=14.33cmdstat_zrevrank:calls=15906400,usec=68891802,usec_per_call=4.33cmdstat_hset:calls=10236125,usec=37507245,usec_per_call=3.66cmdstat_hget:calls=1618802100,usec=2755577270,usec_per_call=1.70cmdstat_hmset:calls=369619411,usec=4843444966,usec_per_call=13.10cmdstat_hmget:calls=56015,usec=344231,usec_per_call=6.15cmdstat_hincrby:calls=170633471,usec=884820311,usec_per_call=5.19cmdstat_hdel:calls=44233,usec=201881,usec_per_call=4.56cmdstat_hlen:calls=21724,usec=39834,usec_per_call=1.83cmdstat_hgetall:calls=311374011,usec=3269118749,usec_per_call=10.50cmdstat_hexists:calls=70864759,usec=285319509,usec_per_call=4.03cmdstat_incrby:calls=2942269,usec=42251052,usec_per_call=14.36cmdstat_decrby:calls=2050,usec=3616,usec_per_call=1.76cmdstat_rename:calls=6472,usec=33326,usec_per_call=5.15cmdstat_keys:calls=3636,usec=1974535725,usec_per_call=543051.62cmdstat_dbsize:calls=9,usec=15,usec_per_call=1.67cmdstat_ping:calls=46747,usec=61691,usec_per_call=1.32cmdstat_type:calls=1,usec=3,usec_per_call=3.00cmdstat_psync:calls=1,usec=3164,usec_per_call=3164.00cmdstat_replconf:calls=21643928,usec=25568830,usec_per_call=1.18cmdstat_info:calls=4,usec=3669,usec_per_call=917.25cmdstat_config:calls=2,usec=37,usec_per_call=18.50cmdstat_subscribe:calls=45505,usec=476993,usec_per_call=10.48cmdstat_publish:calls=34572782,usec=262298295,usec_per_call=7.59cmdstat_client:calls=3,usec=47628,usec_per_call=15876.00cmdstat_eval:calls=2050,usec=76432,usec_per_call=37.28cmdstat_slowlog:calls=1,usec=30,usec_per_call=30.00
redis.2.8.23 版本部署時會有兩個warning
[32555] 09 Nov 12:06:37.804 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
[32555] 09 Nov 12:06:37.804 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
只要按照提示來就行:
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo 511 > /proc/sys/net/core/somaxconn
並加到 /etc/rc.local
==============================================
Redis "MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk"
1.沒有配置 vm.overcommit_memory=1
2.磁碟空間不足
3.rdb檔案被刪除,解決辦法
CONFIG SET dir /data/redis/