Redis偶發串連失敗案例分析

來源:互聯網
上載者:User

標籤:片段   bubuko   副作用   dir   工作   架構   handle   close   Once   

【作者】

張延俊:攜程技術保證中心資深DBA,對資料庫結構描述和疑難問題分析排查有濃厚的興趣。
壽向晨:攜程技術保證中心進階DBA,主要負責攜程Redis及DB的營運工作,在自動化營運,流程化及監控排障等方面有較多的實踐經驗,喜歡深入分析問題,提高團隊營運效率。

【問題描述】

?生產環境有一個Redis會偶爾發生串連失敗的報錯,報錯的時間點、用戶端IP並沒有特別明顯的規律,過一會兒,報錯會自動回復。
?以下是用戶端報錯資訊:

CRedis.Client.RExceptions.ExcuteCommandException: Unable to Connect redis server: ---> CRedis.Third.Redis.RedisException: Unable to Connect redis server:   在 CRedis.Third.Redis.RedisNativeClient.CreateConnectionError()   在 CRedis.Third.Redis.RedisNativeClient.SendExpectData(Byte[][] cmdWithBinaryArgs)   在 CRedis.Client.Entities.RedisServer.<>c__DisplayClassd`1.

?從報錯的資訊來看,應該是串連不上Redis所致。Redis的版本是2.8.19。雖然版本有點老,但基本運行穩定。
?線上環境只有這個叢集有偶爾報錯。這個叢集的一個比較明顯的特徵是用戶端伺服器比較多,有上百台。

【問題分析】

?從報錯的資訊來看,用戶端串連不到服務端。常見的原因有以下幾點:

  • 一個常見的原因是由於連接埠耗盡,對網路連接進行排查,在出問題的點上,TCP串連數遠沒有達到連接埠耗盡的情境,因此這個不是Redis串連不上的根本原因。
  • 另外一種常見的情境是在服務端有慢查詢,導致Redis服務阻塞。我們在Redis服務端,把運行超過10毫秒的語句進行抓取,也沒有抓到運行慢的語句。

?從服務端的部署的監控來看,出問題的點上,串連數有一個突然飆升,從3500個串連突然飆升至4100個串連。如顯示:

同時間,伺服器端顯示Redis服務端有丟包現象:345539 – 344683 = 856個包。

Sat Apr  7 10:41:40 CST 2018   1699 outgoing packets dropped   92 dropped because of missing route   344683 SYNs to LISTEN sockets dropped   344683 times the listen queue of a socket overflowed
Sat Apr  7 10:41:41 CST 2018   1699 outgoing packets dropped   92 dropped because of missing route   345539 SYNs to LISTEN sockets dropped   345539 times the listen queue of a socket overflowed

?用戶端報錯的原因基本確定,是因為建連速度太快,導致服務端backlog隊列溢出,串連被server端reset。

【關於backlog overflow】

?在高並發的短串連服務中,這是一種很常見的tcp報錯類型。一個正常的tcp建連過程如下:

?1.client發送一個(SYN)給server

?2.server返回一個(SYN,ACK)給client

?3.client返回一個(ACK)

?三向交握結束,對client來說建連成功,client可以繼續發送資料包給server,但是這個時候server端未必ready,如所示 :

在BSD版本核心實現的tcp協議中,server端建連過程需要兩個隊列,一個是SYN queue,一個是accept queue。前者叫半開串連(或者半串連)隊列,在接收到client發送的SYN時排入佇列。(一種常見的網路攻擊方式就是不斷髮送SYN但是不發送ACK從而導致server端的半開隊列撐爆,server端拒絕服務。)後者叫全串連隊列,server返回(SYN,ACK),在接收到client發送ACK後(此時client會認為建連已經完成,會開始發送PSH包),如果accept queue沒有滿,那麼server從SYN queue把串連資訊移到accept queue;如果此時accept queue溢出的話,server的行為要看配置。如果tcp_abort_on_overflow為0(預設),那麼直接drop掉client發送的PSH包,此時client會進入重發過程,一段時間後server端重新發送SYN,ACK,重新從建連的第二步開始;如果tcp_abort_on_overflow為1,那麼server端發現accept queue滿之後直接發送reset。

通過wireshark搜尋發現在一秒內有超過2000次對Redis Server端發起建連請求。我們嘗試修改tcp backlog大小,從511調整到2048, 問題並沒有得到解決。所以此類微調,並不能徹底的解決問題。

【網路包分析】

我們用wireshark來識別網路擁塞的準確時間點和原因。我們已經有了準確的報錯時間點,先用editcap把超大的tcp包裁剪一下,裁成30秒間隔,並通過wireshark I/O 100ms間隔分析網路阻塞的準確時間點:

?根據表徵圖可以明顯看到tcp的packets來往存在block。

?對該block前後的網路包進行明細分析,網路包來往情況如下:

Time Source Dest Description
??12:01:54.6536050?? ??Redis-Server?? ??Clients?? ??TCP:Flags=…AP…
??12:01:54.6538580?? ??Redis-Server?? ??Clients?? ??TCP:Flags=…AP…
??12:01:54.6539770?? ??Redis-Server?? ??Clients?? ??TCP:Flags=…AP…
??12:01:54.6720580?? ??Redis-Server?? ??Clients?? ??TCP:Flags=…A..S..
??12:01:54.6727200?? ??Redis-Server?? ??Clients?? ??TCP:Flags=…A……
??12:01:54.6808480?? ??Redis-Server?? ??Clients?? ??TCP:Flags=…AP…..
??12:01:54.6910840?? ??Redis-Server?? ??Clients?? ??TCP:Flags=…A…S.,
??12:01:54.6911950?? ??Redis-Server?? ??Clients?? ??TCP:Flags=…A……
??… ?? ??… ?? ?? … ?? ?? …
??12:01:56.1181350?? ??Redis-Server?? ??Clients?? ??TCP:Flags=…AP….

12:01:54.6808480, Redis Server端向用戶端發送了一個Push包,也就是對於查詢請求的一個結果返回。後面的包都是在做串連處理,包括Ack包,Ack確認包,以及重設的RST包,緊接著下面一個Push包是在12:01:56.1181350發出的。中間的間隔是1.4372870秒。也就是說,在這1.4372870秒期間,Redis的伺服器端,除了做一個查詢,其他的操作都是在做建連,或拒絕串連。

用戶端報錯的前後邏輯已經清楚了,redis-server卡了1.43秒,client的connection pool被打滿,瘋狂建立串連,server的accept queue滿,直接拒絕服務,client報錯。開始懷疑client發送了特殊命令,這時需要確認一下client的最後幾個命令是什麼,找到redis-server卡死前的第一個包,裝一個wireshark的redis外掛程式,看到最後幾個命令是簡單的get,並且key-value都很小,不至於需要耗費1.43秒才能完成。服務端也沒有slow log,此時排障再次陷入僵局。

【進一步分析】

為了瞭解這1.43秒之內,Redis Server在做什麼事情,我們用pstack來抓取資訊。Pstack本質上是gdb attach. 高頻率的抓取會影響redis的吞吐。死迴圈0.5秒一次無腦抓,在redis-server卡死的時候抓到堆棧如下(過濾了沒用的棧資訊):

Thu May 31 11:29:18 CST 2018Thread 1 (Thread 0x7ff2db6de720 (LWP 8378)):#0  0x000000000048cec4 in ?? ()#1  0x00000000004914a4 in je_arena_ralloc ()#2  0x00000000004836a1 in je_realloc ()#3  0x0000000000422cc5 in zrealloc ()#4  0x00000000004213d7 in sdsRemoveFreeSpace ()#5  0x000000000041ef3c in clientsCronResizeQueryBuffer ()#6  0x00000000004205de in clientsCron ()#7  0x0000000000420784 in serverCron ()#8  0x0000000000418542 in aeProcessEvents ()#9  0x000000000041873b in aeMain ()#10 0x0000000000420fce in main ()Thu May 31 11:29:19 CST 2018Thread 1 (Thread 0x7ff2db6de720 (LWP 8378)):#0  0x0000003729ee5407 in madvise () from /lib64/libc.so.6#1  0x0000000000493a4e in je_pages_purge ()#2  0x000000000048cf70 in ?? ()#3  0x00000000004914a4 in je_arena_ralloc ()#4  0x00000000004836a1 in je_realloc ()#5  0x0000000000422cc5 in zrealloc ()#6  0x00000000004213d7 in sdsRemoveFreeSpace ()#7  0x000000000041ef3c in clientsCronResizeQueryBuffer ()#8  0x00000000004205de in clientsCron ()#9  0x0000000000420784 in serverCron ()#10 0x0000000000418542 in aeProcessEvents ()#11 0x000000000041873b in aeMain ()#12 0x0000000000420fce in main ()Thu May 31 11:29:19 CST 2018Thread 1 (Thread 0x7ff2db6de720 (LWP 8378)):#0  0x000000000048108c in je_malloc_usable_size ()#1  0x0000000000422be6 in zmalloc ()#2  0x00000000004220bc in sdsnewlen ()#3  0x000000000042c409 in createStringObject ()#4  0x000000000042918e in processMultibulkBuffer ()#5  0x0000000000429662 in processInputBuffer ()#6  0x0000000000429762 in readQueryFromClient ()#7  0x000000000041847c in aeProcessEvents ()#8  0x000000000041873b in aeMain ()#9  0x0000000000420fce in main ()Thu May 31 11:29:20 CST 2018Thread 1 (Thread 0x7ff2db6de720 (LWP 8378)):#0  0x000000372a60e7cd in write () from /lib64/libpthread.so.0#1  0x0000000000428833 in sendReplyToClient ()#2  0x0000000000418435 in aeProcessEvents ()#3  0x000000000041873b in aeMain ()#4  0x0000000000420fce in main ()

重複多次抓取後,從堆棧中發現可疑堆棧clientsCronResizeQueryBuffer位置,屬於serverCron()函數下,這個redis-server內部的定時調度,並不在使用者線程下,這個解釋了為什麼卡死的時候沒有出現慢查詢。

查看redis源碼,確認到底redis-server在做什麼:

clientsCron(server.h):#define CLIENTS_CRON_MIN_ITERATIONS 5void clientsCron(void) {    /* Make sure to process at least numclients/server.hz of clients     * per call. Since this function is called server.hz times per second     * we are sure that in the worst case we process all the clients in 1     * second. */    int numclients = listLength(server.clients);    int iterations = numclients/server.hz;    mstime_t now = mstime();    /* Process at least a few clients while we are at it, even if we need     * to process less than CLIENTS_CRON_MIN_ITERATIONS to meet our contract     * of processing each client once per second. */    if (iterations < CLIENTS_CRON_MIN_ITERATIONS)        iterations = (numclients < CLIENTS_CRON_MIN_ITERATIONS) ?                     numclients : CLIENTS_CRON_MIN_ITERATIONS;    while(listLength(server.clients) && iterations--) {        client *c;        listNode *head;        /* Rotate the list, take the current head, process.         * This way if the client must be removed from the list it‘s the         * first element and we don‘t incur into O(N) computation. */        listRotate(server.clients);        head = listFirst(server.clients);        c = listNodeValue(head);        /* The following functions do different service checks on the client.         * The protocol is that they return non-zero if the client was         * terminated. */        if (clientsCronHandleTimeout(c,now)) continue;        if (clientsCronResizeQueryBuffer(c)) continue;    }}

clientsCron首先判斷當前client的數量,用於控制一次清理串連的數量,生產伺服器單一實例的串連數量在5000不到,也就是一次清理的串連數是50個。

clientsCronResizeQueryBuffer(server.h):/* The client query buffer is an sds.c string that can end with a lot of * free space not used, this function reclaims space if needed. * * The function always returns 0 as it never terminates the client. */int clientsCronResizeQueryBuffer(client *c) {    size_t querybuf_size = sdsAllocSize(c->querybuf);    time_t idletime = server.unixtime - c->lastinteraction;    /* 只在以下兩種情況下會Resize query buffer:     * 1) Query buffer > BIG_ARG(在server.h 中定義#define PROTO_MBULK_BIG_ARG     (1024*32))            且這個Buffer的小於一段時間的用戶端使用的峰值.     * 2) 用戶端空閑超過2s且Buffer size大於1k. */    if (((querybuf_size > PROTO_MBULK_BIG_ARG) &&         (querybuf_size/(c->querybuf_peak+1)) > 2) ||         (querybuf_size > 1024 && idletime > 2))    {        /* Only resize the query buffer if it is actually wasting space. */        if (sdsavail(c->querybuf) > 1024) {            c->querybuf = sdsRemoveFreeSpace(c->querybuf);        }    }    /* Reset the peak again to capture the peak memory usage in the next     * cycle. */    c->querybuf_peak = 0;    return 0;}

如果redisClient對象的query buffer滿足條件,那麼就直接resize掉。滿足條件的串連分成兩種,一種是真的很大的,比該用戶端一段時間內使用的峰值還大;還有一種是很閑(idle>2)的,這兩種都要滿足一個條件,就是buffer free的部分超過1k。那麼redis-server卡住的原因就是正好有那麼50個很大的或者閒置並且free size超過了1k大小串連的同時迴圈做了resize,由於redis都屬於單線程工作的程式,所以block了client。那麼解決這個問題辦法就很明朗了,讓resize 的頻率變低或者resize的執行速度變快。

既然問題出在query buffer上,我們先看一下這個東西被修改的位置:

readQueryFromClient(networking.c):redisClient *createClient(int fd) {    redisClient *c = zmalloc(sizeof(redisClient));    /* passing -1 as fd it is possible to create a non connected client.     * This is useful since all the Redis commands needs to be executed     * in the context of a client. When commands are executed in other     * contexts (for instance a Lua script) we need a non connected client. */    if (fd != -1) {        anetNonBlock(NULL,fd);        anetEnableTcpNoDelay(NULL,fd);        if (server.tcpkeepalive)            anetKeepAlive(NULL,fd,server.tcpkeepalive);        if (aeCreateFileEvent(server.el,fd,AE_READABLE,            readQueryFromClient, c) == AE_ERR)        {            close(fd);            zfree(c);            return NULL;        }    }    selectDb(c,0);    c->id = server.next_client_id++;    c->fd = fd;    c->name = NULL;    c->bufpos = 0;    c->querybuf = sdsempty(); 初始化是0readQueryFromClient(networking.c):void readQueryFromClient(aeEventLoop *el, int fd, void *privdata, int mask) {    redisClient *c = (redisClient*) privdata;    int nread, readlen;    size_t qblen;    REDIS_NOTUSED(el);    REDIS_NOTUSED(mask);    server.current_client = c;    readlen = REDIS_IOBUF_LEN;    /* If this is a multi bulk request, and we are processing a bulk reply     * that is large enough, try to maximize the probability that the query     * buffer contains exactly the SDS string representing the object, even     * at the risk of requiring more read(2) calls. This way the function     * processMultiBulkBuffer() can avoid copying buffers to create the     * Redis Object representing the argument. */    if (c->reqtype == REDIS_REQ_MULTIBULK && c->multibulklen && c->bulklen != -1        && c->bulklen >= REDIS_MBULK_BIG_ARG)    {        int remaining = (unsigned)(c->bulklen+2)-sdslen(c->querybuf);        if (remaining < readlen) readlen = remaining;    }    qblen = sdslen(c->querybuf);    if (c->querybuf_peak < qblen) c->querybuf_peak = qblen;    c->querybuf = sdsMakeRoomFor(c->querybuf, readlen); 在這裡會被擴大

由此可見c->querybuf在串連第一次讀取命令後的大小就會被分配至少1024*32,所以回過頭再去看resize的清理邏輯就明顯存在問題,每個被使用到的query buffer的大小至少就是1024*32,但是清理的時候判斷條件是>1024,也就是說,所有的idle>2的被使用過的串連都會被resize掉,下次接收到請求的時候再重新分配到1024*32,這個其實是沒有必要的,在訪問比較頻繁的群集,記憶體會被頻繁得回收重分配,所以我們嘗試將清理的判斷條件改造為如下,就可以避免大部分沒有必要的resize操作:

if (((querybuf_size > REDIS_MBULK_BIG_ARG) &&         (querybuf_size/(c->querybuf_peak+1)) > 2) ||         (querybuf_size > 1024*32 && idletime > 2))    {        /* Only resize the query buffer if it is actually wasting space. */        if (sdsavail(c->querybuf) > 1024*32) {            c->querybuf = sdsRemoveFreeSpace(c->querybuf);        }    }

這個改造的副作用是記憶體的開銷,按照一個執行個體5k串連計算,5000*1024*32=160M,這點記憶體消耗對於上百G記憶體的伺服器完全可以接受。

【問題重現】

在使用修改過源碼的Redis server後,問題仍然重現了,用戶端還是會報同類型的錯誤,且報錯的時候,伺服器記憶體依然會出現抖動。抓取記憶體堆棧資訊如下:

Thu Jun 14 21:56:54 CST 2018#3  0x0000003729ee893d in clone () from /lib64/libc.so.6Thread 1 (Thread 0x7f2dc108d720 (LWP 27851)):#0  0x0000003729ee5400 in madvise () from /lib64/libc.so.6#1  0x0000000000493a1e in je_pages_purge ()#2  0x000000000048cf40 in arena_purge ()#3  0x00000000004a7dad in je_tcache_bin_flush_large ()#4  0x00000000004a85e9 in je_tcache_event_hard ()#5  0x000000000042c0b5 in decrRefCount ()#6  0x000000000042744d in resetClient ()#7  0x000000000042963b in processInputBuffer ()#8  0x0000000000429762 in readQueryFromClient ()#9  0x000000000041847c in aeProcessEvents ()#10 0x000000000041873b in aeMain ()#11 0x0000000000420fce in main ()Thu Jun 14 21:56:54 CST 2018Thread 1 (Thread 0x7f2dc108d720 (LWP 27851)):#0  0x0000003729ee5400 in madvise () from /lib64/libc.so.6#1  0x0000000000493a1e in je_pages_purge ()#2  0x000000000048cf40 in arena_purge ()#3  0x00000000004a7dad in je_tcache_bin_flush_large ()#4  0x00000000004a85e9 in je_tcache_event_hard ()#5  0x000000000042c0b5 in decrRefCount ()#6  0x000000000042744d in resetClient ()#7  0x000000000042963b in processInputBuffer ()#8  0x0000000000429762 in readQueryFromClient ()#9  0x000000000041847c in aeProcessEvents ()#10 0x000000000041873b in aeMain ()#11 0x0000000000420fce in main ()

顯然,Querybuffer被頻繁resize的問題已經得到了最佳化,但是還是會出現用戶端報錯。這就又陷入了僵局。難道還有其他因素導致query buffer resize變慢?我們再次抓取pstack。但這時,jemalloc引起了我們的注意。此時回想Redis的記憶體配置機制,Redis為避免libc記憶體不被釋放導致大量記憶體片段的問題,預設使用的是jemalloc用作記憶體配置管理,這次報錯的堆棧資訊中都是je_pages_purge () redis在調用jemalloc回收髒頁。我們看下jemalloc做了些什麼:

arena_purge(arena.c)static voidarena_purge(arena_t *arena, bool all){    arena_chunk_t *chunk;    size_t npurgatory;    if (config_debug) {        size_t ndirty = 0;        arena_chunk_dirty_iter(&arena->chunks_dirty, NULL,            chunks_dirty_iter_cb, (void *)&ndirty);        assert(ndirty == arena->ndirty);    }    assert(arena->ndirty > arena->npurgatory || all);    assert((arena->nactive >> opt_lg_dirty_mult) < (arena->ndirty -        arena->npurgatory) || all);    if (config_stats)        arena->stats.npurge++;    npurgatory = arena_compute_npurgatory(arena, all);    arena->npurgatory += npurgatory;    while (npurgatory > 0) {        size_t npurgeable, npurged, nunpurged;        /* Get next chunk with dirty pages. */        chunk = arena_chunk_dirty_first(&arena->chunks_dirty);        if (chunk == NULL) {            arena->npurgatory -= npurgatory;            return;        }        npurgeable = chunk->ndirty;        assert(npurgeable != 0);        if (npurgeable > npurgatory && chunk->nruns_adjac == 0) {                arena->npurgatory += npurgeable - npurgatory;            npurgatory = npurgeable;        }        arena->npurgatory -= npurgeable;        npurgatory -= npurgeable;        npurged = arena_chunk_purge(arena, chunk, all);        nunpurged = npurgeable - npurged;        arena->npurgatory += nunpurged;        npurgatory += nunpurged;    }}

Jemalloc每次回收都會判斷所有實際應該清理的chunck並對清理做count,這個操作對於高響應要求的系統是很奢侈的,所以我們考慮通過升級jemalloc的版本來最佳化purge的效能。Redis 4.0版本發布後,效能有很大的改進,並可以通過命令回收記憶體,我們線上也正準備進行升級,跟隨4.0發布的jemalloc版本為4.1,jemalloc的版本使用的在jemalloc的4.0之後版本的arena_purge()做了很多最佳化,去掉了計數器的調用,簡化了很多判斷邏輯,增加了arena_stash_dirty()方法合并了之前的計算和判斷邏輯,增加了purge_runs_sentinel,用保持髒塊在每個arena LRU中的方式替代之前的保持髒塊在arena樹的dirty-run-containing chunck中的方式,大幅度減少了髒塊purge的體積,並且在記憶體回收過程中不再移動記憶體塊。代碼如下:

arena_purge(arena.c)static voidarena_purge(arena_t *arena, bool all){    chunk_hooks_t chunk_hooks = chunk_hooks_get(arena);    size_t npurge, npurgeable, npurged;    arena_runs_dirty_link_t purge_runs_sentinel;    extent_node_t purge_chunks_sentinel;    arena->purging = true;    /*     * Calls to arena_dirty_count() are disabled even for debug builds     * because overhead grows nonlinearly as memory usage increases.     */    if (false && config_debug) {        size_t ndirty = arena_dirty_count(arena);        assert(ndirty == arena->ndirty);    }    assert((arena->nactive >> arena->lg_dirty_mult) < arena->ndirty || all);    if (config_stats)        arena->stats.npurge++;    npurge = arena_compute_npurge(arena, all);    qr_new(&purge_runs_sentinel, rd_link);    extent_node_dirty_linkage_init(&purge_chunks_sentinel);    npurgeable = arena_stash_dirty(arena, &chunk_hooks, all, npurge,        &purge_runs_sentinel, &purge_chunks_sentinel);    assert(npurgeable >= npurge);    npurged = arena_purge_stashed(arena, &chunk_hooks, &purge_runs_sentinel,        &purge_chunks_sentinel);    assert(npurged == npurgeable);    arena_unstash_purged(arena, &chunk_hooks, &purge_runs_sentinel,        &purge_chunks_sentinel);    arena->purging = false;}
【解決問題】

實際上我們有多個選項。可以使用Google的tcmalloc來代替jemalloc,可以升級jemalloc的版本等等。我們根據上面的分析,嘗試通過升級jemalloc版本,實際操作為升級Redis版本來解決。我們將Redis的版本升級到4.0.9之後觀察,線上用戶端連線逾時這個棘手的問題得到瞭解決。

【問題總結】

Redis在生產環境中因其支援高並發,響應快,易操作被廣泛使用,對於營運人員而言,其回應時間的要求帶來了各種各樣的問題,Redis的連線逾時問題是其中比較典型的一種,從發現問題,用戶端連線逾時,到通過抓取用戶端與服務端的網路包,記憶體堆棧定位問題,也被其中一些假象所迷惑,最終通過升級jemalloc(Redis)的版本解決問題,這次最值得總結和借鑒的是整個分析的思路。

Redis偶發串連失敗案例分析

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.