大資料引發的小悲劇（一）

最後更新：2018-12-05 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

前幾天，MonitorServer有個功能在客戶現場被報告無法工作，於是立即跟蹤之。

該功能要完成的工作是：從上層(BS系統)接收配置參數，按照系統運行情況，將設定參數轉寄給指定的下層系統（有些發送到嵌入式裝置上，有些發送到其他程式）。

之前在本地測試，一切都Ok。為什麼在客戶現場就不行呢？

於是做了兩項測試：

（1）使用本機資料重新測試，結果正常。

（2）將涉及到的客戶現場資料導回來，測試之，果然無法正常工作。

跟蹤發現，問題出在MonitorServer將參數轉寄給下層（另外一個程式monitord）時，沒有收到monitord返回的響應，導致轉寄失敗。

於是去查看monitord程式，發現它竟然崩潰了，當然就不會給MonitorServer發響應了。

問題是，monitod為什麼會崩潰呢？之前的測試都是通過了的阿。

分析了下資料，發現一點：客戶現場轉寄的資料，遠比本地測試時的資料要多。

再分析MonitorServer的發送部分和monitord接收部分的代碼，分別如下：

 1 //MonitorServer發送代碼 2  3 esmonitor_cfg_t m_monitord_prms; 4  5 ... 6  7 tmp = sock.CreateSock(pConf->m_monitordServerList[i].c_str(), ES_MONITOR_PORT, IPPROTO_TCP, CLIENT); 8  9 ...10 11 if (sock.SendData((__int8*) & m_monitord_prms, sizeof (m_monitord_prms)) < 0)12 {13     printf("setAlarmPrmTask::Send2Monitord().  SendData failed.\n");14     sock.closeHandle();15     nRet = -1;16     continue;17 }18 19 esmonitor_cfg_resp_t resp;20 if (sock.ReceiveData((__int8 *) & resp, sizeof (resp)) < 0)21 {22     printf("setAlarmPrmTask::Send2Monitord().  ReceiveData failed.\n");23     sock.closeHandle();24     nRet = -1;25     continue;26 }27 28 29 //下面是發送的結構體的定義30 31 #define MAX_CFG_NUM 102432 typedef struct _esmonitor_cfg_t33 {34     _esmonitor_cfg_t()35     {36         memset(&header, 0, sizeof(header_t));37         i_cfg_num = 0;38         memset(&threshold, 0, sizeof(threshold_t)*MAX_CFG_NUM);39 40         header.i_sync = htonl(0x12345678);41         header.i_vession = htonl(0x1);42         header.i_type = htonl(ES_MONITOR_CFG_ADD);43     }44 45     header_t    header;46     uint32_t    i_cfg_num;47     threshold_t    threshold[MAX_CFG_NUM];48 }esmonitor_cfg_t;

monitord接收代碼：

 1 static uint8_t p_recv_buf[1500]; 2 while (1) 3 { 4     sock_accept = (SOCKET)accept(sock, (SOCKADDR*)&addr_from, &i_len); 5     if (sock_accept > 0) 6     { 7         i_recv_size = recv(sock_accept, &p_recv_buf, sizeof(p_recv_buf), 0); 8         if (i_recv_size > 0) 9         {10             p_header = p_recv_buf;11             p_header->i_type = ntohl(p_header->i_type);12 13             ...14         }    15     }16 }

發現問題沒有？

monitord的接收緩衝區只有1500位元組，而MonitorServer發送的結構體遠遠超過它！實測sizeof (m_monitord_prms)的大小超過6000位元組！

那為什麼資料量比較小的時候沒有崩潰，而資料量大的時候才崩潰呢？

我們來分析。

首先，發送端定義的結構體esmonitor_cfg_t中，前幾個欄位大小固定，後面跟著1024個數組（每個數組存放一組配置參數），通過欄位i_cfg_num來指定實際有效數組個數。這樣，每次發送的位元組數為sizeof (m_monitord_prms)，也就是6000位元組左右（假定6000位元組）。

然後，接收端定義的接收緩衝區是uint8_t p_recv_buf[1500]，也就是1500位元組。

這樣，接收端每次只能接收使用者發送過來的6000位元組中的前1500位元組。

monitord接收到這1500位元組後，又做了如下處理：

1 for (i = 0; i < p_cfg->i_cfg_num; i++)2 {3     p_threshold = p_cfg->threshold + i;4     p_threshold->i_alarm_delay = ntohl(p_threshold->i_alarm_delay);5     p_threshold->i_alarm_id = ntohl(p_threshold->i_alarm_id);6 ...7 }

當發送端指定的i_cfg_num比較小時，雖然使用者只接收了部分資料，但monitord並不會訪問丟失的資料。

而一旦i_cfg_num指示的資料不在接收到的1500位元組中，p_threshold就會發生數組越界，造成危險的“野指標”，於是就造成了程式崩潰。

查明了原因，問題就很好解決了：增大monitord的接收緩衝區，至少不小於發送端的結構體大小。

----------------------------------------------------------------------------------

ps:MonitorServer和monitord是有不同的人負責的，之前也沒有協調，最後才會發生這種問題。

俺不由得大吼一聲：坑爹呀。。。

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

大資料引發的小悲劇（一）

聯繫我們

熱門內容

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

大資料引發的小悲劇（一）

聯繫我們

熱門內容

熱門主題

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support