Troubleshooting of real-time computing engine processing latency

Source: Internet
Author: User

Author: those things |ArticleCan be reproduced. Please mark the original source and author information in the form of a hyperlink

Web: http://www.cnblogs.com/panfeng412/archive/2012/03/26/real-time-computing-engine-processing-delay-troubleshoot.html

Recommendation: Debug hacks

When processing real-time data, the real-time computing engine must ensure that new data is processed in a timely manner. For example, if there is a log file every minute for the website access log data, the real-time computing engine must be able to process the log data file within one minute, otherwise, log files are accumulated and cannot be processed in time. A few days ago, the quantum back-end team checked a real-time computing engine's handling latency fault. The ltrace and strace tools were used, so I would like to share with you here.

1. fault description

Due to the arrival of a large amount of unexpected data, the speed of the real-time computing engine to process log files per minute is greatly reduced, resulting in significant latency, it takes an average of two or more minutes to process a file every minute. The maximum time is five minutes, which leads to log file accumulation and cannot be processed.

2. Fault Analysis

When processing log files, the real-time computing engine reads logs from files every minute in sequence to the memory for computing (compute process ), therefore, the engine needs to rotate and switch logs every minute after processing the files (rotate process ).

The real-time computing engine uses full-memory computing. Frequent memory operations are performed when processing log files per minute and rotation: memory application and release (memory management uses the glibc library ). During the rotate process, a considerable portion of the memory is released.

After statistics, it is found that the compute time and rotate time of the real-time computing engine when processing log files every minute on the day of the fault are as follows:

(1) The processing time (compute time) of log files per minute and the first half of the logs (for example, 10: 00 ~ Logs generated at am usually take more than one minute to process each minute. logs generated after AM (such as am to am ~ Logs generated at are usually less than 10 seconds to one minute.

(2) The rotation time (rotate time) after Log File Processing per minute, the rotation time of the entire point is much longer than other times, and the rotation time at the end of the whole day is the longest.

The preceding statistics indicate that the processing latency of the real-time computing engine is within the first half of the hour after the rotate rotation at the entire point. The processing latency can be recovered to normal in the second half of the hour: it takes less than one minute to process a log file. It can be seen that the processing delay of the engine is related to the rotate of the entire point, which is triggered by the rotate of the entire point. The specific cause is that the real-time computing engine must be further tracked to determine the status when processing latency occurs.

3. troubleshooting

To track data to the real-time computing EngineProgramIn the event of a fault, the root cause of the engine processing delay can be located only when the engine's operations (such as library function calls and system calls) are obtained in real time, after discussion, select the strace and ltrace tools for troubleshooting. Among them, strace can track system calls and time consumption, and ltrace can track library function calls and time consumption.

3.1 troubleshooting Process

On the test machine, the real-time computing engine program of the online version is used to re-run the log data that triggers the fault. When the entire rotate trigger engine has a processing delay, start to use ltrace and strace tools to track the current running status of the engine. Statistics engine's call time overhead on system calls and library functions.

    1. Time used to trace database functions: ltrace-fp pid-t-c
    2. Trace the time consumption of system calls: strace-fp pid-t-c
3.2 troubleshooting results

The following is an analysis of the results of the troubleshooting process:

    1. The test shows that CPU 100% is used in the user space during rotate. When logs are processed to the entire point, the idle memory of the system is exhausted, and all the memory is used by PAGE cache or process.
    2. Ltrace trace shows that when malloc is large memory blocks (such as more than 1000 bytes), the execution time is long, which is about 2 ~ 4 seconds. This is the most time-consuming Library call counted by ltrace. The engine spends most of its time on malloc.
    3. Using strace to track system calls, it is found that no system calls are sent to the kernel space during malloc memory. In addition, the CPU usage is 100% of the user space, which can be determined basically,MallocTime is spent in glibcIn the memory management moduleIt may be caused by memory fragments. When the program requests to allocate large memory blocks, it will sort out the memory fragments to form a large memory block allocated to the computing engine.
    4. To further verify the conclusion in step 1, we use Google's tcmalloc for testing (export ld_preload = "./libtcmalloc. So "). Test results show that after tcmalloc is used, the real-time computing engine returns to normal. The processing time of each log file is reduced from several minutes to several seconds, and the rotate time is also reduced from 10 minutes to 30 ~ 40 seconds.
4. Solution

As the processing latency of the real-time computing engine was finally located on the memory management of glibc, a simple solution was to replace glibc with tcmalloc. However, in the long run, whether or not we need to implement our own memory management module depends on the human resources and the development of tcmalloc.

5. Lessons learned
    1. Avoid relying only on experience for troubleshooting.CodeImplement logic and fault scene environment analysis.
    2. Data is the most authentic and reliable, including the data collected on site and historical data, which are an important basis for analyzing and locating faults.
    3. None of the tools are common. You must select the appropriate debugging and tracking tools based on the actual situation to accumulate relevant experience.

Finally, I would like to thank my colleagues for troubleshooting the problem: Peng, Cheng Cang, Min Zhan, and YUAN Hong. You are all awesome!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.