Record the Erlang node CPU serious fluctuation troubleshooting Process

Source: Internet
Author: User
Tags nxdomain

After the new service is launched, it is observed that the CPU is between 10 and ~ There were 70% serious fluctuations, but the service processing speed was average from the service counters per second.

Next is the troubleshooting procedure:

1. dstat-Tam

  

About every 10 s of a cycle, the network traffic began to become small, and then suddenly increased, the CPU also increased.

The changes in network traffic do not match the results of performance counters. services related to services are complex. First, find out the services that occupy network traffic.

 

2. iftop

Find the target IP addresses with the highest traffic, and the traffic in the cycle changes to 0 and then surge.

The external HTTP interface address is known through the IP address. Because the interface call is performed asynchronously, performance computing records the start of execution rather than the end record.

That is, all asynchronous interface calls are blocked for a period of time.

 

3. tcpdump

Since it is blocked, the first thing that comes to mind is that the dependent service is unstable and may result from irregular timeout.

TCP packet capture confirmation

However, no TCP connection failure or retransmission is found in the Network Package. When the IO statistics traffic is 0, the client does not initiate any HTTP request.

The problem lies in the caller, httpclient.

  

4. lhttpc

After carefully reading the lhttpc source code, we can find that the only possible impact is that the maximum number of concurrent connections in lhttpc is 50.

When I checked the lhttpc configuration online, I found that the deployment version is not the latest... Find the corresponding version and find that the old version has no limit on the number of concurrent connections, which is relatively simple and cannot find other problems.

 

5. Find the block code path

Find the call entry and execute it manually. Find that the block exists.

Capture the process stack:

IO: Format ("~ S ~ N ", [element (2, process_info (PID, backtrace)]).

Lhttpc always finds that the block takes several seconds to call other interfaces.

 

6. fprof Tool

Fprof: Start ().

Fprof: Apply (M, F, ).

Fprof: Profile ().

Fprof: analyze ().

The following result is obtained (summary)

{[{{lhttpc_client,request,9},                     1, 4993.730,    0.018}], { {lhttpc_client,execute,9},                     1, 4993.730,    0.018},     % [{{lhttpc_client,send_request,1},                1, 4993.486,    0.004},  {{lhttpc_lib,format_request,7},                 1,    0.130,    0.008},  {{gen_server,call,3},                           1,    0.046,    0.003},  {{lhttpc_lib,normalize_method,1},               1,    0.038,    0.017},  {{proplists,get_value,3},                       6,    0.006,    0.006},  {{proplists,is_defined,2},                      2,    0.003,    0.003},  {{proplists,get_value,2},                       1,    0.003,    0.002}]}.{[{{lhttpc_client,execute,9},                     1, 4993.486,    0.004},  {{lhttpc_client,send_request,1},                1,    0.000,    0.003}], { {lhttpc_client,send_request,1},                2, 4993.486,    0.007},     % [{{lhttpc_sock,connect,5},                       1, 4932.214,    0.006},  {{lhttpc_client,read_response,1},               1,   61.202,    0.004},  {{lhttpc_sock,send,3},                          1,    0.061,    0.001},  {{erlang,setelement,3},                         1,    0.002,    0.002},  {{lhttpc_client,send_request,1},                1,    0.000,    0.003}]}.{[{{inet_gethost_native,getit,2},                 1, 4929.280,    0.000},  {{prim_inet,recv0,3},                           1,   59.932,    0.000},  {{prim_inet,connect0,4},                        1,    2.082,    0.000},  {{lhttpc_client,request,9},                     1,    0.017,    0.000},  {{gen,do_call,4},                               1,    0.017,    0.000}], { suspend,                                       5, 4991.328,    0.000},     % [ ]}.{[{{gen_tcp,connect,4},                           1, 4932.188,    0.007}], { {gen_tcp,connect1,4},                          1, 4932.188,    0.007},     % [{{inet_tcp,getaddrs,2},                         1, 4929.488,    0.013},  {{gen_tcp,try_connect,6},                       1,    2.672,    0.020},  {{gen_tcp,mod,2},                               1,    0.019,    0.002},  {{inet_tcp,getserv,1},                          1,    0.002,    0.002}]}.{[{{gen_tcp,connect1,4},                          1, 4929.488,    0.013}], { {inet_tcp,getaddrs,2},                         1, 4929.488,    0.013},     % [{{inet,getaddrs_tm,3},                          1, 4929.475,    0.008}]}.{[{{inet_tcp,getaddrs,2},                         1, 4929.475,    0.008}], { {inet,getaddrs_tm,3},                          1, 4929.475,    0.008},     % [{{inet,gethostbyname_tm,3},                     1, 4929.445,    0.005},  {{inet_parse,visible_string,1},                 1,    0.022,    0.001}]}.{[{{inet,getaddrs_tm,3},                          1, 4929.445,    0.005}], { {inet,gethostbyname_tm,3},                     1, 4929.445,    0.005},     % [{{inet,gethostbyname_tm,4},                     1, 4929.430,    0.002},  {{inet_db,res_option,1},                        1,    0.009,    0.002},  {{lists,member,2},                              1,    0.001,    0.001}]}.{[{{inet,gethostbyname_tm,3},                     1, 4929.430,    0.002}], { {inet,gethostbyname_tm,4},                     1, 4929.430,    0.002},     % [{{inet,gethostbyname_tm_native,4},              1, 4929.428,    0.004}]}.{[{{inet,gethostbyname_tm,4},                     1, 4929.428,    0.004}], { {inet,gethostbyname_tm_native,4},              1, 4929.428,    0.004},     % [{{inet_gethost_native,gethostbyname,2},         1, 4929.423,    0.002},  {{inet,gethostbyname_tm,5},                     1,    0.001,    0.001}]}.

From the above perspective, the block is in inet_tcp: getaddrs. Read the source code and find that the native DNS is called.

  

7. Erlang DNS

Dig domain name, dnsserver is an intranet server, the response is within 1 ms, long-term test, no latency.

It is suspected that it is an issue of Erlang DNS implementation.

Http://erlang.org/doc/apps/erts/inet_cfg.html

> Inet_db: get_rc ().
[{Nameservers, {10, 13, 8, 25 }},
{Nameservers, {172,16, 105,248 }},
{Resolv_conf, "/etc/resolv. conf "},
{Hosts_file, "/etc/hosts "},
{Lookup, [Native]}]
> ETS: Lookup (inet_db, cache_size ).
[{Cache_size, 100}]

{Lookup, Methods }.

Methods = [atom ()]

Specify lookup methods and in which order to try them. the valid methods are: Native (Use System CILS), file (use host data retrieved from system configuration files and/or the user configuration file) or DNS (use the Erlang DNS ClientInet_resFor nameserver queries ).

As described above, by default, each DNS query by Erlang is directly called by the system. The internal DNS cache of Erlang is enabled only when lookup [dns] is used.

  

8. System DNS

Since Erlang does not have Dnscache packet capture analysis, the system

Tcpdump-I any UDP port 53

21:34:58. 739732 IP 10.77.128.49.49003> 10.13.8.25.domain: 26954 +? I .api.xxx.cn. (32)
21:34:58. 739941 IP 10.13.8.25.domain> 10.77.128.49.49003: 26954 1/4/4 A 172.16.105.20.( 193)
21:34:58. 740546 IP 10.77.128.49.40072> 10.13.8.25.domain: 39440 +? I .api.xxx.cn. (32)
21:35:02. 205299 IP 10.77.128.49.6060> 10.13.8.25.domain: 48139 +? Bx49. (22)
21:35:02. 207506 IP 10.13.8.25.domain> 10.77.128.49.6060: 48139 nxdomain 0/1/0 (97)
21:35:03. 479277 IP 10.77.128.49.51277> 10.13.8.25.domain: 11779 +? Bx49. (22)
21:35:03. 479501 IP 10.13.8.25.domain> 10.77.128.49.51277: 11779 nxdomain 0/1/0 (97)
21:35:03. 745591 IP 10.77.128.49.24134> 172.16.105.248.domain: 39440 +? I .api.xxx.cn. (32)
21:35:03. 747254 IP 172.16.105.248.domain> 10.77.128.49.24134: 39440 1/4/4 A 172.16.105.20.( 193)

From the package, there are a large number of frequent requests to query the I .api.xxx.cn domain name, and the record marked Yellow by me does not respond, 5 seconds later, the request is resent to receive the result.

 

Summary:

The I .api.xxx.cn interface does not support keepalive. A new connection is required each time, and Erlang needs to query DNS by system call every time.

The server has not started the nscd service and has no cache

DNS uses the UDP protocol, which occasionally loses even when the Intranet is used.

Erlang merges the same DNS for concurrent queries and only has one DNS request.

Erlang's system call timeout time is too long (5S), there is no timely re-query, resulting in request accumulation during the period, such as DNS return, accumulation of business at the same time start to process, resulting in CPU fluctuations.

Solution:

1. Start nscd 2. Configure Inet to use the memory DNS module Cache

  

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.