As projects become more dependent on Erlang, the problems they encounter increase. The system encountered high memory consumption problem in the previous time line, and recorded the analysis process of troubleshooting. The online system uses the Erlang R16B02 version.
Problem description
There are several online systems that run for some time and the memory soars. The system model is simple and has a network connection, and the pool is looking for a new process to process. The top command observes that the memory has been eaten by the Erlang process, netstat command to see the number of network connections, only a few k. The problem should be Erlang memory leaks.
Analysis method
The Erlang system has the advantage of being able to go directly to the online system and analyze the problem at the production site. Our system is managed through rebar and can be used in different ways to enter the online system.
Native Login
You can log in directly to the online machine and attach into the Erlang system with the following command
Through the remote shell
Get the Erlang system cookie
$ ps -ef |grep beam %%找到参数 --setcookie
Open a new shell, use the same cookie, different nodename
$ erl --setcookie cookiename -name [email protected]
Enter the system with the start remote shell
Erlang R16B02 (erts-5.10.3) [source] [64-bit] [smp:2:2] [async-threads:10] [hipe] [kernel-poll:false]Eshell V5.10.3 (abort with ^G)([email protected])1> net_adm:ping(‘[email protected]‘).pong([email protected])2> nodes().[‘[email protected]‘]([email protected])3> User switch command --> h c [nn] - connect to job i [nn] - interrupt job k [nn] - kill job j - list all jobs s [shell] - start local shell r [node [shell]] - start remote shell q - quit erlang ? | h - this message --> r ‘[email protected]‘ --> j 1 {shell,start,[init]} 2* {‘[email protected]‘,shell,start,[]} --> c 2
Analysis process
Erlang has many tools that can analyze system information, such as Appmon,webtool. But the system memory is seriously insufficient, there is no way to start these tools, fortunately, there are Erlang shell.
The Erlang shell comes with a lot of useful commands that can be viewed with the Help () method
> help().
Erlang system memory consumption
Top results show that it is a memory problem, so the first step is to look at the system memory consumption of Erlang first
> erlang:memory().
Memory () can see the RAM allocated by Erlang emulator, the total memory, the memory that atom consumes, the memory the process consumes, and so on.
Erlang Process Creation Quantity
The online system discovers that the main memory consumption is on the process, and the next analysis is whether the process memory leaks or the process creates too many quantities.
> erlang:system_info(process_limit). %%查看系统最多能创建多少process> erlang:system_info(process_count). %%当前系统创建了多少process
System_info () returns some information about the current system, such as the number of system Process,port. Execution of the above command, surprised, only 2,3k network connection, the result Erlang process already has more than 10 W. The system process was created, but the heap was not released because of code or other reasons.
View information for a single process
Since the process is piling up for some reason, it can only be found in the process.
To get the PID of the stacking process first
> i(). %%返回system信息> i(0,61,886). %% (0,61,886)是pid
See a lot of process hang there, look at the specific PID information, found that message_queue several messages were not processed. The following is a powerful Erlang:process_info () method, which can get a fairly rich process of information.
> erlang:process_info(pid(0,61,886), current_stacktrace).> rp(erlang:process_info(pid(0,61,886), backtrace)).
When you view the backtrace of a process, the following information is found
0x00007fbd6f18dbf8 Return addr 0x00007fbff201aa00 (gen_event:rpc/2 + 96)y(0) #Ref<0.0.2014.142287>y(1) infinityy(2) {sync_notify,{log,{lager_msg,[], ..........}}y(3) <0.61.886>y(4) <0.89.0>y(5) []
Process is stuck while processing the log library lager for Erlang third-party.
Cause of the problem
View the lager documentation and find the following information
Prior to Lager 2.0, the gen_event at the core of lager operated purely in synchronous mode. Asynchronous mode is faster, but have has no protection against message queue overload. In Lager 2.0, the gen_event takes a hybrid approach. It polls its own mailbox size and toggles the messaging between synchronous and asynchronous depending on mailbox size.
{Async_threshold, +}, {Async_threshold_window, 5}
This would use async messaging until the mailbox exceeds 2 0 messages, at which point synchronous messaging would be used, and switch back to asynchronous, when size reduces to 20- 5 =.
If you wish to disable this behaviour, the simply set it to ' undefined '. It defaults to a low number to prevent the mailbox growing rapidly beyond the limit and causing problems. In general, lager should process messages as fast as they come in, so getting behind should is relatively exceptional a Nyway.
The original lager has a configuration item that configures the amount of message unhandled, and if the message is stacked more than one, it will be handled synchronously !
The current system has debug log turned on, and the flood of log has washed away the system.
Foreigners also encounter similar problems, this thread to our analysis brought a lot of help, thank you.
Summarize
Erlang provides a wealth of tools to enter the system online and analyze the problem on the spot, which helps to locate the problem efficiently and quickly. At the same time, the powerful Erlang OTP gives the system a more stable guarantee. We will continue to tap Erlang and look forward to more practice sharing.
About the author
Weibo @liaolinbo, chief engineer of Cloud Ba. Worked for Oracle.
Erlang Memory Leak analysis