Probe into the problem of blocking messages in the process of sending messages through the remote node of Erlang

Source: Internet
Author: User

Problem Description: A performance problem occurs in a production environment where a and b two nodes run on two servers, A and B interconnect, and a continuously sends messages to B. The machine on which the B node is located is down, causing the process of sending messages in a node to gamble on messages.

Tracing process: Through Erlang:process_info (Erlang:whereis (Pid)) Discovery Current_function has been GEN:DO_CALL/4. Messages messages accumulate to level hundreds of thousands of. SOURCE analysis: In the code to send a message to the far end of the call function Erlang:send (pid,msg), Pid is a remote node receiving process. To do a simple test of the function, test the environment as follows, using two machines 192.168.8.206 and 192.168.8.207, respectively, to run the Erlang node on it. Run a receive process on B to test the efficiency of the send when the B machine process, node, and machine are hung out. The service-side code is as follows:
-module (recv).-export ([start/0]). Start ()    , Erlang:spawn (Fun (), Erlang:register (Recv,self ()), Loop () End). Loop ()    , receive        Data        , Io:format ("~p~n", [Data]),    Loop ()    end.
Measurement results:

Process, node, machine survival

Processes, nodes, machines are alive.

Process hangs, nodes and machines live.

Process, node hangs, machine survives.

The machine hangs out.

1000 Send consumption time (ms)

8.333

7.6

1108

2866366.6

Conclusion: The machine hangs itself to the Rpc:call timeout effect is very, the specific reason and the Erlang trap mechanism has the relation, each send in the case that the machine is not in the line time-out can close to 3 seconds, can confirm this problem is causes the Behavior_server blocking message the key factor.

Tracking Source:

Eterm erl_send (Process *p, eterm to, Eterm msg) {    Sint result = Do_send (P, to, MSG,!0);    if (Result > 0) {erts_vbump_reds (P, result); Bif_ret (msg);    }     else switch (result) {case    0:bif_ret (msg); break;    Case SEND_TRAP:BIF_TRAP2 (Dsend2_trap, p, to, MSG); ....
ERLANG:SEND/2 function will eventually enter a trap process, the role of Trap can refer to Yu Feng boss's blog http://mryufeng.iteye.com/blog/334744, why would enter the trap? Called to Remote_send during the do_send process
Static Sint Remote_send (Process *p, Distentry *dep,   eterm to, Eterm full_to, eterm msg, int suspend) {    Sint Res;
   int Code;    Ertsdsigdata DSD;    ASSERT (Is_atom (to) | | is_external_pid (TO));    Code = Erts_dsig_prepare (&DSD, DEP, p, Erts_dsp_no_lock,!suspend);    Switch (code) {Case    erts_dsig_prep_not_alive: case    erts_dsig_prep_not_connected:res = send_trap; ....
Erts_glb_inline Interts_dsig_prepare (Ertsdsigdata *DSDP,    distentry *dep,    Process *proc,    Ertsdsigpreplock DSPL,    int no_suspend) {    int failure;    if (!erts_is_alive) return erts_dsig_prep_not_alive;    if (!DEP) return erts_dsig_prep_not_connected; ....

Erlang's send operation uses the 2nd, deferred operation, because in the erlang:send, the connection between the nodes is not established, the send operation can not continue, the next time the dispatch of the first node connection operation, after the node established connection before continuing. The function that the trap executes here is the DSEND/2 function of Erlang.

Dsend (PID, msg) when Is_pid (PID)---    Net_kernel:connect (node (PID)) of True Erlang:send (PID, MSG); False-Msg    end;

Where NET_KERNEL:CONNECT/1 is actually called Gen:do_call to establish a TCP connection, and once the other machine hangs, TCP can not receive the return, so to wait until the timeout to exit, resulting in a net_kernel:connect/ 1 ability is the way of blocking.

Solution: Use Erlang:send (Pid,msg,[noconnect]) instead of Erlang:send (pid,msg) in cases where the remote machine may be unreliable. Be careful with any large number of send,call operations on certain remote nodes. If the machine will be down for a long time, it can cause blocking access to this node's access to the peer.

The reasons are as follows: Noconnect parameters

Bif_rettype Send_3 (bif_alist_3) {...    result = Do_send (P, to, MSG, suspend);    if (Result > 0) {        erts_vbump_reds (p, result); Bif_ret (AM_OK);    }     else switch (result) {case    0:bif_ret (AM_OK); break;    Case Send_trap:if (Connect) {     bif_trap3 (Dsend3_trap, p, to, MSG, opts);   } else {     bif_ret (am_noconnect); }....

After setting Noconnect, Erlang sends the message to the far end without waiting for the other connection to be established but to return to Noconnect when the other node does not exist.

Probe into the problem of blocking messages in the process of sending messages through the remote node of Erlang

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.