Summary of "surprise cluster" problem in Linux network programming

Source: Internet
Author: User
Tags assert epoll mutex htons

1. Preface

I have been engaged in the network development of Linux for nearly 4 years, often still encountered some problems, but do not know the reason why, and sometimes communicate with other people, make very awkward. Now the computer is multicore, the network programming framework is gradually enriched, I know that there are multi-process, multi-threading, asynchronous event-driven three commonly used models. The most classic model is the Master-worker multi-process asynchronous drive model used in Nginx. Today and everyone to discuss the network development encountered in the "surprise group" phenomenon. Before just heard of this phenomenon, the online search data also understand the basic concept, in the actual work has not encountered. This weekend, combined with their own understanding and online information, the "surprise group" to understand thoroughly. There are a few questions to be clarified:

(1) What is a "surprise group" and what problems will arise?

(2) "Surprise group" phenomenon How to use code simulation out?

(3) How to deal with the problem of "surprise group" and how to deal with the phenomenon after "surprise group"?

2, what is the surprise group

Nowadays the network programming often uses the multi-process or the multithreaded model, the approximate idea is that the parent process creates the Socket,bind, listen, through the fork creates many sub-processes, each child process inherits the parent process the socket, the call Accpet starts listens waits for the network connection. At this time, there are multiple processes waiting for the network connection event, when this event occurs, these processes are simultaneously awakened, is "surprise group." What does this cause? We know that the process is woken up and the kernel rescheduling is required so that each process responds to this event at the same time, and eventually only one process can handle the event successfully, and the other process resumes hibernation or other after the event fails. The network model looks like this:

In short, the Thundering herd is when multiple processes and threads are simultaneously blocking waiting for the same event, and if this event occurs, all processes are awakened, but eventually only one process/thread is processed, and the other processes/threads hibernate after the failure. This performance waste is a surprise group.

3, code simulation "surprise group" phenomenon

We already know the "surprise group" is what is going on, then follow the above figure coding implementation to see the effect. I try to use a multi-process model, create a parent process to bind a port to listen to the socket, and fork out many sub-processes, and the subprocess begins to loop through the socket (for example, accept). The test code looks like this:

 1 #include <stdio.h> 2 #include <unistd.h> 3 #include <sys/types.h> 4 #include <sys/socket.h> 5 #include <netinet/in.h> 6 #include <arpa/inet.h> 7 #include <assert.h> 8 #include <sys/wait. H> 9 #include <string.h>10 #include <errno.h>11 #define IP "127.0.0.1" #define PORT 888814 #define Worker 415 int worker (int listenfd, int i) + {1) {printf ("I am worker%d, begin to accept Conn  Ection.\n ", i); the struct sockaddr_in client_addr;  socklen_t Client_addrlen = sizeof (CLIENT_ADDR);  int CONNFD = Accept (LISTENFD, (struct sockaddr*) &client_addr, &client_addrlen); if (connfd! =-1) {printf ("Worker%d accept a connection success.\t", i);              IP:%s\t ", Inet_ntoa (CLIENT_ADDR.SIN_ADDR)); printf (" Port:%d \ n ", Client_addr.sin_port), and" else {28 " printf ("Worker%d accept a conNection failed,error:%s ", I, Strerror (errno)); 
         Close (CONNFD);}30}31 return 0;32}33 int main () {i = 0;37 struct sockaddr _in address; Bzero (&address, sizeof (address)); address.sin_family = af_inet; Inet_pton (Af_inet, IP, &address.sin_addr); Address.sin_port = htons (port); LISTENFD int = socket (pf_inet, sock_stream, 0); ASSERT (LISTENFD >= 0); LISTENFD int ret = bind (struct sockaddr*) &address, sizeof (address)); (Ret! =-1); RET = Listen (LISTENFD, 5); (Ret! =-1); (i = 0; i < worker; i++) {%d\n printf ("Create WORKER", i+1), pid_t pid = fork (); 54 /*child process */55 if (PID = = 0) {LISTENFD, i),}58 if (PID & Lt 0) {printf ("fork Error"),}62}63/*wait child process*/65 int status;66 Wait ( &status); return 0;68}

Compile execution, using the Telnet 127.0.0.1 8888 test on this machine, the results are as follows:

According to the "surprise group" phenomenon, the expected result should be 4 sub-processes will be accpet to the request, with only one success, and the other three failed cases. The actual results show that the parent process starts creating 4 child processes, and each child process starts waiting for the accept connection. When Telnet is connected, only the Worker2 child process Accpet to the request, while the other three processes do not receive the request.

What is the reason for this? Is the surprise group phenomenon false? So hurry Google Check, surprise group in the end is how to appear.

In fact, after the Linux2.6 version, the kernel kernel has solved the accept () function "Surprise Group" problem, the approximate way is, when the kernel receives a client connection, will only wake up the waiting queue on the first process or thread . Therefore, if the server adopts accept blocking call mode, on the latest Linux system, there is no "surprise group" problem.

However, for the common server programs in the actual project, mostly using the Select, poll or epoll mechanism, at this time, the server is not blocked in the accept, but blocking in the Select, poll or epoll_wait, in this case the "surprise group" still need to consider. Then take Epoll as an example to analyze:

Use the Epoll non-blocking implementation code as follows:

  1 #include <sys/types.h> 2 #include <sys/socket.h> 3 #include <sys/epoll.h> 4 #include <netdb.h&  Gt 5 #include <string.h> 6 #include <stdio.h> 7 #include <unistd.h> 8 #include <fcntl.h> 9 #inclu  De <stdlib.h> #include <errno.h> one #include <sys/wait.h> #include <unistd.h> #define  IP "127.0.0.1" #define PORT 8888 #define PROCESS_NUM 4 #define maxevents, static int create_and_bind () {int FD = socket (pf_inet, sock_stream, 0); sockaddr_in serveraddr; Serveraddr.sin_fami ly = af_inet;   Inet_pton (Af_inet, IP, &serveraddr.sin_addr); Serveraddr.sin_port = htons (port); The bind (FD, struct sockaddr*) &serveraddr, sizeof (SERVERADDR)); return FD; (+)-static int make_socket_non_blocking (int sfd)-{+-int. flags, S; flags = Fcntl (SFD, F_GETFL, 0) ; if (flags = =-1) {perror ("fcntl");return-1; PNS} |= O_nonblock; s = Fcntl (SFD, F_SETFL, flags); if (s = =-1) {perror ("Fcntl"), return-1; (int sfd, int efd, struct epoll_event *events, int k) {-*/the event loop */(1 ) {int n, i; n = epoll_wait (EFD, events, maxevents,-1); printf ("Worker%d return from Epoll_wait!\n ", k); (i = 0; i < n; i++) {si if (events[i].events & Epollerr) | | (Events[i].events & epollhup) | | (! (events[i].events &epollin))) {* * * An error had occured on this FD, or the socket was not a ready for reading (why were we notified then ?) */fprintf (stderr, "Epoll error\n"); Close (EVENTS[I].DATA.FD); Continue; EVENTS[I].DATA.FD} else if (sfd = =) {$/* We have a notification on the listening socket, which means one or more incoming connections. */in_addr struct sockaddr; Socklen_t In_len; INFD int; + Char Hbuf[ni_maxhost], Sbuf[ni_maxserv]; In_len = sizeof in_addr; INFD = Accept (SFD, &in_addr, &in_len);                     if (INFD = =-1) {failed!\n printf ("Worker%d", "K"); 69 Break ("Worker%d accept successed!\n", K); Incoming socket non-blocking and add it to the list of FDS to monitor.  * * Close (INFD); All-in-a-.} +} (+ +) (int argc, char *argv[])--{Bayi int, SFD, s; D Epoll_event event; Epoll_event *events of the struct; * sfd = Create_and_bind (); if (sfd = =-1) {Make_socket_non_ abort ();Blocking (SFD); if (s = =-1) {Listen abort (), and the Somaxconn} is the "s =" (SFD); 94 if (s = =-1) {perror ("Listen"), the "Epoll_create" ();     if (EFD = =-1) {perror ("epoll_create"); 101 abort (); 102}103 EVENT.DATA.FD = sfd;104 Event.events = epollin;105 s = Epoll_ctl (EFD, Epoll_ctl_add, SFD, &event); 106 if (s = =-1) {107 Perro R ("Epoll_ctl"); 108 abort (); 109}110 111/* Buffer where events is returned */112 events = Calloc (maxe Vents, sizeof event); 113 int k;114 for (k = 0; k < process_num; k++) {$ printf ("Create worker%d\n", K     +1); the 117 int pid = fork (); if (PID = = 0) {118 worker (SFD, EFD, events, K); 119}120 }121 int status;122 Wait (&status); 123 free (events); 124 close (SFD); return exit_success;126 }

The parent process creates the socket and sets it to non-blocking, starting listen. Then fork out 4 sub-processes and call epoll_wait in the worker to start the Accpet connection. Using the Telnet test results are as follows:

As seen from the results, and the above is the same, only one process received the connection, the other three did not receive, indicating that no surprise group phenomenon occurred. Why is that?

In earlier versions of Linux, the kernel was also a mechanism for blocking the epoll_wait process, so there was a "surprise cluster" problem with the Accept-like. The new version of the solution will only wake up the wait queue on the first process or thread , so the new version of the Linux section solves the Epoll "surprise cluster" problem. The so-called partial solution, meaning: For some special scenes, the use of epoll mechanism, there is no "surprise group" problem, but for most scenes, epoll mechanism still exist "surprise group."

The scene of the presence of Epoll is as follows: When the worker remains working, it is awakened, for example, by calling sleep once after epoll_wait. Rewrite the Woker function as follows:

void worker (int sfd, int efd, struct epoll_event *events, int k) {/* the event loop */while (1) {int n, I;        n = epoll_wait (EFD, events, maxevents,-1); /*keep running*/Sleep (2);        printf ("Worker%d return from epoll_wait!\n", K); for (i = 0; i < n; i++) {if (events[i].events & Epollerr) | | (Events[i].events & epollhup) | | (! (events[i].events &epollin))) {/* An error had occured on this FD, or the socket was not a ready for reading (why were we notified then?) *                /fprintf (stderr, "Epoll error\n");                Close (EVENTS[I].DATA.FD);            Continue } else if (sfd = = EVENTS[I].DATA.FD) {/* We have a notification on the listening socket, which means one O R more incoming connections.                */struct SOCKADDR in_addr;                Socklen_t In_len;                int INFD;                Char Hbuf[ni_maxhost], Sbuf[ni_maxserv];                In_len = sizeof in_addr;                INFD = Accept (SFD, &in_addr, &in_len); if (INFD = =-1) {printf ("Worker%d accept failed,error:%s\n", K, Strerror (errno));                    Break                 } printf ("Worker%d accept successed!\n", K); /* Make the incoming socket non-blocking and add it to the list of FDS to monitor.             */Close (INFD); }           }       }   }

The test results are as follows:

Finally saw the emergence of the phenomenon of surprise group.

4, solve the problem of surprise group

The use of mutex mutex in nginx to solve this problem, the specific measures have the use of global mutex, each child process before epoll_wait () to apply for the lock, the request to continue processing, not get to wait, and set up a load balancing algorithm (when the task volume of a child process reaches the total set amount of 7/8, it will no longer attempt to request a lock) to balance the task volume of each process. Learn more about Nginx's surprise group process later.

5. Reference website

http://blog.csdn.net/russell_tao/article/details/7204260

Http://pureage.info/2015/12/22/thundering-herd.html

Http://blog.chinaunix.net/uid-20671208-id-4935141.html

Summary of "surprise cluster" problem in Linux network programming

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.