First describe the environment, the previous load balancer is forwarded to the Nginx,nginx and then forwarded to the backend application server.
The Nginx configuration file is as follows:
Upstream Ads {
Server ap1:8888 Max_fails=1 fail_timeout=60s;
Server ap2:8888 Max_fails=1 fail_timeout=60s;
}
The phenomena that occur are:
A log that resembles *379803415 no live upstreams while connecting to upstream is recorded every two minutes.
In addition, there is a large number of "upstream prematurely closed connection while reading response headers from upstream" log.
Let's look at the "No Live upstreams" issue first.
Look at the literal meaning is nginx found no surviving back end, but it is very strange thing is that the time has been the visit is normal, and with Wireshark see also has come in, also have returned.
Now only from the perspective of Nginx source.
Because it is upstream about the error, so in the ngx_http_upstream.c to find "no live upstreams" keyword, you can find the following code (in fact, you will find that if you find in Nginx Global code, There is only this keyword in this file):
As can be seen here, when RC equals Ngx_busy, the "No Live Upstreams" error is recorded.
Looking up 1328 rows, you can see that the RC value is returned by the Ngx_event_connect_peer function.
Ngx_event_connect_peer is implemented in the event/ngx_event_connect.c. In this function, only this place will return to Ngx_busy, other places are NGX_OK or Ngx_error or Ngx_again.
rc = Pc->get (PC, Pc->data);
if (rc = NGX_OK) {
return RC;
}
The PC here is a pointer to the ngx_peer_connection_t structure, and get is a ngx_event_get_peer_pt function pointer where it is not known for a moment. Then look at NGX_HTTP_UPSTREAM.C.
As seen in ngx_http_upstream_init_main_conf, the following code:
USCFP = umcf->upstreams.elts;
for (i = 0; i < umcf->upstreams.nelts; i++) {
init = Uscfp[i]->peer.init_upstream? Uscfp[i]->peer.init_upstream:
Ngx_http_upstream_init_round_robin;
if (init (cf, uscfp[i])! = NGX_OK) {
return ngx_conf_error;
}
}
As you can see here, the default configuration is polling (in fact, the various modules of load balancing form a linked list, each time from the end of the list to the back processing, from the above to the configuration file can be seen, Nginx does not call other modules before polling), and with Ngx_http_upstream_init_ Round_robin initializes each of the upstream.
Then look at the Ngx_http_upstream_init_round_robin function, which has the following lines:
R->upstream->peer.get = Ngx_http_upstream_get_round_robin_peer;
Here we point the Get pointer to the Ngx_http_upstream_get_round_robin_peer
In Ngx_http_upstream_get_round_robin_peer, you can see:
if (Peers->single) {
Peer = &peers->peer[0];
if (Peer->down) {
Goto failed;
}
} else {
/* There is several peers */
Peer = Ngx_http_upstream_get_peer (RRP);
if (peer = = NULL) {
Goto failed;
}
And look at the failed part:
Failed
if (Peers->next) {
/* Ngx_unlock_mutex (Peers->mutex); */
Ngx_log_debug0 (ngx_log_debug_http, Pc->log, 0, "Backup Servers");
Rrp->peers = peers->next;
n = (Rrp->peers->number + (8 * sizeof (uintptr_t)-1))
/(8 * sizeof (uintptr_t));
for (i = 0; i < n; i++) {
Rrp->tried[i] = 0;
}
rc = Ngx_http_upstream_get_round_robin_peer (PC, RRP);
if (rc = ngx_busy) {
return RC;
}
/* Ngx_lock_mutex (Peers->mutex); */
}
/* All peers failed, mark them as live for quick recovery */
for (i = 0; i < peers->number; i++) {
peers->peer[i].fails = 0;
}
/* Ngx_unlock_mutex (Peers->mutex); */
Pc->name = peers->name;
return ngx_busy;
The truth here is that if the connection fails, try to connect to the next, and if all fails, the quick recovery resets each peer's failure to 0, then returns a ngx_busy, and then the Nginx prints a No live Upstreams, and finally back to the original state, and then forwarded.
This explains that no live upstreams can be accessed normally.
Re-look at the configuration file, if one of the failed, Nginx will think it is dead, and then will be all the traffic to the next one, when another one also has a failure, think that both are dead, then quick recovery, and then print a log.
Another problem with this is that if several units at the same time determine that a back end is dead, it will cause the flow imbalance, see Zabbix monitoring can also see:
A preliminary solution:
Change the max_fails from 1 to 5, the effect is obvious, "no live upstreams" the probability of the occurrence of a lot less, but did not completely disappear.
In addition, there will be a lot of "upstream prematurely closed connection while reading response headers from upstream" in the log.
This time from the source, in the implementation of Ngx_http_upstream_process_header this function, will be reported this error, but specifically network reasons or other reasons is not very obvious, the following tcpdump grab the bag.
Where 54 is the load-balanced address of the Nginx front end, 171 is the Nginx address, 32 is the address of the AP1, and the AP2 address is 201
As shown in the following:
The request is load balanced to the Nginx, Nginx first responds to the ACK to load balance, and then three times with the AP1 handshake, A packet of length 614 is then sent to AP1. However, received an ACK and fin+ack, from ack=615 can be seen, these two packets are for a length of 614 packets of the response, the back-end app directly connected to the closed off!
Then, Nginx responds to the backend app with an ACK and fin+ack, from ack=2 can see that this is the response to Fin+ack.
then, the Nginx sends a SYN packet to the AP2 and receives the first ACK returned.
Second Picture:
, it can be seen that nginx and AP2 three times after the handshake, also sent a request packet, the same is directly closed connection.
The nginx then returns 502 to the load balancer.
The capture package here once again supports the analysis of the above code from the side.
Then feedback the problem to the colleague who made the backend application.
Online Nginx "No live upstreams while connecting to upstream" analysis