Overview
The Healthcheck function is essentially a timer that periodically checks the state of the specified upstream group, sends the specified HTTP request and parses the response code, detects the surviving status of each peer in the upstream, and then judges and flags its state with the history request record. If there is a state change, the release record is updated in shared memory, and all worker processes are updated to the latest peer state the next time they are executed.
The following statements assume that the upstream group name we want to monitor is Ats_node_backend, which is the code block in the corresponding nginx.conf
Upstream Ats_node_backend {
Server 127.0.0.2;
Server 127.0.0.3 backup;
Server 127.0.0.4 fail_timeout=23 weight=7 max_fails=200 backup;
}
Configuration parameter Interpretation
Hc.spawn_checker (Options)
Options include the following option, which is passed in as a parameter when the interface is called
The type must be present and HTTP, and currently only supports HTTP
Http_req must exist, Health probe HTTP request Raw string
Timeout default 1000, unit ms
Interval health detection time interval, unit MS, default 1000, recommended 2000
Valid_status A valid Response code table, such as {200, 302}
Concurrency concurrent number, default 1
Fall Default 5, on the UP device, continuous fall times failed, identified as down
Rise default 2, down device, continuous rise success, recognized as up
SHM must be configured, shared memory name for health check, shared memory via NGX.SHARED[SHM]
Upstream specifies the name of the upstream group to be checked for health, must exist
Version default 0
Primary_peers Primary Group
Backup_peers Preparation Group
Statuses the array that holds the legal response code, the Valid_status configuration item from Ipairs ()
Depending on the options, a CTX table is constructed to hold all the configuration data and will be used as the third parameter in the Timer ngx.timer.at ()
The contents of CTX are as follows
Upstream the specified upstream group name
Primary_peers Primary Group
Backup_peers Preparation Group
Raw HTTP request for Http_req Health Check
Timeout time-out, unit s, note not MS
Interval Health check interval, unit s, note not MS
Dict shared memory for storing statistical data
Fall the number of consecutive failures before down, default 5
Rise the number of successive successes before the up, default 2
Statuses that the normal HTTP status code table {200,302}
Version 0 revision number for each scheduled task, with peer status change, version number plus 1
Concurrency create this number of lightweight threads to concurrently send the number of health detection requests
The above configuration items, which are saved as a context, are called repeatedly at different stages
How to use
In the Init_worker_by_lua_block phase, put the following code in the nginx.conf
local hc = require "resty.upstream.healthcheck"
local ok, err = hc.spawn_checker{
shm = "healthcheck", -- defined by "lua_shared_dict"
upstream = "foo.com", -- defined by "upstream"
type = "http",
http_req = "GET /status HTTP/1.0\r\nHost: foo.com\r\n\r\n",
-- raw HTTP request for checking
interval = 2000, -- run the check cycle every 2 sec
timeout = 1000, -- 1 sec is the timeout for network operations
fall = 3, -- # of successive failures before turning a peer down
rise = 2, -- # of successive successes before turning a peer up
valid_statuses = {200, 302}, -- a list valid HTTP status code
concurrency = 10, -- concurrency level for test requests
}
if not ok then
ngx.log(ngx.ERR, "failed to spawn health checker: ", err)
return
end
Here the host can be 127.0.0.1 or other IP. If you need a view of the state of the upstream group for probing, you can configure it again in nginx.conf
server {
...
# status page for all the peers:
location = /status {
access_log off;
allow 127.0.0.1;
deny all;
default_type text/plain;
content_by_lua_block {
local hc = require "resty.upstream.healthcheck"
ngx.say("Nginx Worker PID: ", ngx.worker.pid())
ngx.print(hc.status_page())
}
}
}
And then go directly outside the
wget ' Http://127.0.0.1/status '
You can see the peer status of all upstream, especially the ats_node_backend that we detected, and the "(No checkers)" word in front of the upstream group without configuration detection.
Code Schema
itself is a timer, the module provides two external interfaces
This timer is done in the following jobs:
ask yourself a self-answer
1. How to do a health probe on a peer.
A popular health probe might be to try to connect with a peer, or send an HTTP request. The latter is a more accurate check of whether the service is normal, because only to do port detection, if the service is dead, the port is still detectable.
The Healthcheck module uses a LUA lightweight thread to specifically send the specified HTTP request to the service to be inspected and receive the service's response status code, judging whether the service is normal based on the status code and determining the UP or down
2. How to save each result of peer health detection.
The result of the probe is nothing more than a two-way situation: OK or fail. Therefore, there are two cases to the corresponding storage, such as the key OK:ATS_NODE_BACKEND:P9 record to primary peers peer ID 9 of the cumulative success of the device, the same nok:ats_node_backend: B1 represents the cumulative number of failures for a device with peer ID 1 in backup peers.
If the current probe results are successful, the total number of successes in the shared in-memory query is first added and incremented by 1 on the original count and updated to shared memory. If the number is 0 after the modification, the corresponding failure record needs to be purged from the shared memory.
3. How to determine if you want to change the current peer status.
There are 3 factors to consider, see labeling above (1) (2) (3):
(1) The status of the current peer Peer.down, from the upstream module interface Upstream.get_primary_servers () to get the value of Peer.down, It is also prudent to handle this peer.down value in the current worker.
(2) The current number of successes or failures
(3) The threshold value of success or failure in the Healthcheck module configuration, which can be obtained from the CTX parameters of the timer, Ctx.ok or Ctx.fail
The following is an example to determine the status of the downline:
The peer is currently on-line, that is, peer.down=nil or false,fails times exceeded the set failure threshold, identified as offline (down)
Further action is to call the set_peer_down_globally () function while changing the peer.down=true
What's the relationship between 4.set_peer_down and Peer.down?
The relationship is the same.
5. How the health detection of multiple devices within a node is performed.
Using Nginx lightweight thread to perform multiple checking tasks concurrently, after executing asynchronously, the callback handler functions and the thread dies after processing.
6. When a worker discovers a peer is dropped in the healthcheck process, how it is handled. How it passes the status of the peer to all other workers knows. In this script is divided into three steps to complete, refer to the above diagram:
(1) in the set_peer_down_globally () function, the results of the detection, to set a set_peer_down, while giving the following two steps to induce the subsequent processing
Description Ctx.new_version=true;
At the same time, the corresponding record is stored in the shared memory, marking the record with D:ATS_NODE_BACKEND:P9 as the key to indicate whether the peer is up or down state
(2) in the Do_check () function, if there is a new version, the value of the V:ats_node_backend record in the shared memory is queried first, plus 1, update ctx.version, and empty ctx.new_version at the same time.
The code handling here is very ingenious:
Dict:add (key, 0)
Local new_ver, err = DICT:INCR (key, 1)
Use Dict:add () is to avoid inserting duplicate key values, if the key already exists, directly return nil and and err= "exists", if not present, direct 0
The next step is to add 1 on the original basis, which will be performed properly.
(3) At the next scheduled task execution, all worker processes first go to the shared memory to find the value of the record with key V:ats_node_backend, that is, to get the exact value of the ctx.version. Comparing the current value with Ctx.version, if the value of ctx.version is less than the value in shared memory, the peer version needs to be updated.
Note that the worker process numbers for each log are different, indicating that they are updating the peer version number.
7. How does the other worker update the peer version? is to go to the shared memory to check the key D:ATS_NODE_BACKEND:P9 record, if there is a record that the peer is down, otherwise peer is up. Only Peer.down and the values found are different, you need to Set_peer_down, and set Peer.down=down
8. Why not use Ctx.version to pass the version number, but to dedicate a dedicated record in shared memory to pass it.
This mainly involves the execution of the worker process in Nginx and the configuration synchronization problem. Generally, every piece of code is usually executed by all the workers, but for peers's health check, we only need one worker to execute it, and after one worker executes, the other worker synchronizes its state to be sufficient. In Do_check we see that except for the following Get_lock protected code is performed by one worker, and all the other code is executed by the worker. Timed tasks adopt process preemption, each worker process is not fixed, so the ctx.version is generally discontinuous, through the sharing of memory, you can ensure that each worker process can get the latest peer information each time, and peer version is gradually increasing.
But during a worker execution, all the fields in CTX can be passed in the function.
9. Why should all worker processes execute set_peer_down this function.
Https://github.com/openresty/lua-upstream-nginx-module#set_peer_down
The official document above emphasizes that the execution of this function must be performed by all workers in order to be effective at the server level. Executed only in one worker, only within that worker.
10. When calling in nginx.conf, can I use a single-process invocation, such as an IF condition statement using Ngx.worker.id () ==0.
This is also the mistake that I have made, the answer of course is negative. There are two reasons: one is for the Upstream.set_peer_down interface to take effect at the server level and must be called by all worker processes; Is that the Healthcheck module has been implemented using a single worker to carry out the actual health detection function.
11. External how to obtain the status between peers in the upstream. This module provides a status query interface _m.status_page ()
The following is the details of its handling:
A table of local array types is created in Status_page () to get all the statistics, and finally the elements in those arrays are stitched into strings.
Local bits = New_tab (n * 20, 0)
n = #us us is an abbreviation for upstream, which refers to the number of groups of upstream, each group of upstream consumes 20 array elements
Table.concat the elements in the table are concatenated with the specified delimiter.
Bits array contents are as follows
bits[0]= "Upstream"
bits[1]= "Ats_node_backend"
Bits[2]= "(NO Checkers)"
bits[3]= "\ Primary peers\n"
Bits[4]= ""
bits[5]= "10.10.101.10:18980"
bits[6]= "down\n"
......
bits[n]= "Backup peers\n"
Bits[n+1]= ""
bits[n+2]= "10.10.101.12:18980"
bits[n+3]= "Up\n"
......
Bits[n+m]= "\ n"
Reference documents
[1].https://github.com/openresty/lua-resty-upstream-healthcheck
[2].https://github.com/openresty/lua-upstream-nginx-module#set_peer_down