Original: Istio source analysis--pilot-agent How to manage envoy life cycle
Statement
- The source code for the analysis is 0.7.1 version
- Environment for K8s
- Because there is no C + + foundation, so the source analysis stops in C + +, but also learned a lot of things
What is Pilot-agent?
when we do
kubectl apply -f <(~istioctl kube-inject -f sleep.yaml)
, K8s will help us build 3 of containers.
[root@izwz9cffi0prthtem44cp9z ~]# docker ps |grep sleep8e0de7294922 istio/proxy ccddc800b2a2 registry.cn-shenzhen.aliyuncs.com/jukylin/sleep 990868aa4a42 registry-vpc.cn-shenzhen.aliyuncs.com/acs/pause-amd64:3.0
In
these 3 containers, we are concerned
istio/proxy
. This container runs 2 services.
pilot-agent
This is the following: How to manage the life cycle of envoy.
[root@izwz9cffi0prthtem44cp9z ~]# docker exec -it 8e0de7294922 ps -efUID PID PPID C STIME TTY TIME CMD1337 1 0 0 May09 ? 00:00:49 /usr/local/bin/pilot-agent proxy1337 567 1 1 09:18 ? 00:04:42 /usr/local/bin/envoy -c /etc/ist
Why do you use Pilot-agent?
envoy does not interact directly with K8s,consul,eureka and other platforms, so it requires other services to dock with them, manage the configuration, and Pilot-agent is one of
the "control panels."
Start envoy
Load configuration
pilot-agent will generate a configuration file before booting:/etc/istio/proxy/envoy-rev0.json:
istio.io/istio/pilot/pkg/proxy/envoy/v1/config.go #88func BuildConfig(config meshconfig.ProxyConfig, pilotSAN []string) *Config { ...... return out}
the specific contents of the file can be viewed directly inside the container file
docker exec -it 8e0de7294922 cat /etc/istio/proxy/envoy-rev0.json
the meaning of the configuration content can be seen in the official documentation
Startup parameters
A binary file boot will always require some parameters, envoy is no exception.
istio.io/istio/pilot/pkg/proxy/envoy/v1/watcher.go #274func (proxy envoy) args(fname string, epoch int) []string { ...... return startupArgs}
envoy startup parameters can be
docker logs 8e0de7294922
viewed by viewing, below is the parameters of intercepting envoy from the terminal. Understand the specific parameters meaning official website documentation.
-c /etc/istio/proxy/envoy-rev0.json --restart-epoch 0--drain-time-s 45 --parent-shutdown-time-s 60--service-cluster sleep --service-node sidecar~172.00.00.000~sleep-55b5877479-rwcct.default~default.svc.cluster.local --max-obj-name-len 189 -l info --v2-config-only
Start envoy
Pilot-agent uses
exec.Command
the boot envoy and listens for the envoy's running state (if envoy exits abnormally, status returns non-nil,pilot-agent a policy to restart envoy).
proxy.config.BinaryPath
Is the envoy binary file path:/usr/local/bin/envoy.
args
The envoy startup parameters that are described above.
istio.io/istio/pilot/pkg/proxy/envoy/v1/watcher.go #353func (proxy envoy) Run(config interface{}, epoch int, abort <-chan error) error { ...... /* #nosec */ cmd := exec.Command(proxy.config.BinaryPath, args...) cmd.Stdout = os.Stdout cmd.Stderr = os.Stderr if err := cmd.Start(); err != nil { return err } ...... done := make(chan error, 1) go func() { done <- cmd.Wait() }() select { case err := <-abort: ...... case err := <-done: return err }}
Hot update envoy
here we only discuss pilot-agent how to get envoy hot update, as to how to trigger this step will be described in the following article.
Envoy Hot Update Policy
to learn more about Envoy's hot-update strategy, you can crossing the Web blog envoy Heat restart.
A brief introduction to the following Envoy hot update steps:
- Start another envoy2 process (secondary)
- Envoy2 notifies envoy1 (Primary process) to close its managed port, which is taken over by Envoy2
- Bring envoy1 available listen sockets through the UDs.
- Envoy2 initialization succeeds, notifies envoy1 to gracefully close a working request over a period of time (
drain-time-s
)
- Time (
parent-shutdown-time-s
), Envoy2 notifies envoy1 to close itself
- Envoy2 Upgrade to Envoy1
from the above execution steps, Poilt-agent is only responsible for initiating another envoy process, and the other is handled by envoy itself.
When does the hot update take place?
when the poilt-agent is started, the files under the directory will be listened to, and
/etc/certs/
if the files in this directory are modified or deleted, Poilt-agent will notify envoy for hot updates. As to how to trigger these files to be modified and deleted will be introduced in the next article.
istio.io/istio/pilot/pkg/proxy/envoy/v1/watcher.go #177func watchCerts(ctx context.Context, certsDirs []string, watchFileEventsFn watchFileEventsFn, minDelay time.Duration, updateFunc func()) { fw, err := fsnotify.NewWatcher() if err != nil { log.Warnf("failed to create a watcher for certificate files: %v", err) return } defer func() { if err := fw.Close(); err != nil { log.Warnf("closing watcher encounters an error %v", err) } }() // watch all directories for _, d := range certsDirs { if err := fw.Watch(d); err != nil { log.Warnf("watching %s encounters an error %v", d, err) return } } watchFileEventsFn(ctx, fw.Event, minDelay, updateFunc)}
Hot Update Startup Parameters
-c /etc/istio/proxy/envoy-rev1.json --restart-epoch 1--drain-time-s 45 --parent-shutdown-time-s 60--service-cluster sleep --service-nodesidecar~172.00.00.000~sleep-898b65f84-pnsxr.default~default.svc.cluster.local --max-obj-name-len 189 -l info--v2-config-only
The
hot update start parameter and the first start parameter are different places are-C and--restart-epoch, in fact,-C just the configuration file name is different, their content is the same. --restart-epoch will increment by 1 each time a hot update is made, to determine if a hot update is being performed or to open an existing envoy (this should mean opening the envoy for the first time)
See the official description in detail
istio.io/istio/pilot/pkg/proxy/agent.go #258func (a *agent) reconcile() { ...... // discover and increment the latest running epoch epoch := a.latestEpoch() + 1 // buffer aborts to prevent blocking on failing proxy abortCh := make(chan error, MaxAborts) a.epochs[epoch] = a.desiredConfig a.abortCh[epoch] = abortCh a.currentConfig = a.desiredConfig go a.waitForExit(a.desiredConfig, epoch, abortCh)}
Capture the hot update log from the terminal
2018-04-24T13:59:35.513160Z info watchFileEvents: "/etc/certs//..2018_04_24_13_59_35.824521609": CREATE2018-04-24T13:59:35.513228Z info watchFileEvents: "/etc/certs//..2018_04_24_13_59_35.824521609": MODIFY|ATTRIB2018-04-24T13:59:35.513283Z info watchFileEvents: "/etc/certs//..data_tmp": RENAME2018-04-24T13:59:35.513347Z info watchFileEvents: "/etc/certs//..data": CREATE2018-04-24T13:59:35.513372Z info watchFileEvents: "/etc/certs//..2018_04_24_04_30_11.964751916": DELETE
Rescue envoy
envoy is a service, since it is impossible to guarantee 100% of the availability of services, if envoy is not lucky to be down, then pilot-agent how to rescue, to ensure that envoy high availability?
Get exit status
in the above mentioned Pilot-agent start envoy, will listen to envoy exit status, found abnormal exit status, will rescue envoy.
func (proxy envoy) Run(config interface{}, epoch int, abort <-chan error) error { ...... // Set if the caller is monitoring envoy, for example in tests or if envoy runs in same // container with the app. if proxy.errChan != nil { // Caller passed a channel, will wait itself for termination go func() { proxy.errChan <- cmd.Wait() }() return nil } done := make(chan error, 1) go func() { done <- cmd.Wait() }() ......}
Rescue envoy
The
kill-9 can be used to simulate the envoy abnormal exit status. When an abnormal exit occurs, the Pilot-agent rescue mechanism is triggered. If the first rescue success, that of course is good, if failed, Pilot-agent will continue to rescue, up to 10 times, each time interval is 2
n time.millisecond. More than 10 times have not saved, Pilit-agent will give up rescue, announce death, and exit Istio/proxy, let k8s restart a new container.
Istio.io/istio/pilot/pkg/proxy/agent.go #164func (a *agent) Run (CTX context. Context) {... ... for {... ..... case Status: = <-a.statusch: ..... if Sta..... Tus.err = = Errabort {//pilot-agent notification exits or envoy abnormal exit log. Infof ("Epoch%d aborted", Status.epoch)} else if status.err! = Nil {//envoy Abnormal exit log. WARNF ("Epoch%d terminated with an error:%v", Status.epoch, Status.err) ... a.abortall ()} else {//exit log normally. Infof ("Epoch%d exited normally", Status.epoch)} ... if status.err! = Nil {//skip retrying twice by Checking retry Restart delay if A.retry.restart = = Nil {if A.retry.budget > 0 {delayduration: = A.retry.initialinterval * (1 << UINT (a.retry.maxretries-a.retry.budget)) Restart: = time. Now (). ADD (delayduration) A.retry.restart = &restart A.retry.budget = a.retry.budget-1 log.Infof ("Epoch%d:set retry delay to%v, budget to%d", Status.epoch, Delayduration, A.retry.budget)} else { Declare death, exit Istio/proxy log. Error ("Permanent Error:budget exhausted trying to fulfill the desired configuration") A.proxy.panic (A.desiredcon FIG) Return}} else {log. DEBUGF ("Epoch%d:restart already scheduled", Status.epoch)}} case <-time. After (delay): ... case _, more: = <-ctx. Done (): ...} }}
istio.io/istio/pilot/pkg/proxy/agent.go #72var ( errAbort = errors.New("epoch aborted") // DefaultRetry configuration for proxies DefaultRetry = Retry{ MaxRetries: 10, InitialInterval: 200 * time.Millisecond, })
Rescue log
Epoch 6: set retry delay to 200ms, budget to 9Epoch 6: set retry delay to 400ms, budget to 8Epoch 6: set retry delay to 800ms, budget to 7
Graceful close envoy
Service offline or upgrade we all want them to be very gentle, so that users do not feel, to avoid disturbing users. This requires that the service receives an exit notification and finishes processing the task that is being performed before shutting down instead of directly. Does envoy support graceful shutdown? This requires k8s,pilot-agent also support this play. Because there is an association relationship k8s Management pilot-agent,pilot-agent management envoy.
K8s let the service gracefully exit
Online has a summary of the blog k8s graceful close pods, I am here to briefly introduce the graceful closing process:
- K8s sends the SIGTERM signal to the 1th process of all services under Pods
- After the service receives the signal, gracefully shuts down the task and exits
- After a while (default 30s), if the service does not exit, K8s sends a SIGKILL signal to force the container to quit.
Pilot-agent let envoy gracefully quit
- Pilot-agent receiving k8s Signal
Pilot-agent will receive syscall. SIGINT, Syscall. SIGTERM, these 2 signals can be achieved gracefully off the envoy effect.
istio.io/istio/pkg/cmd/cmd.go #29func WaitSignal(stop chan struct{}) { sigs := make(chan os.Signal, 1) signal.Notify(sigs, syscall.SIGINT, syscall.SIGTERM) <-sigs close(stop) _ = log.Sync()}
- Notify child service off envoy
in Golang There is a context Management Pack
context
, which notifies each sub-service to perform a shutdown operation by means of a broadcast.
istio.io/istio/pilot/cmd/pilot-agent/main.go #242ctx, Cancel: = context. Withcancel (context. Background ()) Go watcher. Run (CTX) Stop: = Make (chan struct{}) cmd. Waitsignal (stop) <-stop//notifies the sub service cancel () Istio.io/istio/pilot/pkg/proxy/agent.gofunc (a *agent) Run (CTX context. Context) {...... for {... ...////Receive Master Service information Notification envoy Exit Case _, more: = <-ctx. Done (): if!more {a.terminate () return}}}}istio.io/istio/pilot/pkg/proxy/envoy/v1/watcher. Go #297func (proxy envoy) Run (config interface{}, epoch int, abort <-chan error) Error {... Select {case ERR: = <-abort:log. WARNF ("Aborting epoch%d", epoch)//Send kill signal to envoy if Errkill: = cmd. Process.kill (); Errkill! = Nil {log. WARNF ("Killing epoch%d caused an error%v", Epoch, Errkill)} return err ...}
The
above shows the process of pilot-agent receiving signals from the k8s to the notification envoy shutdown, which illustrates that poilt-agent is also support graceful shutdown. But eventually envoy does not gracefully shut down, which is okay with pilot-agent sending the kill signal, because envoy itself is not supported.
Envoy graceful off
It
's a pity to be here to inform you that Envoy can not gracefully close, envoy will receive SIGTERM,SIGHUP,SIGCHLD,SIGUSR1 4 signals, but these 4 are not related to elegance, these 4 signals can be seen in official documents. Of course, the official also noticed this problem, you can go to GitHub to learn about 2920 3307.
in fact, using graceful close to achieve the goal is: Let the service smooth upgrade, reduce the impact on users. So we can use the canary deployment to achieve, not necessarily envoy implementation. The approximate process:
- Define older versions of services (v1), new version (v2)
- Publish new version
- Slowly migrate traffic to V2 in a gradient manner
- Migration complete, run for a period of time, no problem close V1
- Golang Graceful Exit HTTP Service
take this opportunity to learn about the graceful closure of the next Golang, which was supported by Golang in version 1.8.
net/http/server.go #2487func (srv *Server) Shutdown(ctx context.Context) error { atomic.AddInt32(&srv.inShutdown, 1) defer atomic.AddInt32(&srv.inShutdown, -1) srv.mu.Lock() // 把监听者关掉 lnerr := srv.closeListenersLocked() srv.closeDoneChanLocked() //执行开发定义的函数如果有 for _, f := range srv.onShutdown { go f() } srv.mu.Unlock() //定时查询是否有未关闭的链接 ticker := time.NewTicker(shutdownPollInterval) defer ticker.Stop() for { if srv.closeIdleConns() { return lnerr } select { case <-ctx.Done(): return ctx.Err() case <-ticker.C: } }}
in fact, Golang's closing mechanism and envoy on GitHub discuss graceful shutdown mechanisms very similar:
Golang mechanism
- Close Listener (
ln, err := net.Listen("tcp", addr)
, nil to LN)
- Check to see if there are no closed links
- All links are exited, service exits
Envoy Mechanism:
- Ingress listeners stop accepting new connections (clients see TCP Connection refused) or continues to service E xisting connections. Egress listeners is completely unaffected
- configurable delay to allow workload to finish servicing existing con Nections
- envoy (and workload) both terminate