This is a creation in Article, where the information may have evolved or changed.
Original: Jaeger Source analysis--Service discovery and registration
Statement
Jaeger official does not specify its service registration and service discovery specific use and introduction, this part of the function is in the analysis of the source code, found that its principle and service registration and service discovery similar, so combined with their own knowledge of service registration and service discovery, do a summary , please advise me of mistakes.
TChannel Service registration and service discovery
Jaeger can also implement service registration and service discovery without the help of third-party tools, which is provided by the RPC framework on which it relies.
Third party registration--manual registration
go run cmd/agent/main.go --collector.host-port=192.168.0.10:14267,192.168.0.11:14267
When you start the agent, you can configure multiple collector static addresses, which form a single registry.
Registration Form
github.com/uber/tchannel-go/peer.go #59type PeerList struct { sync.RWMutex parent *RootPeerList //以hostPort为下标组成注册表 peersByHostPort map[string]*peerScore //负载均衡实现 peerHeap *peerHeap scoreCalculator ScoreCalculator lastSelected uint64}
Github.com/jaegertracing/jaeger/pkg/discovery/peerlistmgr/peer_list_mgr.go #150func (M *peerlistmanager) Ensureconnections () {peers: = M.peers.copy () minpeers: = M.getminpeers (peers) numconnected, notconnected: = m.fi Ndconnected (peers)//If there are 3 active services, there will be no health check if numconnected >= minpeers {return} ... for I: = Ran GE notconnected {//swap current peer with random from the remaining positions r: = i + m.rnd.intn (len (notconne CTED)-i) notconnected[i], notconnected[r] = Notconnected[r], notconnected[i]//try to connect to the current peer ( Swapped) Peer: = Notconnected[i] M.logger.info ("Trying to connect to peer", Zap. String ("Host:port", peer. Hostport ()))//For controlling timeout ctx, Cancel: = context. Withtimeout (context. Background (), m.connchecktimeout) conn, err: = Peer. Getconnection (CTX) cancel () if err! = Nil {m.logger.error ("Unable to connect", Zap. String ("Host:port", peer. Hostport ()), Zap. Duration ("ConnchEcktimeout ", M.connchecktimeout), Zap. Error (ERR)) Continue} ...}
On the registry address, TChannel will perform a health check, once per second, if 0.25 seconds are not connected, as the service is not available. If the connection succeeds, the current service instance is retained for the agent to submit the data for use.
github.com/uber/tchannel-go/connection.go #228func (ch *Channel) newOutboundConnection(timeout time.Duration, hostPort string, events connectionEvents) (*Connection, error) { conn, err := net.DialTimeout("tcp", hostPort, timeout) if err != nil { if ne, ok := err.(net.Error); ok && ne.Timeout() { ch.log.WithFields(LogField{"hostPort", hostPort}, LogField{"timeout", timeout}).Infof("Outbound net.Dial timed out") err = ErrTimeout } return nil, err } return ch.newConnection(conn, hostPort, connectionWaitingToSendInitReq, events), nil}
Client Service Discovery
github.com/uber/tchannel-go/peer.go #149func (l *PeerList) choosePeer(prevSelected map[string]struct{}, avoidHost bool) *Peer { var psPopList []*peerScore var ps *peerScore ...... size := l.peerHeap.Len() for i := 0; i < size; i++ { //把peer从Heap头部弹出来 popped := l.peerHeap.popPeer() if canChoosePeer(popped.HostPort()) { ps = popped break } psPopList = append(psPopList, popped) } //不符合的放入Heap尾部 for _, p := range psPopList { heap.Push(l.peerHeap, p) } if ps == nil { return nil } //符合条件的打分,再放入Heap尾部 l.peerHeap.pushPeer(ps) ps.chosenCount.Inc() return ps.Peer}
When the agent needs to submit data, the peer (service information) is obtained from the TChannel load balancer, and when there are multiple, TChannel queries peer by polling. Implementation: The registry puts all the peers into the peerheap, pops the peer from the head, and then puts the peer back to the tail, thus enabling the load balancing of the polling strategy.
github.com/uber/tchannel-go/retry.go #212func (ch *channel) Runwithretry (runctx context. Context, f Retriablefunc) error {var err error opts: = Getretryoptions (runctx) rs: = Ch.getrequeststate (opts) Defer Requeststatepool.put (RS)//default retry 5 times for I: = 0; I < opts. maxattempts; i++ {Rs. attempt++ if opts. Timeoutperattempt = = 0 {err = f (Runctx, RS)} else {attemptctx, Cancel: = context. Withtimeout (Runctx, opts. TIMEOUTPERATTEMPT) Err = f (attemptctx, RS) cancel ()} If Err = = Nil {retur N Nil} if!opts. Retryon.canretry (Err) {if ch.log.Enabled (loglevelinfo) {ch.log.WithFields (Errfield (Err)). Info ("Failed after non-retriable error.") } Return err} ...} Too many retries, return the last error return err}
Communication between networks avoids network anomalies, so in order to improve usability, retry is one of the ways. When the peer submits data from the load balancer to collector, if the commit fails, the peer is then fetched from the load balancer, up to 5 times, and if 5 times is unsuccessful the submission is discarded.
Consul+docker Service registration and service discovery
Using consul to implement service registration and service discovery is a simple matter. Many features are available out of the box.
Preparatory work
- Start consul--ip:172.18.0.2
docker run -itd --network=backend \-p 8400:8400 -p 8500:8500 -p 8600:53/udp \-h node1 progrium/consul -server -bootstrap -ui-dir /ui
docker run \-itd --network=backend \--name=jaeger-agent \-p5775:5775/udp \-p6831:6831/udp \-p6832:6832/udp \-p5778:5778/tcp \--dns-search="service.consul" --dns=172.18.0.2 \jaegertracing/jaeger-agent \/go/bin/agent-linux --collector.host-port=jaeger-collector:14267
#node1docker run -itd --network=backend \--name=jaeger-collector-node1 \-p :14267 \--dns-search="service.consul" --dns=172.18.0.2 \jaegertracing/jaeger-collector \/go/bin/collector-linux \--span-storage.type=cassandra \--cassandra.keyspace=jaeger_v1_dc \--cassandra.servers=cassandra:9042#node2docker run -itd --network=backend \--name=jaeger-collector-node2 \-p :14267 \--dns-search="service.consul" --dns=172.18.0.2 \jaegertracing/jaeger-collector \/go/bin/collector-linux \--span-storage.type=cassandra \--cassandra.keyspace=jaeger_v1_dc \--cassandra.servers=cassandra:9042
Service Registration-Automatic registration
docker run -itd --net=backend --name=registrator \--volume=/var/run/docker.sock:/tmp/docker.sock \gliderlabs/registrator:latest \consul://172.18.0.2:8500
Using the Consul+docker form, as long as the deployment of a good service, it will be automatically registered to consul, very simple.
Registration Form
- Viewing registry information
View registry Information Http://localhost:8500/ui/#/dc1/nodes/node1
You can see that the 2 Collector service IPs that were started are: 172.18.0.5 and 172.18.0.8
Consul offers a variety of health checks: HTTP, TCP, Docker, Shell, and TTL. Details can be found on the official website.
Service-side Service discovery
Consul is a remote service relative to agents and collector, so there are 2 ways to discover Services: HTTP and DNS, where the primary use is DNS because it is simple and lightweight.
- DNS and soft load balancing
When the agent resolves multiple IPs through DNS, Consul randomly chooses an IP to load balance the agent.
Due to the existence of DNS cache, it is possible that the service is unhealthy, the same will be normal parsing, so by default consul is not set cache time, TTL is 0, but also take into account the pressure does not cache to consul, so open configuration, let us decide to cache point-in-time DNS Caching.
Summarize
Both TChannel and Consul+docker implement service discovery and service registration with their pros and cons:
Service Registration
TChannel Service registration applies To some basic services, such as Jaeger, which is a basic service that is rarely changed once deployed.
It's a lot easier to register a service with Consul in today's popular Docker environment, and Docker has a feature that IP addresses are dynamic, so it's a good fit for business scenarios because the business is constantly changing and the service varies.
Health Check
Both TChannel and consul provide health checks, but they are all just testing whether the service is running and not knowing if the request will be processed properly.
Service discovery
TChannel uses client service discovery, which has the advantage of discovering the service-side service of consul without the remote network overhead, a single point of issue. At the same time, the disadvantage is that each language needs to implement its own registry, load balancing and other functions.
Consul uses service-side services to discover that it can be used in conjunction with other services, without the need for relational registries, load balancing, and so on. It also provides scenarios for both network overhead and single point issues.