Gig: High-availability RPC solution with load balancing and downgrade capabilities

Source: Internet
Author: User
Tags documentation error code

Yun-June Guide: Gig based on the negative feedback control of latency, it realizes the functions of bad node shielding, service preheating, heterogeneous cluster load balancing and auto-demotion, which greatly improves the stability of Ali search online service.


In the online query system, the business logic divides the service into a tree structure, each node increases its service capability by horizontal scaling, and finally forms the topological structure shown in the following diagram:



When a query from a portal into the system, the top-down query each service, each service has more than one node to choose from, the simplest load balancing strategy is polling or consistent hash, each node connected to the same traffic, but this strategy if there are bad nodes in the cluster, result in a partial user query with no results or timeouts, which can cause a failure when critical.


Complex systems detect bad nodes through service discovery, such as Ali's cm2 and Vipserver. Take cm2 as an example, CM2 will check the health of the hanging node through the HTTP heartbeat detection, if a node can respond to cm2 heartbeat detection, it is considered a healthy node, CM2 will publish the node to the online system query; When the node is abnormal, heartbeat detection fails, CM2 notifies the online system that the node needs to be deleted, The node traffic drops by 0, thus avoiding bad nodes interfering with traffic, as shown in the following figure:



There are two main problems in detecting bad nodes through heartbeat detection:


Heartbeat detection and actual query of the network path is different, heartbeat detection successfully indicates that the network between the name service and the target service is normal, but the online system often has a network path delay high or not, external heartbeat detection can not solve such problems.


External heartbeat detection cannot discover the service's own business problems, or external probing can only block part of the process, system-level problems (process core, system crash, etc.). For example, in a search scenario, a top-level service itself is healthy, but his lower-hanging service has no higher rate of results, when the external detection found that the upper service is healthy, traffic will be directed to the problematic subsystem, which is common in gray-scale publishing, according to the release of the room and other scenes.


The reasons for the above problems are in fact consistent: the online system decision-making ability is weak, relying entirely on the external system to assist decision-making query path. To solve these problems, we developed the gig.


The core idea of the gig (Graph Irrigator) is the online real-time decision flow flow, no longer confined to the way the external system detects this ineffective, but as a highly available RPC lib embedded into the online application, proxy application access to the underlying services, real-time statistics of each node latency and error rate , shielding high-latency nodes, high error-rate nodes, and actively limiting the downstream node latency until it is unacceptable, allowing traffic to flow to the healthy path seen by this node at the moment.


From the perspective of the above image, the way of external probing is node-oriented, that is, each node has a health, and the gig is edge-oriented, each edge has a health degree. Obviously, the number of edges is much larger than the number of nodes, the state of the system is more comprehensive characterization, at the same time, embedded in the online system has a natural advantage: gig can perceive the business level of errors, and can be shielded from the service of abnormal clusters.


first, the principle of the gig


One of the core functions of gig is to solve the problem of bad nodes including network anomalies, and different systems have different definitions of exceptions, and gig chooses a core indicator "query delay" as the criterion of node quality, node queue congestion, network timeout, The abnormal condition of high operating system load can be reflected on latency. Similarly, the system load can also be reflected from latency, so the downgrade (current limit) and load balancing strategies of gig are also based on latency.


If the online tree system is imagined as a large irrigation system (the origin of the gig name), the function of the irrigation is to use the water to guide the system in the least resistance of the part, can not let the inlet water congestion, and can not let the drought, to the gig here, water is the flow, resistance can be approximated as latency, Congestion is a fall in system throughput and drought is a load imbalance. So gig will count the average latency on each side of the graph, and then, in this way, the gig also controls the latency of each side by controlling the flow level.


1.1 Latency negative feedback control based on PID controller


Most of the online scene, query latency and traffic are positively correlated, that is, the larger the flow, the higher the latency, similar to the following image:



The purpose of flow control is to stabilize the latency of each node, the latency difference of each node is controlled within reasonable range, and the nodes of high latency should be shielded. To stabilize the latency of each node, we designed the negative feedback controller as shown in the figure below:



The PID controller in the figure actually only uses the I (Integral) part, namely an integrator with anti-saturation function, the whole is an over-damping system, so the design reason is that the latency-qps relation of each system in the line varies greatly, the integral itself, although the corresponding speed is slow, However, due to the fast response time of the online system (latency will rise instantly when the flow rate rises), the integrator stops the integral instantaneously when the latency reaches the expected value, and no overshoot causes the downstream node CPU to be filled.


After the system is stabilized, the current limit valve is divided into a fixed proportion of flow from the inlet flow to the downstream node, while the average latency of the downstream node can be controlled to coincide with input latency. The gig applies the above controller to all downstream nodes, and the last remaining question is how the given value of Latency (Input Latency) is set.


There are two ways: User designation and System auto derivation. User designation is actually more difficult, and most of the time the latency is changed with traffic and business, and it is almost impossible to specify in advance.


The gig uses the system automatic derivation method, the derivation way is also very simple: takes all node's average Latency the minimum value as the input Latency of all nodes. This is in line with the intent of flow control and load balancing, where traffic tends to flow toward low-latency nodes. The structure is eventually formed as shown in the figure below:



The whole system is driven by the ingress traffic, and the system can keep the latency of each node in a smaller interval, and the node traffic outside this interval drops by 0.


1.2 downgrade function


With the average latency post-downgrade feature is also easy to implement, gig takes the average latency as the indicator of the system load, allowing the user to set the maximum latency for the system to run (actually two values, one is to start the downgrade begindegradelatency, and the other is 100% Degraded fulldegradelatency), when the average latency minimum value of all nodes of the system exceeds begindegradelatency, the gig will be based on the out-of-section-occupied zone [Begindegradelatency, Fulldegradelatency] proportional to the current limit until the current is fully limited. So the application of latency is impossible to surpass fulldegradelatency.


second, the application scenario


With the precise control of the latency, the problems of many scenarios can be solved:


2.1-Part node network timeout


When the node network time-out, the latency of the node is higher, the controller will reduce the traffic of the node, if the network itself problem caused time-out, then the traffic is adjusted to 0 after the latency still can not recover, the system to keep the node is not connected to the function of blocking the node. When the network is recovering, the gig probe discovers that the node delay decreases and the node traffic recovers automatically. For the gig detection mechanism, please refer to the Gig user documentation at the end of this article.


2.2 Part node service capability down


The service capacity of the node is disturbed by the system load or other application, the node increases with the latency and the traffic falls, if the node service ability drops 0, then the traffic drops 0, if the node service ability is halved, then the system is stable after halving the traffic, and finally realizes the effect of "how much capacity to connect the big traffic".


2.3 New plus node needs preheating


Many services need to be warmed up after the new launch, such as having a JIT-enabled application (Java, TensorFlow xla), Disk IO class applications, and so on, these types of application system service capacity is increased with the time of the stream, the service just started, the new node latency high, traffic fell 0 and triggered the gig detection , the node latency gradually decreases, and the flow automatically recovers when the user sets a multiple.


2.4 Heterogeneous clusters


CPU and GPU heterogeneous computing cluster, the nodes based on different hardware can provide the same service, but the service capability is different, the latency equalization mechanism of gig can balance the traffic proportion of two clusters, and the latency of the two will be controlled within a certain difference. Avoid unreasonable traffic allocation, which causes a single cluster to load too high and time out.


For the active downgrade scenario, the LATENCY-QPS change curve is no longer positive, the latency control becomes positive feedback, and the gig interior designs a proprietary error code to prevent this from happening.


iii. gig optimization for multi-column applications


Big Data scenario, the single machine is difficult to store all the data, so many applications to data disaggregation, forming a multi-column multi-copy topology, we call multi-row multi-column applications. When querying, the application logic requires querying for a complete row, that is, each column is checked once. In the gig, which abstracts this type of application as multiple columns, each providing homogeneous service, and the above-mentioned flow control between the various machines in the column, the gig also made three optimizations for the characteristics of multi-column applications: Early termination (early terminate, ET), retry (retry), Missing column demotion.


3.1 Early Terminate


Et means that when a column does not return a result, the result of discarding the column returns early, and the application layer gets less data, which is acceptable for scenarios such as searching, and getting partial results better than the entire query timeout.


3.2 Retry


Retry is when the result of some columns is not returned, the gig will re-check another replica of the column, and if another replica returns in time, the query result is still intact.


The purpose of ET and retry is to prevent the single-machine jitter caused by query timeout or failure, the core assumption is that the data is evenly distributed, each column processing the same query time is similar, whether a machine is abnormal can be compared with another column. For example, if the first column is returned in 10ms, then the second column is likely to return within 5-15MS, and if the second column uses 50ms still not returned, there is a large probability that the column will have a momentary jitter, when the gig takes retry or ET operations based on the user configuration, or both.


3.3 Missing Column demotion


In the case of multiple columns, the gig active current limit is no longer the direct loss of the entire query, but the query in some of the column queries, so as to maximize the quality of service. For example, if you have a 10-column application that triggers a 10% downgrade, 10% of the query will be missing the result of the 1 column. As the degree of demotion increases, the query scale for missing columns increases, and the missing columns increase until the whole query is restricted.


Detailed action mechanism and configuration method of ET and retry refer to the gig user documentation, missing column demotion is enabled automatically as long as multiple columns of users are configured for demotion.


HA3 uses the ET and retry functions to enable the front-end user to be unaware of a single searcher core drop.


Four, the gig Group Network: Multilevel cascade system


The gig itself is designed to invoke another service for one service, but the gig can also compose a cascade system as shown in the figure below



Considering that some of the gig nodes in the middle tier may trigger gig demotion due to their own load and so on, and after the downgrade because the latency does not have a large change in the continuous stream, so that there will always be a degraded node in the cluster, affecting the quality of service, so gig specifically defines the downgrade-related error code, The upper gig receives the error code and lowers the node's traffic.


After cascading, except for the red nodes in the graph, the remaining nodes can be masked when an exception occurs.


v. Production flow copy


When a new version of the system is on-line, measured, or optimized for performance, it is often desirable to be able to obtain exactly the same amount of traffic as the production system, including the flow size and composition, and the gig supports the copy of the production traffic to a completely separate cluster, and the copy function increases the network bandwidth of the online system. However, the results of the copy cluster will not serve the online traffic, and the copy cluster itself will not interfere with the operation of the online system.


six, Multi-protocol support


Gig currently supports three protocols: ARPC (Ali internal RPC Protocol), HTTP, custom TCP, users can choose according to their own application characteristics.


Seven, double 11 application situation


During the 11 phase of this year, the search for on-line gig services includes: main search and Tmall search HA3, Rankservice, SP, Igraph and DII, and more than tens of thousands of peak gig calls. Some applications use or trigger pre-warming, downgrading, and other related functions, where the Rankservice cluster also uses the pre-heating function of gig and the heterogeneous (CPU and GPU) equalization function, and the structure of the HA3 and SP engines constitutes a class two gig cascade system.


viii. Summary and Prospect


The presence of gig improves the robustness of search-line applications, significantly reduces the reliance on external environments for service quality, and continues to optimize the performance of the gig's flow control strategy, including improving latency statistics, adding error rate statistics, and support for Java applications, while the recent service mesh has been unusually hot , gig has many features that match the concept of service mesh, and subsequent gig will also take into account the cloud and micro-service scenarios.


Exciting events

Come and get your technical transcripts, ape-boys.

Identify the QR code below

See how many people you have in the cloud community.

What are your 2018 keywords?

Enter the H5 can also draw camera, Alibaba cloud vouchers, dolls and other good gifts.


-end-

Cloud-dwelling community

Id:yunqiinsight

Cloud computing 丨 Internet Architecture 丨 Big Data 丨 machine learning 丨 Operations


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.