Analysis of several soft load balancing strategies

Source: Internet
Author: User

The company last year on the F5, easy to use is good, but the cost is too high, so recently has been studying the soft load balancing this together, it happened early this year Google Open source seesaw, let oneself can bypass many detours. This is a summary of the load balancing policies that you learned before. -sunface

In distributed systems, load balancing is an important part of distributing requests to one or more nodes in the network through load balancing. Typically, load balancing is divided into hardware load balancing and software load balancing. Hardware load balancing, as the name implies, to install specialized hardware between server nodes for load balancing work, F5 is one of the best. Software load Balancing is the allocation of requests through specific load balancing software installed on the server or with a Load balancer module.

Generally, there are several common load balancing strategies:

One. Polling. As a very classic load balancing strategy, this strategy has been applied extensively in the early days. The principle is simple, to mark each request a sequence number, and then distribute the request to the server node, which is suitable for each node in the cluster to provide a service-capable and stateless scene. Its drawbacks are also obvious, and the strategy treats nodes as equals, inconsistent with the actual complex environment. Weighted polling is an improved strategy for polling, each node will have the right to re-attribute, but because the weight of the setting is difficult to change with the actual situation, there are still some shortcomings.

Two. Random. Similar to polling, there is simply no need to number each request and take one randomly at a time. Similarly, the policy will be equivalent for each node in the backend. In addition, there are also improved weighted random algorithms, not to repeat.

Three. Minimum response time. The average response time is obtained by recording the time required for each request, and then the minimum response time is chosen based on the response time. This strategy can better reflect the state of the server, but due to the relationship between the average response time, time lag, can not meet the requirements of rapid response. So on top of that, there are some improved versions of the strategy, such as a strategy that calculates only the average time of the last few times.

Four. Minimum number of concurrent. The client's request service at the time of the server can be significantly different, with longer working hours, if the use of a simple round robin or random equalization algorithm, the connection process on each server may produce a large difference, and does not achieve true load balancing? The strategy for minimum concurrency is to record the current moment, The number of transactions that each alternate node is processing, and then select the node with the smallest number of concurrent nodes. This strategy can quickly reflect the current situation of the server, and more reasonably will be responsible for the allocation of uniform, suitable for the current system load more sensitive scenarios.

Five. Hash. In the case of the back-end node having state, it is necessary to use the hashing method to load balance, in this case the situation is more complicated, this article does not discuss this.

In addition, there are other load balancing strategies no longer listed, interested students can check the relevant information on their own.

Distributed system is facing more complicated environment than single machine system, including different network environment, running platform, machine configuration and so on. In such a complex environment, error is unavoidable, then how to be able to do fault tolerance, the cost of error can be reduced to a minimum is a problem that must be considered in the distributed system. Choosing a different load-balancing strategy will be very different.

Consider the following situation. The completion of the request requires the following four clusters, A,b,c,d, which assumes that the completion call needs to call the cluster B3 times, B cluster total 5 servers.

When a server in cluster B fails to provide services, if other fault-tolerant methods in the cluster are not yet in effect, 4/5 of the requests are not affected in the best case.

If a polling or random load balancing strategy is used, the probability of a single request being distributed to a normal node is 4/5, then the chance of success of the request is 1 (4/5) (4/5) * (4/5) = 64/125 is about two, less than the ideal state of 4/5.

In this case, if only the use of such a strategy, the scope of the failure will spread, not in line with expectations.

If the minimum concurrency of the complex equalization strategy, assuming that a normal request takes time to 10ms, the timeout is set to 1s, then, according to the minimum concurrency policy, the ability of the exception node to provide services is 1, the normal node provides a service capacity of 100, the probability of distribution to the abnormal node is 1/( 4+ 1) =1/401, the chance of success for this request is 1(400/401) ^3≈99.25%, higher than 4/5.

More generally, set the ratio p for failed machines in the cluster, then the expected probability of a successful invocation is

The entire request calls K times, and if a polling or random load balancing strategy is used, the probability of a single dispatch to a normal node is

The success rate of the request will be reduced to

When K is 3, the relationship between the success rate F (p) and P is obtained:

From the understanding, when the p increases, the success rate of the request F (P) will have a significant decline, so in the high reliability requirements of the distributed system, it is not easy to adopt such a strategy. If the strategy of minimum concurrency, the total number of clustered servers is m, assuming that the service capability drops to normal 1/q in exceptional cases, the total number of services that the cluster can provide in the unit time is

Then the probability of a single distribution to the normal node is:

The success rate of the request is the K-square of the above value, i.e.

When q=10,k=3, you can get an image of the request Success rate F (p):

It is known that when p changes in a smaller interval (e.g. (0,0.4]), with the increase of p, the success rate F (p) does not decrease significantly, and the abnormal condition of multiple node faults can be handled well under the condition that each node can withstand the service pressure.

Think in a different way, then dig up the equation, if p is constant, that is, if a certain number of machines in the cluster have failed.

When p=0.1,k=3, you can get an image of the success rate F (q):

From the known, the service timeout time does not need to be set too large, in general, set to 10 times times the normal server time to provide.

In addition, if the abnormal situation is very quickly perceived by the client and feedback (such as the client to check the backend of a node configuration error, etc.), that is, when q<1, assuming q=0.1,k=3, you can get the following relationship:

In this case, it can lead to a significant increase in failure, even if only a smaller percentage of the cluster has an exception, causing a large number of requests to fail, and other means of detecting this type of exception.

In this case, considering the network fluctuation and other abnormal state, the protection mechanism of removing the abnormal node is added, and when a node in the back end fails more than a certain number of times, the node is removed. However, this strategy also exists in cases where the normal node is removed because of a user input error or other accidental factors that cause the return failure. Consider this situation, assuming that the probability of this return anomaly is 1%, the number of failures to remove a node is 9,q=0.1, the number of available nodes is 5, according to the calculation formula of the minimum concurrency number, can be distributed to this node probability is 2/7, then the probability of continuous distribution to the node is (2/7) ^ 9≈0.001%, this anomaly can be ignored.

In the actual application, the concurrency number of the client can be maintained at a low level, because the concurrency of the client does not represent the concurrency of the server, it will result in a situation where the client concurrency is small and the actual load on the server is unbalanced.

Therefore, the load balancing policy of the minimum concurrency does not apply to situations where the client is load balanced and the client is less loaded.            In this case, the problem of load imbalance is solved by random method. Of course, in the actual distributed system, because one node anomaly causes the pressure of other nodes to increase, it may degrade the performance of other nodes, the relationship between them is difficult to be described simply by the above equation.

Analysis of several soft load balancing strategies

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.