RPC framework based on Redis for flow control system in PHP

Source: Internet
Author: User
Tags apcu ip number


We have a degree of micro-service transformation of the project module, before all modules are placed in a project (a large folder), the same as on-line deployment, such shortcomings are obvious. After we split into sub-modules according to the business function, and then the sub-modules through the RPC framework to access, each sub-module has its own separate online machine cluster, MySQL and Redis storage resources, such a sub-module problem will not affect the other modules, while maintainability, scalability is stronger.

But in reality, the service capability of each submodule is different, as shown by the architecture diagram after the sub-module split, assuming that the QPS to the A module is 100,a dependent on B, while each a module arrives at the B module's request QPS is also 100, but the B module can provide a maximum QPS capacity of 50, If there is no traffic limit, the B module is not available due to the overload and the whole system is unavailable, our dynamic flow control system is to find the best service capability of sub-module, that is to limit a module to reach the B module traffic is 50QPS, at least ensure that a part of the request is able to proceed normally, Instead of dragging across the entire system because a sub-service is hung out.

Our RPC framework is a PHP implementation framework that primarily supports HTTP protocol access. For a front-end a module, for the dependent back-end B module, the B module should be serviced configuration, and then by the service name for reference access, the General Service configuration is as follows:

[Module-b]  ; Service Name protocol = "HTTP"  ; interactive protocol lb_alg = "random"; load balancing algorithm Conn_timeout_ms = 1000; connection timeout, all protocols used, MS Read_timeout_ms = 300 0; Read timeout Write_timeout_ms = 3000; Write timeout Exe_timeout_ms = 3000; Execution timeout host.default[] = "127.0.0.1"; IP or domain name host.default[] = "127.0.0.2"; IP or domain name host.default[] = "127.0.0.3"; IP or domain name port = 80; Port domain = ' api.abc.com '; Domain name configuration, do not really parse, as header host field to the backend

For a service module to access, the deployment is generally a cluster, we need to configure all the IP of the machine cluster, of course, if there is an internal DNS service, you can also be equipped with the domain name of the cluster.

For an RPC framework, the basic functions are load balancing, health Check, downgrade & current limit, and our traffic control is for the downgrade & current limit function, before the detailed introduction of it, we first say how load balancing and health check is realized, this is the basis of the flow control implementation.

Load Balancing we implement the random and polling algorithm, random algorithm by randomly select one in all IP, it is easier to implement, for the polling algorithm, we are based on the single-machine polling, the last selected IP number with the APCU extension recorded in local memory, in order to easily find the next IP sequence to use.

The machine being accessed may fail, we record the failed request IP in Redis, and analyze the logged failure log to determine whether a machine IP needs to be removed, that is, the IP machine has been hung off, can not provide services normally, this is the function of health check, We describe the specific features of the following health checks through the relevant service configuration items:

Ip_fail_sample_ratio = 1; Sample scale failed IP record sampling scale, we record failed requests in Redis, to prevent too many Redis requests, we can match a failed sample scale ip_fail_cnt_threshold  = ten;  Number of IP failures ip_fail_delay_time_s = 2;  Time interval ip_fail_client_cnt = 3; The number of failed clients cannot be removed from the list of healthy IPs once for one IP failure, and the request fails only within the valid ip_fail_delay_time_s time range Ip_fail_cnt_ Threshold times, and the failed client reaches ip_fail_client_cnt, it is considered an unhealthy IP. Why add ip_fail_client_cnt such a configuration, because if only a certain machine to access the back end of a service IP failure, it is not necessarily the problem of service IP, it may be the problem of access to the client, Only when most clients have failed records is it considered a back-end service IP problem we record the failure log in the Redis list table with a timestamp, which makes it easier to count the number of failures in the time interval. ip_retry_delay_time_s = 30; Check if failed IP recovery interval a failed IP may recover within a certain amount of time, we ip_retry_delay_time_s long intervals to check if the request succeeds, Remove ip_retry_fail_cnt = Ten from the failed IP list; failed  IP If the check fails, the failed weight value of the record is ip_log_ttl_s = 60000; Log validity time generally only the most recent failure log makes sense, For the history of the log we will automatically delete it. ip_log_max_cnt = 10000; maximum log volume recorded we use Redis to record the failure log, the capacity is limited, we want to set a record of the maximum number of logs, redundant logs are automatically deleted.

In our code implementation, in addition to the normal service IP configuration, we also maintain a list of failed IP, so that through the algorithm select IP first to remove the failure IP, the failure IP record in a file, while using the APCU memory cache to speed up access, so that all our operations are basically memory-based access, There are no performance issues.

We only log in Redis when the request fails, and when will the failed IP be found, which involves querying all the failed logs in the Redis list and counting the number of failures, which is a more complex operation. Our implementation is a way for multiple PHP processes to preempt locks, who grab a parse operation and log the failed IP to the file. Because only one process performs parse operations, there is no effect on normal requests. At the same time, only in the event of failure will have a preemptive lock action, under normal circumstances will not have any interaction with Redis, no performance loss.

Our health check relies on a centralized redis service, what if it hangs up? If the Redis service itself is judged to be dead, the RPC framework automatically shuts down the Health check service and no longer interacts with redis, at least without affecting the normal RPC functionality.

On the basis of health check implementation we can achieve flow control, that is, when we find that most or all of the IP failure, we can infer that because the traffic is too large to respond to the backend service and the request failed, then we should be a certain policy limit flow, the general realization is to directly remove the traffic, which is a bit rough, Our implementation is to gradually reduce traffic, until the failure of the IP ratio to a certain number, and then try to gradually increase the flow, increase and decrease may be a cycle process, that is, dynamic flow control, eventually we will find an optimal flow value. To introduce the functions of flow control through the relevant configuration:

Degrade_ip_fail_ratio = 1; When the service starts to downgrade, the failed IP ratio starts to degrade when the IP scale fails, starting to reduce traffic degrade_dec_step = 0.1; How much per current limit increase is the percentage of traffic degrade_stop_ip_ratio = 0.5; Start to stop reducing traffic when the failed IP has been reduced and try to increase traffic degrade_stop_ttl_s = 10; Stop waiting for how long to start trying to increase traffic degrade_step_ttl_s = 10 traffic increases or decreases the amount of time to wait. After each increase or decrease in traffic, what to do next is determined by the percentage of IP that failed at the time, and the current traffic value is maintained for a period of time, rather than making a decision immediately. Degrade_add_step = 0.1 The proportional value of each increase in traffic Degrade_return = false; Demotion when the return value is degraded we will not go back to the backend service, but return a configured value directly to the caller.

The state diagram of the flow control is described below:

How to achieve control flow in a certain proportion? By random selection, such as getting a random number and judging whether it falls within a certain range. By limiting the flow in an optimal value, in the case of the least impact of the user to make the majority of requests to work properly, while the flow control with monitoring alarm, found that a module of the flow control ratio of 1 or less, indicating that the relevant module is the bottleneck of the system, the next step should be to increase the hardware resources or optimize our program performance.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.