PHP-based flow control system for Redis

Source: Internet
Author: User
Tags apcu ip number

PHP-based flow control system for Redis

We have a degree of micro-service transformation of the project module, before all modules are placed in a project (a large folder), the same as on-line deployment, such shortcomings are obvious. After we split into sub-modules according to the business function, and then the sub-modules through the RPC framework to access, each sub-module has its own separate online machine cluster, MySQL and Redis storage resources, such a sub-module problem will not affect the other modules, while maintainability, scalability is stronger.

But in reality, the service capability of each submodule is different, as shown by the architecture diagram after the sub-module split, assuming that the QPS to the A module is 100,a dependent on B, while each a module arrives at the B module's request QPS is also 100, but the B module can provide a maximum QPS capacity of 50, If there is no traffic limit, the B module is not available due to the overload and the whole system is unavailable, our dynamic flow control system is to find the best service capability of sub-module, that is to limit a module to reach the B module traffic is 50QPS, at least ensure that a part of the request is able to proceed normally, Instead of dragging across the entire system because a sub-service is hung out.

Our RPC framework is a PHP implementation framework that primarily supports HTTP protocol access. For a front-end a module, for the dependent back-end B module, the B module should be serviced configuration, and then by the service name for reference access, the General Service configuration is as follows:

[MODULE-B]  ; 服务名字protocol = "http"  ;交互协议lb_alg = "random" ; 负载均衡算法conn_timeout_ms = 1000 ; 连接超时,所有协议使用, 单位为ms read_timeout_ms = 3000 ; 读超时write_timeout_ms = 3000 ; 写超时 exe_timeout_ms = 3000 ; 执行超时host.default[] = "127.0.0.1" ; ip或域名host.default[] = "127.0.0.2" ; ip或域名host.default[] = "127.0.0.3" ; ip或域名port = 80 ; 端口domain = ‘api.abc.com‘ ; 域名配置,不作真正解析,作为header host字段传给后端
    • [MODULE-B]  ; 服务名字
      protocol = "http"  ;交互协议
      lb_alg = "random" ; 负载均衡算法
      conn_timeout_ms = 1000 ; 连接超时,所有协议使用, 单位为ms
      read_timeout_ms = 3000 ; 读超时
      write_timeout_ms = 3000 ; 写超时
      exe_timeout_ms = 3000 ; 执行超时
      host.default[] = "127.0.0.1" ; ip或域名
      host.default[] = "127.0.0.2" ; ip或域名
      host.default[] = "127.0.0.3" ; ip或域名
      port = 80 ; 端口
      domain = ‘api.abc.com‘ ; 域名配置,不作真正解析,作为header host字段传给后端

For a service module to access, the deployment is generally a cluster, we need to configure all the IP of the machine cluster, of course, if there is an internal DNS service, you can also be equipped with the domain name of the cluster.

For an RPC framework, the basic functions are load balancing, health Check, downgrade & current limit, our traffic control is for the downgrade & current limit function, before the detailed introduction of it, the first to say how load balancing and health check is implemented, is the basis of this flow control implementation.

Load Balancing we implement the random and polling algorithm, random algorithm by randomly select one in all IP, it is easier to implement, for the polling algorithm, we are based on the single-machine polling, the last selected IP number with the APCU extension recorded in local memory, in order to easily find the next IP sequence to use.

The machine being accessed may fail, we record the failed request IP in Redis, and analyze the logged failure log to determine whether a machine IP needs to be removed, that is, the IP machine has been hung off, can not provide services normally, this is the function of health check, We describe the specific features of the following health checks through the relevant service configuration items:

ip_fail_sample_ratio = 1 ; 采样比例失败IP记录采样比例,我们将失败的请求记录在redis中,为防止太多的redis请求,我们可以配一个失败采样比例ip_fail_cnt_threshold  = 10;  IP失败次数ip_fail_delay_time_s = 2 ;  时间区间ip_fail_client_cnt = 3 ; 失败的客户端数不可能一个IP失败一次就将其从健康IP列表中去掉,只有在有效的ip_fail_delay_time_s 时间范围内,请求失败了 ip_fail_cnt_threshold 次,并且失败的客户端达到ip_fail_client_cnt 个, 才认为其是不健康的IP。 为什么要添加 ip_fail_client_cnt 这样一个配置,因为如果只是某一台机器访问后端某个服务IP失败,那不一定是服务IP的问题,也可能是访问客户端的问题,只有当大多数客户端都有失败记录时才认为是后端服务IP的问题我们将失败日志记录在redis的list表中,并带上时间戳,就比较容易统计时间区间内的失败次数。ip_retry_delay_time_s = 30 ; 检查失败IP是否恢复间隔时间某个失败的IP有可能在一定时间内恢复,我们间隔 ip_retry_delay_time_s 长的时间去检查,如果请求成功,则从失败的IP列表中去除ip_retry_fail_cnt = 10;  失败IP如果检查失败,记录的失败权重值ip_log_ttl_s = 60000; 日志有效期时间一般来说只有最近的失败日志才有意义,对于历史的日志我们将其自动删除。ip_log_max_cnt = 10000; 记录的最大日志量我们用redis记录失败日志,容量有限,我们要设定一个记录的最大日志数量,多余的日志自动删除。

In our code implementation, in addition to the normal service IP configuration, we also maintain a list of failed IP, so that through the algorithm select IP first to remove the failure IP, the failure IP record in a file, while using the APCU memory cache to speed up access, so that all our operations are basically memory-based access, There are no performance issues.

We only log in Redis when the request fails, and when will the failed IP be found, which involves querying all the failed logs in the Redis list and counting the number of failures, which is a more complex operation. Our implementation is a way for multiple PHP processes to preempt locks, who grab a parse operation and log the failed IP to the file.  Because only one process performs parse operations, there is no effect on normal requests. At the same time, only in the event of failure will have a preemptive lock action, under normal circumstances will not have any interaction with Redis, no performance loss.

Our health check relies on a centralized redis service, what if it hangs up? If the Redis service itself is judged to be dead, the RPC framework automatically shuts down the Health check service and no longer interacts with redis, at least without affecting the normal RPC functionality.

On the basis of health check implementation we can achieve flow control, that is, when we find that most or all of the IP failure, we can infer that because the traffic is too large to respond to the backend service and the request failed, then we should be a certain policy limit flow, the general realization is to directly remove the traffic, which is a bit rough, Our implementation is to gradually reduce traffic, until the failure of the IP ratio to a certain number, and then try to gradually increase the flow, increase and decrease may be a cycle process, that is, dynamic flow control, eventually we will find an optimal flow value. To introduce the functions of flow control through the relevant configuration:

degrade_ip_fail_ratio = 1 ; 服务开始降级时失败IP比例即失败的IP比例达到多少时开始降级,即开始减少流量degrade_dec_step = 0.1 ; 每次限流增加多少即每次减少多少比例的流量degrade_stop_ip_ratio = 0.5; 在失败的IP已降到多少比例时开始停止减少流量,并尝试增加流量degrade_stop_ttl_s = 10;停止等待多长时间开始尝试增加流量degrade_step_ttl_s = 10流量增加或减少需要等待的时间。每一次流量增加或减少后,下一步如何做是根据当时失败的IP比例来决定的,而且会保持当前流量值一段时间,而不是立即做决定。degrade_add_step = 0.1每次增加流量增加的比例值degrade_return = false ; 降级时返回值降级时我们不会再去访问后端服务,而是直接给调用方返回一个配置的值。

The state diagram of the flow control is described below:

How to achieve control flow in a certain proportion? By random selection, such as getting a random number and judging whether it falls within a certain range. By limiting the flow in an optimal value, in the case of the least impact of the user to make the majority of requests to work properly, while the flow control with monitoring alarm, found that a module of the flow control ratio of 1 or less, indicating that the relevant module is the bottleneck of the system, the next step should be to increase the hardware resources or optimize our program performance.

PHP-based flow control system for Redis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.