With the 2014 New Year's Day micro-Bo platform against the smooth passage of the 2013 micro-BO Platform Core Service Interface Availability index was fixed at 99.991%.
The availability of microblogging services is an important goal of the 2013 micro-blogging platform Technology team, to this end, the platform also set up a special platform of the SLA index system, including the micro-BO Platform Core Service interface (mainly feed service-related interface), the availability of indicators: annual average interface Request performance < The ratio of 100ms is >99.99%, that is 4 9 availability indicators.
Where are our challenges?
When it comes to feed services, we all know that feed service is the core of microblogging, the most valuable service, so it is also the product managers spend most of the place, a variety of product functional strategy in the Feed service gradually realized, "feed top", "Keyword shielding", "hot recommended" and so on, The attendant is the increase in service dependency, although the micro-blogging platform for the internal dependent resources and service modules have SLA requirements, but even if the availability of each service depends on the reasonable performance of 99.99%, assuming that the feed service relies on 9 service modules, In theory, the availability of feed services can only reach 99.99% ^9=99.91%, that is, only 3 9 of availability can be guaranteed. What's more, the feed service actually relies on more than 9 resources or service modules, and even some dependent service modules do not actually reach 4 9 availability metrics due to some limitations. These situations pose a great threat to the availability of services, which poses a great challenge to our goal of accomplishing them.
What have we done?
To achieve the goal of 4 9 availability, on the one hand, the establishment of a standard SLA index system within the micro-blogging platform, both resources and services, defines SLA metrics (mainly including performance standards and availability), and implements service-ranking strategies to enforce standard SLA standards for generally important (weakly dependent) resources or services , but for very important (strongly dependent) resources or services, the implementation of high requirements of SLA standards, while devoting sufficient resources and manpower to ensure that key-dependent services to optimize SLA metrics. On the other hand, according to the characteristics of the feed service itself, based on the SLA index data of the resource or service, the reasonable fault tolerant and guarantee strategy is developed, which ensures the robustness of the service and improves the service availability through the strategy transformation in the architecture.
For SLA metrics for services, the most important thing is to define performance standards and service invocation ratios (i.e. availability) to this standard, and take MySQL resources as an example: Single request performance <50ms, overall meet this performance standard request proportion >99.99%, (of course, This indicator needs to be based on specific business characteristics, the complexity of SQL, and other specific circumstances of the definition of different, can not generalize. By defining the SLA metrics that depend on the service, we can protect the overall service quality from the architecture through a number of strategies, without having to rely too heavily on specific resources and services, specifically as follows:
Timeout control:
From connect timeout to the socket timeout refine timeout and processing policies. The feed service relies on more than 10 resources or service modules, occasionally there is a problem with a service, or the network jitter caused the request timeout, if the feed service to wait without discrimination, it would be a nightmare. Therefore, the feed service according to the business characteristics, these dependencies are graded, and then based on the strong and weak dependence of the SLA indicators, and finally to each dependent requests to set the exception handling threshold. As an example of the dependency of the feed function on the resource, the top microblogging ID is stored in the cache memcached, and the feed aggregation needs to be removed from the memcached, and the platform's SLA requirements for this resource are <50ms,>99.99%, Under normal circumstances can meet the overall service performance, but when encountering special circumstances such as network jitter, many get the top microblogging ID request will exceed the threshold, if the request timeout is not effective control, will affect the entire feed request. Therefore, in this case, the feed service from the architecture, the resource dependency on the top function clear timeout control, once the resource request time more than 80ms (in general, this threshold is more than relying on resource-confirmed SLA standards slightly more relaxed, To ensure that this timeout control policy does not accidentally hurt the normal business function, then disconnect the request, avoid affecting the entire feed request, and this strategy is through the resources of Connect timeout and socket timeout to achieve. Although in the case of this resource problem, the top microblog may be sporadic, but the main function of the feed is not affected. As you can see, this is important for dependency isolation, and there is no endless trust for any dependencies.
Blocking and fault tolerance:
For the necessary dependencies, there must be a certain degree of tolerance (limited within the SLA), through the request queue, automatic fault-tolerant degradation, such as the tolerance of short-term service fluctuations, but also requires a certain automatic repair strategy to ensure that after the recovery of dependent services, the main service can quickly automatically recover.
The most typical congestion-tolerant strategy in the feed service is the release module of the Feed, when you publish a microblog, the microblogging content is not directly stored in MySQL, but the message queue, and then the special Message processing module for these messages processing, update the storage. This reduces the feed service to about 3 MySQL dependencies to a lightweight message queue, and reliability increases by a single notch. There are a number of scenarios that currently rely on demotion or isolation through blocking and fault-tolerant policies.
Manual switch:
However, in some special cases, some dependent resources or services may be unavailable for long periods of time, causing the service to be affected. Although we have previously mentioned some protection policies, such as time-out control, these protection policies are also consuming this amount of time. The problem of short time fluctuation can be solved very well, but it is difficult to achieve the effect for the relatively long problem. This is the need for us to manually disconnect dependencies, and so on when the dependency problem resumes before opening.
One might argue, why can't we automatically increase this switching mechanism on top of the timeout control strategy? In fact, for very core resource service dependencies, the general problem is not allowed to downgrade, if the automatic demotion policy risk will be very large, because the automatic demotion of the boundary is very difficult to judge, this situation is the best way to use manual demotion. (Once this has been implemented, it is generally a very serious problem, service quality has been affected, just as far as possible to reduce the impact to a smaller extent.) Similar to "Brokeback")