Design a fault-tolerant micro-service architecture

Source: Internet
Author: User
Tags failover response code
This is a creation in Article, where the information may have evolved or changed.

Design a fault-tolerant micro-service architecture

Abstract: This article belongs to original, welcome reprint, reprint please retain Source: Https://

Original address


The MicroServices architecture enables you to isolate failures through clearly defined service boundaries. But as in every distributed system, it is common for network, hardware, and application-level errors to occur. Because of service dependencies, any component may temporarily fail to provide services. To minimize the impact of partial outages, we need to build fault-tolerant services to gracefully handle the response of these interrupts.

This article describes the most common technology and architecture patterns for building and operating a highly available microservices system based on Risingstack's node. JS Consulting and development experience.

If you're not familiar with the patterns in this article, that doesn't necessarily mean you're doing it wrong. Creating a reliable system always brings additional costs.

The risk of a microservices architecture

MicroServices architectures Move application logic to services and use the network layer to communicate between them. This substitution of inter-network communication for invocation within a single application leads to additional delays and the need to coordinate the system complexity of multiple physical and logical components. The increased complexity of distributed systems will also lead to higher network failure rates.

One of the biggest advantages of the microservices architecture is that teams can design, develop, and deploy their services independently. They have full ownership of the life cycle of the service. This also means that the team has no control over the services they rely on because it is more likely to be managed by different teams. Using the microservices architecture, we need to remember that the provider service may be temporarily unavailable due to incorrect versions, configurations, and other changes issued by other people.

Graceful service downgrade

One of the greatest benefits of microservices architectures is that you can isolate failures and perform graceful service demotion when components fail individually. For example, during an outage, customers in the photo sharing app may not be able to upload new pictures, but they can still browse, edit, and share their existing photos.

Micro-Service Fault tolerant Isolation

In most cases, because applications in distributed systems depend on each other, it is difficult to achieve this graceful service demotion, and you need to apply several types of failover logic, some of which will be described later in this article, to prepare for temporary failures and outages.

Services are dependent on each other, and the service fails without failover logic.

Change Management

Google's website Reliability team found that about 70% of outages were caused by changes in existing systems. When you change something in the service, you deploy the new version of code or change some configuration-this can always cause a failure, or introduce a new bug.

In a microservices architecture, services depend on each other. That's why you should try to minimize the failure and limit its negative impact. To handle the issues in the change, you can implement a change management policy and an automatic rollback mechanism.

For example, when you deploy new code or change some configurations, you should replace all instances of the service incrementally by replacing them in a small, partial-scale way. During this time, you need to monitor them and if you find them negatively impacting your key metrics, you should immediately perform a service rollback, which is called Canary Deployment.

Change management-Roll back deployment

Another solution might be for you to run two production environments. You can always deploy only one and point the load balancer to the new if you verify that the new version meets your expectations. This is called a blue-green or red-black deployment.

Rolling back the code is not a bad thing. You should not leave the wrong code in production, and then consider what the problem is. If necessary, the sooner you roll back your code the better.

Health checks and load balancing

Instances are continuously started, restarted, or stopped due to failure, deployment, or Autoscale. It may cause them to be temporarily or permanently unavailable. To avoid problems, your load balancer should skip unhealthy instances from the route because they are currently unable to service customers or subsystems.

Application instance health can be determined by external observation. You can do this either by repeating the call GET /health to the endpoint or by self-reporting. Now the mainstream service discovery solution continues to collect health information from the instance and configures the load balancer to route traffic only to healthy components.


Self-healing can help the application recover from the error. When an application can take the necessary steps to recover from a failed state, we can say that it can be self-healing. In most cases, it is implemented by an external system that monitors instance health and restarts them for a longer period of time in a failed state. Self-healing is very useful in most cases. However, in some cases, restarting the application continuously can cause trouble. Frequent reboots in this situation may not be appropriate when your application is unable to give a healthy health condition due to overloading or its database connection time-out.

For such a special scenario, such as a lost database connection, it can be tricky to implement a solution that meets its advanced self-healing options. In this case, you need to add additional logic to the application to handle the edge situation and let the external system know that the instance does not need to be restarted immediately.

Fail-Over cache

Services often fail due to network problems and changes in our systems. However, due to self-healing and load balancing safeguards, most of these interrupts are temporary, and we should find a solution that will allow our services to still work when these failures are in service. This is the role of failover caching, which can help and provide the necessary data for our applications in the case of a service failure.

The failover cache typically uses two different expiration dates; A short time tells you that the cache can be used for an expiration time in normal circumstances, and that a longer period of time can cache the expiration time that is still available in the event of a service failure.

Fail-Over cache

It is important to mention that failover caching is only available if the service uses obsolete data better than no data.

To set up caching and failover caching, you can use standard response headers in HTTP.

For example, max-age you can use properties to specify the maximum time that a resource is considered valid. Using stale-if-error attributes, you can specify the maximum time that a resource can still be fetched from the cache in the event of a failure.

Modern CDN and load balancers offer a variety of caching and failover behaviors, but you can also create a shared library for companies with standard reliability solutions.

Retry logic

In some cases, we can't cache the data, or we want to change it, but our operation eventually fails. For this, we can retry our operation because we can expect the resource to recover after a period of time, or our load balancer will send the request to a healthy instance.

You should be careful to add retry logic to your applications and clients, because a large number of retries can make things worse and even prevent application recovery, such as a large number of retries can only make the situation worse when the service is overloaded.

In distributed systems, the MicroServices system retries can trigger multiple other requests or retries, and initiate cascading effects. To minimize the impact of retries, you should limit the number of them and use the exponential backoff algorithm to continuously increase the delay between retries until the maximum limit is reached.

When the client (browser, other microservices, etc.) initiates a retry, and the client does not know that the operation failed before or after the request is processed, you should prepare your application for idempotent processing. For example, when you retry the purchase operation, you should not charge the customer again. Using a unique power equivalent key for each transaction can help with retries.

Current limiter and load demotion

Traffic throttling is the technology that defines how many requests a particular customer or application can receive or process over time. For example, with traffic throttling, you can filter out customers and services that cause spikes in traffic, or you can ensure that your application is not overloaded when Autoscale is not met.

You can also block lower-priority traffic and provide sufficient resources for critical transactions.

The current limiter prevents spikes in traffic to occur

There is a different type of current limiter called a concurrent request limiter. When you have an important endpoint, you should not be called more than the specified number of times, and you still want to be able to provide the service when this will be useful.

A series of uses for load demotion ensures that there is always enough resources to provide critical transactions. It retains some resources for high-priority requests and does not allow low-priority transactions to use them. The load downgrade switch is based on the overall state of the system, rather than on the size of a single user's request volume. Load demotion can help your system recover, because when you have an occasional event (which may be a hot event ), you can still keep the core functionality working properly.

To learn more about current limiters and load downgrades, I recommend reviewing this article for stripe.

Fast failure principle and independence

In the microservices architecture, we want to make our services capable of rapid failure and mutual independence. In order to perform fault isolation at the service level, we can use bulkhead mode. You can read more about bulkhead in the back of this article.

We also want our components to fail quickly because we do not want the failed service to be disconnected after the request has timed out. Nothing is more disappointing than pending requests and unresponsive UIs. This not only wastes resources, but also affects the user experience. Our services are called each other in the call chain, so we should pay special attention to preventing the suspend operation before these delays accumulate.

The first idea you can think of is to set an explicit timeout level for each service invocation. The problem with this approach is that you don't know what a really reasonable timeout value is, because some of the things that happen to network failures and other problems can only affect one or two operations. In this case, if only some of these timeouts occur, you may not want to reject these requests.

We can say that the effect of using timeouts to achieve a fast failure is an anti-pattern, and you should avoid using it. Instead, you can apply the circuit breaker mode, depending on the success and failure statistics of the operation.

Bulkhead mode

The use of bulkheads in industry is divided into several parts so that, in the case of hull damage, the various parts of the ship can be sealed.

The concept of bulkhead can be applied to isolated resources in software development.

By applying bulkhead mode, we can protect limited resources from being exhausted. For example, for a DB instance with a connection limit, if we have two operations to connect to it, we can connect in a way that uses two connection pools instead of just one shared connection pool. Because this client is isolated from the resource, time-outs or excessive use of the pool's action pages will not cause other operations to fail.

One of the main reasons for the sinking of the Titanic was that its bulkhead design failed, and the water could be poured over the top of the bulkhead through the upper deck, causing the entire hull to drown.

Titanic Bulkhead Design (void design)

Circuit Breaker

To limit the duration of the operation, we can use timeouts. Timeouts prevent the suspend operation and keep the system from responding. However, using a static, granular timeout in MicroServices is an anti-pattern, because we are in a highly dynamic environment and it is almost impossible to propose the correct time limit to work correctly in each case.

The alternative to this static timeout is that we can use a circuit breaker to handle errors. Circuit breakers are named after the real-world electronic components because they function the same. You can protect resources and help them recover with a circuit breaker. They are useful in distributed systems where repetitive failures can cause snowball effects and paralyse the entire system.

When a particular type of error occurs more than once in a short period of time, the breaker is disconnected. An open circuit breaker can prevent further requests-just as we normally call circuit tripping. A circuit breaker usually shuts down after a certain period of time, during which the underlying service can be provided with sufficient space to recover.

Keep in mind that not all errors should trigger a circuit breaker. For example, you might want to skip client issues, such as requests that have a 4xx response code, but do not include a 5xx server-side failure. Some circuit breakers also have a half-open state. In this state, the service sends the first request to check for system availability while making other requests fail. If this first request succeeds, it will return the circuit breaker to the off state and flow the flow. Otherwise, it remains open.

Circuit Breaker

Test failure

You should constantly test your system's frequently asked questions to ensure that your service can withstand a variety of failures. You should frequently test for failures and give your team the ability to handle failures.

For testing, you can use an external service to identify the instance group and randomly terminate one instance of this group. This allows you to prepare a single instance failure, but you can even shut down the entire area to simulate a cloud provider's failure.

One of the most popular test solutions is Netflix's chaosmonkey resiliency tool.


It's not easy to implement and run reliable services. You need to pay a lot of effort, at the same time the company must have the corresponding financial input.

There are many levels and aspects of reliability, so it is important to find the solution that best suits your team. You should make reliability a factor in your business decision-making process and allocate enough budget and time for it.

Main Harvest

    • Dynamic environments and distributed systems (such as microservices) can lead to higher probability of failure;
    • Service should be fault isolation, to achieve graceful degradation, to enhance the user experience;
    • 70% of interrupts are caused by changes, and code rollback is not a bad thing;
    • To achieve rapid service failure and independence. The team is unable to control the service they depend on;
    • Architecture patterns and technologies such as caches, bulkheads, circuit breakers, and current limiters help build a reliable microservices architecture.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.