Distributed system delay and fault tolerant framework Hystrix

Source: Internet
Author: User
Tags throw exception

    • Introduction
      In large and medium-sized distributed systems, often the system relies on a lot of dependencies (HTTP,HESSION,NETTY,DUBBO, etc.), under high concurrent access, the stability of these dependencies on the system is very large, but the dependency has a lot of uncontrolled problems:
      such as slow network connection, resource busy, temporarily unavailable, service offline and so on.
      Under normal circumstances:
      When dependent service I busy, other dependencies are normal:
      I block, the thread pool of most servers is blocked (block), affecting the stability of the entire online service: Applications with complex distributed architectures have a lot of dependencies and inevitably fail at some point. When high concurrency dependencies fail, the current application service is at risk of being dragged down if there are no isolation measures.
      Hystrix is designed to provide greater fault tolerance for latency and failure by controlling the nodes that access remote systems, services, and third-party libraries.
      Hystrix is based on the Flex engineering work launched by the Netflix API team in 2011, which currently handles tens of billions of of isolated threads and hundreds of millions of of isolated signal calls per day in Netflix. Hystrix is an open source library based on the Apache License 2.0 protocol, currently hosted on GitHub.
    • What Hystrix can do

        1) hystrix uses the command mode Hystrixcommand to wrap dependent call logic, each command executes under a separate line approached/signal authorization, and can only be executed once per order.
        2) Provide fuse components that can be automatically run or manually called, stop the current dependency for a period of time (10 seconds), the fuse default error rate threshold is 50%, more than will automatically run.
        3) Configurable dependency call timeout time, the time-out is generally set to be slightly higher than the average time of 99.5%. When the call times out, the fallback logic is returned or executed directly.
        4) Provide a small thread pool (or signal) for each dependency, and if the thread pool is full The call will be immediately rejected, by default not queued. Accelerated failure determination time.
        5) Dependent invocation Result: success, failure (throw exception), timeout, thread reject, short circuit. The fallback (downgrade) logic is executed when the request fails (exception, Deny, timeout, short-circuit).


  • Quarantine Policy

    1. Thread Isolation



    Separating the thread that executes the dependent code from the request thread (for example, the jetty thread), the request thread is free to control the time of departure (the asynchronous process). The thread pool size can control concurrency, and when the thread pool is saturated, the service can be rejected in advance, preventing dependency problems from spreading.

    It is recommended that thread pools should not be set too large, otherwise a large number of blocked threads may slow down the server.

    Advantages:

    [1]: Using threads to completely isolate third-party code, the request thread can be quickly put back.

    [2]: When a failed dependency becomes available again, the thread pool will be cleaned up and immediately resumed as available instead of a long time recovery.

    [3]: Asynchronous calls can be fully simulated to facilitate asynchronous programming.

    Disadvantages:

    [1]: The main disadvantage of the thread pool is that it increases CPU scheduling and context switching.

    [2]: Adding complexity to code that relies on thread state, such as threadlocal, requires manual delivery and cleanup of thread state.

    Note: Netflix internally considers the thread isolation overhead to be small enough to cause significant cost or performance impact. The Netflix internal API relies on 10 billion of hystrixcommand per day for thread isolation, with approximately 40 thread pools per application, and approximately 5-20 threads per thread pool.

    2. Signal Volume isolation

    Signal isolation can also be used to limit concurrent access to prevent blocking from spreading, unlike thread isolation, where the thread that executes the dependent code is still the request thread (which requires a signal request), and if the client is trustworthy and can return quickly, use signal isolation to replace thread isolation and reduce overhead.

    The size of the semaphore can be dynamically adjusted, and the thread pool size is not allowed.

    Advantages:

    [1]: Using semaphores does not have the additional CPU thread context switching overhead, which is much smaller than the thread isolation overhead.

    [2]: can limit the flow of requests.

    Disadvantages:

    [1]: Cannot fuse a dependent request call.

    3. How to Choose



    The default and recommended isolation method is thread isolation, and commands executed by separate threads provide additional protection that is not provided by a layer of network timeouts. Semaphore isolation (typically non-network request calls) is used only when the request is large enough to cause the use of thread isolation overhead to be too large.

  • Using fallback to provide a downgrade policy



    When executing the Hystrixcommand run method, if any error throws an exception:

    Fallback:hystrixruntimeexception Not available: * Failed and no fallback available.
    Provide fallback: Go straight to fallback logic.

    Note: It is best not to have logic that can cause exceptions or errors in fallback logic, and to provide static return content as much as possible.

  • Fuse
    1. Fuse Request Judging mechanism
    Using the lock-free cyclic queue count, each fuse maintains 10 buckets by default, one bucket per 1 seconds, and each blucket logs a request for success, failure, timeout, rejected status, default error exceeding 50% and more than 20 requests in 10 seconds for interrupt interception.
    2. Fuse Recovery
    For a fused request, every 5s allows a request to pass, and if the request is healthy, restore the request health.
    3, fuse three kinds of states

    OPEN, Half-open, CLOSED

    The precise, the circuit opening and closing occurs is as follows:
    1. Assuming the volume across a circuit meets a certain threshold ( Hystrixcommandproperties.circuitbreakerrequestvolumethreshold ()) ...
    2. And assuming that the error percentage exceeds the threshold error percentage (Hystrixcommandproperties.circuitbreakererr Orthresholdpercentage ()) ...
    3. Then the circuit-breaker transitions from CLOSED to OPEN.
    4. While it's open, it short-circuits all requests made against that Circuit-breaker.
    5. After some amount of time (Hystrixcommandproperties.circuitbreakersleepwindowinmilliseconds ()), the next single request IS-let-through (the Half-open state). If the request fails, the Circuit-breaker returns to the OPEN state for the duration of the Sleep window. If the request succeeds, the Circuit-breaker transitions to CLOSED and the logic in 1. Takes over again.
    4. Fuse Action Range
    To differentiate by Commandkey, the granularity of control needs to be controlled.


  • Configuration
    1, Statistical scrolling time window default 10000 ten seconds
    Withmetricsrollingstatisticalwindowinmilliseconds (10000)
    2, scrolling time window bucket number default
    Withmetricsrollingstatisticalwindowbuckets (10)
    3, sampling time interval default 500
    Withmetricshealthsnapshotintervalinmilliseconds (1)
    4, the fuse in the entire statistical time whether the threshold is open, the default of 20. That is, at least 20 times in 10 seconds, the fuse works.
    Withcircuitbreakerrequestvolumethreshold (20)
    5, Default: 50. The fuse starts when the error rate exceeds 50%.
    Withcircuitbreakererrorthresholdpercentage (30)
    6, the fuse default operating time, default: 5 seconds. Fuse Interrupt Request 5 seconds after the retry is turned off, if the request still fails, continue to open the fuse for 5 seconds, so loop
    Withcircuitbreakersleepwindowinmilliseconds (1000)
    7. Quarantine policy
    Withexecutionisolationstrategy (Executionisolationstrategy.semaphore)
    8. Maximum number of concurrent requests when Semaphore is isolated
    Withexecutionisolationsemaphoremaxconcurrentrequests (2)
    9. Command group name,which group the command belongs to helps us better organize the commands.
    Withgroupkey (HystrixCommandGroupKey.Factory.asKey ("hellogroup"))
    10, the command name, each Commandkey represents a dependent abstraction, the same dependency to use the same commandkey name. The root of dependency isolation is to isolate the dependency of the same commandkey.
    Andcommandkey (HystrixCommandKey.Factory.asKey ("Hello")
    11, the name of the owning thread pool,similarly configured commands share the same thread pool, which, if not configured, uses Groupkey as the thread pool name by default.
    Andthreadpoolkey (HystrixThreadPoolKey.Factory.asKey ("hellothreadpool"))
    12. Command Properties,settings include circuit breaker configuration, isolation policy, downgrade settings, and some monitoring metrics.
    13. Thread Pool Properties,the configuration includes the thread pool size, the size of the queued queue, and so on.
     
  • Other features

    1, request cache.

    2, batch execution request.

Distributed system delay and fault tolerant framework Hystrix

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.