Netflix's path to micro-service evolution

Source: Internet
Author: User
Tags cassandra jfrog jfrog artifactory

background


Netflix is one of the world's leading video sites, including Hollywood productions, indie films, local movies, and other well-known TV dramas such as "Solitaire House". More than 80 million of subscribers worldwide, covering 190 countries (not covered in China ...) ), supporting more than 1000 types of devices.

Netflix is a heavy user of AWS services and has tens of thousands of VMS on AWS. In the DevOps world, Netflix is a pioneer in the industry, and they contribute a lot of good open source software to the Spring Cloud Netflix community, such as Eureka,zuul,turbine,hystrix and more.

the challenges encountered


Netflix believes that their previous application architecture is a typical boulder application. Although the application level of multi-activity, but still use a single huge code base, a single huge database, if the database is hung, the entire system will be paralyzed.

What is a microservices architecture? To review the definition of Martin Fowler:


It mentions several important keywords: multiple microservices, standalone processes, and lightweight communication mechanisms (usually HTTP).


The Netflix Task Micro service must have the following capabilities:

Separation of concerns of the service:

A service can not process user information, and processing order information, services to achieve modularity, and the need to encapsulate the internal interface, to provide services externally.

Horizontal scalability of the service:

Can the service be scaled up smoothly? How long does it take to extend a service? How do you offload traffic to new nodes after the service level is expanded?

The ability to virtualize and Flex compute:

Need to be able to automate operations, create compute environments on demand.

1.Netflix Micro-Service architecture


is a graph of service invocation relationships across Netflix.

On the left is the Edge service, which contains:

    • ELB (Elastic load Balance)-Used for client-side request load distribution.

    • Zuul–netflix's Open source Gateway component is used to provide dynamic routing, monitoring, fault self-healing, security and other services.

    • Api–netflix Unified call to the interface layer of the backend service.

On the right is the middleware service layer:


The services offered include:

    • Products

    1. A/b testing of products

    2. Subscription services

    3. Recommended Services

    • Platform

    1. Routing

    2. Service configuration

    3. Encryption


A typical microservices should have a cache layer, a service layer, and a data layer.

The pain point of 2.Netflix

    • Inter-service Call failure


Calls between services are subject to network latency, service failures, call logic errors, and failure to expand.

    • Avalanche caused by service failure


When a core service fails, it affects the availability of the overall system. A failed service causes the user's request to wait for the return and is not released until the server's resources are blown up.

Solution Solutions

Optimization Solutions : Fuses and FIT


Hystrix (Fuse) is an open source component of Netflix's contribution, and Netflix believes that a service that has been hung up should be immediately discovered that the system will not continue to invoke the unavailable service for a time-out return, but should call a FallBack method for error handling immediately.


The Netflix site is characterized by high peak concurrent traffic, low traffic, many applications go live, and there is no test environment to fully simulate the online environment to verify the high availability of the application. To solve these problems, Netflix built the Fault injection testing (FIT) framework for fault tolerance testing. It mainly provides 3 kinds of capabilities:

    • Analog on-line flow

    • Push the actual flow of the environment to 100% for pressure measurement, to see the true response in high-concurrency state.

    • Netflix is divided into 2 micro services, one is the core services, that is, users load applications, watch video, and other non-core services. FIT can create a scene to stop all non-core services and see if users can still enjoy Netflix's core services.

optimization Scenario : Distributed Data consistency


Netflix's data needs to be stored in different availability zones of AWS, while one data is written to multiple availability zones for database storage latency, then there is a possibility that some write failures, and Netflix has chosen eventual consistency in the CAP theory to solve this problem. Netflix uses Cassandra as a distributed database, and when you write data to availability zone B, Cassandra is automatically copied to other availability zones. You can use Quorum for a flexible write strategy, such as writing a node success, you think the entire cluster of write success, let Cassandra help you do the rest of the synchronization, you can also set the entire cluster to write success, only to think that the data write success.

optimization Scenario : Stateless Cache Service

Traditional caching services such as squid, even if you can do squid based on user ID do sharding to share high concurrent requests, so that each user can access their own cache, but each sharding still have a single point of failure, a Squid service hangs, will still bring an avalanche. Netflix initially used Squid as a shard cache, but when a shard was dropped, Netflix took 3 hours of downtime.


So Netflix uses a multi-write distributed cache Evcache,evcache to encapsulate the MemcacheD implementation of distributed caching. Each Evcache client writes cached data to a cache server in multiple availability zones, avoiding a single point of failure of the cache server. When the client reads the cache, it simply reads from the Local availability zone.


With the Evcache cluster, Netflix supported 30 million requests per second. But again, the service itself encounters a bottleneck when each application requests a evcahce on the line. Many background tasks (Batch), such as referral services, frequently access cache on the line, so Netflix separates both the online and offline caches, with the advantage that background tasks do not affect the caching service on the line.

optimization Scheme : Checklist of On-line inspection


Netflix has summed up a number of best practices in its ongoing operations, including a checklist before they go online, coverage of alarms, automated Canary Release analysis, automatic expansion, ELB configuration, stress testing, blue-green deployment, failure rollback, and so on.


Netflix uses Nebula to build, Jenkins does Ci,jfrog artifactory to manage (. jar,. Deb, Docker image), and open source their continuous deployment platform spinnaker. Spinnaker is able to automate the launch of the Canary, enabling it to connect to Jenkins and back to the deployment of the Aws,kubernetes cluster. It can automatically score every stage of the Canary's release, and the score will include service status, user feedback, system anomalies and other information to assess whether the machine is going into the next Canary release phase.

One important part of Netflix's ability to deliver fast delivery services is that the fault tolerance of online services is very mature, and that their programmers submit code will also fail the online service.


From the visible, Netflix service failure time on weekdays is very high, but due to service degradation, the isolation process is automated, so the programmer has full confidence to submit code, fix the problem.

Summary

Netflix's operations team provides a strong infrastructure for the business development team, which brings a great deal of rapid release capabilities, dramatically reducing the time to turn idea into online service, and the value of the operations team in the company is greatly enhanced!

Resources:

Https://www.youtube.com/watch?v=CZ3wIuvmHeM

Https://github.com/Netflix/zuul

Wang Qing

Currently serving as Jfrog China's chief architect, previously in Ibm,hpe, Iqiyi, Sina, Vipkid and other companies have done research and development and architecture, is a more than 10 years of experience in the development of Internet veterans, focus on software lifecycle management, micro-service architecture, cloud-native applications, containerized and other fields.

Welcome reprint, but reprint please indicate the author and source. Thank you!

Netflix's path to micro-service evolution

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.