Netflix announces Personalization and Referral system architecture

Last Update:2015-04-06 Source: Internet

Author: User

Tags cassandra

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Netflix's recommendations and personalization features have always been accurate, and shortly before, they announced their own system architecture in this area.

March 27, Netflix engineer Xavier Amatrain and Justin Basilico The official blog post, introducing their own personalization and referral system architecture. At the beginning of the article, they pointed out:

It's not easy to develop a software architecture that can handle massive amounts of existing data, respond to user interactions, and make it easy to experiment with new recommended methods.

Next, the article posts out their system framework diagrams, where the main components include a variety of machine learning algorithms.

They explain the components and processes in this way:

For data, the simplest way to do this is to save it for subsequent offline processing, which is part of the architecture we use to manage offline jobs (Offline jobs). Calculations can be done offline, close to online, or online. Online computation can respond more quickly to recent events and user interactions, but must be done in real time. This limits the complexity of using the algorithm and the amount of data processed. Offline computing (Offline computation) has fewer limits on the amount of data and algorithmic complexity because it is done in batch mode with no strong time requirements. However, it is easy to get out of date by not adding the latest data in time. The key issue with the personalization architecture is how to seamlessly combine and manage the online and offline computing processes. Near-online computing (nearline computation) is a method that resembles an online calculation, but does not have to be done in real time. Model training is another type of calculation that uses existing data to produce a model that can be used later in the calculation of actual results. Another piece of architecture is how to handle different types of data and events using the event and data distribution system (distribution). The related question is how to combine different signals and models that span offline, close to online and online (signals and Models). Finally, you need to figure out how to combine the recommended results (recommendation Results) to make sense for the user.

Next, the article analyzes online, near-online, and offline computing.

For on-line computing, related components need to meet SLA requirements for availability and response time, and pure online computing may not meet SLAs in a given scenario, so a quick backup scenario is important, such as returning pre-calculated results. Online computing also requires a different source of data to ensure availability online, which requires additional infrastructure.

Off-line computing is relatively flexible in terms of algorithms and engineering requirements are simple. The client's SLA response time requirements are also low. The need for performance tuning is not high when deploying new algorithms into production environments. Netflix uses this flexibility to do rapid experimentation: If a new experimental algorithm performs slowly, they will deploy more Amazon EC2 instances to reach the throughput target, rather than spend valuable engineer time optimizing performance, as business value may not be high.

Near-online computing is the same as on-line computing, but the results are not immediately available, but are temporarily stored to make them asynchronous. Near-online computing is done in response to user events, so the system responds faster between requests. This makes it possible to do more complex processing for each event. The incremental learning algorithm is well suited for applications in near-online computing.

In any case, choosing Online, near-online, or offline is not an either-or-no decision. All the ways can and should be used together. ...... Even the modeling section can be done in a hybrid way, both online and offline. This may not be appropriate for traditional supervised taxonomy (supervised classification) applications, because classifiers must be trained in batches from tagged data, and can only be used online to classify new inputs. However, methods such as matrix factorization are more suitable for hybrid offline and online modeling methods: Some factors can be calculated in advance offline, and some factors can be updated in real time to create updated results. Other non-supervised methods, such as cluster processing, can also be used to compute the cluster center offline and perform online operations on the cluster nodes. These examples illustrate that model training can be broken down into large-scale and complex global model training, as well as lightweight user-specified model training or update phases that are completed online.

For offline jobs (Offline jobs), it is primarily used to run personalized machine learning algorithms. These jobs are executed on a regular basis and do not have to be synchronized with the request and presentation of the results. There are two main kinds of tasks: model training and intermediate and final result batch calculations (batch computation of intermediate or final results). However, they also have some learning algorithms that are done in an online incremental fashion.

Both of these tasks require improved data, usually done by a database query. Because these queries operate on a large amount of data and are easier to do in a distributed way, it is natural to work with Hadoop or hive and pig. Once the query is complete, some mechanism is needed to publish the resulting data. For such a mechanism, Netflix has the following requirements:

You can notify Subscribers that the query is complete.
Supports different storage methods (not just HDFs, S3 or Cassandra, etc.)
Errors should be handled transparently, allowing monitoring and alerting.

Netflix uses its internal tools, Hermes, to deliver data in near real-time to subscribers, in some ways close to Apache Kafka, but it's not a message/event queuing system.

Both offline and online computing requires processing of three inputs: models, data, and signals. The model is a parameter file that is trained to be done offline, and the data is the information that has been processed and exists in some kind of database. In Netflix, a signal is a source of fresh information entered into the algorithm. This data comes from real-time services and can be used to generate user-related data.

For event and data distribution, Netflix collects as many user events as possible from multiple devices and applications, and centralizes them to provide the underlying data for the algorithm. They differentiate between data and events. Events are time-sensitive information that needs to be addressed as soon as possible. Events are routed, trigger a follow-up, or process. While data needs to be processed and stored for later use, latency is not important and information quality and quantity are important. Some user events can also be used as data processing.

Netflix uses the internal framework Manhattan to process near real-time event streams. The distributed computing system is the center of the proposed algorithm architecture. It's similar to Twitter's storm, but it's different in its usefulness and responds to different internal requirements. Data flow is mainly through Chukwa, input to Hadoop, the initial stage of processing. After that, use Hermes as the publish-subscribe mechanism.

Netflix uses Cassandra, Evcache, and MySQL to store offline and intermediate results. They all have pros and cons. MySQL stores structured relational data, but faces extensibility issues in a distributed environment. When a large number of write operations are required, they use evcache more appropriately. The key question is how to meet the conflicting requirements of query complexity, read-write latency, transactional consistency, and so on, to achieve some of the best benefits for every situation.

In the summary, they point out:

We need the ability to use complex machine learning algorithms that can adapt to high complexity and handle large amounts of data. We also need to be able to provide a flexible, agile and innovative architecture that can be easily developed and inserted on the basis of new approaches. And we need our recommendations to be new enough to respond quickly to new data and user behavior. Finding the right balance between these requirements is not easy, requires thoughtful needs analysis, careful technical choice, strategic recommendation algorithm decomposition, and ultimately to achieve the best results for customers.

(Source: http://www.infoq.com/cn/news/2013/04/netflix-ml-architecture/)

Netflix announces Personalization and Referral system architecture

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More