How to build a real-time and personalized recommendation system with Kiji

Source: Internet
Author: User
Tags serialization

Now it's all over the Internet. Mainstream e-commerce sites such as Amazon recommend products to users in various forms based on their page attributes. Financial planning sites such as Mint.com provide users with a lot of advice, such as recommending to users that they might want to handle a credit card that would provide a better rate for a bank. Google optimizes search results based on user search history information to find more relevant results.

These well-known companies use recommendations to provide a situational, relevant user experience to improve conversion and user satisfaction. These recommendations were typically provided by generating newly recommended batch job calculations nightly, weekly, or monthly.

However, for some types of recommendations, the response time is more likely to be shorter than the amount of time required to bulk process jobs, such as providing location-based referrals to consumers. For example, the film recommendation system, if the user has seen action movies, but now to find a comedy, the volume of recommendations will likely give more action movies, rather than the most relevant comedy. This article will explain how to use the Kiji framework, which is an open source framework for building large data applications and real-time recommendation systems.

Kiji, entity-centric data and 360-degree perspective

To build a real-time recommendation system, you first need a system that can store 360 of perspective customers. In addition, we need the ability to quickly acquire data related to a specific user to make recommendations when interacting with Web sites and mobile applications. Kiji is a modular open source framework for building real-time applications that collects, stores, and analyzes such data.

In general, the data required for a 360-degree view can be referred to as entity-centric data. An entity can be any number of things, such as customers, users, accounts, or more abstract things like POS systems or mobile devices.

An entity-centric storage system should be able to store all information about a particular entity in one row of data. This is a challenge for traditional relational databases because they can have state-type data (such as name, e-mail address, etc.) and event flow (such as clicks). Traditional systems need to store these data in multiple tables, and then join them when processing, making it difficult to deal with them in real time. To solve this problem, Kiji uses the Apache HBase, which stores data in four dimensions-row, column family, column ID, and timestamp. The ability to store multiple versions of the cell with timestamp dimensions and hbase enables Kiji to store slowly changing event flow data with more states.

HBase is a key-value storage system used by Apache Hadoop, built on top of HDFs, which provides the necessary scalability for large data solutions. The big challenge in developing applications on HBase is that it requires all data in and out of the system to be byte arrays. To solve this problem, the ultimate core component of Kiji is the Apache Avro, which is Kiji to store data types that are easy to handle, such as standard strings and integers, and more complex data types defined by the user. When reading and writing data, Kiji the necessary serialization and serialization processing for the application.

Developing models used in real time

Kiji provides two sets of Api,java and Scala for the development model, and both sets of APIs support both bulk and real-time components. The purpose of this division is to divide the model execution into different stages. The batch stage is a training phase, which is a typical learning process, in which a complete dataset is used to train the model. The output of this phase may be the parameter of the linear classifier, or the cluster position of the clustering algorithm, or the similarity matrix of the interrelated items in the collaborative filtering system. The real time phase is called the grading stage, and the trained model is obtained and the derived information is generated by combining it with the entity data. The key is that these derivative data are treated as a first-class citizen, meaning that it can be stored in the row of the entity, for recommendation, or as input for subsequent computations.

The Java API is called KIJIMR, and the Scala API forms the core of the Kijiexpress tool. Kijiexpress uses the scalding library to provide APIs to build complex mapreduce workflows, while avoiding a large number of Java redundant code, as well as the task scheduling and collaboration necessary to concatenate mapreduce jobs.

The individual and the general

The reason to divide the two stages of batch training and real-time scoring is because Kiji observed that the overall trend was slow and the individual trends changed rapidly.

For example, a user's total dataset that contains thousands of records purchased. One more purchase is unlikely to have a significant impact on the likes and dislikes of the overall trend. But for a particular user with a record of only 10 purchases, the 11th purchase will have a significant impact on the system's judgment of the user's interest. In view of this proposition, an application needs to train its model only when it collects data that is sufficient to affect the overall trend. But for a particular user, we can improve the recommended relevance by responding to the user's behavior in real time.

Give the Model a real-time rating

To achieve real-time scoring, the Kijiscoring module provides an inert computing system that allows applications to generate the latest recommendations for active users who often interact with them. With lazy computing, Kiji applications do not have to generate recommendations for users who do not often patronize or return. There are additional benefits that Kiji can consider when recommending scenarios such as the location of mobile devices.

The main component of kijiscoring is called Freshener. Freshener is actually a combination of another two Kiji components: Scoringfunctions and freshnesspolicies. As mentioned earlier, a model includes two stages of training and scoring. Scoringfunction is a piece of code that describes how to combine a trained model with a single entity's data to produce a score or recommendation. Freshnesspolicy defines when data becomes stale or obsolete. For example, ordinary freshnesspolicy will point out that the data expires after more than one hours. More complex policies may mark an entity as out-of-date after it has experienced a certain number of events, such as clicks or product access. Finally, Scoringfunction and freshnesspolicy are attached to specific columns in the Kiji table and are triggered when necessary to refresh the data.

The real-time scoring application contains a server layer called the kijiscoring server, which is the execution layer responsible for refreshing stale data. When a user interacts with an application, the request is passed to the Kijiscoring server layer, which communicates directly with the HBase cluster. The kijiscoring server will request the data and check to see if the data is up to date after the data is fetched freshnesspolicy. If it is the most recent data, it returns it directly to the client. In the case of obsolete data, the Kijiscoring server runs the specified scoringfunction for the user who made the request. The point you need to master is that it refreshes the data or recommendation only for the user who made the request, instead of performing a batch operation that refreshes all the user's data. So Kiji can just do the work that is necessary. When the data is refreshed, it is returned to the user and written back to HBase for later use.

A typical Kiji application will include a number of kijiscoring servers, which are stateless Java processes that can be extended outwards, and are able to run the scoringfunction using single Entity data as input. The Kiji application filters the client request through the kijiscoring server, which determines whether the data is up to date. If necessary, it runs scoringfunction before sending all the recommendations back to the client, and writes the data back to HBase for later use.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.