Percentage point recommendation engine-from demand to Architecture (reposted from infoq)

Last Update:2018-12-05 Source: Internet

Author: User

Tags redis cluster

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Document directory

Storage layer
Algorithm Layer
Business Layer
Management Layer

The percentage receng is a leading recommendation technology platform in China. It focuses on providing Saas-based Personalized Recommendation services for e-commerce and information websites, improving the overall site conversion rate and user viscosity of websites. This article introduces the architecture design and construction of the percentage point recommendation engine based on the actual needs of e-commerce websites. Requirement

At present, the trend of the Personalized era is unstoppable, and the industry generally realizes that recommendation is a basic service of websites. However, people do not know much about the recommendation technology itself. We often encounter the following question: "If a user who has purchased this product has purchased the recommended products, isn't it an SQL statement ?" Otherwise, the recommendation technology is far from that simple. Broadly speaking, the recommendation technology belongs to Data Mining and machine learning. This also means that good recommendation services depend on scientific recommendation algorithms and a large amount of learning data. For e-commerce and information websites, it is very difficult to develop high-end recommendation algorithms and apply them to massive data in the field of recommendation technology. In this context, the percentage point recommendation engine came into being. During the development of the receng product by age point, we work with well-known e-commerce websites such as wheat bags, red kids, catwalk networks, and Yao Dian 100, the technical departments of well-known media and information websites such as Skynet and yibang power have made in-depth discussions and gained a lot of help and inspiration from them. During our communication with these industry pioneers, we found that there are some common requirements in the industry, such as recommendation timeliness and high availability. Other requirements are industrial. For example, the repeated purchase rate of baby products is relatively high, but the repeated purchase rate of the same bags is not high. For a parenting mother, we can give her repeated recommendations for the same milk powder and diapers that suit their preferences, but for a fashionable girl, we recommend the same bag to her repeatedly, which may not be suitable.

After a wide range of market demands and exchanges, we require that the percentage point recommendation engine be able to support customers' marketing strategies from all aspects. In summary, it mainly includes:

Scientific and efficient recommendation algorithms, and best recommendation algorithms and policies based on Website characteristics;
Analyze users' potential preferences based on users' behaviors across the network to help the website implement precise marketing outside the site;
Analyze the relevance between various content based on the product and Information information across the network to help the website optimize the traffic import work outside the site.

The percentage receng is facing product information and user behavior across the network. How can we use this data scientifically and effectively to provide a wide range of recommendation services for e-commerce and information websites, meeting its marketing objectives has become our biggest technical challenge. Therefore, we propose the following technical requirements for the percentage point recommendation engine:

Supports various recommendation algorithms and scientific metrics. Researchers have proposed hundreds of Recommendation algorithms and corresponding standard datasets and recommendation performance metrics. The percentage point recommendation engine must be flexible enough to support these algorithms. In addition, we need to specify the performance metrics of each algorithm on each dataset so that we can select an appropriate recommendation algorithm for specific needs.
Big Data Processing. In the face of resources and user behavior across the network, it is critical to store and analyze such data securely and reliably. Our minimum requirement is to be able to process 0.1 billion million data input and recommendation requests every day and ensure absolute data security. Obviously, distributed and cloud services are our only choice.
High Availability and real-time performance. As a web service provider, providing stable, reliable, and low-latency services is a basic requirement. From the perspective of user experience, each recommendation request must be processed within 2 ms.
Scalability. This is a common demand for all computer systems. We require the percentage recommendation engine to easily add various new recommendation logics and provide new recommendation services. In addition, when the entire system needs to be upgraded and resized, the labor and hardware costs are linearly controllable.
Easy to manage. O & M is the highlight of Web services. We require that all components (or logical units) in the receng be independently detachable and replaceable. Each component must have a complete disaster recovery mechanism, in this way, the management of the entire system is gradually subdivided, which is conducive to division of labor and collaboration.

Architecture Design

According to the requirements in the previous section, we designed the percentage recommendation engine as an organic combination of a group of cloud services. For example, the percentage recommendation engine can be divided into the storage layer, business layer, algorithm layer and management layer. Each component can be subdivided into smaller units or service modules to provide basic storage or computing services. The unit and unit should be decoupled as much as possible and collaborate only through the API Protocol. The impact of such a unit upgrade is controllable. Each unit must be reliable and available. Next, we will introduce four major functional components of the percentage point receng.

Storage layer

The storage layer provides basic data access services and backup and disaster recovery to ensure data security and reliability. Based on different application requirements, the storage layer is subdivided into four categories: redis cluster, membase cluster, MySQL cluster and hadoop/HDFS.

Redis cluster. The percentage point receng uses redis as the cache to store hot data, including the resource (product or consulting) ID, name, Link, image, category, and brand. The quantity of such information is not very large, but the frequency of use is very high. Basically, each recommendation requires dozens or even hundreds of product information. The reason we choose redis is its speed, persistence, and master-slave mechanism. Currently, we use redis in the form of a master with several slaves for read/write splitting. The master is only responsible for writing, and the slaves is only responsible for reading. Two Slave have a serialization mechanism, and there must be two slave instances on different machines to eliminate spof risks.
Membase cluster. Membase plays the primary storage role in the percentage point recommendation engine, which is mainly used to support the computation of the percentage point recommendation engine. At present, the percentage point recommendation engine contains more than 10 online and offline computing modules, which use a lot of data during the computing process and generate a large number of intermediate results, this includes the user's behavior history on each website and the relationship between resources. This data feature does not require a schema, but is frequently used. Membase is used mainly because it can be easily scaled out and supports a wide range of client APIs.
MySQL cluster. In the initial stage of the percentage point recommendation engine, the main task we assign to MySQL is to store raw data (including user behavior, recommendation requests, and recommendation results) of all customers for backup, the recommendation results will be calculated later. However, we soon discovered that the MySQL database has become so huge that we need to compress, back up, and cut it every week, and the O & M workload is too large. Now, we have moved data backup and post-statistics to the hadoop/HDFS platform, and only store final statistics and other small-scale data such as customer configuration information in MySQL. Because the task volume of MySQL is not heavy, we only perform dual-host hot backup for it to avoid the failure of service caused by Single-host crash.
Hadoop/HDFS. As mentioned above, we currently use hadoop/HDFS to store customers' raw data and perform some statistical processing on it. In addition, we plan to transfer some offline algorithms and data to the hadoop platform to realize hadoop's potential. Hadoop namenode has a single point of failure (spof) risk. Therefore, we have created a backup namenode and switched the service to the backup server when the master server encounters a problem.

Algorithm Layer

This is the core and most challenging part of the receng. We design this layer as a set of abstract algorithms. We have thoroughly studied the academic field's theoretical knowledge in user behavior-based recommendation algorithms, content-based recommendation algorithms, and association rules, over 10 online and offline recommendation algorithms for big data processing have been developed independently. Currently, our online algorithms include userbased/itembased CF, content-based recommendation, heat diffusion, and behavior patteranalysis) and so on. Offline algorithms include KNN clustering, Association Rule Mining Based on FP tree, Association Rule Mining Based on contextual statistics, sequence mode algorithm, and document modeling algorithm.

The algorithm layer is not concerned with specific business logic, but only responsible for data processing and result return. The Heat Diffusion algorithm is used as an example to calculate the relationship between users and users/resources, we can also request a user to which resources are most interested or which resources are most relevant to them.

This design method disconnects the business logic from the recommendation algorithm itself makes the recommendation algorithm the most universal. It also ensures that the front-end recommendation function module can integrate multiple algorithms according to logic requirements. Taking the personalized recommendation based on browsing history as an example, the recommendation engine uses two algorithms: Hot diffusion and content-based recommendation.

Due to the separation of storage and algorithms, the algorithm layer does not need to consider data backup and disaster tolerance. In this way, if an Algorithm Module encounters an exception due to a server failure, we can quickly start a backup of the same Algorithm Module on another server to replace it without any data migration issues, availability is maximized.

Business Layer

This is part of the recommendation engine that directly faces customers, that is, our HTTP web service, which is mainly responsible for collecting data submitted by customers, converts the data to the input data required by each recommendation algorithm and submits the data to the recommendation algorithm for calculation. Based on the Recommendation request submitted by the customer, the data is requested from one or more recommendation algorithms, and convert it to the data format required by the customer. It can be seen that the business layer plays a role in connecting specific requirements and recommendation algorithms to the real world and the computer world.

Taking "which products have been purchased by users who have purchased this product" as an example, we will introduce how this recommendation function module communicates with customer needs and recommendation algorithms. At present, we mainly use the Heat Diffusion Algorithm to implement this recommendation function module. First, when the customer submits the purchase data, the percentage point recommendation engine processes the event as an acceptable triples of the algorithm based on a certain business logic. For example, if user U buys product k, we may send an input data (u, K, 1.0) to the algorithm ). Secondly, when the customer requests the products that the user who bought K has bought, on the one hand, we use K as the parameter to request the resources closest to K to the algorithm. On the other hand, if the customer submits the user ID, we will also request the products that the user may be interested in from the algorithm. Finally, we will combine the two results to select the products with the maximum weight and meet the customer's additional requirements (such as filtering the user's purchase history, based on the product category/price, etc.

It can be seen that the business layer uses the recommendation algorithm as a black box, so that the business layer can focus on meeting the diverse needs of customers. In addition, like the algorithm layer, the business layer does not need to care about data storage backup and disaster tolerance.

Management Layer

In the percentage recommendation engine, the management layer is responsible for internal DNS, configuration management, service deployment, service monitoring, and automatic emergency response.

Internal DNS is an important part to achieve high availability. Each component of the receng accesses other services through internal domain names, and the primary and secondary DNS of all servers are also set to internal DNS. In this way, when a key server, such as the hadoop namenode, fails, we can change the IP address corresponding to the domain name to ensure uninterrupted service.
Configuration Management. The main function of this module is to automate Configuration updates and notifications. We once considered using zookeeper to implement this function, but later we thought zookeeper was too heavy, so we developed a configuration management service based on our own needs. The Internal Service of the receng can register itself under a configuration management project. When the configuration changes, the configuration management module notifies the service to obtain the latest configuration information.
Service monitoring. This module is mainly used to monitor the health status of the server, whether each process can provide services normally, and execute SMS alarms and trigger AutoMATIC emergency response in case of exceptions. Our methods include:
- Use top, PS, free, and other basic tools to check the system load and whether the processes are alive, CPU, memroy, and other resource usage conditions. Use redis-CLI, memstats, and other specific tools to view the running status of redis and membase.
- For self-developed programs, we need to provide a test call. This call can complete the main service process and return whether exceptions occur in the execution process, such as configuration item Setting errors, execution Process timeout and so on.
- We will analyze the logs output by various services to find out the exceptions. For example, if a large number of exceptions or errors occur in the short term, the request processing time is too long, and a large number of recommendation requests cannot receive the results.
- Once an exception is detected, the monitoring module sends an SMS to our O & M personnel and notifies the automatic emergency response module to try to fix the exception.
- Automatic Emergency Response. In the automatic emergency response module, we have implemented the DNS Configuration modification function to start/stop service programs and recommendation algorithms at the business layer. For example, when the MySQL master server goes down, the automatic emergency response module will receive a notification from the monitoring module, then it will try to modify the MySQL server domain name in the master-slave DNS to the IP address of the MySQL slave server; or if the automatic emergency response module receives a notification from the monitoring module that a service process in the business layer has been occupying 100% of the CPU for one consecutive minute, the emergency response module will kill it and restart it, this process may be abnormal.

Summary

This article gives a detailed description of the overall architecture and functional division of the receng. It is not difficult to see that throughout the architecture design, we have always adhered to the principles of modularization, low coupling, and elimination of single points, the recommendation engine strives to create a recommendation technology platform with excellent scalability and reliability. After nearly two years of practical tests on many large and medium-sized e-commerce cooperation websites, this architecture fully satisfies the needs we proposed at the beginning, and in the visible future, it is also competent for the strategic planning of the recommendation engine by percentage points. This architecture reflects the stability and flexibility of the receng team in many aspects. The receng team is actively working hard on recommendation technologies and services and leading technologies.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More