The "Discover the secrets of the recommendation Engine" series will lead readers from shallow to deep learning to explore the mechanisms of the recommendation engine, which also involves some basic optimization methods, such as clustering and classification applications. At the same time, on the basis of theoretical explanation, it will also combine Apache Mahout to introduce how to implement various recommendation strategies in large-scale data, optimize strategy, and build efficient recommendation engine method. As the first article in this series, this article will provide an in-depth introduction to how the recommendation engine works, and the various recommended mechanisms involved, as well as their respective strengths and weaknesses and scenarios, to help users clearly understand and quickly build a recommendation engine that suits them.
Information discovery
Now that it has entered an era of data explosion, with the development of Web 2.0, the Web has become a platform for data sharing, so how to get people to find the information they need in a huge amount of data will become more and more difficult.
In such a situation, search engines (google,bing, Baidu, etc.) become the best way to quickly find the target information. When users are relatively clear about their needs, with search engine is very convenient through the keyword search quickly find the information they need. But the search engine does not completely satisfy the user to the information discovery the demand, because in many cases, the user actually does not have the clear own need, or their demand is difficult to use the simple keyword to express. Or they need to be more in line with their personal tastes and preferences of the results, so there is a recommendation system, and search engine corresponding, we also used to call it the recommendation engine.
With the advent of the recommendation engine, users get information from simple, target-specific data searches to more advanced information that conforms to people's usage habits.
Nowadays, with the development of recommendation technology, the recommendation engine has been very successful in e-commerce (e-commerce, such as Amazon, Dangdang) and some social-based social sites (including music, film and book sharing, such as watercress, Mtime, etc.). This also further illustrates that, in the face of massive data in the WEB2.0 environment, users need this more intelligent, more understanding of their needs, tastes and preferences of information discovery mechanisms.
Recommended engine
The importance of the recommendation engine for the current Web2.0 site is described in the previous chapter, which we will talk about how the recommendation engine works. The recommendation engine uses special information filtering techniques to recommend different items or content to users who may be interested in them.
Figure 1. Recommended engine working principle diagram
Figure 1 shows the working principle of the recommendation engine, where the recommendation engine is considered a black box, it accepts input is the recommended data source, in general, the recommended engine needs data sources include: To recommend items or content metadata, such as keywords, gene description, basic information of the system user, such as gender, age, etc. User preferences for items or information, depending on the application itself, may include the user's rating of the item, the user's view of the item's record, the user's purchase record, etc. In fact, the preferences of these users can be divided into two categories: explicit user feedback: This is the user on the site of natural browsing or use of the site, explicitly provide feedback information, such as user ratings on items, or comments on items. Implicit user feedback: This is the user in the use of the site is generated data, implicitly reflects the user's preferences for items, such as the user purchased an item, the user to view the information of an item and so on.
Explicit user feedback can accurately reflect the user's real preferences for the goods, but requires the user to pay an additional cost, and the implicit user behavior, through some analysis and processing, but also reflects the user's preferences, but the data is not very accurate, some behavior analysis there is a large noise. However, as long as the correct behavior characteristics, implicit user feedback can also be very good results, but the choice of behavior features may be in different applications are very different, for example, in the e-commerce website, buying behavior is actually a good performance of user preferences implicit feedback.
The recommendation engine may use a portion of the data source based on different referral mechanisms, and then, based on these data, analyze certain rules or directly predict the user's preferences for other items. This allows the referral engine to recommend items that he might be interested in when the user enters.
Classification of recommended engines
Recommendation engine classification can be based on a number of indicators, below we introduce: the recommendation engine is not for different users to recommend different data
According to this indicator, the recommendation engine can be divided into popular behavior based on the recommendation engine and personalized recommendation engine based on the popular behavior of the recommendation engine, the same recommendation for each user, these recommendations can be static manually set by the system administrator, or based on the feedback statistics of all users of the system to calculate the current more popular items. Personalized recommendation engine, for different users, according to their tastes and preferences to give more accurate recommendations, at this time, the system needs to understand the characteristics of the recommended content and users, or based on social networks, by finding the same preferences with the current user, the implementation of recommendations.
This is a basic recommendation engine classification, in fact, most people discuss the recommendation engine is to be personalized recommendation engine, because fundamentally, only the personalized recommendation engine is a more intelligent information discovery process. According to the data source of the recommendation engine
In fact, this is about how to find the relevance of the data, because most of the recommendation engine work is based on the object or the user's similarity set to recommend. Then refer to the recommended system schematic diagram given in Figure 1, according to different data sources to find the data correlation method can be divided into the following: According to the basic information of the users of the system to discover the relevance of the user, this is called based on demographic recommendations (demographic-based Recommendation) The relevance of an item or content based on the metadata of the recommended item or content, which is referred to as content-based recommendation (content-based recommendation) based on the user's preference for goods or information, Discovering the relevance of an item or content itself, or discovering the relevance of a user, is known as a recommendation based on collaborative filtering (collaborative filtering-based recommendation). According to the establishment of the recommendation model
It can be imagined that in a large number of items and users of the system, the recommended engine calculation is considerable, to achieve real-time recommendations must establish a recommendation model, about the establishment of the recommendation model can be divided into the following: Based on the object and the user itself, this recommendation engine will each user and each item as a separate entity, Predict each user's preference for each item, which is often described by a two-dimensional matrix. Because the user is interested in items far less than the total number of items, such a model leads to a large number of data vacancy, that is, we get a two-dimensional matrix is often a very large sparse matrix. At the same time, in order to reduce the amount of computing, we can cluster items and users, and then record and calculate a class of users of a class of preference, but such a model will be in the recommended accuracy loss. Recommendation based on association Rules (rule-based Recommendation): The Mining of Association rules is a classic problem in data mining, which is mainly to excavate some data dependencies, the typical scene is "shopping basket problem", and through the mining of association Rules, We can find which items are often purchased at the same time, or what other items are usually purchased after the user has purchased some items, and we can recommend them based on these rules when we dig out these association rules. Model-based recommendation (model-based recommendation): This is a typical machine learning problem, you can use existing user preferences as a training sample, training a model to predict user preferences, so that users in the system after entering, you can calculate the recommendation based on this model. The problem with this approach is how to feed the user's real-time or recent preferences to a well-trained model to improve the recommended accuracy.
In fact, in the present recommendation system, very few use only a recommendation engine, generally in different scenarios using different recommendation strategy to achieve the best recommendations, such as Amazon's recommendation, it will be based on the user's own history of the purchase of data recommendations, and based on the user's current view of the item recommendations, and popular items based on popular preferences are recommended to users in different regions, allowing users to find the items they are really interested in from a full range of recommendations.
In-depth referral mechanism
This chapter will detail the working principles of each recommendation mechanism, their pros and cons, and their application scenarios.
Recommendations based on demographic statistics
The recommendation mechanism based on demography (demographic-based recommendation) is one of the easiest to implement, it simply finds the relevance of the user based on the basic information of the system user, and then recommends other items similar to the user's favorite to the current user, Figure 2 The working principle of this recommendation is given.
Figure 2. Working principle of recommendation mechanism based on demography
It is clear from the diagram that, first of all, the system has a user profile model for each user, which includes the user's basic information, such as the user's age, gender, etc., then the system will calculate the user's similarity according to the profile of the user, and can see User A's profile and user C , then the system will think that users a and C are similar users, in the recommendation engine, you can call them "neighbors", finally, based on the "neighbor" user group preferences recommended to the current user a number of items, the image of user A like the item A is recommended to user C.
The benefit of this demographic-based referral mechanism is that there is no "cold start" problem for new users because they do not use historical data about the current user's preferences for items. This method does not depend on the data of the item itself, so this method can be used in the field of different items, it is domain independent (domain-independent).
So what are the drawbacks and problems of this approach? This basic user-based information on the classification of users is too rough, especially in the areas of high taste requirements, such as books, movies and music fields, can not be very good recommendations. Perhaps in some e-commerce sites, this method can give some simple recommendations. Another limitation is that this approach may involve sensitive information that is not relevant to the information discovery problem itself, such as the age of the user, and the user information is not well acquired.
Content-based recommendations
Content-based recommendation is the most widely used recommendation mechanism at the beginning of the recommendation engine, and its core idea is to discover the relevance of items or content based on the metadata of the recommended items or content, and then recommend to the user similar items according to the user's previous preferences. Figure 3 shows the rationale for content-based recommendations.
Figure 3. Fundamentals of Content-based recommendation mechanism
Figure 3 shows a typical example based on content recommendation, the film recommendation system, first we need to have a model of the movie metadata, here is a simple description of the movie type, and then through the movie metadata to find the similarity between movies, because the type is "love, romance" movies A and C is considered similar to the film (of course, only according to the type is not enough, to get a better recommendation, we can also consider the film director, actors, etc.); Finally, the recommendation is that for user A, he likes to watch movie A, then the system can recommend a similar movie C.
The benefit of this content-based recommendation mechanism is that it can model the user's tastes well and provide more accurate recommendations. But it also has several problems: the need to analyze and model items, the recommended quality depends on the completeness and comprehensiveness of the item model. In today's application we can observe that the keywords and tags (tag) are considered as a simple and effective way to describe the item metadata. The analysis of the similarity of items depends only on the characteristics of the item itself, and there is no consideration for the attitude of the object. There is a "cold start" issue for new users because they need to make recommendations based on the history of their previous preferences.
Although this method has a lot of shortcomings and problems, but he is still successful application in some movies, music, books, social sites, some sites also ask professional personnel to encode items, such as Pandora, said in a report, in Pandora's recommendation engine, each song has more than 100 metadata characteristics, Including the style of the song, year, singers and so on.
Recommendations based on collaborative filtering
With the development of Web2.0, the WEB site advocates user participation and user contribution, so the recommendation mechanism based on collaborative filtering is born. Its rationale is simple, based on the user's preference for items or information, to find the relevance of the item or content itself, or to discover the relevance of the user, and then based on these related to the recommendation. Recommendations based on collaborative filtering can be divided into three sub-categories: User-based recommendations (user-based recommendation), project-based recommendations (item-based recommendation), and model-based recommendations (model-based Recommendation). Below we are a detailed introduction of the three kinds of collaborative filtering recommendation mechanism.
User-based collaborative filtering recommendations
The basic principle of user-based collaborative filtering recommendation is that, based on the preference of all users for goods or information, the "neighbor" user group which is similar to the current user's tastes and preferences is found, and the algorithm of "K-neighbor" is used in general application. Then, based on the history preference information of the K-neighbor, the current user is recommended Figure 4 below shows the schematic.
Figure 4. The basic principle of user-based collaborative filtering recommendation mechanism
The above diagram shows the basic principle of user-based collaborative filtering recommendation mechanism, assuming that user a likes item A, item C, User B likes item B, user C likes item A, item C and item D; From these users ' historical preferences, we can find that the tastes and preferences of user A and user C are more similar , and user C also likes item D, then we can infer that user A may also like item D, so you can recommend item D to User A.
The user-based collaborative filtering recommendation mechanism and the demographic-based recommendation mechanism are calculated for the user's similarity, and are based on the "neighbor" user base calculation recommendations, but they are different how to calculate the user's similarity, based on the demographic mechanism only consider the user's own characteristics, The user-based collaborative filtering mechanism, however, calculates the user's similarity on the data of the user's historical preference, the basic assumption being that the user who likes the similar item may have the same or similar tastes and preferences.
Project-based collaborative filtering recommendations
The rationale for project-based collaborative filtering recommendations is similar, except that it uses all user preferences for items or information, discovers similarities between items and items, and then recommends similar items to users based on their historical preferences, Figure 5 explains its rationale.
Suppose user A likes goods A and item C, user B likes items A, item B and item C, User C likes item A, from these user's historical preferences can analyze items A and item C compared to similar, like item a people all like item C, based on this data can infer user C is very may also like item C, so the system will recommend the item C to User C.
Similar to the above, collaborative filtering recommendations based on projects and content-based recommendations are all based on item similarity prediction, but the similarity calculation method is not the same, the former is inferred from the user's historical preferences, and the latter is based on the property characteristics of the item itself information.
Figure 5. The basic principle of collaborative filtering recommendation mechanism based on project
At the same time collaborative filtering, how to choose between user-based and two-based project strategies. In fact, project-based collaborative filtering recommendation mechanism is a strategy for Amazon to improve on the user-based mechanism, because in most Web sites, the number of items is much smaller than the number of users, and the number of items and similarity is relatively stable, and the project-based mechanism is better than the user-based real-time. But not all of the scenarios are the case, you can imagine that in some news recommendation system, perhaps the number of items, that is, news may be greater than the number of users, and the news update degree is also very fast, so its shape is still unstable. So, in fact, it can be seen that the choice of recommendation strategy is actually very much related to the specific application scenario.
Model-based collaborative filtering recommendations
Model-based collaborative filtering recommendation is a sample-based user preferences information, training a recommendation model, and then based on real-time user preferences of the information to predict, calculate recommendations.
The recommendation mechanism based on collaborative filtering is the most widely used recommendation mechanism today, and it has several notable advantages: it does not require strict modeling of items or users, and does not require that the description of items be machine understandable, so this method is also irrelevant to the field. This method is calculated by the recommendation is open, can share the experience of others, very good support users to identify potential interest preferences
But it also has the following problems: The core of the method is based on historical data, so the new items and new users have a "cold start" problem. The recommended effect depends on the amount and accuracy of the user's historical preference data. In most implementations, the user's historical preferences are stored with sparse matrices, while the computation on sparse matrices has some obvious problems, including the possibility that a few people's error preference will have a great impact on the recommended accuracy. For some special tastes of the user can not give a good recommendation. Based on historical data, crawling and modeling user preferences is difficult to modify or evolve based on user usage, resulting in inflexible methods.
A hybrid recommendation mechanism
The recommendations on the current Web site are often not purely based on a single recommended mechanism and strategy, they tend to mix multiple methods together to achieve better recommendations. about how to combine each recommendation mechanism, here are some of the more popular combination methods. Weighted blending (Weighted hybridization): Using a linear formula (linear formula) to combine several different recommendations according to a certain weight, the value of specific weights needs to be tested repeatedly on the test data set, thus achieving the best recommendation effect. Switching mix (switching hybridization): In fact, for different situations (data volume, system health, number of users and items, etc.), the recommended strategy may be very different, then the combination of switching is allowed in different situations, Select the most appropriate recommendation mechanism to calculate the recommendation. Partition blending (Mixed hybridization): Multiple referral mechanisms are used, and different referral results are displayed to the user in different areas. In fact, Amazon, Dangdang and many other e-commerce sites are used in this way, users can get very comprehensive recommendations, but also easier to find what they want. Layered blending (meta-level hybridization): Adopt a variety of recommendation mechanisms, and the results of one recommendation mechanism as another input, so as to synthesize the pros and cons of each recommendation mechanism to get more accurate recommendations.
Application of the recommendation engine
This paper introduces the basic principle of the recommendation engine, the fundamental recommendation mechanism, and briefly analyzes the application of several representative recommendation engines, here choose two areas: Amazon as a representative of e-commerce, watercress as a representative of social networks.
recommended application in e-commerce Amazon
Amazon, the originator of the recommendation engine, has infiltrated the recommended ideas in every corner of the application. The core of Amazon's recommendation is to predict what the user might be interested in by comparing data mining algorithms to other users ' consumer preferences. In response to the various recommended mechanisms described above, Amazon uses a hybrid partitioning mechanism and displays different recommendations to the user in different areas, as shown in Figure 6 and Figure 7, which shows the recommendations that users can get on Amazon.
Figure 6. Amazon's referral mechanism-home
Figure 7. Amazon's referral mechanism-Browse items
Amazon leverages the behavior of all users that can be recorded on the site, processes them according to the characteristics of different data, and divides them into different areas for user push recommendations: Today's recommendation for you: Usually based on the user's recent history to buy or view records, and combined with the current popular items to give a compromise recommendation. New product recommendations (new for You): a content-based recommendation mechanism (content-based recommendation) that introduces some new items to the user. In the method selection because new items do not have a large number of user preferences information, so the content-based recommendation can be a good solution to this "cold start" problem. Bundled sales (frequently bought Together): The use of data mining technology to analyze the user's purchase behavior, to find often together or the same person to purchase the collection of items, bundling, which is a typical project-based collaborative filtering recommendation mechanism. Items purchased/browsed by others (Customers who bought/see this item Also bought/see): It is also a typical project-based collaborative filtering recommendation application that enables users to find their interests more quickly and easily through social mechanisms.
It is worth mentioning that when Amazon makes recommendations, the design and user experience are particularly unique:
Amazon uses the advantages of having a lot of historical data to quantify the reason for the recommendation. Based on social referrals, Amazon will give you factual data that users can convince, such as how much of the user buys the item, and Amazon lists the recommended reasons based on the item itself, for example, because you have * * * in your shopping box, or because you bought * * *, So I recommend a similar * * * to you.
In addition, many of Amazon's recommendations are based on the user's profile, which records the user's behavior on Amazon, including looking at those items, buying those items, collecting folders and wish list items, and, of course, integrating ratings into Amazon. such as other user feedback, they are part of the profile, and Amazon provides the ability to allow users to manage their own profile, in this way users can more clearly tell the recommendation engine his taste and intentions.
recommended for use in social networking sites – watercress
Watercress is a relatively successful domestic social networking site, it is based on books, movies, music and city activities as the center, the formation of a diversified social network platform, natural recommended features are essential, below we see how the watercress is recommended.
Figure 8. The recommended mechanism of watercress-watercress movie
When you are in the Watercress movie and you have seen or interested in the film to join the list you have seen and want to see, and to give them a corresponding rating, then the Watercress recommendation engine has received some of your preference information, then it will show you the movie recommendation as shown in Figure 8.
Figure 9. Recommended mechanism of watercress-based on user's taste
Watercress recommendation is through the "watercress guess", in order to let users know how these recommendations are, watercress also gave a "watercress guess" a brief introduction.
"Your personal recommendation is automatically based on your collection and evaluation, and everyone's referral list is different. The more you collect and evaluate, the more accurate and informative the recommendation will be to your watercress.
The daily recommended content is subject to change. As the watercress grows, the content you recommend will become more and more accurate. ”
This allows us to clearly understand that watercress is necessarily based on the recommendation of social collaborative filtering, so the more users, the more users feedback, then the recommended effect will be more accurate.