Original address: http://www.ibm.com/developerworks/cn/web/1103_zhaoct_recommstudy1/index.html
The "Discover the secrets of the recommendation Engine" series will lead readers from shallow to deep learning to explore the mechanisms of the recommendation engine, which also involves some basic optimization methods, such as clustering and classification applications. At the same time, on the basis of theoretical explanation, it will also combine Apache Mahout to introduce how to implement various recommendation strategies in large-scale data, optimize strategy, and build efficient recommendation engine method. As the first article in this series, this article will provide an in-depth introduction to how the recommendation engine works, and the various recommended mechanisms involved, as well as their respective strengths and weaknesses and scenarios, to help users clearly understand and quickly build a recommendation engine that suits them.
Information discovery
Now that it has entered an era of data explosion, with the development of Web 2.0, the Web has become a platform for data sharing, so how to get people to find the information they need in a huge amount of data will become more and more difficult.
In such a situation, search engines (google,bing, Baidu, etc.) become the best way to quickly find the target information. When users are relatively clear about their needs, with search engine is very convenient through the keyword search quickly find the information they need. But the search engine does not completely satisfy the user to the information discovery the demand, because in many cases, the user actually does not have the clear own need, or their demand is difficult to use the simple keyword to express. Or they need to be more in line with their personal tastes and preferences of the results, so there is a recommendation system, and search engine corresponding, we also used to call it the recommendation engine.
With the advent of the recommendation engine, users get information from simple, target-specific data searches to more advanced information that conforms to people's usage habits.
Nowadays, with the development of recommendation technology, the recommendation engine has been very successful in e-commerce (e-commerce, such as Amazon, Dangdang) and some social-based social sites (including music, film and book sharing, such as watercress, Mtime, etc.). This also further illustrates that, in the face of massive data in the WEB2.0 environment, users need this more intelligent, more understanding of their needs, tastes and preferences of information discovery mechanisms.
Back to top of page
Recommended engine
The importance of the recommendation engine for the current Web2.0 site is described in the previous chapter, which we will talk about how the recommendation engine works. The recommendation engine uses special information filtering techniques to recommend different items or content to users who may be interested in them.
Figure 1. Recommended engine working principle diagram
Figure 1 shows the working principle diagram of the recommendation engine, where the recommendation engine is considered a black box, the input that it accepts is the recommended data source, and in general, the data sources required by the recommendation engine include:
- To recommend metadata for items or content, such as keywords, gene descriptions, etc.;
- Basic information about users of the system, such as gender, age, etc.
- User preferences for items or information, depending on the application itself, may include the user's rating of the item, the user's view of the item's record, the user's purchase record, etc. In fact, these user preference information can be divided into two categories:
- Explicit user feedback: This is a user's natural browsing on the site or use of the site, explicit feedback information, such as user ratings of items, or comments on items.
- Implicit user feedback: This is the user in the use of the site is generated data, implicitly reflects the user's preferences for items, such as the user purchased an item, the user to view the information of an item and so on.
Explicit user feedback can accurately reflect the user's real preferences for the goods, but requires the user to pay an additional cost, and the implicit user behavior, through some analysis and processing, but also reflects the user's preferences, but the data is not very accurate, some behavior analysis there is a large noise. However, as long as the correct behavior characteristics, implicit user feedback can also be very good results, but the choice of behavior features may be in different applications are very different, for example, in the e-commerce website, buying behavior is actually a good performance of user preferences implicit feedback.
The recommendation engine may use a portion of the data source based on different referral mechanisms, and then, based on these data, analyze certain rules or directly predict the user's preferences for other items. This allows the referral engine to recommend items that he might be interested in when the user enters.
Back to top of page
Classification of recommended engines
The classification of the recommended engines can be based on a number of indicators, here we introduce:
- Does the recommendation engine recommend different data for different users?
According to this indicator, the recommendation engine can be divided into popular behavior based on the recommendation engine and personalized recommendation engine
- According to the recommendation engine for popular behavior, each user is given the same recommendation, which can be statically set manually by the system administrator or calculated based on the feedback statistics of all users of the system.
- Personalized recommendation engine, for different users, according to their tastes and preferences to give more accurate recommendations, at this time, the system needs to understand the characteristics of the recommended content and users, or based on social networks, by finding the same preferences with the current user, the implementation of recommendations.
This is a basic recommendation engine classification, in fact, most people discuss the recommendation engine is to be personalized recommendation engine, because fundamentally, only the personalized recommendation engine is a more intelligent information discovery process.
- based on the data source of the recommendation engine
This is actually about discovering the relevance of the data, because most of the recommendation engines work based on a similar set of items or users. Then refer to the recommended system schematic diagram given in Figure 1, according to different data sources to find data correlation method can be divided into the following:
- based on the basic information of users to discover the relevance of users, This is called demographic-based recommendation (demographic-based recommendation)
- to discover the relevance of an item or content based on the metadata of the recommended item or content, which is referred to as content-based recommendation (content-based Recommendation)
- based on the user's preference for items or information, to discover the relevance of the item or the content itself, or to discover the relevance of the user, which is referred to as collaborative filtering recommendations (collaborative filtering-based Recommendation).
- According to the establishment of the recommendation model
It can be imagined that in a large number of items and users of the system, the recommended engine calculation is considerable, to achieve real-time recommendations must be set up a recommendation model, about the proposed model can be divided into the following types of establishment:
- Based on the item and the user itself, this recommendation engine treats each user and each item as a separate entity, predicting how much each user likes each item, which is often described by a two-dimensional matrix. Because the user is interested in items far less than the total number of items, such a model leads to a large number of data vacancy, that is, we get a two-dimensional matrix is often a very large sparse matrix. At the same time, in order to reduce the amount of computing, we can cluster items and users, and then record and calculate a class of users of a class of preference, but such a model will be in the recommended accuracy loss.
- Recommendation based on association Rules (rule-based Recommendation): The Mining of Association rules is a classic problem in data mining, which is mainly to excavate some data dependencies, the typical scene is "shopping basket problem", and through the mining of association Rules, We can find which items are often purchased at the same time, or what other items are usually purchased after the user has purchased some items, and we can recommend them based on these rules when we dig out these association rules.
- Model-based recommendation (model-based recommendation): This is a typical machine learning problem, you can use existing user preferences as a training sample, training a model to predict user preferences, so that users in the system after entering, you can calculate the recommendation based on this model. The problem with this approach is how to feed the user's real-time or recent preferences to a well-trained model to improve the recommended accuracy.
In fact, in the present recommendation system, very few use only a recommendation engine, generally in different scenarios using different recommendation strategy to achieve the best recommendations, such as Amazon's recommendation, it will be based on the user's own history of the purchase of data recommendations, and based on the user's current view of the item recommendations, and popular items based on popular preferences are recommended to users in different regions, allowing users to find the items they are really interested in from a full range of recommendations.
Back to top of page
In-depth referral mechanism
This chapter will detail the working principles of each recommendation mechanism, their pros and cons, and their application scenarios.
Recommendations based on demographic statistics
The recommendation mechanism based on demography (demographic-based recommendation) is one of the easiest to implement, it simply finds the relevance of the user based on the basic information of the system user, and then recommends other items similar to the user's favorite to the current user, Figure 2 The working principle of this recommendation is given.
Figure 2. Working principle of recommendation mechanism based on demography
It can be clearly seen that, first of all, the system has a user profile modeling for each user, including basic user information, such as the user's age, gender, etc., then, the system will calculate the user's similarity according to the user profile, you can see User A's profile and user C, Then the system will think that users a and C are similar users, in the recommendation engine, you can call them "neighbors", finally, based on the "neighbor" user group preferences recommended to the current user a number of items, the figure is a favorite item A is recommended to user C.
The benefits of this demographic-based referral mechanism are:
- There is no "cold start" problem for new users because they do not use the current user's preferences for historical data.
- This method does not depend on the data of the item itself, so this method can be used in the field of different items, it is domain independent (domain-independent).
So what are the drawbacks and problems of this approach? This basic user-based information on the classification of users is too rough, especially in the areas of high taste requirements, than books, movies and music and other fields, can not be very good recommendations. Perhaps in some e-commerce sites, this method can give some simple recommendations. Another limitation is that this approach may involve sensitive information that is not relevant to the information discovery problem itself, such as the age of the user, and the user information is not well acquired.
Content-based recommendations
Content-based recommendation is the most widely used recommendation mechanism at the beginning of the recommendation engine, and its core idea is to discover the relevance of items or content based on the metadata of the recommended items or content, and then recommend to the user similar items according to the user's previous preferences. Figure 3 shows the rationale for content-based recommendations.
Figure 3. Fundamentals of Content-based recommendation mechanism
Figure 3 shows a typical example based on content recommendation, the film recommendation system, first we need to have a model of the movie metadata, here is a simple description of the movie type, and then through the movie metadata to find the similarity between movies, because the type is "love, romance" movies A and C is considered similar to the film (of course, only according to the type is not enough, to get a better recommendation, we can also consider the film director, actors, etc.); Finally, the recommendation is that for user A, he likes to watch movie A, then the system can recommend a similar movie C.
The benefit of this content-based recommendation mechanism is that it can model the user's tastes well and provide more accurate recommendations. But it also has the following problems:
- Items need to be analyzed and modeled, and the recommended quality depends on the completeness and comprehensiveness of the item model. In today's application we can observe that the keywords and tags (tag) are considered as a simple and effective way to describe the item metadata.
- The analysis of the similarity of items depends only on the characteristics of the item itself, and there is no consideration for the attitude of the object.
- There is a "cold start" issue for new users because they need to make recommendations based on the history of their previous preferences.
Although this method has a lot of shortcomings and problems, but he is still successful application in some movies, music, books, social sites, some sites also ask professional personnel to encode items, such as Pandora, said in a report, in Pandora's recommendation engine, each song has more than 100 metadata characteristics, Including the style of the song, year, singers and so on.
Recommendations based on collaborative filtering
With the development of Web2.0, the WEB site advocates user participation and user contribution, so the recommendation mechanism based on collaborative filtering is born. Its rationale is simple, based on the user's preference for items or information, to find the relevance of the item or content itself, or to discover the relevance of the user, and then based on these related to the recommendation. Recommendations based on collaborative filtering can be divided into three sub-categories: User-based recommendations (user-based recommendation), project-based recommendations (item-based recommendation), and model-based recommendations (model-based Recommendation). Below we are a detailed introduction of the three kinds of collaborative filtering recommendation mechanism.
User-based collaborative filtering recommendations
The basic principle of user-based collaborative filtering recommendation is that, based on the preference of all users for goods or information, the "neighbor" user group which is similar to the current user's tastes and preferences is found, and the algorithm of "K-neighbor" is used in general application. Then, based on the history preference information of the K-neighbor, the current user is recommended 4 The schematic diagram is given.
Figure 4. The basic principle of user-based collaborative filtering recommendation mechanism
To show the basic principle of user-based collaborative filtering recommendation mechanism, suppose user A likes item A, item C, User B likes item B, user C likes item A, item C and item D; From these users ' historical preferences, we can find that the tastes and preferences of user A and user C are more similar, When user C also likes item D, then we can infer that user A may also like item D, so you can recommend item D to User A.
The user-based collaborative filtering recommendation mechanism and the demographic-based recommendation mechanism are calculated for the user's similarity, and are based on the "neighbor" user base calculation recommendations, but they are different how to calculate the user's similarity, based on the demographic mechanism only consider the user's own characteristics, The user-based collaborative filtering mechanism, however, calculates the user's similarity on the data of the user's historical preference, the basic assumption being that the user who likes the similar item may have the same or similar tastes and preferences.
Project-based Collaborative filtering recommendations
The rationale for project-based collaborative filtering recommendations is similar, except that it uses all user preferences for items or information, discovers similarities between items and items, and then recommends similar items to users based on their historical preferences, Figure 5 explains its rationale.
Suppose user A likes goods A and item C, user B likes items A, item B and item C, User C likes item A, from these user's historical preferences can analyze items A and item C compared to similar, like item a people all like item C, based on this data can infer user C is very may also like item C, so the system will recommend the item C to User C.
Similar to the above, collaborative filtering recommendations based on projects and content-based recommendations are all based on item similarity prediction, but the similarity calculation method is not the same, the former is inferred from the user's historical preferences, and the latter is based on the property characteristics of the item itself information.
Figure 5. The basic principle of collaborative filtering recommendation mechanism based on project
At the same time, how should we choose between user-based and project-based two strategies? In fact, project-based collaborative filtering recommendation mechanism is a strategy for Amazon to improve on the user-based mechanism, because in most Web sites, the number of items is much smaller than the number of users, and the number of items and similarity is relatively stable, and the project-based mechanism is better than the user-based real-time. But not all of the scenarios are the case, you can imagine that in some news recommendation system, perhaps the number of items, that is, news may be greater than the number of users, and the news update degree is also very fast, so its shape is still unstable. So, in fact, it can be seen that the choice of recommendation strategy is actually very much related to the specific application scenario.
Model-based collaborative filtering recommendations
Model-based collaborative filtering recommendation is a sample-based user preferences information, training a recommendation model, and then based on real-time user preferences of the information to predict, calculate recommendations.
The recommendation mechanism based on collaborative filtering is the most widely used recommendation mechanism today, and it has several notable advantages:
- It does not require strict modeling of items or users and does not require that the description of the item be machine understandable, so this method is also irrelevant to the field.
- This method is calculated by the recommendation is open, can share the experience of others, very good support users to identify potential interest preferences
And it also has the following problems:
- The core of the approach is based on historical data, so there is a "cold start" problem with new items and new users.
- The recommended effect depends on the amount and accuracy of the user's historical preference data.
- In most implementations, the user's historical preferences are stored with sparse matrices, while the computation on sparse matrices has some obvious problems, including the possibility that a few people's error preference will have a great impact on the recommended accuracy.
- For some special tastes of the user can not give a good recommendation.
- Based on historical data, crawling and modeling user preferences is difficult to modify or evolve based on user usage, resulting in inflexible methods.
A hybrid recommendation mechanism
The recommendations on the current Web site are often not purely based on a single recommended mechanism and strategy, they tend to mix multiple methods together to achieve better recommendations. about how to combine each recommendation mechanism, here are some of the more popular combination methods.
- Weighted blending (Weighted hybridization): Using a linear formula (linear formula) to combine several different recommendations according to a certain weight, the value of specific weights needs to be tested repeatedly on the test data set, thus achieving the best recommendation effect.
- Switching mix (switching hybridization): In fact, for different situations (data volume, system health, number of users and items, etc.), the recommended strategy may be very different, then the combination of switching is allowed in different situations, Select the most appropriate recommendation mechanism to calculate the recommendation.
- Partition blending (Mixed hybridization): Multiple referral mechanisms are used, and different referral results are displayed to the user in different areas. In fact, Amazon, Dangdang and many other e-commerce sites are used in this way, users can get very comprehensive recommendations, but also easier to find what they want.
- Layered blending (meta-level hybridization): Adopt a variety of recommendation mechanisms, and the results of one recommendation mechanism as another input, so as to synthesize the pros and cons of each recommendation mechanism to get more accurate recommendations.
Back to top of page
Application of the recommendation engine
This paper introduces the basic principle of the recommendation engine, the fundamental recommendation mechanism, and briefly analyzes the application of several representative recommendation engines, here choose two areas: Amazon as a representative of e-commerce, watercress as a representative of social networks.
Recommended application in e-commerce Amazon
Amazon, the originator of the recommendation engine, has infiltrated the recommended ideas in every corner of the application. The core of Amazon's recommendation is to predict what the user might be interested in by comparing data mining algorithms to other users ' consumer preferences. In response to the various recommended mechanisms described above, Amazon uses a hybrid partitioning mechanism and displays different recommendations to the user in different areas, as shown in Figure 6 and Figure 7, which shows the recommendations that users can get on Amazon.
Figure 6. Amazon's Referral mechanism-home page Figure 7. Amazon's referral mechanism-Browse items
Amazon takes advantage of the behavior of all users that can be recorded on the site, processes them according to the characteristics of different data, and divides them into different areas to push recommendations for users:
- Today's recommendation for you: usually buy or view records based on the user's recent history, and combine the popular items with a compromise recommendation.
- New product recommendations (new for You): a content-based recommendation mechanism (content-based recommendation) that introduces some new items to the user. In the method selection because new items do not have a large number of user preferences information, so the content-based recommendation can be a good solution to this "cold start" problem.
- Bundled sales (frequently bought Together): The use of data mining technology to analyze the user's purchase behavior, to find often together or the same person to purchase the collection of items, bundling, which is a typical project-based collaborative filtering recommendation mechanism.
- Items purchased/browsed by others (Customers who bought/see this item Also bought/see): It is also a typical project-based collaborative filtering recommendation application that enables users to find their interests more quickly and easily through social mechanisms.
It is worth mentioning that when Amazon makes recommendations, the design and user experience are particularly unique:
Amazon uses the advantages of having a lot of historical data to quantify the reason for the recommendation.
- Based on social referrals, Amazon will give you factual data to convince users, such as how much of the user who buys the item buys that item;
- Based on the recommendation of the item itself, Amazon also lists the recommended reasons, for example, because your shopping box has * * *, or because you bought * * *, so you recommend a similar * * *.
In addition, many of Amazon's recommendations are based on the user's profile, which records the user's behavior on Amazon, including looking at those items, buying those items, collecting folders and wish list items, and, of course, integrating ratings into Amazon. such as other user feedback, they are part of the profile, and Amazon provides the ability to allow users to manage their own profile, in this way users can more clearly tell the recommendation engine his taste and intentions.
Recommended for use in social networking sites – watercress
Watercress is a relatively successful domestic social networking site, it is based on books, movies, music and city activities as the center, the formation of a diversified social network platform, natural recommended features are essential, below we see how the watercress is recommended.
Figure 8. The recommended mechanism of watercress-watercress movie
When you are in the Watercress movie and you have seen or interested in the film to join the list you have seen and want to see, and to give them a corresponding rating, then the Watercress recommendation engine has received some of your preference information, then it will show you 8 of the film recommendations.
Figure 9. Recommended mechanism of watercress-based on user's taste
Watercress recommendation is through the "watercress guess", in order to let users know how these recommendations are, watercress also gave a "watercress guess" a brief introduction.
" your personal recommendation is automatically based on your collection and evaluation, and everyone's referral list is different. The more you collect and evaluate, the more accurate and informative the recommendation will be to your watercress.
The daily recommended content is subject to change. As the watercress grows, the content you recommend will become more and more accurate. "
This allows us to clearly understand that watercress is necessarily based on the recommendation of social collaborative filtering, so the more users, the more users feedback, then the recommended effect will be more accurate.
Compared to Amazon's user behavior model, the Watercress film model is more simple, that is, "see" and "want to see", which also makes their recommendations more focused on the user's taste, after all, buy things and movies to the motives are very different.
In addition, watercress also has a recommendation based on the item itself, when you look at some of the movie details, he will give you a "like this movie people also like the movie", 10, this is a collaborative filtering based application.
Figure 10. The recommended mechanism of watercress-based on the recommendations of the film itself
Back to top of page
Summarize
In the era of network data explosion, how to let users find the data they want faster, how to let users find their potential interests and needs, whether for e-commerce or social network applications are critical. The emergence of the recommendation engine makes this issue more and more attention. But for most people, it may be amazing why it always guesses what you want. The magic of the recommendation engine is that you don't know what the engine is recording and reasoning behind this recommendation.
Through this review of the article, you can understand, in fact, the recommendation engine is only silently record and observe your every move, and then by all users generated by the massive data analysis and found the law, and then slowly understand you, your needs, your habits, and silently help you quickly solve your problem, Find what you're looking for.
In fact, think back, many times, the recommendation engine more than you know your own.
Through the first article, I believe you have a clear first impression of the recommendation engine, and the next article in this series will delve into the recommendation strategy based on collaborative filtering. In today's recommended technology and algorithms, the most widely recognized and adopted is based on collaborative filtering recommendation method. It has a simple model, low data dependence, convenient data collection, the recommended effect is more than a number of advantages to become the public eye of the recommended algorithm "the best." This article will take you deep into the secret of collaborative filtering, and give an efficient implementation of a collaborative filtering algorithm based on Apache Mahout. Apache Mahout is a new open source project for ASF, which originates from Lucene and is built on top of Hadoop to focus on the efficient implementation of machine learning classic algorithms on massive amounts of data.
Thank you for your interest and support in this series.
Back to top of page
Statement
The content I publish is personal and does not represent IBM's position, strategy, and perspective.
Reference Learning
- Recommendation Engines Seminar Paper, Thomas Hess, 2009: A summary of the recommendation engine, Thomas gives the model of the recommendation engine, the working principle of various recommendation mechanisms, and analyzes the many problems that the recommendation engine faces.
- Toward the Next Generation of Recommender systems:a Survey of the State-of-the-art and Possible Extensions, Adomavicius, G.; Tuzhilin, A., 2005:2005 years of paper, the current popular recommendation technology to summarize, in-depth concrete implementation of technical methods, but also put forward to the next generation of recommendation engine Outlook.
- A Taxonomy of Recommenderagents on the Internet, Montaner, M.; Lopez, B.; De la Rosa, J. L., 2003, summarizes the recommended engines on the Internet, gives the classification and characteristics of different recommended methods, and helps readers to have a comprehensive understanding of the recommendation engine.
- Amazon:www.amazon.com: Pioneers in recommended Technologies, Amazon's recommended technology for consumer-to-consumer areas is worth referencing.
- Watercress: www.douban.com: As the pioneer of the domestic social network, watercress is also in the leading position in the recommendation technology, and has deep research on the recommendation strategy of different content.
- A discussion on personalized recommendation technology: The basic principle of personalized recommendation technology is briefly introduced, and the author gives a multi-angle understanding of excellent personalized recommendation.
- Google Recommender System group: Google discussion group for referral systems, with a lot of interesting discussions about the recommendation engine
- Recommender System Algorithms: Resources for recommendation engine algorithms
- Design of Recommender System: Introduction to designing methods for recommendation engines
- How to build a Recommender system: This demo gives you an example of how to build a recommendation engine and introduces a recommendation strategy based on collaborative filtering in detail.
- Developerworksweb Technical Zone: Hundreds of articles on various aspects of WEB programming.
- DeveloperWorks Ajax Resource Center: This is a one-stop center for information about AJAX programming models, including many documents, tutorials, forums, blogs, wikis, and news. Any new Ajax information can be found here.
- The DeveloperWorks Web 2.0 Resource Center, a one-stop Center for Web 2.0-related information, includes a large number of Web 2.0 technical articles, tutorials, downloads, and related technical resources. You can also quickly learn about the concepts of Web 2.0 through the Web 2.0 starter section.
- Check out the HTML5 topic for more information and trends related to HTML5.
Discuss
- Join DeveloperWorks Chinese community.
Explore the secrets of the recommended engine, part 1th: Recommended Engines (RPM)