[This article is reproduced. Original article address: http://www.ibm.com/?works/cn/web/1103_zhaoct_recommstudy1/index.html]
Introduction:With the development of web technology, it becomes easier to create and share content. A large number of images, blogs, and videos are published on the Internet every day. The explosion of information makes it increasingly difficult for people to find the information they need. Traditional search technology is a relatively simple tool to help people find information and is widely used by people. However, search engines cannot fully meet users' needs for information discovery, one of the reasons is that it is difficult for users to describe their needs with appropriate keywords, and the other is that keyword-based Information Retrieval is not enough in many cases. The appearance of receng enables users to retrieve information from simple and specific data searches to more advanced context information that meets people's habits and richer information discovery.
Release date:March 16, 2011
The "exploring the secrets inside the receng" series will lead readers to explore the receng's mechanisms and implementation methods from the ground up. Some basic optimization methods are also involved, for example, clustering and classification applications. At the same time, based on the theoretical explanation, we will also introduce how to implement various recommendation policies on large-scale data, optimize policies, and build efficient recommendation engine methods in conjunction with Apache mahout. As the first article in this series, this article will introduce the working principles of the receng, the various recommendation mechanisms involved in it, and their respective advantages and disadvantages and applicable scenarios, it helps users clearly understand and quickly build a recommendation engine suitable for themselves.
Information discovery
Now we have entered the era of data explosion. With the development of Web 2.0, Web has become a data sharing platform, it is increasingly difficult for people to find the information they need in massive data volumes.
In this case, search engines (Google, Bing, Baidu, and so on) have become the best way to quickly find target information. When you have specific requirements for yourself, you can use a search engine to quickly search for the information you need using keywords. However, the search engine does not fully meet users' requirements for information discovery, because in many cases, users are not clear about their own needs, or their needs are hard to be expressed using simple keywords. Or they need to better match their personal tastes and preferences. Therefore, a recommendation system appears, which corresponds to a search engine and is also called a recommendation engine.
With the advent of receng, users can retrieve information from simple and specific data searches to more advanced information discovery that is more in line with people's habits.
Today, with the continuous development of recommendation technology, the recommendation engine is already in e-commerce (e-commerce, such as Amazon, Dangdang) and some social websites based on social (including music, movie and book sharing, such as Douban and mtime, all achieved great success. This further demonstrates that in the Web2.0 environment, users need to be more intelligent and better aware of their needs, tastes, and information discovery mechanisms in the face of massive data.
Recommendation Engine
We have discussed the significance of receng for the current Web website. This chapter describes how receng works. The receng uses special information filtering technologies to recommend different items or content to users who may be interested in them.
Figure 1. recommendation engine working principle
Figure 1 shows the working principle of the recommendation engine. Here, we first regard the recommendation engine as a black box. The input it receives is the recommendation data source. Generally, the data sources required by the recommendation engine include:
- Metadata of the item or content to be recommended, such as keywords and genetic descriptions;
- Basic information of system users, such as gender and age
- Users' preferences on items or information may include users' ratings on items, users' viewing records of items, and users' purchase records based on different applications. In fact, these user preferences can be divided into two types:
- Explicit user feedback: This type of feedback is explicitly provided, such as the user's score on the item or comment on the item, outside of the user's natural browsing or use of the website.
- Implicit user feedback: This type of data is generated by users who use the website. It implicitly reflects users' preferences for items. For example, if a user buys an item, you can view the information of an item.
Explicit user feedback can accurately reflect users' true preferences on items, but it requires additional costs. Implicit user behavior can be analyzed and processed, it can also reflect users' preferences, but the data is not very accurate, and some behavior analysis has a lot of noise. However, as long as you select the correct behavior characteristics, implicit user feedback can also achieve good results, but the selection of behavior characteristics may vary greatly in different applications, for example, on an e-commerce website, purchasing behavior is actually an implicit feedback that shows user preferences.
The receng may use a part of the data source based on different recommendation mechanisms. Based on the data, it analyzes certain rules or directly computes users' preferences for other items. In this way, the receng can recommend items that the user may be interested in when entering.
Receng category
The receng classification can be based on many indicators. Here we will introduce them one by one:
- Does receng recommend different data for different users?
Based on this indicator, recommendation engines can be divided into recommendation engines and personalized recommendation engines based on public behavior.
- According to the recommendation engine of public behavior, the same recommendations are given to every user. These recommendations can be static and manually set by the system administrator, or the items that are currently popular are calculated based on the feedback of all users in the system.
- The personalized recommendation engine provides more accurate recommendations to different users based on their tastes and preferences. At this time, the system needs to understand the content to be recommended and the characteristics of users, or based on social networks, you can find users that have the same preferences as the current user to make recommendations.
This is the most basic receng classification. In fact, most people discuss the receng that uses personalized receng, because basically, only a personalized recommendation engine is a more intelligent information discovery process.
- Based on the recommendation engine data source
In fact, here we will talk about how to discover the relevance of data, because most recommendation engines work on recommendations based on item or user similarity sets. For more information, see the diagram of the recommendation system shown in Figure 1. Data correlation can be found based on different data sources in the following ways:
- Demographic-based recommendation)
- Discover the relevance of an item or content based on the metadata of the recommended item or content. This is called Content-based recommendation)
- Based on users' preferences on items or information, this product discovers the relevance of items or content, or the relevance of users. This kind of recommendation is called collaborative filtering-based recommendation ).
Based on the recommendation Model Creation Method
As you can imagine, the recommendation engine requires a considerable amount of computing in the system of massive items and users. To achieve real-time recommendation, you must establish a recommendation model, the recommendation model can be created in the following ways:
- Based on items and users, this recommendation engine treats each user and each item as an independent entity and predicts the user's preferences for each item, this information is often described using a two-dimensional matrix. Because the number of items that users are interested in is much smaller than the total number of items, such a model leads to a large amount of data being vacant, that is, the two-dimensional matrix we get is usually a large sparse matrix. At the same time, in order to reduce the amount of computing, We can cluster items and users, and then record and calculate the preferences of a type of users for a type of items, however, such a model may cause loss in the accuracy of recommendation.
- Recommendation Based on Association Rules (rule-based recommendation): mining association rules is already a classic problem in Data Mining. It mainly involves mining data dependencies, A typical scenario is the shopping basket problem. By mining association rules, we can find out which items are often purchased at the same time, or what other items are usually purchased after users buy some items, after discovering these association rules, we can recommend these Rules to users.
- Model-based recommendation: this is a typical machine learning problem. You can use existing user preferences as training samples to train a model that predicts user preferences, in this way, users can enter the system to calculate recommendations based on this model. The problem with this method is how to feed back the user's real-time or recent preference information to the trained model to improve the recommendation accuracy.
In fact, in the current recommendation system, few recommendation engines use only one recommendation policy. Generally, different recommendation policies are used in different scenarios to achieve the best recommendation effect, for example, Amazon recommendation provides recommendations based on the user's historical purchase data and based on the items currently viewed by the user, currently, popular items based on public preferences are recommended to users in different regions, allowing users to find items they are truly interested in from comprehensive recommendations.
In-depth recommendation Mechanism
This chapter details the working principles, advantages and disadvantages, and application scenarios of each recommendation mechanism.
Demographic-based recommendation
Demographic-based recommendation is the most easy-to-Implement recommendation method. It simply discovers user Relevance Based on the basic information of system users, then, we recommend other items that are favored by similar users to the current user. Figure 2 shows how the recommendation works.
Figure 2. Working Principle of demographic-based recommendation Mechanism
We can clearly see that, first, the system creates a profile for each user, including the user's basic information, such as the user's age and gender. Then, the system calculates the user similarity based on the user profile. We can see that the profile of user a is the same as that of user C. Then, the system considers that user a and user C are similar users. In the recommendation engine, they can be called "neighbors". Finally, some items are recommended to the current user based on the preferences of the "neighbors" user group. In the figure, item A liked by user a is recommended to user C.
The benefits of this demographic-based recommendation mechanism are:
- Because the current user's preferences for items are not used, there is no "Cold Start" problem for new users.
- This method does not depend on the item data, so it can be used in different item fields. It is a domain-independent ).
So what are the shortcomings and problems of this method? This method of classifying users based on the basic information of users is too rough, especially for fields with high taste requirements, such as books, movies, and music, which cannot obtain good recommendation results. This method may provide some simple recommendations on some e-commerce websites. Another limitation is that this method may involve sensitive information that is irrelevant to the information discovery problem, such as the user's age. The user information is not well obtained.
Content-based recommendation
Content-based recommendation is the most widely used recommendation mechanism at the beginning of its appearance. Its core idea is to discover the relevance of items or content based on the metadata of recommended items or content, then, similar items are recommended to users based on users' preferences. Figure 3 shows the basic principles of content-based recommendation.
Figure 3. Basic principles of content-based recommendation
Figure 3 shows a typical example of content-based recommendation. For a movie recommendation system, we need to model the metadata of the movie. Here we only briefly describe the type of the movie; then, the similarity between movies is discovered through the metadata of the movie, because the types are "Love, romantic" movies A and C are considered to be similar movies (of course, only the type is not enough, for better recommendations, we can also consider movie directors, actors, and so on.) Finally, we recommend that user a like movie, then the system can recommend a similar movie C to him.
The benefit of this content-based recommendation mechanism is that it can well model user tastes and provide more accurate recommendations. However, it also has the following problems:
- The product needs to be analyzed and modeled. The recommendation quality depends on the completeness and comprehensiveness of the product model. In our current application, we can see that keywords and tags are considered a simple and effective method to describe item metadata.
- The Analysis of item similarity only depends on the characteristics of the item. The attitude of the person to the item is not considered here.
- Because we need to make recommendations based on the user's past preferences, there is a "Cold Start" problem for new users.
Although this method has many shortcomings and problems, it is still successfully applied to some social websites of movies, music, and books. Some websites also require professional personnel to perform genetic code on the items, for example, Pandora said in a report that in Pandora's recommendation engine, each song has more than 100 metadata features, including the style, year, and singer of the song.
Collaborative Filtering-based recommendation
With the development of Web, Web sites advocate user participation and user contribution. Therefore, the collaborative filtering-based recommendation mechanism is born. The principle is very simple, that is, discovering the relevance of the item or content, or discovering the relevance of the user based on the user's preference for the item or information, and then making recommendations based on the relevance. Collaborative Filtering-based recommendation can be divided into three sub-categories: user-based recommendation and project-based recommendation) and model-based recommendation ). Next we will introduce three collaborative filtering recommendation mechanisms in detail.
User-based collaborative filtering and recommendation
The basic principle of user-based collaborative filtering recommendation is to find the "Neighbor" user group similar to the current user's taste and preferences based on the preferences of all users on items or information, in general, the algorithm for calculating "K-Neighbor" is used. Then, based on the historical preference information of the K-neighbor, the algorithm is recommended for the current user. 4. The schematic diagram is provided.
Figure 4. Basic principles of user-based collaborative filtering and recommendation Mechanism
The following figure shows the basic principles of the user-based collaborative filtering and recommendation mechanism. Assume that user a prefers item A, item C, and user B, and user C prefers item, item C and item D. from the user's historical preferences, we can find that user a and user C have similar tastes and preferences, and user C also like item D, so we can infer that user a may also like item d, so we can recommend item d to user.
User-based collaborative filtering recommendation mechanism and demographic-based recommendation mechanism both calculate user similarity and calculate recommendations based on the "Neighbor" user group, however, what they differ from is how to calculate user similarity. The demographic-based mechanism only takes into account user characteristics, however, the user-based collaborative filtering mechanism computes user similarity on the user's historical preference data. Its basic assumption is, users who like similar items may have the same or similar tastes and preferences.
Project-based collaborative filtering and recommendation
The basic principle of project-based collaborative filtering and recommendation is similar. It only means that it uses all users' preferences on items or information to discover similarity between items, then, similar items are recommended to the user based on the user's historical preferences. Figure 5 shows the basic principles of the item.
Assume that user a prefers item A and item C, user B prefers item A, item B, and item C, and user C prefers item, from the historical preferences of these users, we can analyze the similarities between item A and item C. People who like item A like item C, based on this data, we can infer that user C may also like item C, so the system will recommend item C to user C.
Similar to the above, project-based collaborative filtering recommendation and content-based recommendation are both based on item similarity prediction recommendation, but the similarity calculation method is different, the former is based on the preference of the user's history, and the latter is based on the attribute feature information of the item itself.
Figure 5. Basic principles of project-based collaborative filtering and recommendation
At the same time, how should we choose collaborative filtering based on user and project? In fact, the project-based collaborative filtering and recommendation mechanism is an improved method of Amazon's user-based mechanism, because in most web sites, the number of items is far smaller than the number of users, and the number and similarity of items are relatively stable. At the same time, the project-based mechanism is better than the user-based Real-time performance. However, this is not the case in all scenarios. In some news recommendation systems, the number of items, that is, news, may be larger than the number of users, in addition, news are updated quickly, so its shape is still unstable. Therefore, we can see that the selection of Recommendation policies is closely related to specific application scenarios.
Model-based collaborative filtering recommendation
Model-based collaborative filtering recommendation is to train a recommendation model based on sample user preferences, and then predict and calculate recommendations based on real-time user preferences.
The recommendation mechanism based on collaborative filtering is the most widely used recommendation mechanism today. It has the following significant advantages:
- It does not need to strictly model items or users, and does not require that item descriptions be understandable by machines. Therefore, this method is also irrelevant to the field.
- The recommendation calculated in this way is open and can share the experience of others, so that users can find potential interests and preferences.
It also has the following problems:
- The core of the method is based on historical data, so there is a "Cold Start" problem for new items and new users.
- The recommendation results depend on the quantity and accuracy of user historical preference data.
- In most implementations, the user's historical preferences are stored using a sparse matrix, while the computing on the sparse matrix has some obvious problems, including the possibility that a small number of people may have a great impact on the accuracy of recommendations.
- Users with special tastes cannot give good recommendations.
- Based on historical data, it is difficult to modify and model users' preferences after capturing and modeling the user's preferences or according to the user's usage evolution, resulting in the inflexible method.
Hybrid recommendation Mechanism
The recommendation on the current web site often does not simply adopt a recommendation mechanism and strategy. They often combine multiple methods, to achieve better recommendation results. Here are several popular combination methods for how to combine various recommendation mechanisms.
- Weighted hybridization: uses a linear formula (linear formula) to combine several different recommendations based on a certain weight. The specific weight value needs to be tested repeatedly on the test dataset, to achieve the best recommendation results.
- Switching Hybridization ), recommendation policies may vary greatly, so the Mixed Mode of switching is to allow the most suitable recommendation mechanism to calculate recommendations under different circumstances.
- Mixed hybridization: uses multiple recommendation mechanisms and displays different recommendation results in different regions. In fact, Amazon, Dangdang, and many other e-commerce websites use this method. Users can get comprehensive recommendations and find what they want.
- Meta-level hybridization: uses multiple recommendation mechanisms and uses the results of one recommendation mechanism as another input to integrate the advantages and disadvantages of each recommendation mechanism, get more accurate recommendations.
Recommendation Engine Applications
After introducing the basic principles and basic recommendation mechanisms of the receng, we will briefly analyze several representative receng applications. Here we will select two fields: Amazon as the representative of e-commerce, douban is a representative of social networks.
Recommendation applications in e-commerce-Amazon
Amazon, as the originator of the recommendation engine, has penetrated the idea of recommendation into every corner of the application. The core of Amazon recommendation is to use data mining algorithms to compare users' consumption preferences with other users, so as to predict the products that users may be interested in. Amazon uses a hybrid partitioning mechanism and displays different recommendation results in different regions, figure 6 and Figure 7 show your recommendations on Amazon.
Figure 6. Amazon recommendation mechanism-Homepage
Figure 7. Amazon recommendation mechanism-item browsing
Amazon processes all users' behaviors on the Site Based on the characteristics of different data and divides them into different zones to push recommendations to users:
- Today's recommendation (today's recommendation for you): Generally, you can purchase or view records based on your recent history, and provide a compromise recommendation based on popular items.
- New for you: Uses Content-based recommendation to recommend new items to users. There is no large amount of user preferences in the selection of methods for new items. Therefore, content-based recommendations can effectively solve this "Cold Start" problem.
- Frequently Bought Together: uses data mining technology to analyze users' purchasing behaviors, find the item set that are often bought together or by the same person, and sell it in a bundled manner, this is a typical project-based collaborative filtering recommendation mechanism.
- Customers who bought/See this item also bought/See: this is also a typical project-based collaborative filtering recommendation application, through the Socialization Mechanism, users can quickly and conveniently find the items they are interested in.
It is worth mentioning that Amazon's design and user experience are also unique when making recommendations:
Amazon uses the advantage of having a large amount of historical data to quantify the reasons for recommendation.
- Based on social recommendations, Amazon will convince you of the fact data. For example, if you buy a certain percentage of users, you will also buy the item;
- Amazon also lists the reasons for recommendation based on the item itself, for example, because your shopping box has *** or because you have bought ***, so we recommend a similar *** for you ***.
In addition, many Amazon recommendations are calculated based on the user's profile. The user's profile records users' behaviors on Amazon, including reading and buying those items, items in favorites and wish lists. Of course, Amazon also integrates ratings and other user feedback methods. They are part of the profile. At the same time, amazon provides the ability for users to manage their own profiles. In this way, users can clearly tell the recommendation engine what their tastes and intentions are.
Recommended applications on social networking websites-Douban
Douban is a successful social networking website in China. It is centered on books, movies, music, and local activities to form a diversified social networking platform, naturally, the recommended functions are essential. Let's take a look at how Douban recommends them.
Figure 8. Douban recommendation mechanism-Douban Film
When you add some movies you have watched or are interested in to the list you have watched and want to watch, and rate them accordingly, at this time, Douban's recommendation engine has obtained some of your preference information, so it will show you 8 movie recommendations.
Figure 9. Douban recommendation mechanism-recommendation based on user taste
Douban recommendations are based on "Douban guesses". In order to let users know how these recommendations come from, Douban also provides a brief introduction to "Douban guesses.
"Your personal recommendations are automatically obtained based on your favorites and comments. Each person's recommendation list is different. The more your favorites and comments you have, the more accurate and diversified Douban will give you.
The recommended content may change every day. As Douban grows, the recommended content will become more accurate ."
This makes it clear that Douban must be a recommendation based on social collaborative filtering. As a result, the more users and more user feedback, the more accurate the recommendation results.
Compared with Amazon's user behavior model, the Douban movie model is simpler, that is, "View" and "want to see". This allows their recommendations to focus more on user tastes, after all, shopping is very different from watching movies.
In addition, Douban also provides recommendations based on items. When you view details about a movie, he will recommend the movie to you ", 10. This is an application based on collaborative filtering.
Figure 10. Douban recommendation mechanism-movie-based recommendation
Summary
In the age of network data explosion, how can users quickly find desired data and discover their potential interests and needs, it is vital for e-commerce and social network applications. The emergence of the recommendation engine has attracted more and more attention. But for most people, they may still wonder why they can always guess what they really want. The magic of the receng is that you don't know what the engine records and reasoning behind this receng.
Through this comprehensive article, you can understand that, in fact, the receng only silently records and observes your every move, and then analyzes and discovers the rules through the massive data generated by all users, then you can learn about you, your needs, and your habits, and help you solve your problems and find what you want.
In fact, let's look back and think about it. In many cases, the recommendation engine knows you better than you.
Through the first article, I believe everyone has a clear first impression on the recommendation engine. The next article in this series will introduce the recommendation strategies based on collaborative filtering in depth. Among today's recommendation technologies and algorithms, the most widely recognized and adopted is the collaborative filtering-based recommendation method. With its simple method model, low data dependence, convenient data collection, and superior recommendation effects, it becomes the recommendation algorithm "No. 1" in the eyes of the masses ". This article will show you the secrets of collaborative filtering and provide efficient implementation of the collaborative filtering algorithm based on Apache mahout. Apache mahout is a new open-source project of ASF. It originated from Lucene and is built on hadoop. It focuses on the efficient implementation of classic Machine Learning Algorithms on massive data.
Thank you for your attention and support for this series.
Author Profile
Zhao chenting is now working in the Web 2.0 development team of the IBM China Software Development Center. He has rich experience in developing SOA, J2EE, and Web 2.0 applications. The main focus is on data processing, data search, recommendation algorithms, and recommendation system design.
Ma chune, working in the ibm csdl Web2.0 team, has participated in the development of project zero and Lotus mashup center. The main focus is on data modeling in the Web2.0 field, data processing, data visualization, data semantics in the Web2.0 field, and data association.