Receng internal secrets 2

Source: Internet
Author: User

From: http://www.ibm.com/developerworks/cn/web/1103_zhaoct_recommstudy2/index.html

For innovative companies in 2005, the most important revolutionary idea may be the so-called "Long Tail" theory proposed by Chris Anderson, editor-in-chief of Wired magazine in 2004. This theory says that the Internet has left the last few popular commodities out of the box. Now, even the least popular things will be liked by some people in the niche market era.

The general trend of the Internet is even a general trend of society as a whole. People are facing more and more choices. In the past, I used to watch a TV series all over the country, but now China produces fifteen thousand sets every year, many of which are not even possible. In 1994, a total of 0.5 million different types of goods were sold in the United States, and now there are more than 2.4 million types of goods on the Amazon website alone. Long Tail and Web 2.0 are becoming more and more choices. Anderson proposed three rules for long tail. The first is to make everything available; the second is to make these things sell very cheaply; and the third is to help me find it. The first two points can be said to have been done and done well. Now the key is the third point, how to help users make choices. This is the role of the recommendation engine.

According to Forrester, a market analysis company, 1/3 of users who have been recommended for products on e-commerce websites will buy items based on these recommendations. No advertisement can achieve this. Therefore, the recommendation engine is not only the core technology of Web2.0, but also the ultimate form of advertising. We can imagine that when a person faces hundreds of thousands of products on a shopping website, how likely is he willing to buy one? The most important reason for a person to return empty-handed may be that the item he will definitely buy has not been discovered by him.

Collective wisdom and collaborative filtering

Collective Intelligence is not unique in the web era. It is only in the web era that everyone uses collective wisdom to build more interesting applications or get a better user experience. Collective Intelligence refers to collecting answers in the behaviors and data of a large number of groups to help you draw statistical conclusions on the entire group. These conclusions are not available to individual individuals, it is often a trend or a common part of the population.

Collaborative Filtering is a typical method of using collective wisdom. To understand what collaborative filtering (CF) is, first consider a simple question. If you want to watch a movie, but you don't know which one to watch, what do you do? Most people will ask their friends to see what movie recommendations are available recently, and we generally prefer to get recommendations from friends with similar tastes. This is the core idea of collaborative filtering.

Core of collaborative filtering

First, you need to perform several steps to implement collaborative filtering.

  • 1. Collect user preferences
  • 2. find similar users or items
  • 3. Computing recommendations

Collect user preferences

We need to discover patterns in user behaviors and preferences and give recommendations based on them. How to collect user preferences becomes the most fundamental determining factor for system Recommendation results. Users have many ways to provide their preference information to the system, and different applications may be very different. The following is an example:

User behavior

Type

Features

Function

Rating

Explicit

Integer quantization preference. The possible value is [0, N]. N is generally set to 5 or 10.

Users can accurately obtain users' preferences by rating items.

Vote

Explicit

Boolean quantization preference; Value: 0 or 1

Users can accurately obtain users' preferences by voting on items.

Forwarding

Explicit

Boolean quantization preference; Value: 0 or 1

Users can accurately obtain users' preferences by voting on items.
If it is in the station, you can infer the preference of the person to be forwarded (inaccurate)

Save bookmarks

Display

Boolean quantization preference; Value: 0 or 1

Users can accurately obtain users' preferences by voting on items.

Tag
(TAG)

Display

Some words need to be analyzed to get preference

By analyzing user tags, you can get a user's understanding of the project and analyze users' emotions: like or dislike.

Comment

Display

Text analysis is required to get preference.

By analyzing users' comments, you can get users' emotions: like or dislike.

Click stream
(View)

Implicit

Click a group of users. Users are interested in items and need to analyze the items to get preferences.

Users' clicks reflect users' attention to a certain extent, so they can also reflect users' preferences to a certain extent.

Page stay time

Implicit

A group of time information with high noise. Noise Removal and analysis are required to obtain preferences.

The user's page stay time reflects the user's attention and preferences to a certain extent, but the noise is too high and it is not easy to use.

Purchase

Implicit

Boolean quantization preference; Value: 0 or 1

The user's purchase clearly shows that this project is of interest.

The user behaviors listed above are quite common. receng designers can add special user behaviors based on their application characteristics and use them to express users' preferences for items.

In general applications, we extract more than one user behavior. There are basically two ways to combine these different user behaviors:

  • 1. group different behaviors. Generally, they can be divided into "View" and "purchase". Then, different user/item similarity is calculated based on different behaviors. Similar to Dangdang or Amazon's "the person who bought the book also bought...", "The person who checked the book also checked ..."
  • 2. weighted the user's preferences based on their different behaviors to obtain the user's overall preferences for items. In general, explicit user feedback is larger than implicit weights, but sparse. After all, there are a few users who provide feedback ", the "purchase" behavior reflects a greater degree of user preferences, but this varies with applications.

After collecting user behavior data, we also need to pre-process the data. The core task is noise reduction and normalization.

  • Noise Reduction: user behavior data is generated when the user is using the application. It may have a lot of noise and user misoperations, we can use the classic data mining algorithm to filter out noise in behavior data, which makes our analysis more accurate.
  • Normalization: As mentioned earlier, different behavior data may need to be weighted when calculating the user's preference for an item. However, we can imagine that the values of data for different behaviors may vary greatly. For example, the user's viewing data must be much larger than the purchase data, how to unify the data of each behavior in the same value range, so that the general preference of the weighted sum is more accurate, we need to normalize it. The simplest normalization process is to divide all types of data by the maximum value in this class to ensure that the normalized data value is in the range of [0, 1.

After preprocessing, you can select grouping or weighted Processing Based on the Behavior Analysis Methods of different applications. Then we can get a two-dimensional matrix of user preferences, one dimension is the user list, the other one is the item list. The value is the user's preference for the item. It is generally a floating point value of [0, 1] or [-1, 1.

Find similar users or items

After analyzing user behaviors to get user preferences, we can calculate similar users and items based on user preferences, and then make recommendations based on similar users or items, this is the two most typical branches of CF: user-based CF and item-based cf. Next, let's take a look at several basic similarity calculation methods.

The existing basic methods for similarity calculation are based on vectors. In fact, the distance between two vectors is calculated, and the closer the distance is, the higher the similarity. In the recommendation scenario, in the two-dimensional matrix of user-item preferences, we can use a user's preferences for all items as a vector to calculate the similarity between users, or, all users' preferences for an item can be used as a vector to calculate the similarity between items. The following describes several common similarity calculation methods:
· Euclidean distance)
It was originally used to calculate the distance between two points in the Euclidean space. Assume that X and Y are two points in the n-dimensional space, and the Euclidean distance between them is:

We can see that when n is 2, Euclidean distance is the distance between two points on the plane. When Euclidean distance is used to represent similarity, the following formula is generally used for conversion: the smaller the distance, the larger the similarity

· Pearson Correlation Coefficient)

Pearson correlation coefficient is generally used to calculate the closeness between two fixed-distance variables. Its value is between [-1, + 1.

SX, Sy is the standard deviation between x and y samples.

· Cosine similarity (Cosine similarity)

Cosine similarity is widely used to calculate the similarity of document data:

· Tanimoto coefficient (tanimoto coefficient)

The tanimoto coefficient, also known as the jaccard coefficient, is an extension of cosine similarity. It is also used to calculate the similarity of document data:

Calculation of similar neighbors

After introducing the similarity calculation method, let's take a look at how to find the user-item neighbor based on the similarity. The common principle of choosing a neighbor can be divided into two categories: a fixed number of neighbors:
K-neighborhoods or fix-size neighborhoods
Regardless of the neighbor's "distance", the nearest K are used as the neighbor. Assume that we want to calculate the 5-neighbor of point 1. Then, based on the distance between the points, we take the 5 nearest vertices, which are vertices 2, vertices 3, vertices 4, points 7 and 5. But obviously, we can see that this method is not good for the calculation of isolated points, because a fixed number of neighbors are required. When there are not many similar points nearby, it is forced to take some very similar points as neighbors, which affects the degree of similarity of the neighbors. In comparison to 1, vertices 1 and vertices 5 are not very similar.

Neighbor based on similarity threshold: threshold-based neighborhoods
Different from the principle of calculating a fixed number of neighbors, similarity-based neighbor calculation limits the maximum distance of neighbors and falls in the center of the current vertex, all vertices in the region with a distance of K are the neighbors of the current vertex. The number of neighbors calculated by this method is uncertain, but there is no large error in similarity. B In, starting from vertex 1, computes the neighbor with the similarity within K, and obtains vertex 2, vertex 3, vertex 4, and vertex 7, the degree of similarity calculated by this method is better than the previous one, especially for processing isolated points.

Computing recommendation

After calculation, adjacent users and items are obtained. The following describes how to recommend the items based on the information. The previous article in this series briefly introduced that collaborative filtering-based recommendation algorithms can be divided into user-based CF and item-based Cf. Below we will go deep into the calculation methods of these two methods, use Cases and advantages and disadvantages.

User-based cf)

The basic idea of user CF is quite simple. Based on users' preferences for items, we can find neighboring users, and then recommend them to the current users. In computing, a user's preference for all items is used as a vector to calculate the similarity between users. After finding the K-neighbor, based on the similarity weight of the neighbor and their preference for items, predict items that do not involve preferences of the current user, and calculate a list of sorted items as a recommendation. As follows, for user A, based on the user's historical preferences, only one neighbor-user C is calculated here, and user C's favorite item D is recommended to user.

Item-based CF (item CF)

The principle of item-based CF is similar to that of user-based Cf. It only uses the item itself when calculating neighbors, rather than finding similar items based on users' preferences, then, we recommend similar items to the user based on their historical preferences. From the computing point of view, it is to use all users' preferences for an item as a vector to calculate the similarity between items. After obtaining similar items of an item, predict the items that the current user has not expressed as preference based on the user's historical preferences, and calculate a list of sorted items as recommendation. As follows, for item A, according to the historical preferences of all users, users who like item A like item C and get that item A is similar to item C, while user C prefers item, it can be inferred that user C may also like item C.

We have introduced the basic principles of user CF and item Cf. The following describes their advantages and disadvantages and applicable scenarios from several different perspectives:
Computing complexity
Item CF and user CF are two basic algorithms for collaborative filtering recommendation. User CF was proposed a long time ago, item CF has become popular since Amazon's papers and patents were published (around 2001). We all think that item CF has better performance and complexity than user cf, one of the main reasons is that for an online website, the number of users often exceeds the number of items, and the item data is relatively stable. Therefore, similarity calculation not only requires a small amount of computing, and do not need to be updated frequently. However, we often ignore this situation and only adapt to e-commerce websites that provide commodities. For news, blogs, or microcontent recommendation systems, the situation is often the opposite, the number of items is massive and frequently updated. Therefore, from the complexity perspective, these two algorithms have their own advantages in different systems, the recommendation engine designer needs to select a more appropriate algorithm based on the characteristics of their own applications.
Applicable scenarios
In non-social network websites, the internal relationship of content is an important recommendation principle, which is more effective than the recommendation principle based on similar users. For example, when you read a book on the book purchase website, the recommendation engine will recommend relevant books to you. The importance of this Recommendation far exceeds the comprehensive recommendation of the user on the homepage of the website. We can see that in this case, item CF recommendations have become an important means to guide users to browse. At the same time, item CF is easy to explain the recommendation. In a non-Social Network website, it recommends a book to a user, at the same time, the explanation is that a person who has similar interests with you has read this book, which is hard to convince the user, because the user may not know the person at all; but if it is explained that this book is similar to a book you have read before, users may feel reasonable and have adopted this recommendation. On the contrary,In today's popular social network sites, user CF is a better choice.Adding social network information to user CF can increase users' confidence in recommendation interpretation.
Recommendation diversity and Accuracy
Scholars studying the recommendation engine used user CF and item CF to calculate the recommendation results on the same dataset. They found that only 50% of the recommendation list is the same, and 50% is completely different. However, these two algorithms have similar precision, so they can be said to be very complementary.
There are two measurement methods for recommendation diversity:
The first measurement method is to measure from the perspective of a single user, that is to say, to give a user, check whether the recommendation list provided by the system is diverse, that is, to compare the similarity between items in the recommendation list, it is not hard to imagine that the diversity of item CF is obviously not as good as that of user CF, because the recommendation of item CF is the most similar to that seen previously.
The second measurement method is to consider the diversity of systems, also known as coverage. It refers to whether a recommendation system can provide a wide range of choices for all users. Under such indicators, the diversity of item CF is far better than that of user CF, because user CF always tends to recommend popular ones. from another perspective, that is to say, item CF recommendations have a good novelty, we are good at recommending items in long tails. Therefore, although the precision of item CF is slightly smaller than that of user CF in most cases, item CF is much better than user CF in consideration of diversity.
If you are still confused about the diversity of recommendations, let's take another example to see the differences between user CF and item cf. First, assume that each user has a wide range of interests and interests in several fields. However, each user must have a major field that is more concerned than other fields. Given a user, assume that he prefers three fields: A, B, C, and A. At this time, let's look at the recommendations that user CF and item CF tend to make: if user CF is used, it will recommend the popular things in fields A, B, and C to users. If itemcf is used, it will basically only recommend things in area A to users. Therefore, we can see that user CF only recommends hot ones, so it lacks the project capability in the recommendation long tail. Item CF only recommends field a to users, in this way, his limited recommendation list may contain a certain number of non-popular long tail items. At the same time, the recommendation of item CF is obviously not diverse for this user. But for the entire system, because different users have different main points of interest, the system coverage rate will be better.
From the above analysis, we can clearly see that both of these recommendations have their rationality, but they are not the best choice, so their accuracy will also suffer. In fact, the best choice for this type of system is that if the system recommends 30 items to this user, it neither selects 10 of the most popular items for him in each field, it is not recommended for 30 fields a, but for example, to recommend 15 fields a. The remaining 15 fields are selected from B and C. Therefore, the combination of user CF and item CF is the best choice. The basic principle of integration isWhen the use of item CF leads to a lack of system diversity of personal recommendations, we add user CF to increase the diversity of personal recommendations, thus improving accuracy.When user CF is used to make the overall diversity of the system insufficient, we can add item CF to increase the overall diversity, and also improve the recommendation accuracy.
User's adaptability to recommendation Algorithms
Most of the above are from the recommendation engine perspective which algorithm is better, but in fact we should consider more as the end user of the recommendation engine-The Application User's adaptability to the recommendation algorithm.
For user CF, the recommendation principle is to assume that the user will like things that the user has the same preferences as him, but if a user does not have the same preferences, the user CF algorithm will have a poor effect, so the adaptability of a user to the CF algorithm is proportional to the number of common users he or she prefers.
The item CF algorithm also has a basic assumption that a user will like something similar to what he liked before, so we can calculate the self-similarity of a user's favorite item. A user's favorite items have a high self-similarity, which means that the items he prefers are similar. That is to say, the user is in line with the basic assumption of the item CF method, the adaptability of item CF is naturally better. If the self-similarity is small, it means that the user's preferences and habits do not meet the basic assumptions of the item CF method, the possibility of making good recommendations using the item CF method is very low.
Summary
A core idea of web is "collective wisdom". The basic idea of collaborative filtering-based recommendation strategies is to provide personalized recommendations for each user based on mass behaviors, this allows you to quickly and accurately discover the required information. From the perspective of applications, today's more successful recommendation engines, such as Amazon, Douban, and Dangdang, all adopt collaborative filtering. They do not need to strictly model items or users, in addition, the description of items is understandable by machines and is a recommendation method unrelated to the field. The recommendation calculated by this method is open and can share the experience of others, it is very helpful for users to discover potential interests and preferences.

For more information, see the original article: efficient collaborative filtering recommendation based on Apache mahout.

Receng internal secrets 2

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.