Explore the secrets of the recommended engine, part 2nd: In-depth recommendation engine-related algorithms-collaborative filtering (RPM)

Source: Internet
Author: User
Tags knowledge base

Part 2nd: In-depth recommendation of engine-related algorithms-collaborative filtering

The first article in this series provides an overview of the recommendation engine, and the following articles provide an in-depth introduction to the recommended engine's algorithms and help readers implement them efficiently. In today's recommended technology and algorithms, the most widely recognized and adopted is based on collaborative filtering recommendation method. It has a simple model, low data dependence, convenient data collection, the recommended effect is more than a number of advantages to become the public eye of the recommended algorithm "the best." This article will take you deep into the secret of collaborative filtering, and give an efficient implementation of a collaborative filtering algorithm based on Apache Mahout. Apache Mahout is a new open source project for ASF, which originates from Lucene and is built on top of Hadoop to focus on the efficient implementation of machine learning classic algorithms on massive amounts of data.

Collective intelligence and collaborative filtering what is collective wisdom

Collective Wisdom (collective Intelligence) is not unique to the Web2.0 era, but in the Web2.0 era, we use collective intelligence to build more interesting applications or better user experiences in WEB applications. Collective wisdom is the collection of answers in the behavior and data of a large number of people to help you get a statistical conclusion about the entire population that we cannot get on a single individual, which is often a trend or a common part of the population.

Wikipedia and Google are two typical WEB 2.0 applications that use collective intelligence:

    • Wikipedia is an encyclopedia of knowledge management, and Wikipedia allows end users to contribute knowledge as compared to traditional encyclopedias edited by domain experts, and as the number of participants increases, Wikipedia becomes an unparalleled comprehensive knowledge base covering all areas. Maybe someone will question its authority, but if you think about it from another side, you may be able to solve it. In the issue of a book, although the author is authoritative, but inevitably there are some errors, and then through a version of a version of the revision, the content of the book more and More perfect. On Wikipedia, such revisions and corrections are turned into things that everyone can do, and anyone who discovers errors or imperfections can contribute their ideas, even if some of the information is wrong, but it will be corrected by others as soon as possible. From a macroscopic point of view, the whole system in accordance with a virtuous circle of the trajectory of continuous improvement, which is also the charm of collective wisdom.
    • Google: Currently the most popular search engine, unlike Wikipedia, it does not require users to make explicit contributions, but think carefully about Google's core PageRank thinking, it takes advantage of the relationship between Web pages, How many other pages are linked to the number of current pages as a measure of the importance of the current page; if that doesn't make sense, then you can think of it as an election process, each WEB page being a voter and a voter, PageRank A relatively stable score is obtained by a certain number of iterations. Google actually takes advantage of the collective wisdom of the links on all Web pages on the Internet, and it's important to find which pages are.
What is collaborative filtering

Collaborative filtering is a typical method of using collective intelligence. To understand what is collaborative filtering (collaborative Filtering, abbreviated CF), first think of a simple question, if you want to see a movie now, but you don't know exactly which part to look at, what would you do? Most people ask their friends around to see what good movie recommendations they have recently, and we generally prefer to get referrals from friends who have more similar tastes. This is the core idea of collaborative filtering.

Collaborative filtering is generally found in a large number of users with a small fraction of your taste is similar, in collaborative filtering, these users become neighbors, and then according to their favorite other things organized into a sort of directory as recommended to you. There is, of course, one of the core issues:

    • How do you determine if a user has similar tastes to you?
    • How do you organize your neighbors ' preferences into a sorted directory?

Collaborative filtering in relation to collective intelligence, it retains the individual's characteristics to a certain extent, it is your taste preference, so it can be more as a personalized recommendation of the algorithm thought. As you can imagine, this recommendation strategy is important in the long tail of WEB 2.0, and recommending popular things to people in the long tail can get good results, and it goes back to one of the core issues of the Recommender system: knowing your users and then giving better recommendations.

The core of deep collaborative filtering

As background knowledge, we introduce the basic idea of collective intelligence and collaborative filtering, this section we will analyze the principle of collaborative filtering, introduce the multi-recommendation mechanism based on collaborative filtering, advantages and disadvantages and practical scenarios.

First, to implement collaborative filtering, you need a few steps

    • Collect User Preferences
    • Find a similar user or item
    • Calculation recommendations
Collect User Preferences

In order to find the rule from the user's behavior and preference, and based on this recommendation, how to collect the user's preference information becomes the most fundamental determinant of the system recommendation effect. Users have many ways to provide their preferences to the system, and different applications may vary greatly, the following examples are described:

Table 1 user behavior and user preferences
User Behavior type features function
Score An explicit The preference for integer quantization, the possible value is [0, N];n general value is 5 or 10 Users ' preferences can be accurately obtained by rating the items.
Vote An explicit Boolean quantization preference, with a value of 0 or 1 Users ' preferences can be more accurately obtained by voting on items.
Forward An explicit Boolean quantization preference, with a value of 0 or 1 Through the user's vote on the item, the user's preference can be accurately obtained.
If it is inside the station, it can be inferred that the preference of the forwarded person (imprecise)
Save Bookmark Show Boolean quantization preference, with a value of 0 or 1 Through the user's vote on the item, the user's preference can be accurately obtained.
Tag tags
(TAG)
Show Some words, need to analyze the words, get preference By analyzing the user's tags, users can get the understanding of the project, and can analyze the user's emotion: like or hate
Comments Show A piece of text that needs text analysis to get preference By analyzing the user's comments, you can get the user's feelings: like or hate
Click Stream
View
Implicit A group of user clicks, users interested in items, need to analyze, get preferences The user's click to a certain extent reflects the user's attention, so it can also reflect the user's preferences to a certain extent.
Page Dwell time Implicit A set of time information, noise, need to be de-noising, analysis, get preference The user's page dwell time to a certain extent reflects the user's attention and preferences, but the noise is too large, not good use.
Buy Implicit Boolean quantization preference, with a value of 0 or 1 The user's purchase is very clear and it is interesting to note this item.

The above enumerated user behavior is more general, the recommendation engine designers can according to their own application characteristics to add special user behavior, and use them to express the user's preference for items.

In general applications, we extract more than one user behavior, about how to combine these different user behavior, there are basically the following two ways:

    • Grouping different behaviors: Generally can be divided into "view" and "buy" and so on, and then based on different behavior, calculate the different user/item similarity. Like Dangdang or Amazon, "the person who bought the book also bought ...", "the person who viewed the book also viewed ..."
    • They are weighted according to the extent to which the different behaviors reflect user preferences, resulting in a user's overall preference for items. In general, explicit user feedback is larger than implicit weights, but relatively sparse, after all, the number of users who display feedback is small, and the purchase behavior reflects a greater degree of user preference than "view", but this also varies by application.

Collecting user behavior data, we also need to do some preprocessing of the data, the core of which is: noise reduction and normalization.

    • Noise reduction: User behavior data is generated by the user in the application process, it may have a lot of noise and user's misoperation, we can filter out the noise in the behavior data through the classical data mining algorithm, this can be our analysis more accurate.
    • Normalization: As mentioned earlier, it may be necessary to weighting different behavioral data when calculating user preference for items. However, it can be imagined that the different behavior of the data value may vary greatly, for example, the user's viewing data is necessarily larger than the purchase data, how to unify the data of each behavior in a same value range, so that the weighted sum of the overall preferences more accurate, we need to be normalized. The simplest normalization is to divide all kinds of data by the maximum value in this class to ensure that the normalized data is evaluated in the [0,1] range.

After preprocessing, according to different application behavior analysis method, can choose to group or weighted processing, then we can get a user preference of two-dimensional matrix, one-dimensional is the user list, the other dimension is a list of items, the value is the user's preference for items, is generally [0,1] or [-1, 1] floating point value.

Find a similar user or item

After the user's behavior has been analyzed by user preferences, we can calculate similar users and items according to user preferences, and then based on similar users or items to recommend, this is the most typical CF two branches: User-based CF and item-based CF. Both methods need to calculate similarity, let's take a look at some of the most basic methods of calculating similarity.

The calculation of similarity degree

On the calculation of similarity, the existing basic methods are based on vector (vectors), in fact, the distance between two vectors is calculated, the closer the similarity of the greater. In the recommended scenario, in a two-dimensional matrix of user-item preferences, we can use a user's preference for all items as a vector to calculate the similarity between users, or to calculate the similarity between items by a vector of all users ' preferences for an item. Here we describe in detail several commonly used similarity calculation methods:

    • Euclidean distance (Euclidean Distance)

Originally used to calculate the distance between two points in Euclidean space, suppose that x, Y is two points in an n-dimensional space, the Euclidean distance between them is:

As you can see, Euclidean distance is the distance of two points on the plane when n=2.

When using Euclidean distance to denote similarity, the following formula is generally used to convert: the smaller the distance, the greater the similarity

    • Pearson correlation coefficient (Pearson Correlation coefficient)

Pearson correlation coefficients are generally used to calculate the tightness of the connections between the two fixed-distance variables, and its value is between [ -1,+1].

SX, SY is the standard deviation of the sample for x and Y.

    • Cosine similarity (cosine similarity)

The similarity of cosine is widely used to calculate the similarity of document data:

    • Tanimoto coefficient (Tanimoto coefficient)

Tanimoto coefficients, also known as Jaccard coefficients, are extensions of cosine similarity and are used to calculate the similarity of document data:

Computation of similar neighbors

After the introduction of the calculation method of similarity, we see how to find the user-item neighbor according to the similarity, the common principle of selecting neighbors can be divided into two categories: Figure 1 shows the point set on the two-dimensional planar space.

    • Fixed number of neighbors: K-neighborhoods or fix-size neighborhoods

Regardless of the neighbor's "near and far", take only the nearest K, as its neighbors. 1 in a, suppose to calculate point 1 of 5-neighbor, then according to the distance between points, we take the nearest 5 points, respectively, point 2, point 3, point 4, point 7 and point 5. But obviously we can see that this method is not good for outliers, because to take a fixed number of neighbors, when it is not near enough to compare similar points, it is forced to take some of the less similar points as neighbors, which affect the neighbor similar degree, than 1, point 1 and point 5 is not very similar.

    • Neighbor based on similarity threshold: threshold-based neighborhoods

Unlike the principle of calculating a fixed number of neighbors, neighbor computation based on the threshold of similarity is the limit of the maximum proximity of neighbors, falling at the center of the current point, and all the points in the area of K as the neighbors of the current point, the method calculates the number of neighbors is indeterminate, but the similarity does not have a large error. 1 in B, starting from point 1, calculate the similarity in the K neighbor, get point 2, point 3, point 4 and point 7, this method calculates the similarity degree of the neighbor than the previous advantage, especially the processing of outliers.

Figure 1: Similar neighbor calculations

Calculation recommendations

After the previous calculation has obtained the adjacent users and adjacent items, the following describes how to make recommendations for users based on this information. The previous review article in this series has briefly introduced the recommendation algorithm based on collaborative filtering can be divided into user-based CF and item-based CF, below we delve into the two methods of computing, use scenarios and advantages and disadvantages.

Users-based CF (user CF)

The basic idea of the user-based CF is quite simple, based on the user's preference for the item to find the neighboring neighbor user, then the neighbor user likes the recommendation to the current user. In the calculation, it is a user's preference for all items as a vector to calculate the similarity between users, after finding K neighbors, according to the neighbor's similarity weight and their preference for items, predict the current user does not have a preference for items, calculate a sorted list of items as a recommendation. Figure 2 shows an example, for user A, based on the user's historical preferences, here only to get a neighbor-user C, and then the user C-like item D is recommended to user A.

Figure 2: Fundamentals of the user-based CF

Item-based CF (item CF)

The principle of the item-based CF is similar to the user-based CF, except that the item itself is used in the calculation of the neighbor, not from the user's point of view, that is, based on the user's preference for the item to find similar items, and then according to the user's historical preference, recommend similar items to him. From the point of view of computing, it is the preference of all users of an item as a vector to calculate the similarity between items, to obtain similar items, according to the user's historical preferences to predict the current user has not expressed the preferences of the items, calculated to get a sorted list of items as a recommendation. Figure 3 shows an example, for item A, according to the historical preferences of all users, like item a users like item C, the article A and item C is similar, and User C likes item A, then you can infer that user C may also like item C.

Figure 3: Fundamentals of item-based CF

User CF vs. Item CF

The basic principles of User CF and Item CF are described earlier, and we'll look at a few different angles to see their pros and cons and the scenarios that apply:

    • Computational complexity

The item CF and user CF are the recommended two most basic algorithms based on collaborative filtering, and the user CF was introduced long ago, and the item CF was popular since Amazon's papers and patents were published (around 2001), and everyone felt that the Item CF was more performance and complexity than User CF is better, one of the main reasons is that for an online site, the number of users is often much more than the number of items, while the data of the item is relatively stable, so the calculation of the similarity of the item is not only a small amount of calculation, but also do not have to update frequently. But we tend to ignore this situation only to provide goods for the e-commerce site, for news, blog or micro-content recommendation system, the situation is often the opposite, the number of items is huge, but also updated frequently, so single from the point of view of complexity, the two algorithms in different systems have advantages, The designer of the recommendation engine needs to choose a more appropriate algorithm based on the characteristics of his application.

    • Applicable scenarios

In non-social network sites, the internal link of content is an important recommendation principle, which is more effective than the recommendation principle based on similar users. For example, on the purchase of a book site, when you read a book, the recommendation engine will give you recommendations related to the book, the importance of this recommendation far more than the homepage of the user's comprehensive recommendation. As you can see, in this case, the Item CF recommendation becomes an important means of navigating the user. At the same time, the Item CF makes it easy to explain the recommendation, to recommend a book to a user on a non-social network site, and to explain that someone with similar interests has read the book, which is hard to convince the user, because the user may not know the person at all. But if the explanation is that the book is similar to a book you've read before, users may find it reasonable to adopt the recommendation.

On the contrary, in today's popular social networking sites, user CF is a better choice, and user CF plus social Networking information can increase users ' confidence in the referral interpretation.

    • Recommended diversity and Accuracy

The researchers who study the recommendation engine use the User CF and Item CF to calculate the recommended results on the same data set, and find that only 50% of the recommendation lists are the same, and 50% are completely different. But the two algorithms have similar precision, so it can be said that the two algorithms are very complementary.

There are two ways to measure the diversity of recommendations:

The first measure is measured from the perspective of a single user, that is, given a user, to see if the system gives a variety of recommendations, that is, to compare the recommended list of items between the 22 similarity, it is not difficult to think of this measure, the diversity of the item CF is obviously not as good as the User CF, because the item CF's recommendation is the most similar to what you have seen before.

The second measure is to consider the diversity of systems, also known as coverage (coverage), which refers to whether a referral system can provide a rich choice for all users. In this indicator, the diversity of the item CF is much better than the user CF, because the user CF always tends to recommend the hot, from the other side, that is, the item CF recommendation has a very good novelty, is very good at recommending long tail items. Therefore, although the accuracy of the item CF is slightly smaller than the user CF in most cases, the item CF is much better than the user CF if the diversity is considered.

If you're still wondering about the diversity of recommendations, let's take another example to see what the difference is between User CF and Item cf. First of all, suppose that each user's interests are broad, like several areas of things, but each user must also have a major area, the field will be more than other areas of concern. Given a user, assuming he likes 3 domains A,b,c,a is the main area he likes, this time we see what the user CF and Item CF tend to recommend: If you use the user CF, it will a,b,c the more popular things in the three fields to the user; EMCF, it will basically only recommend A field of stuff to the user. So we see that because the user CF is only recommended for hot, so it has insufficient ability to recommend long tail projects, and the item CF only recommends a domain to the user, so that his limited list of recommendations may contain a certain number of non-popular long tail items, and the item CF recommendation for this user, obviously many Lack of sample. But for the whole system, because different users of the main points of interest are different, so the system coverage will be better.

From the above analysis, it can be clearly seen that both recommendations have their rationality, but are not the best choice, so their accuracy will be lost. In fact, the best choice for this kind of system is, if the system to the user recommended 30 items, not each field pick 10 the hottest to him, also not recommended 30 a field to him, but for example, recommended 15 A field to him, the remaining 15 from the b,c choice. Therefore, the combination of user CF and Item CF is the best choice, the basic principle is that when the use of the item CF causes the system to the diversity of individual recommendations, we add the user CF to increase the diversity of individual recommendations, thereby improving the accuracy, and when the use of user CF to the system's entire When the volume diversity is insufficient, we can increase the overall diversity by adding the Item CF, as well as improving the recommended accuracy.

    • User's adaptability to the recommended algorithm

Most of us consider which algorithm is better from the point of view of the recommendation engine, but in fact we should consider as the end user of the recommendation engine-the application user's adaptability to the recommendation algorithm.

For the user CF, the recommended principle is to assume that users will like those who have the same preferences of the user like things, but if a user does not have the same preferences of friends, that the user CF algorithm of the effect will be very poor, so a users of the CF algorithm's fitness and how much he has a common preference for users in proportion.

The item CF algorithm also has a basic assumption that users will like something similar to what he used to like, so we can calculate the self-similarity of a user's favorite item. A user like the object of the self-similarity, it means that he likes the things are more similar, that he is more consistent with the basic assumptions of the item CF method, then his adaptation to the item CF is naturally better; Conversely, if the self-similarity is small, it means that the user's preferences do not meet the Item CF The basic assumptions of the method, then the possibility of making good recommendations with the Item CF method is very low for such users.

Through the above introduction, I believe that we have a collaborative filtering recommended various methods, principles, characteristics and application scenarios have in-depth understanding, the following we enter the actual combat phase, focusing on how to implement a collaborative filtering recommendation algorithm based on Apache Mahout.

Back to top of page

Efficient collaborative filtering recommendations based on Apache Mahout

Apache Mahout is an open-source project under the Apache Software Foundation (ASF) that provides a number of extensible machine learning domain Classic algorithms designed to help developers create smart applications more quickly and easily, and in Mahout Also added support for Apache Hadoop to enable these algorithms to run more efficiently in the cloud environment.

For the installation and configuration of Apache Mahout, refer to the "Building a social recommendation engine based on Apache Mahout", a DeveloperWorks article published in 09 on the Mahout implementation recommendation engine, which details Mahout Installation steps and give an example of a simple movie recommendation engine.

The efficient implementation of a collaborative filtering algorithm provided in Apache Mahout is a scalable, efficient recommendation engine based on Java implementations. Figure 4 shows a component diagram of the recommended implementation of collaborative filtering in Apache Mahout, which we'll step through in more detail.

Figure 4. Component Diagram

Data representation:

Preference

The input of the recommendation engine based on collaborative filtering is the user's historical preference information, in Mahout it is modeled as Preference (interface), a Preference is a simple ternary group < user ID, item ID, user preferences, and its implementation class is Gener Icpreference, you can create a genericpreference with the following statement.

Genericpreference preference = new Genericpreference (123, 456, 3.0f);

Of these, 123 is the user Id,long type, 456 is the item id,long type; 3.0f is user preference, float type. From this example we can see that a single genericpreference of data occupies a bytes, so you will find that if only a simple practical array of arrays loading user preferences data, it is necessary to occupy a large amount of memory, Mahout in this area to do some optimization, it created a Preferencearray (interface) holds a set of user preference data, in order to optimize performance, Mahout gives two implementation classes, Genericuserpreferencearray and Genericitempreferencearray, The user's preference is assembled according to the user and the item itself, so that the user ID or item ID space can be compressed. The code in Listing 1 below shows how to create and use a preferencearray, as an example of Genericuserpreferencearray.

Listing 1. Create and use Preferencearray
Preferencearray userpref = new Genericuserpreferencearray (2); Size = 2  userpref.setuserid (0, 1L);  Userpref.setitemid (0, 101L);  <1l, 101L, 2.0f>  userpref.setvalue (0, 2.0f);  Userpref.setitemid (1, 102L);  <1l, 102L, 4.0f>  userpref.setvalue (1, 4.0f);  Preference pref = userpref.get (1);   <1l, 102L, 4.0f>

In order to improve performance Mahout also built its own HashMap and Set:fastbyidmap and Fastidset, interested friends can refer to Mahout official instructions.

Datamodel

Mahout's recommendation engine actually accepts input as Datamodel, which is a compressed representation of user preference data, and by creating a memory version of the Datamodel statement we can see:

Datamodel model = new Genericdatamodel (fastbyidmap<preferencearray> map);

He is saved in a preferencearray that is hashed according to the user ID or item ID, while the Preferencearray contains all user preference information that holds the user ID or item ID.

Datamodel is an abstract interface for user preferences, and its implementation supports extracting user preferences from any type of data source, including Genericdatamodel in memory version, support for file read Filedatamodel and support for database read Jdbcdatamodel, let's look at how to create various Datamodel.

Listing 2. Create various Datamodel
 In-memory Datamodel-genericdatamodel fastbyidmap<preferencearray> preferences = new fastbyidmap<  Preferencearray> ();   Preferencearray prefsForUser1 = new Genericuserpreferencearray (10);  Prefsforuser1.setuserid (0, 1L);  Prefsforuser1.setitemid (0, 101L);   Prefsforuser1.setvalue (0, 3.0f);  Prefsforuser1.setitemid (1, 102L); Prefsforuser1.setvalue (1, 4.5f); ...   (8 more) Preferences.put (1L, prefsForUser1); Use UserID as the key ...  (More users)  Datamodel model = new Genericdatamodel (preferences);  file-based Datamodel-filedatamodel Datamodel datamodel = new Filedatamodel (New File ("Preferences.csv");  database-based Datamodel-mysqljdbcdatamodel Mysqldatasource dataSource = new Mysqldatasource ();  Datasource.setservername ("My_user");  Datasource.setuser ("My_password");  Datasource.setpassword ("My_database_host"); Jdbcdatamodel Datamodel = new Mysqljdbcdatamodel (DataSource, "my_prefs_table", "My_user_column", "My_item_column", "my _pref_value_column ");

Filedatamodel,mahout that support file reads do not have too many requirements for the format of the file, as long as the contents of the file meet the following format:

    • Each line includes user ID, item ID, user preference
    • Comma-separated or Tab-separated
    • *.zip and *.gz files are automatically decompressed (Mahout recommended to use compressed data storage when the data volume is too large)

Support for database read Jdbcdatamodel,mahout provides a default MySQL support, which has the following simple requirements for storing user preference data:

    • The user ID column needs to be BIGINT and not empty
    • Item ID column needs to be BIGINT and not empty
    • The User preference column needs to be a FLOAT

It is recommended that you index the user ID and item ID.

Implementation recommendation: Recommender

After introducing the data presentation model, the following describes the recommended strategy for collaborative filtering provided by Mahout, where we choose the three most classic, User CF, Item CF, and Slope one.

User CF

The principle of user CF is described in detail here, and here we focus on how to implement the recommended strategy of user CF based on Mahout, and we will start with an example:

Listing 3. Implementing User CF based on Mahout
Datamodel model = new Filedatamodel (New File ("Preferences.dat"));  Usersimilarity similarity = new pearsoncorrelationsimilarity (model);  Userneighborhood neighborhood = new Nearestnuserneighborhood (+, similarity, model);  Recommender recommender = new Genericuserbasedrecommender (model,  neighborhood, similarity);
    1. To build Datamodel from the file, we use the Filedatamodel described above, which assumes that the user's preferences are stored in the Preferences.dat file.
    2. Based on user preference data to calculate the user's similarity, the list is pearsoncorrelationsimilarity, the previous chapters have described in detail the various computational similarity methods, Mahout provides a basic similarity in the calculation, they are Usersimilarity this interface to achieve user similarity calculation, including the following commonly used:
    • Pearsoncorrelationsimilarity: Calculation of similarity based on Pearson correlation coefficients
    • Euclideandistancesimilarity: Calculation of similarity based on Euclidean distance
    • Tanimotocoefficientsimilarity: Calculation of similarity based on Tanimoto coefficients
    • Uncerteredcosinesimilarity: Calculating cosine similarity

The itemsimilarity is similar:

    1. The neighbor user is found based on the established similarity calculation method. Here's how to find a neighbor user. According to the previous introduction, we also included two kinds: "Fixed number of neighbors" and "similarity threshold Neighbor" calculation method, Mahout provides the corresponding implementation:
      • Nearestnuserneighborhood: The nearest neighbor that takes a fixed number of N per user
      • Thresholduserneighborhood: For each user based on a certain limit, all users falling within the similarity threshold are neighbors.
    2. Build Genericuserbasedrecommender based on Datamodel,userneighborhood and usersimilarity to implement User CF recommendation strategy.

Item CF

Understanding that the user Cf,mahout Item CF implementation is similar to the user CF, is based on the itemsimilarity, below we see the implementation of the code example, it is more simple than the user CF, because the Item CF does not need to introduce the concept of neighbor:

Listing 4. Implementation of Item CF based on Mahout
Datamodel model = new Filedatamodel (New File ("Preferences.dat"));  Itemsimilarity similarity = new pearsoncorrelationsimilarity (model);  Recommender recommender = new Genericitembasedrecommender (model, similarity);

Slope One

As described earlier, User CF and Item CF are the recommended strategies for the two most commonly understood CF, but when it comes to large data volumes, they can be computationally large, leading to poor recommendation efficiency. So Mahout also offers a more lightweight CF recommendation strategy: Slope one.

Slope One is an improved approach to the scoring-based collaborative filtering recommendation engine presented in 2005 by Daniel Lemire and Anna MacLachlan, which briefly introduces its basic ideas.

Figure 5 shows an example, assuming that the system's average score for item A, item B and item C is 3,4 and 4, respectively. The method based on Slope one will get the following rules:

    • User's rating of item B = User's rating of item A + 1
    • User's rating of item B = User's rating of item C

Based on the above rules, we can predict the score of user A and user B:

    • For user A, he scored 4 for item A, so we can speculate that he scored 5 for item B and 5 for item C.
    • For User B, he scored 2 for item A, 4 for item C, and according to the first rule, we could infer that he scored 3 for item B, and according to the second rule, the score was 4. When there is a conflict, we can get an average of the inferences from the various rules, so the inference given is 3.5.

This is the rationale for the Slope one recommendation, which considers the relationship between user ratings as a simple linear relationship:

Y = MX + b;

When m = 1 o'clock is Slope one, which is the example we just showed.

Figure 5.Slope One recommended policy example

The core advantage of Slope one is that it still guarantees good computing speed and recommended results on large scale data. Mahout provides a basic implementation of the Slope one recommendation method, the implementation code is very simple, refer to listing 5.

Listing 5. Implementation of Slope one based on Mahout
In-memory recommender  diffstorage diffstorage = new Memorydiffstorage (model, weighting.unweighted, False,  Long.max_value));  Recommender recommender = new Slopeonerecommender (model, weighting.unweighted,  weighting.unweighted, DiffStorage );   database-based recommender  Abstractjdbcdatamodel model = new Mysqljdbcdatamodel ();  Diffstorage diffstorage = new Mysqljdbcdiffstorage (model);  Recommender recommender = new Slopeonerecommender (model, weighting.weighted,  weighting.weighted, diffStorage);

1. Model Diffstorage that creates a linear relationship between data based on the database model.

2. Create slopeonerecommender based on the Data Model and diffstorage to implement the Slope one recommendation strategy.

Summarize

One of the core ideas of Web2.0 is "collective intelligence", the basic idea of the recommendation strategy based on collaborative filtering is based on the mass behavior, providing individual recommendations for each user, thus enabling users to find the information they need more quickly and accurately. From the perspective of application analysis, today's more successful recommendation engine, such as Amazon, watercress, when the use of collaborative filtering, it does not need to be a product or user rigorous modeling, and does not require the description of the item is machine understandable, is not a field-independent recommendation method, At the same time this method calculates the recommendation is open, can share other people's experience, very good supports the user to discover the latent interest preference. The recommendation strategy based on collaborative filtering also has different branches, they have different practical scenarios and recommendations, the user can choose the appropriate method according to the actual situation of their application, different or combined methods to get better recommendations.

In addition, this paper also describes how to efficiently implement collaborative filtering recommendation algorithm based on Apache Mahout, Apache Mahout focuses on the efficient implementation of machine learning classical algorithms on massive data, which provides good support for collaborative filtering based recommendation method, based on Mahout You can easily experience the magic of highly effective recommendations.

As the first article in-depth recommendation engine-related algorithm, this paper introduces the collaborative filtering algorithm in depth, and gives an example of how to implement collaborative filtering recommendation algorithm based on Apache Mahout efficiently, Apache Mahout as the high-efficient realization of machine learning classical algorithm on mass data, It also provides good support for collaborative filtering based recommendations, and you can easily experience the magic of highly effective recommendations based on Mahout. However, we also find that it is very challenging to efficiently run collaborative filtering algorithms and other recommended strategies with high complexity on massive data. In the process of facing this problem, we put forward a lot of methods to reduce the computational amount, and clustering is undoubtedly the best choice. So the next article in this series will detail various clustering algorithms, their principles, advantages and disadvantages, and practical scenarios, and give an efficient implementation of the clustering algorithm based on Apache Mahout, and analyze the implementation of the recommendation engine, how to solve the large amount of data caused by the massive computation by introducing clustering, So as to provide efficient recommendations.

Finally, thank you for your interest and support in this series.

(Source: http://www.ibm.com/developerworks/cn/web/1103_zhaoct_recommstudy2/index.html)

Explore the secrets inside the recommendation engine, part 2nd: In-depth recommendation engine-related algorithms-collaborative filtering (RPM)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.