Efficient collaborative filtering recommendations based on Apache Mahout
Apache Mahout is an open-source project under the Apache Software Foundation (ASF) that provides a number of extensible machine learning domain Classic algorithms designed to help developers create smart applications more quickly and easily, and in Mahout Also added support for Apache Hadoop to enable these algorithms to run more efficiently in the cloud environment.
For the installation and configuration of Apache Mahout, refer to the "Building a social recommendation engine based on Apache Mahout", a DeveloperWorks article published in 09 on the Mahout implementation recommendation engine, which details Mahout Installation steps and give an example of a simple movie recommendation engine.
The efficient implementation of a collaborative filtering algorithm provided in Apache Mahout is a scalable, efficient recommendation engine based on Java implementations. Figure 4 shows a component diagram of the recommended implementation of collaborative filtering in Apache Mahout, which we'll step through in more detail.
Figure 4. Component Diagram
Data representation:
Preference
The input of the recommendation engine based on collaborative filtering is the user's historical preference information, in Mahout it is modeled as Preference (interface), a Preference is a simple ternary group < user ID, item ID, user preferences, and its implementation class is Gener Icpreference, you can create a genericpreference with the following statement.
Genericpreference preference = new Genericpreference (123, 456, 3.0f);
Of these, 123 is the user Id,long type, 456 is the item id,long type; 3.0f is user preference, float type. From this example we can see that a single genericpreference of data occupies a bytes, so you will find that if only a simple practical array of arrays loading user preferences data, it is necessary to occupy a large amount of memory, Mahout in this area to do some optimization, it created a Preferencearray (interface) holds a set of user preference data, in order to optimize performance, Mahout gives two implementation classes, Genericuserpreferencearray and Genericitempreferencearray, The user's preference is assembled according to the user and the item itself, so that the user ID or item ID space can be compressed. The code in Listing 1 below shows how to create and use a preferencearray, as an example of Genericuserpreferencearray.
Listing 1. Create and use Preferencearray
Preferencearray userpref = new Genericuserpreferencearray (2); Size = 2 Userpref.setuserid (0, 1L); Userpref.setitemid (0, 101L); <1l, 101L, 2.0f> userpref.setvalue (0, 2.0f); Userpref.setitemid (1, 102L); <1l, 102L, 4.0f> userpref.setvalue (1, 4.0f); Preference pref = userpref.get (1); <1l, 102L, 4.0f>
In order to improve performance Mahout also built its own HashMap and Set:fastbyidmap and Fastidset, interested friends can refer to Mahout official instructions.
Datamodel
Mahout's recommendation engine actually accepts input as Datamodel, which is a compressed representation of user preference data, and by creating a memory version of the Datamodel statement we can see:
Datamodel model = new Genericdatamodel (fastbyidmap<preferencearray> map);
He is saved in a preferencearray that is hashed according to the user ID or item ID, while the Preferencearray contains all user preference information that holds the user ID or item ID.
Datamodel is an abstract interface for user preferences, and its implementation supports extracting user preferences from any type of data source, including Genericdatamodel in memory version, support for file read Filedatamodel and support for database read Jdbcdatamodel, let's look at how to create various Datamodel.
Listing 2. Create various Datamodel
//In-memory DataModel - GenericDataModel FastByIDMap<PreferenceArray> preferences = new FastByIDMap<PreferenceArray> (); preferencearray Prefsforuser1 = new genericuserpreferencearray ( prefsforuser1.setuserid) (0, &NBSP;1L); prefsforuser1.setitemid (0, 101l); prefsforuser1.setvalue (0, 3.0f); prefsforuser1.setitemid (1, 102l); prefsforuser1.setvalue (1, 4.5f); ... (8 more) preferences.put (1l, prefsforuser1); //use userid as the key ... (more users) datamodel model = new genericdatamodel (Preferences); //file-based datamodel - filedatamodel datamodel datamodel = new filedatamodel (New file ("Preferences.csv"); //database-based datamodel - mysqljdbcdatamodel mysqldatasource datasource = new Mysqldatasource (); datasource.setservername ("My_user"); datasource.setuser ("My_password") ); datasource.setpassword ("My_database_host"); jdbcdatamodel datamodel = New mysqljdbcdatamodel (datasource, "my_prefs_table", "My_user_column", "my_item_ Column ", " My_pref_value_column ");
Filedatamodel,mahout that support file reads do not have too many requirements for the format of the file, as long as the contents of the file meet the following format:
Each line includes user ID, item ID, user preference
Comma-separated or Tab-separated
*.zip and *.gz files are automatically decompressed (Mahout recommended to use compressed data storage when the data volume is too large)
Support for database read Jdbcdatamodel,mahout provides a default MySQL support, which has the following simple requirements for storing user preference data:
The user ID column needs to be BIGINT and not empty
Item ID column needs to be BIGINT and not empty
The User preference column needs to be a FLOAT
It is recommended that you index the user ID and item ID.
Implementation recommendation: Recommender
After introducing the data presentation model, the following describes the recommended strategy for collaborative filtering provided by Mahout, where we choose the three most classic, User CF, Item CF, and Slope one.
User CF
The principle of user CF is described in detail here, and here we focus on how to implement the recommended strategy of user CF based on Mahout, and we will start with an example:
Listing 3. Implementing User CF based on Mahout
Datamodel model = new Filedatamodel (New File ("Preferences.dat")); Usersimilarity similarity = new pearsoncorrelationsimilarity (model); Userneighborhood neighborhood = new Nearestnuserneighborhood (+, similarity, model); Recommender recommender = new Genericuserbasedrecommender (model, neighborhood, similarity);
To build Datamodel from the file, we use the Filedatamodel described above, which assumes that the user's preferences are stored in the Preferences.dat file.
Based on user preference data to calculate the user's similarity, the list is pearsoncorrelationsimilarity, the previous chapters have described in detail the various computational similarity methods, Mahout provides a basic similarity in the calculation, they are Usersimilarity this interface to achieve user similarity calculation, including the following commonly used:
Pearsoncorrelationsimilarity: Calculation of similarity based on Pearson correlation coefficients
Euclideandistancesimilarity: Calculation of similarity based on Euclidean distance
Tanimotocoefficientsimilarity: Calculation of similarity based on Tanimoto coefficients
Uncerteredcosinesimilarity: Calculating cosine similarity
The itemsimilarity is similar:
The neighbor user is found based on the established similarity calculation method. Here's how to find a neighbor user. According to the previous introduction, we also included two kinds: "Fixed number of neighbors" and "similarity threshold Neighbor" calculation method, Mahout provides the corresponding implementation:
Nearestnuserneighborhood: The nearest neighbor that takes a fixed number of N per user
Thresholduserneighborhood: For each user based on a certain limit, all users falling within the similarity threshold are neighbors.
Build Genericuserbasedrecommender based on Datamodel,userneighborhood and usersimilarity to implement User CF recommendation strategy.
Item CF
Understanding that the user Cf,mahout Item CF implementation is similar to the user CF, is based on the itemsimilarity, below we see the implementation of the code example, it is more simple than the user CF, because the Item CF does not need to introduce the concept of neighbor:
Listing 4. Implementation of Item CF based on Mahout
Datamodel model = new Filedatamodel (New File ("Preferences.dat")); Itemsimilarity similarity = new pearsoncorrelationsimilarity (model); Recommender recommender = new Genericitembasedrecommender (model, similarity);
Slope One
As described earlier, User CF and Item CF are the recommended strategies for the two most commonly understood CF, but when it comes to large data volumes, they can be computationally large, leading to poor recommendation efficiency. So Mahout also offers a more lightweight CF recommendation strategy: Slope one.
Slope One is an improved approach to the scoring-based collaborative filtering recommendation engine presented in 2005 by Daniel Lemire and Anna MacLachlan, which briefly introduces its basic ideas.
Figure 5 shows an example, assuming that the system's average score for item A, item B and item C is 3,4 and 4, respectively. The method based on Slope one will get the following rules:
Based on the above rules, we can predict the score of user A and user B:
For user A, he scored 4 for item A, so we can speculate that he scored 5 for item B and 5 for item C.
For User B, he scored 2 for item A, 4 for item C, and according to the first rule, we could infer that he scored 3 for item B, and according to the second rule, the score was 4. When there is a conflict, we can get an average of the inferences from the various rules, so the inference given is 3.5.
This is the rationale for the Slope one recommendation, which considers the relationship between user ratings as a simple linear relationship:
Y = MX + b;
When m = 1 o'clock is Slope one, which is the example we just showed.
Figure 5.Slope One recommended policy example
The core advantage of Slope one is that it still guarantees good computing speed and recommended results on large scale data. Mahout provides a basic implementation of the Slope one recommendation method, the implementation code is very simple, refer to listing 5.
Listing 5. Implementation of Slope one based on Mahout
In-memory recommender diffstorage diffstorage = new Memorydiffstorage (model, weighting.unweighted, False, Long.MAX_VA LUE)); Recommender recommender = new Slopeonerecommender (model, weighting.unweighted, weighting.unweighted, diffStorage); database-based recommender Abstractjdbcdatamodel model = new Mysqljdbcdatamodel (); Diffstorage diffstorage = new Mysqljdbcdiffstorage (model); Recommender recommender = new Slopeonerecommender (model, weighting.weighted, weighting.weighted, diffStorage);
1. Model Diffstorage that creates a linear relationship between data based on the database model.
2. Create slopeonerecommender based on the Data Model and diffstorage to implement the Slope one recommendation strategy.
Back to top of page
Summarize
One of the core ideas of Web2.0 is "collective intelligence", the basic idea of the recommendation strategy based on collaborative filtering is based on the mass behavior, providing individual recommendations for each user, thus enabling users to find the information they need more quickly and accurately. From the perspective of application analysis, today's more successful recommendation engine, such as Amazon, watercress, when the use of collaborative filtering, it does not need to be a product or user rigorous modeling, and does not require the description of the item is machine understandable, is not a field-independent recommendation method, At the same time this method calculates the recommendation is open, can share other people's experience, very good supports the user to discover the latent interest preference. The recommendation strategy based on collaborative filtering also has different branches, they have different practical scenarios and recommendations, the user can choose the appropriate method according to the actual situation of their application, different or combined methods to get better recommendations.
In addition, this paper also describes how to efficiently implement collaborative filtering recommendation algorithm based on Apache Mahout, Apache Mahout focuses on the efficient implementation of machine learning classical algorithms on massive data, which provides good support for collaborative filtering based recommendation method, based on Mahout You can easily experience the magic of highly effective recommendations.
As the first article in-depth recommendation engine-related algorithm, this paper introduces the collaborative filtering algorithm in depth, and gives an example of how to implement collaborative filtering recommendation algorithm based on Apache Mahout efficiently, Apache Mahout as the high-efficient realization of machine learning classical algorithm on mass data, It also provides good support for collaborative filtering based recommendations, and you can easily experience the magic of highly effective recommendations based on Mahout. However, we also find that it is very challenging to efficiently run collaborative filtering algorithms and other recommended strategies with high complexity on massive data. In the process of facing this problem, we put forward a lot of methods to reduce the computational amount, and clustering is undoubtedly the best choice. So the next article in this series will detail various clustering algorithms, their principles, advantages and disadvantages, and practical scenarios, and give an efficient implementation of the clustering algorithm based on Apache Mahout, and analyze the implementation of the recommendation engine, how to solve the large amount of data caused by the massive computation by introducing clustering, So as to provide efficient recommendations.
Finally, thank you for your interest and support in this series.
ResourcesLearn
Collective Intelligence in Action: details how to build smart apps with collective intelligence.
Toward the next generation of Recommender Systems:a survey of the State-of-the-art and possible extensions. Adomavicius, G. and Tuzhilin in 2005, a detailed summary of the development and existing problems of the recommendation engine
Collaborative_filtering:wikipedia on the introduction of collaborative filtering and related papers.
item-based Collaborative Filtering recommendation Algorithms:amazon first paper on the recommended strategy of Item CF
An introduction to correlation and similarity calculations on Correlation and Dependence:wikipedia.
Tanimoto coefficient: An introduction to the calculation of Tanimoto coefficients in Wikipedia.
Cosine similarity: An introduction to cosine similarity calculation on Wikipedia.
Coverage of recommended systems: Introduction to calculation methods for recommended coverage
Slope One:wikipedia on the Slope one recommended method of introduction.
Slope one predictors for Online rating-based collaborative Filtering: presented Slope One's paper for a comprehensive and in-depth introduction to the Slope one predictive approach.
Xlvector–recommender system: An illuminated blog that focuses on the various aspects and dimensions of the referral system.
Apache Mahout Introduction: Mahout's founder, Grant Ingersoll, introduces the basic concepts of machine learning and demonstrates how to use Mahout to implement a document cluster, make recommendations, and organize content.
Apache Mahout:apache Mahout Project home page, search all content about Mahout.
Apache Mahout recommended algorithm: gives the framework and installation guide for recommended calculations in Mahout
Building a social recommendation engine based on Apache Mahout: A developerWorks article published in 09 on the Mahout implementation recommendation engine, detailing the installation steps of Mahout and giving an example of a simple movie recommendation engine.
Machine learning: The Wikipedia page of machine learning can help you learn more about machine learning.
DeveloperWorks Web Development Zone: Extend your skills in web development with articles and tutorials dedicated to web technology.
DeveloperWorks Ajax Resource Center: This is a one-stop center for information about AJAX programming models, including many documents, tutorials, forums, blogs, wikis, and news. Any new Ajax information can be found here.
The DeveloperWorks Web 2.0 Resource Center, a one-stop Center for Web 2.0-related information, includes a large number of Web 2.0 technical articles, tutorials, downloads, and related technical resources. You can also quickly learn about the concepts of Web 2.0 through the Web 2.0 starter section.
Check out the HTML5 topic for more information and trends related to HTML5.
Explore the secrets of the recommended engine, part 2nd: In-depth recommendation engine-related algorithms-collaborative filtering (ii)