"Gandalf." Distributed ITEMCF recommendation algorithm based on mahout0.9+cdh5.2 operation

Source: Internet
Author: User

Environment:hadoop-2.5.0-cdh5.2.0
mahout-0.9-cdh5.2.0 IntroductionAlthough Mahout has announced that it will no longer continue to develop and migrate to spark based on MapReduce, the reality is that the company cluster does not have enough memory to support Spark, the beast that only eats the memory, plus the pressure on the project and the skills of the developers, So I had to keep using mahout for a while. Today, the command line is logged.ITEMCF on the Hadoop process. HistoryRead some of the predecessors about Mahout ITEMCF on Hadoop programming related articles, described how to implement based on Mahout programmingitemcf on Hadoop, because they do not have time to do their own research, so always follow the predecessors of the implementation of the practice, such as the following in the major blogs are frequently appearing code: import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.util.GenericOptionsParser; import org.apache.mahout.cf.taste.hadoop.item.RecommenderJob; public class Itemcfhadoop { Public static void main (string[] args) throws Exception {jobconf conf = new jobconf (Itemcfhadoop. Class);Genericoptionsparser optionparser = new genericoptionsparser (conf, args);string[] Remainingargs = Optionparser.getremainingargs (); if (Remainingargs. Length ! = 5) {System. out. println ("args length:"+remainingargs. Length);System. err. println ("Usage:hadoop jar <jarname> <package>. Itemcfhadoop <inputpath> <outputpath> <tmppath> <booleanData> <similarityClassname> " );System. Exit (2);        }System. Out. println ("input:"+remainingargs[0]); System. Out. println ("Output:"+remainingargs[1]); System. Out. println ("tempdir:"+remainingargs[2]); System. Out. println ("booleandata:"+remainingargs[3]); System. Out. println ("similarityclassname:"+remainingargs[4]);         StringBuilder sb = new StringBuilder ();sb.append ("--input"). Append (Remainingargs[0]);sb.append ("--output"). Append (Remainingargs[1]);sb.append ("--tempdir"). Append (remainingargs[2]);sb.append ("--booleandata"). Append (Remainingargs[3]);sb.append ("--similarityclassname"). Append (Remainingargs[4]);conf.setjobname ("Itemcfhadoop");recommenderjob job = new recommenderjob ();job.setconf (conf);Job.run (Sb.tostring (). Split (""));    }}
The above code is executable and can be completed as long as the correct parameters are passed in the command line .itemcf on Hadoop task. However, if you are doing the command-line work in Java by using this kind of code logic, why not just execute it directly from the command line? Official InformationThe elders have shown me the way,the task of ITEMCF on Hadoop is throughimplemented by the Org.apache.mahout.cf.taste.hadoop.item.RecommenderJob class. Official website (https://builds.apache.org/job/Mahout-Quality/javadoc/) for thethe description of the Org.apache.mahout.cf.taste.hadoop.item.RecommenderJob class is as follows:
Runs a completely distributed Recommender job as a series of mapreduces.Preferences in the input file should look like UserID, itemid[, Preferencevalue]Preference value is optional to accommodate applications that has no notion of a Preference value (that is, the user Simply expresses a preference for an item, but no degree of preference).The preference value is assumed to being parseable as a double. The user IDs and item IDs are parsed as longs.Command Line Arguments specific to this class is:--input (path): Directory containing one or more text files with the preference data--output (PATH): Output path where recommender output should go--tempdir (PATH): Specifies a directory where the job may place temp files (default "temp")
--similarityclassname (classname): Name of Vector similarity class to instantiate or a predefined similarity from Vect Orsimilaritymeasure--usersfile (PATH): Compute recommendations for user IDs contained in this file (optional)--itemsfile (PATH): Only include item IDs from this file in the recommendations (optional)--filterfile (PATH): file containing comma-separated userid,itemid pairs. Used to exclude the item from the recommendations for that user (optional)--numrecommendations (integer): Number of recommendations to compute per user (Ten)--booleandata (Boolean): Treat input data as having no pref values (false)--maxprefsperuser (integer): Maximum number of Preferences considered per user in final recommendation phase (TEN)--maxsimilaritiesperitem (integer): Maximum number of similarities considered per item (each)--minprefsperuser (integer): Ignore users with less preferences than this in the similarity computation (1)--maxprefsperuserinitemsimilarity (integer): max number of Preferences to consider per user in the item similarity com Putation phase, users with more preferences 'll be sampled down (+)--threshold (Double): Discard item pairs with a similarity value below this
In order to facilitate the students with English reading ability, the above retained the original text, the following is translated:
run a fully distributed recommendation task, implemented through a series of mapreduce tasks. the preference data format in the input file is:UserID, itemid[, Preferencevalue]. Among them,Preferencevalue is not a must. userid and Itemid will be parsed to a long type,andPreferencevalue will be resolved to a double type. the command line arguments that the class can receive are as follows:
  • --input (PATH): A directory that stores user preference data, which can contain one or more text files that store user preference data;
  • --output (PATH): Output directory for settlement results
  • --tempdir (path): directory where temporary files are stored
  • --similarityclassname (classname): Vector similarity calculation class, optional similarity algorithm includes Cityblocksimilarity,cooccurrencecountsimilarity,cosinesimilarity,countbasedmeasure, Euclideandistancesimilarity,loglikelihoodsimilarity,pearsoncorrelationsimilarity, TanimotoCoefficientSimilarity. Note that the package name should be taken in the parameter.
  • --usersfile (PATH): Specifies a file path that contains one or more stored userid, which is recommended only for the userid contained in all files under that path (option optional)
  • --itemsfile (PATH): Specifies a file path that contains one or more storage itemid, which is recommended only for Itemid included in all files under that path (this option is optional)
  • --filterfile (PATH): Specifies a path under which files contain [Userid,itemid] value pairs, and UserID and ItemID are separated by commas. The calculation results will not be included in the user's recommended [Userid,itemid] value pair (this option is optional)
  • --numrecommendations (integer): The number of item recommended for each user, default is 10
  • --booleandata (Boolean): set this parameter to TRUE if the input data does not contain a preference value , the default is False
  • --maxprefsperuser (integer): The maximum number of preference data that is used for each user, the default is 10, at the end of the final calculation of the recommended results
  • --maxsimilaritiesperitem (integer): Maximum similarity for each item, default is 100
  • --minprefsperuser (integer): In the similarity calculation, ignores all users who have less than the value of the preference data, the default is 1
  • --maxprefsperuserinitemsimilarity (integer): In the item similarity calculation phase, the maximum number of preference data to be considered per user, default is 1000
  • --threshold (Double): Ignores item on which the similarity is less than the threshold value
command line executionuser preference data for testing "UserID, ItemID, Preferencevalue":1,101,21,102,51,103,12,101,12,102,32,103,22,104,63,101,13,104,13,105,13,107,24,101,24,103,24,104,54,106,35,101,35,102,55 , 103,65,104,85,105,15,106,1
once the underlying environment is well configured, execute the following command at the command lineITEMCF on Hadoop recommended calculation:Hadoop jar $MAHOUT _home/mahout-core-0.9-cdh5.2.0-job.jar Org.apache.mahout.cf.taste.hadoop.item.RecommenderJob--input/userpreference--output/cfoutput--tempdir/tmp-- Similarityclassname org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.LoglikelihoodSimilarity

Note: Only the most important parameters are used here, and more parameters use tuning to test with the actual project.
Calculation results"UserID    [Itemid1:score1,itemid2:score2 ...]":1 [104:3.4706533,106:1.7326527,105:1.5989419]2 [106:3.8991857,105:3.691359]3 [106:1.0,103:1.0,102:1.0]4 [105:3.2909648,102:3.2909648]5 [107:3.2898135]

"Gandalf." Distributed ITEMCF recommendation algorithm based on mahout0.9+cdh5.2 operation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.