Mahout algorithm source code analysis: Itembased Collaborative Filtering (1) PreparePref

Last Update:2013-12-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Mahout version: 0.7, hadoop version: 1.0.4, jdk: 1.7.0 _ 25 64bit. This article analyzes the source code of RecommenderJob. This class also inherits AbstractJob, so it will overwrite its run method. Click this run method to see that it is the same as other job classes, at the beginning, it was set and obtained by the default value of the basic parameter. In the first job, there was a shouldRunNextPhase () function before the job. Click this function to see the following source code: [java] protected static boolean shouldRunNextPhase (Map <String, List <String> args, AtomicInteger currentPhase) {int phase = currentPhase. getAndIncrement (); String startPhase = getOption (args, "-- startPhase"); String endPhase = getOption (args, "-- EndPhase"); boolean phaseSkipped = (startPhase! = Null & phase <Integer. parseInt (startPhase) | (endPhase! = Null & phase> Integer. parseInt (endPhase); if (phaseSkipped) {log.info ("Skipping phase {}", phase);} return! PhaseSkipped;} phase obtains the current phase value. For more information about phase, see the meaning of phase in mahout, here we can see that the value of phase, startPhase, and endPhase is compared, and true or false is returned, because in practice, the default value is used (neither startPhase nor endPhase is set ), therefore, this function in RecommenderJob returns true. Check the call of the First job: [java] if (shouldRunNextPhase (parsedArgs, currentPhase) {ToolRunner. run (getConf (), new PreparePreferenceMatrixJob (), new String [] {"-- input", getInputPath (). toString (), "-- output", prepPath. toString (), "-- maxPrefsPerUser", String. valueOf (maxPrefsPerUserInItemSimilarity), "-- minPrefsPerUser", String. valueOf (minPrefsPerUser), "-- booleanData", String. valueOf (booleanData), "-- tempDir", get TempPath (). toString ()}); numberOfUsers = HadoopUtil. readInt (new Path (prepPath, PreparePreferenceMatrixJob. NUM_USERS), getConf ();} The main class of the called job is PreparePreferenceMatrixJob. The input parameters of this job include input, output, maxPrefsPerUser, minPrefsPerUser, booleanData, and tempDir. Open the master class PreparePreferenceMatrixJob and take a look. This PreparePreferenceMatrixJob also implements the AbstractJob class, so let's look at the run method directly. In the parameter settings in run, there is a ratingShift, which is not used during the call. Therefore, the default value is 0.0. A general view shows that there are three preparejobs, so this main class will generate three jobs. Next let's take a look: (1) // convert items to an internal index [java] Job itemIDIndex = prepareJob (getInputPath (), getOutputPath (ITEMID_INDEX), TextInputFormat. class, ItemIDIndexMapper. class, VarIntWritable. class, VarLongWritable. class, ItemIDIndexReducer. class, VarIntWritable. class, VarLongWritable. class, SequenceFileOutputFormat. class); input format: userid, itemid, value first mapper: [java] protected void map (LongWritable key, Text value, Context context) throws IOException, InterruptedException {String [] tokens = TasteHadoopUtils. splitPrefTokens (value. toString (); long itemID = Long. parseLong (tokens [transpose? 0: 1]); int index = TasteHadoopUtils. idToIndex (itemID); context. write (new VarIntWritable (index), new VarLongWritable (itemID);} in map, first obtain the itemID. In tokens, tokens [1] is the itemID, when transpose is set to true, tokens [0] should be selected as the itemID. Because this parameter is not set during the call, the default value is false here, therefore, tokens [1] is selected as the itemID. Then we can see that the conversion between index and itemID uses TasteHadoopUtils. the idToIndex () function returns return 0x7FFFFFFF & Longs. hashCode (id); so when the number is within the range (less than 2147483647) That int can represent, it will return the number itself, for example, project 101 in practice, the returned index is also 101. Let's look at Cer CER: [java] protected void reduce (VarIntWritable index, Iterable <VarLongWritable> possibleItemIDs, Context context) throws IOException, InterruptedException {long minimumItemID = Long. MAX_VALUE; for (VarLongWritable varLongWritable: possibleItemIDs) {long itemID = varLongWritable. get (); if (itemID <minimumItemID) {minimumItemID = itemID ;}} if (minimumItemID! = Long. MAX_VALUE) {context. write (index, new VarLongWritable (minimumItemID);} I always feel that there is no need for this. Will the reducer return 101 --> 101, or what should I say here? The output file is ITEMID_INDEX, and the output format is <key, value>: VarintWritable --> VarLongWritable. Therefore, this job is completely analyzed. (2) // convert user preferences into a vector per user [java] Job toUserVectors = prepareJob (getInputPath (), getOutputPath (USER_VECTORS), TextInputFormat. class, ToItemPrefsMapper. class, VarLongWritable. class, booleanData? VarLongWritable. class: EntityPrefWritable. class, ToUserVectorsReducer. class, VarLongWritable. class, VectorWritable. class, SequenceFileOutputFormat. class); input format: userid, itemid, value mapper :( ToItemPrefsMapper inherits ToEntityPrefsMapper, while ToItemPrefsMapper is empty, so ToEntityPrefsMapper) [java] public void map (LongWritable key, key, text value, Context context) throws IOException, InterruptedException {String [] token S = DELIMITER. split (value. toString (); long userID = Long. parseLong (tokens [0]); long itemID = Long. parseLong (tokens [1]); if (itemKey ^ transpose) {// If using items as keys, and not transposing items and users, then users are items! // Or if not using items as keys (users are, as usual), but transposing items and users, // then users are items! Confused? Long temp = userID; userID = itemID; itemID = temp;} if (booleanData) {context. write (new VarLongWritable (userID), new VarLongWritable (itemID);} else {float prefValue = tokens. length> 2? Float. parseFloat (tokens [2]) + ratingShift: 1.0f; context. write (new VarLongWritable (userID), new EntityPrefWritable (itemID, prefValue);} the most important code is the last two sentences. One sentence is the evaluation value, but what is ratingShift added here? Although ratingShift is 0.0. The final output is userID --> [itemID, prefValue]. Check CER: [java] protected void reduce (VarLongWritable userID, Iterable <VarLongWritable> itemPrefs, Context context) throws IOException, interruptedException {Vector userVector = new RandomAccessSparseVector (Integer. MAX_VALUE, 100); for (VarLongWritable itemPref: itemPrefs) {int index = TasteHadoopUtils. idToIndex (itemPref. get (); float value = itemPref instanceof EntityPrefWritable? (EntityPrefWritable) itemPref ). getPrefValue (): 1.0f; userVector. set (index, value);} if (userVector. getNumNondefaultElements ()> = minPreferences) {VectorWritable vw = new VectorWritable (userVector); vw. setWritesLaxPrecision (true); context. getCounter (Counters. USERS ). increment (1); context. write (userID, vw) ;}firstly, the value output by mapper is EntityPrefWritable, but the Iterable here uses VarLongWritable for receiving, because the former inherits the latter. Then, the user writes a vecotr for all the scores, uses itemid as the subscript of the vector, and prefValue as the value. Finally, let's judge, if the number of items contained in the vector is greater than or equal to minPreference (here we can see the meaning of this parameter), it will be output; otherwise, it will not be output. In addition, a Counters. USERS counter is set to count the number of USERS. The output of this job is USER_VECTORS in the format of <key, value>: userid --> vector [itemid: prefValue, itemid: prefValue,...] the Code then obtains the number of users: [java] int numberOfUsers = (int) toUserVectors. getCounters (). findCounter (ToUserVectorsReducer. counters. USERS ). getValue (); HadoopUtil. writeInt (numberOfUsers, getOutputPath (NUM_USERS), getConf (); (3) // build the rating matrix [java] Job toItemVectors = prepareJob (getOutputPath (USER_VECTORS), get OutputPath (RATING_MATRIX), ToItemVectorsMapper. class, IntWritable. class, VectorWritable. class, ToItemVectorsReducer. class, IntWritable. class, VectorWritable. class); the input is the output of the second job, in the format of <key, value>: userid --> vector [itemid: prefValue, itemid: prefValue,...] first look at mapper: [java] protected void map (VarLongWritable rowIndex, VectorWritable vectorWritable, Context ctx) throws IOException, InterruptedException {Ve Ctor userRatings = vectorWritable. get (); int numElementsBeforeSampling = userRatings. getNumNondefaultElements (); userRatings = Vectors. maybeSample (userRatings, sampleSize); int numElementsAfterSampling = userRatings. getNumNondefaultElements (); int column = TasteHadoopUtils. idToIndex (rowIndex. get (); VectorWritable itemVector = new VectorWritable (new RandomAccessSparseVector (Integer. MAX_VALUE, 1 )); ItemVector. setWritesLaxPrecision (true); Iterator <Vector. element> iterator = userRatings. iterateNonZero (); while (iterator. hasNext () {Vector. element elem = iterator. next (); itemVector. get (). setQuick (column, elem. get (); ctx. write (new IntWritable (elem. index (), itemVector);} ctx. getCounter (Elements. USER_RATINGS_USED ). increment (numElementsAfterSampling); ctx. getCounter (Elements. USER_RATINGS_NEGLE CTED ). increment (numElementsBeforeSampling-numElementsAfterSampling);} userRatings = Vectors. maybeSample (userRatings, sampleSize); function, because sampleSize is not set, so the number obtained is the maximum Integer value, then maybeSample will return the original value, the number of non-default items in the vector must be smaller than the maximum Integer value: [java] public static Vector maybeSample (Vector original, int sampleSize) {if (original. getNumNondefaultElements () <= sampleSize) {return original;} Vector sample = origi Nal. like (); Iterator <Vector. element> sampledElements = new FixedSizeSamplingIterator <Vector. element> (sampleSize, original. iterateNonZero (); while (sampledElements. hasNext () {Vector. element elem = sampledElements. next (); sample. setQuick (elem. index (), elem. get ();} return sample;} in the map function, column is userid, and the output is elem. index () is itemID, and itemVector. get (). setQuick (column, elem. get () is actually set itemVecotor to [use RID: prefValue] format. In this case, the mapper output is itemID --> vector [userID: prefValue]. There are also two counters, because numElementsBeforeSampling-numElementsAfterSampling = 0, so the counter Elements. USER_RATINGS_NEGLECTED is always zero. Let's look at Cer CER: [java] protected void reduce (IntWritable row, Iterable <VectorWritable> vectors, Context ctx) throws IOException, InterruptedException {VectorWritable vectorWritable = VectorWritable. merge (vectors. iterator (); vectorWritable. setWritesLaxPrecision (true); ctx. write (row, vectorWritable);} the merge function is to convert the mapper output into the following format: itemID --> vector [userID: prefValue, userID: prefVlaue,...]; so the output of this job is: RATING _ MATRIX, in the format of <key, value>: itemID --> vector [userID: prefValue, userID: prefVlaue,...]; well, that sampleSize has a value, not the maximum value of the default Integer: [java] if (hasOption ("maxPrefsPerUser") {int samplingSize = Integer. parseInt (getOption ("maxPrefsPerUser"); toItemVectors. getConfiguration (). setInt (ToItemVectorsMapper. SAMPLE_SIZE, samplingSize);} This value can also be set, so now you know the use of maxPrefsPerUser value. However, the default value is 100, and the total number of items in practice is 7. Therefore, numElementsBeforeSampling-numElementsAfterSampling = 0 remains unchanged. All right, this job has been analyzed.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Mahout algorithm source code analysis: Itembased Collaborative Filtering (1) PreparePref

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Mahout algorithm source code analysis: Itembased Collaborative Filtering (1) PreparePref

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support