Mahout algorithm source code analysis: Itembased Collaborative Filtering (1) PreparePref

Source: Internet
Author: User

Mahout version: 0.7, hadoop version: 1.0.4, jdk: 1.7.0 _ 25 64bit. This article analyzes the source code of RecommenderJob. This class also inherits AbstractJob, so it will overwrite its run method. Click this run method to see that it is the same as other job classes, at the beginning, it was set and obtained by the default value of the basic parameter. In the first job, there was a shouldRunNextPhase () function before the job. Click this function to see the following source code: [java] protected static boolean shouldRunNextPhase (Map <String, List <String> args, AtomicInteger currentPhase) {int phase = currentPhase. getAndIncrement (); String startPhase = getOption (args, "-- startPhase"); String endPhase = getOption (args, "-- EndPhase"); boolean phaseSkipped = (startPhase! = Null & phase <Integer. parseInt (startPhase) | (endPhase! = Null & phase> Integer. parseInt (endPhase); if (phaseSkipped) {log.info ("Skipping phase {}", phase);} return! PhaseSkipped;} phase obtains the current phase value. For more information about phase, see the meaning of phase in mahout, here we can see that the value of phase, startPhase, and endPhase is compared, and true or false is returned, because in practice, the default value is used (neither startPhase nor endPhase is set ), therefore, this function in RecommenderJob returns true. Check the call of the First job: [java] if (shouldRunNextPhase (parsedArgs, currentPhase) {ToolRunner. run (getConf (), new PreparePreferenceMatrixJob (), new String [] {"-- input", getInputPath (). toString (), "-- output", prepPath. toString (), "-- maxPrefsPerUser", String. valueOf (maxPrefsPerUserInItemSimilarity), "-- minPrefsPerUser", String. valueOf (minPrefsPerUser), "-- booleanData", String. valueOf (booleanData), "-- tempDir", get TempPath (). toString ()}); numberOfUsers = HadoopUtil. readInt (new Path (prepPath, PreparePreferenceMatrixJob. NUM_USERS), getConf ();} The main class of the called job is PreparePreferenceMatrixJob. The input parameters of this job include input, output, maxPrefsPerUser, minPrefsPerUser, booleanData, and tempDir. Open the master class PreparePreferenceMatrixJob and take a look. This PreparePreferenceMatrixJob also implements the AbstractJob class, so let's look at the run method directly. In the parameter settings in run, there is a ratingShift, which is not used during the call. Therefore, the default value is 0.0. A general view shows that there are three preparejobs, so this main class will generate three jobs. Next let's take a look: (1) // convert items to an internal index [java] Job itemIDIndex = prepareJob (getInputPath (), getOutputPath (ITEMID_INDEX), TextInputFormat. class, ItemIDIndexMapper. class, VarIntWritable. class, VarLongWritable. class, ItemIDIndexReducer. class, VarIntWritable. class, VarLongWritable. class, SequenceFileOutputFormat. class); input format: userid, itemid, value first mapper: [java] protected void map (LongWritable key, Text value, Context context) throws IOException, InterruptedException {String [] tokens = TasteHadoopUtils. splitPrefTokens (value. toString (); long itemID = Long. parseLong (tokens [transpose? 0: 1]); int index = TasteHadoopUtils. idToIndex (itemID); context. write (new VarIntWritable (index), new VarLongWritable (itemID);} in map, first obtain the itemID. In tokens, tokens [1] is the itemID, when transpose is set to true, tokens [0] should be selected as the itemID. Because this parameter is not set during the call, the default value is false here, therefore, tokens [1] is selected as the itemID. Then we can see that the conversion between index and itemID uses TasteHadoopUtils. the idToIndex () function returns return 0x7FFFFFFF & Longs. hashCode (id); so when the number is within the range (less than 2147483647) That int can represent, it will return the number itself, for example, project 101 in practice, the returned index is also 101. Let's look at Cer CER: [java] protected void reduce (VarIntWritable index, Iterable <VarLongWritable> possibleItemIDs, Context context) throws IOException, InterruptedException {long minimumItemID = Long. MAX_VALUE; for (VarLongWritable varLongWritable: possibleItemIDs) {long itemID = varLongWritable. get (); if (itemID <minimumItemID) {minimumItemID = itemID ;}} if (minimumItemID! = Long. MAX_VALUE) {context. write (index, new VarLongWritable (minimumItemID);} I always feel that there is no need for this. Will the reducer return 101 --> 101, or what should I say here? The output file is ITEMID_INDEX, and the output format is <key, value>: VarintWritable --> VarLongWritable. Therefore, this job is completely analyzed. (2) // convert user preferences into a vector per user [java] Job toUserVectors = prepareJob (getInputPath (), getOutputPath (USER_VECTORS), TextInputFormat. class, ToItemPrefsMapper. class, VarLongWritable. class, booleanData? VarLongWritable. class: EntityPrefWritable. class, ToUserVectorsReducer. class, VarLongWritable. class, VectorWritable. class, SequenceFileOutputFormat. class); input format: userid, itemid, value mapper :( ToItemPrefsMapper inherits ToEntityPrefsMapper, while ToItemPrefsMapper is empty, so ToEntityPrefsMapper) [java] public void map (LongWritable key, key, text value, Context context) throws IOException, InterruptedException {String [] token S = DELIMITER. split (value. toString (); long userID = Long. parseLong (tokens [0]); long itemID = Long. parseLong (tokens [1]); if (itemKey ^ transpose) {// If using items as keys, and not transposing items and users, then users are items! // Or if not using items as keys (users are, as usual), but transposing items and users, // then users are items! Confused? Long temp = userID; userID = itemID; itemID = temp;} if (booleanData) {context. write (new VarLongWritable (userID), new VarLongWritable (itemID);} else {float prefValue = tokens. length> 2? Float. parseFloat (tokens [2]) + ratingShift: 1.0f; context. write (new VarLongWritable (userID), new EntityPrefWritable (itemID, prefValue);} the most important code is the last two sentences. One sentence is the evaluation value, but what is ratingShift added here? Although ratingShift is 0.0. The final output is userID --> [itemID, prefValue]. Check CER: [java] protected void reduce (VarLongWritable userID, Iterable <VarLongWritable> itemPrefs, Context context) throws IOException, interruptedException {Vector userVector = new RandomAccessSparseVector (Integer. MAX_VALUE, 100); for (VarLongWritable itemPref: itemPrefs) {int index = TasteHadoopUtils. idToIndex (itemPref. get (); float value = itemPref instanceof EntityPrefWritable? (EntityPrefWritable) itemPref ). getPrefValue (): 1.0f; userVector. set (index, value);} if (userVector. getNumNondefaultElements ()> = minPreferences) {VectorWritable vw = new VectorWritable (userVector); vw. setWritesLaxPrecision (true); context. getCounter (Counters. USERS ). increment (1); context. write (userID, vw) ;}firstly, the value output by mapper is EntityPrefWritable, but the Iterable here uses VarLongWritable for receiving, because the former inherits the latter. Then, the user writes a vecotr for all the scores, uses itemid as the subscript of the vector, and prefValue as the value. Finally, let's judge, if the number of items contained in the vector is greater than or equal to minPreference (here we can see the meaning of this parameter), it will be output; otherwise, it will not be output. In addition, a Counters. USERS counter is set to count the number of USERS. The output of this job is USER_VECTORS in the format of <key, value>: userid --> vector [itemid: prefValue, itemid: prefValue,...] the Code then obtains the number of users: [java] int numberOfUsers = (int) toUserVectors. getCounters (). findCounter (ToUserVectorsReducer. counters. USERS ). getValue (); HadoopUtil. writeInt (numberOfUsers, getOutputPath (NUM_USERS), getConf (); (3) // build the rating matrix [java] Job toItemVectors = prepareJob (getOutputPath (USER_VECTORS), get OutputPath (RATING_MATRIX), ToItemVectorsMapper. class, IntWritable. class, VectorWritable. class, ToItemVectorsReducer. class, IntWritable. class, VectorWritable. class); the input is the output of the second job, in the format of <key, value>: userid --> vector [itemid: prefValue, itemid: prefValue,...] first look at mapper: [java] protected void map (VarLongWritable rowIndex, VectorWritable vectorWritable, Context ctx) throws IOException, InterruptedException {Ve Ctor userRatings = vectorWritable. get (); int numElementsBeforeSampling = userRatings. getNumNondefaultElements (); userRatings = Vectors. maybeSample (userRatings, sampleSize); int numElementsAfterSampling = userRatings. getNumNondefaultElements (); int column = TasteHadoopUtils. idToIndex (rowIndex. get (); VectorWritable itemVector = new VectorWritable (new RandomAccessSparseVector (Integer. MAX_VALUE, 1 )); ItemVector. setWritesLaxPrecision (true); Iterator <Vector. element> iterator = userRatings. iterateNonZero (); while (iterator. hasNext () {Vector. element elem = iterator. next (); itemVector. get (). setQuick (column, elem. get (); ctx. write (new IntWritable (elem. index (), itemVector);} ctx. getCounter (Elements. USER_RATINGS_USED ). increment (numElementsAfterSampling); ctx. getCounter (Elements. USER_RATINGS_NEGLE CTED ). increment (numElementsBeforeSampling-numElementsAfterSampling);} userRatings = Vectors. maybeSample (userRatings, sampleSize); function, because sampleSize is not set, so the number obtained is the maximum Integer value, then maybeSample will return the original value, the number of non-default items in the vector must be smaller than the maximum Integer value: [java] public static Vector maybeSample (Vector original, int sampleSize) {if (original. getNumNondefaultElements () <= sampleSize) {return original;} Vector sample = origi Nal. like (); Iterator <Vector. element> sampledElements = new FixedSizeSamplingIterator <Vector. element> (sampleSize, original. iterateNonZero (); while (sampledElements. hasNext () {Vector. element elem = sampledElements. next (); sample. setQuick (elem. index (), elem. get ();} return sample;} in the map function, column is userid, and the output is elem. index () is itemID, and itemVector. get (). setQuick (column, elem. get () is actually set itemVecotor to [use RID: prefValue] format. In this case, the mapper output is itemID --> vector [userID: prefValue]. There are also two counters, because numElementsBeforeSampling-numElementsAfterSampling = 0, so the counter Elements. USER_RATINGS_NEGLECTED is always zero. Let's look at Cer CER: [java] protected void reduce (IntWritable row, Iterable <VectorWritable> vectors, Context ctx) throws IOException, InterruptedException {VectorWritable vectorWritable = VectorWritable. merge (vectors. iterator (); vectorWritable. setWritesLaxPrecision (true); ctx. write (row, vectorWritable);} the merge function is to convert the mapper output into the following format: itemID --> vector [userID: prefValue, userID: prefVlaue,...]; so the output of this job is: RATING _ MATRIX, in the format of <key, value>: itemID --> vector [userID: prefValue, userID: prefVlaue,...]; well, that sampleSize has a value, not the maximum value of the default Integer: [java] if (hasOption ("maxPrefsPerUser") {int samplingSize = Integer. parseInt (getOption ("maxPrefsPerUser"); toItemVectors. getConfiguration (). setInt (ToItemVectorsMapper. SAMPLE_SIZE, samplingSize);} This value can also be set, so now you know the use of maxPrefsPerUser value. However, the default value is 100, and the total number of items in practice is 7. Therefore, numElementsBeforeSampling-numElementsAfterSampling = 0 remains unchanged. All right, this job has been analyzed.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.