Mahout Hadoop-based CF Code Analysis (RPM)

Source: Internet
Author: User

From: http://www.codesky.net/article/201206/171862.html

The taste framework of mahout is the implementation of collaborative filtering algorithm. It supports Datamodel, such as files, databases, NoSQL storage, and so on, and also supports the mapreduce of Hadoop. Here the main analysis is based on the implementation of Mr.

The main flow of CF based on Mr is in the Org.apache.mahout.cf.taste.Hadoop.item.RecommenderJob class (note that Mahout has two recommendjob, which is to be seen clearly which package). The Run method of this class contains all the steps. From top to bottom, the complete actually has 10 steps (the intermediate calculation item similarity actually splits into 3 job, we also consider as a phase bar), that is, if all necessary parameters are specified, run the item-based CF algorithm, will execute 12 job, Of course, some of the steps can be ignored, the following will be said. The following is a detailed analysis of each step:

Phase1:itemidindex

This step is mainly to convert the ItemId into an int. The design is actually a little bit of a problem, if the number of item is very large, such as more than the maximum value of int, it is possible to appear coincident. So using long is actually more appropriate.

Input: user scoring file (this is also our most primitive input), the format is generally:userId t itemId T score. Note that the input must be textfile . Perhaps for the convenience of testing it,mahout Many packages default output is in textfile format.

Map: (index, ITEMID)

Reduce: (index, ITEMID)

Phase2:touservector

Input: user ratings file

Param:--userbooleandata If this parameter is true, the scoring column is ignored and sometimes it is necessary to refer to this value for buying or not buying such data.

Map: (UserId, Itemid,pref)

Reduce: è (userId, Vectorwritable<itemid, pref>) with user as key and output to vector form

Phase3:countuser , calculate the number of users

Map: (USERID)

Reduce: Total number of output users count

Phase4:maybepruneandtranspose

output of Input:phase2:uservector

Param:--maxcooccurrences

Map: (Userid,vector<itemid, pref>) è(itemid,distributedrowmatrix<userid,pref>), Note if you specify a -maxcooccurrences parameter, there will be clipping, www.codesky.net each userId up to maxcooccurrences itemId score

Here's the Distributedrowmatrix, distributed row matrix: row:itemId, column:userId

Reduce: (ItemId, vectorwritable<userid,pref>)

Phase5:rowsimilarityjob

This step is more critical and calculates the item similarity, which is split into three jobs .

Param:--numberofcolumns,--similarityclassname,--maxsimilaritiesperrow ( default : +)

Job1:weight

Output of the Input:phase4

Map: (ItemId, vectorwritable <userid, pref>) ==> (UserId, Weightedoccurrence<itemid, Pref, weight>)

Here the weight, for the Euclidean vector distance, or Pearson distance, etc., are Double.NaN, that is, invalid. The value of weight is useful in loglikelihoodvectorsimilarity.

Reduce: (UserId, Weightedoccurrencearray<itemid, Pref, weight>)

job2:pairwise similarity *item similarity calculation *

Map: For all item-rating of the same user, the relationship between output 22 item ==> (Weightedrowpair<itema, Itemb, Weighta, Weightb>, coocurrence< Userid,valuea, valueb>) (IBID., here the weight weighta,b for Euclidean distance, etc. can be ignored)

Reduce: At this end, the <itemA,itemB> key aggregates All users from different maps, and finally outputs the symmetric similarity of ItemA and B (i.e. ItemA as key or ITEMB as key) ==> (Similaritymatrixentrykey<itema,similarity>, Matrixentrywritable<weightedrowpair<itema, ItemB, Weighta, weightb>>), (Similaritymatrixentrykey<itemb,similarity>, matrixentrywritable< Weightedrowpair<itemb, ITEMA,WEIGHTB, weighta>>)

job3:entries2vectors * summarizes the similarity of item items*

param:--maxsimilaritiesperrow

Map: (ItemA, ITEMB, Similarity) & (Itemb,itema, similarity) here in group, sort by the descending order of similarity, if there are--maxsimilaritiesperrow parameters, It will be cropped.

Reduce: (ItemA, vectorwritable <item,similarity>)

At this point, the item similarity calculation is complete.

Phase6:prepartialmultiply1

INPUT:PHASE5 final output (that is, item similarity)

Map: Direct output item corresponding to the similar items, here with Vectororprefwritable did the package, indicating that there may be a similarity vector, also may be the item Rating, and for item, set the similarity to Double.NaN tofilter itself. è (itemid,vectororprefwritable<item, similarity>)

Reduce:identityreducer

Phase7:prepartialmultiply2

Output of the Input:phase2 uservectors

Map: Output: (ItemId, Vectororprefwritable<userid, pref>)

The default is to consider the user's rating of 10 item, which can be adjusted by the maxprefsperuserconsidered parameter.

If Usersfile is specified, all of the userid is read into memory at setup for filtering. If the userid of the map input data is not in Usersfile, it is ignored. Note that this is a design bug for Mahout, which is likely to cause oom for larger datasets (in fact, Oom has already appeared in my test program ...). ), this bug will also appear below. The output is the user's rating, with the phase6 of the vectororprefwritable package.

Reduce:identityreducer

Phase8:partialmultiply

Output of Input:6 and 7: prePartialMultiply1, PrePartialMultiply2

Map:identity. Since the outputs of 6 and 7 are all Itemid, the similar item on the reduce side and the corresponding user score are aggregated together.

Reduce: (ItemId, Vectorandprefswritable<similaritymatrix, list<userid>,list<pref>>) did not do special treatment, Directly together, the output similarity matrix, all the UserID and the item's rating.

Phase9:itemfiltering

Output the filter file to <userid, itemid>. If the--filterfile parameter is specified, the corresponding items of the UserID are filtered in the final aggregation recommendation. This step is mostly negligible in practice, as long as you do not specify this parameter.

Phase10:aggregateandrecommend

Map: For each user, output a rating of the current item, as well as all similar itemsè (UserId, Prefandsimilaritycolumnwritable<pref,vector<item, similarity>>)

Reduce: Aggregates all of this user's scoring history, and similar items, to calculate the recommended results for that user è (userId, list<itemid>).

Note In the setup of reduce, all the mappings generated by PHASE1 to index are read into memory, as long as the item dataset is slightly larger, it will be oom. This is a more serious design bug.

In fact, if item is a regular integer, not a GUID, phase1 and this step of reading into memory can be completely omitted. In this case, it can be used on an enterprise-level dataset (my test dataset is 1.5 billion + user-item-rating,1.5 million + users, and in the last step, all the phase can run successfully).

At this point, the recommended results have been formed, CF completed.

In all of the steps above, PHASE5 's calculation item similarity is the slowest (this is actually very intuitive).

Mahout Hadoop-based CF Code Analysis (RPM)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.