The most recent thing to do is write a mapreduce program with Mrjob and read the data from MONGO. My approach is easy and understood, because Mrjob can support Sys.stdin reading, so I consider using a Python program to read the data in MONGO, and then let the Mrjob script accept input, processing, and output.
Specific ways:
readinmongodb.py:
#coding: UTF-8 "Created on May 28, 2014 @author:hao" "Import Pymongopyconn = Pymongo. Connection (host,port=27017) pycursor = Pyconn.userid_cid_score.find (). Batch_size (+) for I in Pycursor: userid = i[ ' UserId '] cid = i[' CID '] score = i[' score ']# temp = list () # Temp.append (userId) # Temp.append (CID # Temp.append (score) print str (userId) + ', ' +str (CID) + ', ' +str (score)
step1.py:
#Coding:utf-8" "Created on May 27, 2014 @author:hao" " fromMrjob.jobImportMrjob#From mrjob Import ProtocolImportPymongoImportLoggingImportSimplejson as SJclassStep (mrjob):" " " "#logging.c defParsematrix (Self, _, line):" "input one stdin for Pymongo onetime search output contentId, (UserId, rating)" " Line=Line (str)=line.split (',') UserId=Line[0]#Print UserIdCID = line[1]#Print CIDScore = float (line[2])#Print Score yieldCID, (UserId, float (score))defscorecombine (self, CID, userrating):" "put the same content (users, ratings) into a list" "userratings=list () forIinchuserRating:userRatings.append (i)yieldCID, UserratingsdefUserbehavior (self, CID, userratings):" " " "scorelist=list () forDocinchuserratings:#each combiner result forIinchdoc:scoreList.append (i) forUser1inchscorelist: forUser2inchscorelist:ifUser1[0] = =User2[0]:Continue yield(User1[0], user2[0]), (user1[1], user2[1]) defSteps (self):return[Self.mr (mapper =Self.parsematrix, Reducer=self.scorecombine), self.mr (reducer=self.userbehavior),]if __name__=='__main__': FP= Open ('a.txt','W') Fp.write ('[') Step.run () Fp.write (']') Fp.close ()
Then execute the script Python readinmongodb.py | Python step1.py >> OUT.txt
This is a very good way to do this locally, without any problems (except for the mrjob speed problem, which has little effect in this application)
Original: http://blog.csdn.net/whzhcahzxh/article/details/29587059
Mrjob using MONGOLDB data source "Go"