From Udacity,intro to Hadoop, final Project.
The topic requests, from the forum Post's structured data, discovers each user po pastes the most time. For example, user A at 8 points per day, po the most posts, the number marked as X so reducer output is a X
Train of thought: Mapper need to output author_id, reducer to each user to do a {Hour:count} statistics, the last output of the most count hour
#mapper
#!/usr/bin/python
import sys
import csv
reader = Csv.reader (Sys.stdin, delimiter = "\ T")
Next (reader, None) for line in
reader:
If len (line) = =
author_id = line[3]
hour = line[8][11:13]
print ' {0}\t{1} '. Format (author_id, hour)
# reducer
#!/usr/bin/python
import sys
oldid = None
hours = {} for
I in range:
hours[i] = 0 for line in
Sys.stdin:
data = Line.strip (). Split ("\ t ")
If len (data) = = 2:
Thisauthorid, hour = data
if oldid and oldid!= thisauthorid:
print" {0}\t{1} ". f Ormat (oldid, max (hours, key=hours.get))
oldid = Thisauthorid for
i in Range (a):
hours[i] = 0
oldid = Thisauthorid
Hours[int (hour)] + + 1
if oldid!= None:
print "{0}\t{1}". Format (oldid, Max (hours, key= Hours.get))