From Udacity,intro to Hadoop, final Project.
The topic says to want to know the length of the post has no relevance to the length of the answer
Train of thought: Mapper filter out question and answer, output id,type (question, answer), length reducer statistics each ID corresponding question Len and answer Len
# Mapper #!/usr/bin/python import SYS import CSV reader = Csv.reader (Sys.stdin, delimiter = "\ T") Next (reader, None) fo
R line in Reader:if len (line) = = 19:node_type = line[5] BODY = line[4] node_id = line[0] parent_id = line[6] if Node_type = = "question": print "{0}\t{1}\t{2}". Format (node_id, Node_type,
Len (body)) elif Node_type = = "Answer": print "{0}\t{1}\t{2}". Format (parent_id, Node_type, Len (body)) # reducer #!/usr/bin/python Import sys questionlen = 0 Answerlen = 0 Answercount = 0 oldkey = None for line in SYS.STDI
N:data = Line.strip (). Split ("T") if Len (data)!= 3:continue ThisKey, node_type, length = data If Oldkey and Oldkey!= thiskey:if answercount = = 0:print "{0}\t{1}\t{2}". Format (Oldkey, Questionlen , 0) else:print "{0}\t{1}\t{2}". Format (Oldkey, Questionlen, answerlen/answercount) Oldkey = ThisKey QuestioNlen = 0 Answerlen = 0 Answercount = 0 Oldkey = ThisKey if Node_type = "question": Quest Ionlen = Int (length) elif Node_type = = "Answer": Answercount + = 1 Answerlen + = float (length) if Oldke Y!= none:if Answercount = = 0:print "{0}\t{1}\t{2}". Format (Oldkey, Questionlen, 0) Else:print " {0}\t{1}\t{2} ". Format (Oldkey, Questionlen, Answerlen/answercount)