最近一直在學coursera上面web intelligence and big data這門課,上周五印度老師布置了一個家庭作業,要求寫一個mapreduce程式,用python來實現。
具體描述如下:
Programming Assignment for HW3
Homework 3 (Programming Assignment A)
Download data files bundled as a .zip file from hw3data.zip
Each file in this archive contains entries that look like:
journals/cl/SantoNR90:::Michele Di Santo::Libero Nigro::Wilma Russo:::Programmer-Defined Control Abstractions in Modula-2.
that represent bibliographic information about publications, formatted as follows:
paper-id:::author1::author2::…. ::authorN:::title
Your task is to compute how many times every term occurs across titles, for each author.
For example, the author Alberto Pettorossi the following terms occur in titles with the indicated cumulative frequencies (across all his papers): program:3, transformation:2, transforming:2,
using:2, programs:2, and logic:2.
Remember that an author might have written multiple papers, which might be listed in multiple files. Further notice that ‘terms’ must exclude common stop-words, such as prepositions etc. For
the purpose of this assignment, the stop-words that need to be omitted are listed in the script stopwords.py. In
addition, single letter words, such as "a" can be ignored; also hyphens can be ignored (i.e. deleted). Lastly, periods, commas, etc. need to be ignored; in other words, only alphabets and numbers can be part of a title term: Thus, “program” and “program.”
should both be counted as the term ‘program’, and "map-reduce" should be taken as 'map reduce'. Note: You do not need to do stemming, i.e. "algorithm" and "algorithms" can be treated as separate terms.
The assignment is to write a parallel map-reduce program for the above task using either octo.py, or mincemeat.py, each of which is a lightweight map-reduce implementation written in Python.
These are available from http://code.google.com/p/octopy/ and mincemeat.py-zipfile respectively.
I strongly recommend mincemeat.py which is much faster than Octo,py even though the latter was covered first in the lecture video as an example. Both are very similar.
Once you have computed the output, i.e. the terms-frequencies per author, go attempt Homework 3 where you will be asked questions that can be simply answered using your computed output, such
as the top terms that occur for some particular author.
Note: There is no need to submit the code; I assume you will experiment using octo.py to learn how to program using map-reduce. Of course, you can always write a serial program for the task
at hand, but then you won’t learn anything about map-reduce.
Lastly, please note that octo.py is a rather inefficient implementation of map-reduce. Some of you might want to delve into the code to figure out exactly why. At the same time, this inefficiency
is likely to amplify any errors you make in formulating the map and reduce functions for the task at hand. So if your code starts taking too long, say more than an hour to run, there is probably something wrong.
很顯然,如果要讓我們從頭寫起那一周左右時間顯然不夠,老師在作業中強烈推薦我們使用mincemeat.py這個庫,使用這個庫以後我們真正要修改的是datasource,mapfn,reducefn這三個參數而已。總之,老師是考驗我們修改代碼的能力,所以他才說這個作業沒有交的必要。
下面來說說實現過程:
1、下載並分析檔案
hw3data.zip
開啟以後能發現很多檔案,像這樣:
每一個檔案裡面內容都是題目說的那樣:
books/bc/tanselCGSS93/Tuzhilin93:::Alexander Tuzhilin:::Applications of temporal Databases to Knowledge-based Simulations.
books/aw/kimL89/DiederichM89:::Jim Diederich::Jack Milton:::Objects,
Messages, and Rules in Database Design.
.................................................
.................................................
.................................................
前面藍色的部分不用考慮,我們只需要提取出紅色部分和後面的墨綠色部分,注意紅色部分如老師說的作者可能不止一個,所以要對字串先按":::"分割,再對第二個字串按"::"分割才行。最後那個title也並不是一個整體,老師要求的是提取出他們當中的每個單詞。最終的匹配要求是這樣:
author: word-number 其中number就是某個單詞在這個作者title中出現的次數,明確了要求,接下來就是改代碼了。
2、修改example.py
下載mincemeat壓縮包的時候,我們能得到一個example.py檔案,我們只需要照著這個檔案的格式修改即可:
example.py檔案如下:
#!/usr/bin/env pythonimport mincemeatdata = ["Humpty Dumpty sat on a wall", "Humpty Dumpty had a great fall", "All the King's horses and all the King's men", "Couldn't put Humpty together again", ]# The data source can be any dictionary-like objectdatasource = dict(enumerate(data))#need changedef mapfn(k, v): for w in v.split(): yield w, 1#need changedef reducefn(k, vs): result = sum(vs) return results = mincemeat.Server()s.datasource = datasource #need changes.mapfn = mapfn s.reducefn = reducefnresults = s.run_server(password="changeme")print results
只需要修改mapfn() reducefn()和datasource,其中datasource是一個dict類型,我之前的想法是存在author:word對,但是後來發現這樣mapreduce就不需要做什麼事情了,我已經替mapfn()找出了所以的author:word,這肯定不是老師所希望的,所以真正傳入的應該是檔案內容,然後交給mapfn()去讀取檔案裡面值,建立author:word對。reducefn()負責統計這些author:word對的次數。當搞清楚了這些,改寫也變得容易了,這裡參考了 zdw12242 童鞋的代碼,在此表示感謝,之前我寫的代碼有點慘不忍睹。
# -*- coding: utf-8 -*-#!/usr/bin/env pythonimport globimport mincemeatimport operatortext_files=glob.glob('E:\\Web\\hw3data\\/*')def file_contents(file_name): f=open(file_name) try: return f.read() finally: f.close()source=dict((file_name,file_contents(file_name)) for file_name in text_files)# setup map and reduce functionsdef mapfn(key, value): stop_words=['all', 'herself', 'should', 'to', 'only', 'under', 'do', 'weve', 'very', 'cannot', 'werent', 'yourselves', 'him', 'did', 'these', 'she', 'havent', 'where', 'whens', 'up', 'are', 'further', 'what', 'heres', 'above', 'between', 'youll', 'we', 'here', 'hers', 'both', 'my', 'ill', 'against', 'arent', 'thats', 'from', 'would', 'been', 'whos', 'whom', 'themselves', 'until', 'more', 'an', 'those', 'me', 'myself', 'theyve', 'this', 'while', 'theirs', 'didnt', 'theres', 'ive', 'is', 'it', 'cant', 'itself', 'im', 'in', 'id', 'if', 'same', 'how', 'shouldnt', 'after', 'such', 'wheres', 'hows', 'off', 'i', 'youre', 'well', 'so', 'the', 'yours', 'being', 'over', 'isnt', 'through', 'during', 'hell', 'its', 'before', 'wed', 'had', 'lets', 'has', 'ought', 'then', 'them', 'they', 'not', 'nor', 'wont', 'theyre', 'each', 'shed', 'because', 'doing', 'some', 'shes', 'our', 'ourselves', 'out', 'for', 'does', 'be', 'by', 'on', 'about', 'wouldnt', 'of', 'could', 'youve', 'or', 'own', 'whats', 'dont', 'into', 'youd', 'yourself', 'down', 'doesnt', 'theyd', 'couldnt', 'your', 'her', 'hes', 'there', 'hed', 'their', 'too', 'was', 'himself', 'that', 'but', 'hadnt', 'shant', 'with', 'than', 'he', 'whys', 'below', 'were', 'and', 'his', 'wasnt', 'am', 'few', 'mustnt', 'as', 'shell', 'at', 'have', 'any', 'again', 'hasnt', 'theyll', 'no', 'when','other', 'which', 'you', 'who', 'most', 'ours ', 'why', 'having', 'once','a','-','.',','] for line in value.splitlines(): word=line.split(':::') authors=word[1].split('::') title=word[2] for author in authors: for term in title.split(): if term not in stop_words: if term.isalnum(): yield author,term.lower() elif len(term)>1: temp='' for ichar in term: if ichar.isalpha() or ichar.isdigit(): temp+=ichar elif ichar=='-': temp+=' ' yield author,temp.lower()def reducefn(key, value): terms = value result={} for term in terms: if term in result: result[term]+=1 else: result[term]=1 return result# start the servers = mincemeat.Server()s.datasource = sources.mapfn = mapfns.reducefn = reducefnresults = s.run_server(password="changeme")#print resultsresult_file=open('hw3_result.txt','w')sorted(results.iteritems(), key=operator.itemgetter(1))for result in results: result_file.write(result+' : ') for term in results[result]: result_file.write(term+':'+str(results[result][term])+'#') result_file.write('\r\n')result_file.close()
3、運行
開啟兩個cmd,一個運行自己的程式,一個運行mincemeat.py程式,等於一個做伺服器,一個做用戶端,當伺服器處理完以後,就能看到結果了。介面如下:
至於其中原理,應該是類比mapreduce來實現的,可以看看這篇文章:
Introduction to MapReduce
輸出檔案內容是這樣的:
José Cristóbal Riquelme Santos : evolutionary:1#numeric:2#selection:1#reglas:1#clasificacin:1#efficient:1#feature:1#induccin:1#mediante:1#discovering:1#soap:1#rules:1#de:3#evolutivo:1#oblicuas:1#association:1#algoritmo:1#algorithm:1#via:1#un:1#mtodo:1#attributes:1#
Larry L. Kinney : control:1#microprogrammed:1#testing:1#number:1#detection:1#registers:1#feedback:1#group:1#strategy:1#intrainverted:1#units:1#evaluation:1#method:1#linear:1#concurrent:1#probing:1#relating:1#chips:1#a:1#cyclic:1#shift:1#large:1#behavior:1#error:1#
Gianfranco Bilardi : operations:1#logp:1#characterization:1#fat trees:1#computation:1#its:1#functions:1#for:1#locality:1#temporal:1#memory:1#hierarchies:1#across:1#versus:1#monotone:1#bsp:1#broadcast:1#a:1#lower:1#crew pram:1#portability:1#of:1#bounds:1#time:1#associative:1#
Joseph C. Culberson : binary:1#search:1#extended:1#polygons:1#polygon:1#orthogonal:1#simple:1#abstract:1#uncertainty:1#searching:1#trees:1#number:1#minimum:1#orthogonally:1#updates:1#convex:1#effect:1#covering:1#the:1#
.....................................
.....................................
.....................................
4、後記
這裡有一點遺憾的是,對results按單詞出現次數的多少排序沒有效果,我覺得原因是把word:number共同作為了一個字典中的一個value值,暫時還不清楚如何對dict中value值得某一項進行排序,於是在完成第三周作業的時候難免要費眼力去找數字最大的兩個值。
最後不得不說,最近已經碰到好幾次要用python寫東西,看來自己是難逃python的手掌心了。
5、補充:
感謝zdw12242的修改
對每個作者的詞項頻率排序,mapfn不變,reducefn及主程式改變後如下:
def reducefn(key, value): terms = value counts={} for term in terms: if term in counts: counts[term]+=1 else: counts[term]=1 items=counts.items() # sort the counts reverse_items=[ [v[1],v[0]] for v in items] reverse_items.sort(reverse=True) result=[] for i in reverse_items: result.append([i[1],i[0]]) return result# start the servers = mincemeat.Server()s.datasource = sources.mapfn = mapfns.reducefn = reducefnresults = s.run_server(password="changeme")#save resultsresult_file=open('hw3_result_sorted','w')for result in results: result_file.write(result+' : ') for term in results[result]: result_file.write(term[0]+':'+str(term[1])+',') result_file.write('\n')result_file.close()
因為字典類型在python中是以雜湊表的方式儲存的,其輸出結果是亂序的,所以將字典轉為列表後排序輸出。
先將字典轉化成列表:
[['creating', 2], ['lifelong', 1],['quality', 3],['learners', 5], ['assurance', 1]]
然後交換列表中每個子列表前後的值:
[[2, 'creating'], [1, 'lifelong'],[3, 'quality'],[5, 'learners'], [1, 'assurance']]
按照列表中子列表第一個值從大到小排序:
[[5, 'learners'], [3, 'quality'], [2, 'creating'], [1, 'lifelong'], [1, 'assurance']]
最後將列表歸位:
[['learners', 5], ['quality', 3], ['creating', 2], ['lifelong', 1], ['assurance', 1]]