Recently, I have been learning the course Web Intelligence and big data on Coursera. Last Friday, an Indian teacher assigned a homework assignment asking me to write a mapreduce program and use python for implementation.
The detailed description is as follows:
Programming assignment for hw3
Homework 3 (programming assignment)
Download data files bundled as A. ZIP file from hw3data.zip
Each file in this archive contains entries that look like:
Journals/CL/santonr90: Michelin Di Santo: Libero nigro: Wilma Russo: programmer-defined control activities actions in Modula-2.
That represent bibliographic information about publications, formatted as follows:
Paper-ID: author1: author2 ::.... : Authorn ::: title
Your task is to compute how many times every term occurs average SS titles, for each author.
For example, the author Albert to pettorossi The following terms occur in titles with the indicated cumulative frequencies (each SS all his papers): Program: 3, Transformation: 2, transforming: 2,
Using: 2, programs: 2, and logic: 2.
Remember that an author might have written multiple papers, which might be listed in multiple files. Further notice that 'terms' must exclude common stop-words, such as prepositions etc.
The purpose of this assignment, the stop-words that need to be omitted are listed in the script stopwords. py. In
Addition, single letter words, such as "A" can be ignored; Also hyphens can be ignored (I. e. deleted ). lastly, periods, commas, etc. need to be ignored; in other words, only alphabets and numbers can be part of a title term: thus, "program" and "program."
Shocould both be counted as the term 'project', and "Map-reduce" shocould be taken as 'map reduce '. note: you do not need to do stemming, I. e. "algorithm" and "algorithms" can be treated as separate terms.
The assignment is to write a parallel map-Reduce program for the above task using either Octo. py, or mincemeat. py, each of which is a lightweight map-reduce implementation written in Python.
These are available from http://code.google.com/p/octopy/ and mincemeat. py-zipfile respectively.
I stronugly recommend mincemeat. py which is much faster than Octo, Py even though the latter was covered first in the lecture video as an example. Both are very similar.
Once you have computed the output, I. e. the terms-frequencies per author, go attempt Homework 3 where you will be asked questions that can be simply answered using your computed output, such
As the top terms that occur for some special author.
Note: There is no need to submit the Code; I assume you will experiment using Octo. PY to learn how to program using map-reduce. of course, you can always write a serial program for the task
At hand, but then you won't learn anything about map-reduce.
Lastly, please note that Octo. py is a rather inefficient implementation of Map-reduce. some of you might want to delve into the code to figure out exactly why. at the same time, this inefficiency
Is likely to amplify any errors you make in formulating the map and reduce functions for the task at hand. so if your code starts taking too long, say more than an hour to run, there is probably something wrong.
Obviously, it is not enough for us to start from scratch that week or so. The teacher strongly recommends us to use mincemeat in our homework. PY library. After using this library, we really need to modify the three parameters datasource, mapfn, and reducefn. In short, the teacher is a test of our ability to modify code, so he said that this assignment is not necessary.
The implementation process is as follows:
1. Download and analyze files
Hw3data.zip
After opening, you can find many files, such:
The content in each file is the same as the question:
Books/BC/tanselcgss93/tuzhilin93: Alexander tuzhilin: applications of temporal databases to knowledge-based simulations.
Books/AW/kiml89/diederichm89: Jim Diederich: Jack MILTON: objects,
Messages, and rules in database design.
........................................ .........
........................................ .........
........................................ .........
You do not need to consider the blue part above. We only need to extract the red part and the dark green part behind it. Note that the red part may be more than one as the author said, so press "::: "split, and then split the second string. The last title is not a whole. The teacher asks to extract every word from them. The final matching requirement is as follows:
Author: Word-number, where number is the number of times a word appears in the author's title. The requirements are clarified and the code is changed.
2. Modify example. py.
When downloading the mincemeat package, we can get a example. py file. We only need to modify the file format as follows:
The example. py file is as follows:
#!/usr/bin/env pythonimport mincemeatdata = ["Humpty Dumpty sat on a wall", "Humpty Dumpty had a great fall", "All the King's horses and all the King's men", "Couldn't put Humpty together again", ]# The data source can be any dictionary-like objectdatasource = dict(enumerate(data))#need changedef mapfn(k, v): for w in v.split(): yield w, 1#need changedef reducefn(k, vs): result = sum(vs) return results = mincemeat.Server()s.datasource = datasource #need changes.mapfn = mapfn s.reducefn = reducefnresults = s.run_server(password="changeme")print results
You only need to modify mapfn () reducefn () and datasource. datasource is a dict type. My previous idea was to have Author: Word pair, but later I found that mapreduce does not need to do anything. I have found out the author: Word For mapfn (). This is definitely not what the teacher expected, therefore, the actual input should be the file content, and then hand it to mapfn () to read the file's nominal value and create an author: Word pair. Performancefn () is responsible for counting the number of Author: word pairs. After figuring out this, it will become easier to rewrite it. Here I refer to the zdw12242 kids shoes code. I would like to express my gratitude. The code I wrote was a little miserable.
# -*- coding: utf-8 -*-#!/usr/bin/env pythonimport globimport mincemeatimport operatortext_files=glob.glob('E:\\Web\\hw3data\\/*')def file_contents(file_name): f=open(file_name) try: return f.read() finally: f.close()source=dict((file_name,file_contents(file_name)) for file_name in text_files)# setup map and reduce functionsdef mapfn(key, value): stop_words=['all', 'herself', 'should', 'to', 'only', 'under', 'do', 'weve', 'very', 'cannot', 'werent', 'yourselves', 'him', 'did', 'these', 'she', 'havent', 'where', 'whens', 'up', 'are', 'further', 'what', 'heres', 'above', 'between', 'youll', 'we', 'here', 'hers', 'both', 'my', 'ill', 'against', 'arent', 'thats', 'from', 'would', 'been', 'whos', 'whom', 'themselves', 'until', 'more', 'an', 'those', 'me', 'myself', 'theyve', 'this', 'while', 'theirs', 'didnt', 'theres', 'ive', 'is', 'it', 'cant', 'itself', 'im', 'in', 'id', 'if', 'same', 'how', 'shouldnt', 'after', 'such', 'wheres', 'hows', 'off', 'i', 'youre', 'well', 'so', 'the', 'yours', 'being', 'over', 'isnt', 'through', 'during', 'hell', 'its', 'before', 'wed', 'had', 'lets', 'has', 'ought', 'then', 'them', 'they', 'not', 'nor', 'wont', 'theyre', 'each', 'shed', 'because', 'doing', 'some', 'shes', 'our', 'ourselves', 'out', 'for', 'does', 'be', 'by', 'on', 'about', 'wouldnt', 'of', 'could', 'youve', 'or', 'own', 'whats', 'dont', 'into', 'youd', 'yourself', 'down', 'doesnt', 'theyd', 'couldnt', 'your', 'her', 'hes', 'there', 'hed', 'their', 'too', 'was', 'himself', 'that', 'but', 'hadnt', 'shant', 'with', 'than', 'he', 'whys', 'below', 'were', 'and', 'his', 'wasnt', 'am', 'few', 'mustnt', 'as', 'shell', 'at', 'have', 'any', 'again', 'hasnt', 'theyll', 'no', 'when','other', 'which', 'you', 'who', 'most', 'ours ', 'why', 'having', 'once','a','-','.',','] for line in value.splitlines(): word=line.split(':::') authors=word[1].split('::') title=word[2] for author in authors: for term in title.split(): if term not in stop_words: if term.isalnum(): yield author,term.lower() elif len(term)>1: temp='' for ichar in term: if ichar.isalpha() or ichar.isdigit(): temp+=ichar elif ichar=='-': temp+=' ' yield author,temp.lower()def reducefn(key, value): terms = value result={} for term in terms: if term in result: result[term]+=1 else: result[term]=1 return result# start the servers = mincemeat.Server()s.datasource = sources.mapfn = mapfns.reducefn = reducefnresults = s.run_server(password="changeme")#print resultsresult_file=open('hw3_result.txt','w')sorted(results.iteritems(), key=operator.itemgetter(1))for result in results: result_file.write(result+' : ') for term in results[result]: result_file.write(term+':'+str(results[result][term])+'#') result_file.write('\r\n')result_file.close()
3. Run
Open two cmd, one running your own program, and the other running the mincemeat. py program, which is equivalent to a server and a client. After the server completes processing, you can see the results. The interface is as follows:
As for the principle, it should be implemented by simulating mapreduce. Let's take a look at this article:
Introduction to mapreduce
The output file content is as follows:
Jos é Crist óbal Riquelme Santos: evolutionary: 1 # numeric: 2 # Selection: 1 # reglas: 1 # clasificacin: 1 # efficient: 1 # feature: 1 # induccin: 1 # mediante: 1 # discovering: 1 # soap: 1 # rules: 1 # de: 3 # evolutivo: 1 # oblicuas: 1 # Association: 1 # algoritmo: 1 # algorithm: 1 # via: 1 # UN: 1 # mtodo: 1 # attributes: 1 #
Larry L. kinney: Control: 1 # microprogrammed: 1 # testing: 1 # Number: 1 # Detection: 1 # registers: 1 # feedback: 1 # group: 1 # strategy: 1 # intrainverted: 1 # units: 1 # evaluation: 1 # method: 1 # linear: 1 # concurrent: 1 # probing: 1 # Relating: 1 # chips: 1 # A: 1 # cyclic: 1 # shift: 1 # large: 1 # behavior: 1 # error: 1 #
Gianfranco bilardi: operations: 1 # LOGP: 1 # characterization: 1 # fat trees: 1 # computation: 1 # Its: 1 # functions: 1 # For: 1 # locality: 1 # temporal: 1 # Memory: 1 # hierarchies: 1 # messaging SS: 1 # versus: 1 # monotone: 1 # BSP: 1 # broadcast: 1 # A: 1 # lower: 1 # crew pram: 1 # Portability: 1 # of: 1 # bounds: 1 # Time: 1 # associative: 1 #
Joseph C. culberson: Binary: 1 # search: 1 # extended: 1 # polygons: 1 # polygon: 1 # orthogonal: 1 # simple: 1 # Abstract: 1 # uncertainty: 1 # Searching: 1 # trees: 1 # Number: 1 # minimum: 1 # orthogonally: 1 # updates: 1 # convex: 1 # effect: 1 # covering: 1 # The: 1 #
.....................................
.....................................
.....................................
4. Postscript
It is a pity that the sorting of results by the number of word occurrences is ineffective. I think the reason is that word: number is used as a value in a dictionary, it is not clear how to sort values in dict. Therefore, when completing assignments in the third week, it is inevitable that you have to pay attention to the two largest values of the number.
Finally, I have encountered several times of using python to write things recently. It seems that I cannot escape python.
5. Supplement:
Thanks for the modification of zdw12242.
Sort the word term frequency of each author. mapfn remains unchanged. After performancefn and the main program change, the following information is displayed:
def reducefn(key, value): terms = value counts={} for term in terms: if term in counts: counts[term]+=1 else: counts[term]=1 items=counts.items() # sort the counts reverse_items=[ [v[1],v[0]] for v in items] reverse_items.sort(reverse=True) result=[] for i in reverse_items: result.append([i[1],i[0]]) return result# start the servers = mincemeat.Server()s.datasource = sources.mapfn = mapfns.reducefn = reducefnresults = s.run_server(password="changeme")#save resultsresult_file=open('hw3_result_sorted','w')for result in results: result_file.write(result+' : ') for term in results[result]: result_file.write(term[0]+':'+str(term[1])+',') result_file.write('\n')result_file.close()
Because the dictionary type is stored as a hash table in Python, the output result is out of order, so the dictionary is converted into a list and sorted for output.
First, convert the dictionary into a list:
[['Creating', 2], ['Lifelong ', 1], ['quality', 3], ['learners', 5], ['Assurance ', 1]
Then, the values before and after each sublist in the list are exchanged:
[[2, 'creating'], [1, 'lifelong '], [3, 'quality'], [5, 'learners'], [1, 'Assurance ']
Sort by the first value of the List neutron list from large to small:
[[5, 'learners'], [3, 'quality'], [2, 'creating'], [1, 'lifelong '], [1, 'Assurance ']
Finally, return the list:
[['Learners', 5], ['quality ', 3], ['creating', 2], ['Lifelong', 1], ['Assurance ', 1]