The first parameter, ITER, is an iterator that involves the keys and values generated by the map function, which are the reduce instances.
In this case, the word is randomly delegated to a different reduce instance, and then the same word is used, and the reduce for processing it is the same, ensuring that the final total is correct.
The second parameter, params, is consistent with the map function, where only simple Disco.util.kvgroup () is used to extract each word count, cumulative count, yield (yield) result.
Run Job
Starting the job below, you can customize the job with a large number of parameters, but typically only 3 of them are used for simple tasks. In addition to starting the job, we also need to output the results, first of all, we wait before the job completes, by calling wait waits for the call to complete, the completion will return the results, for convenience, through the job object calls wait and other related methods.
The Result_iterator () function takes the result file address list, which is returned by the wait () function, and iterates (iterates) through the key-value pairs in all the results.
Defmap (line, params):
For word in Line.split ():
Yield Word, 1
Defreduce (ITER, params):
From Disco.util import Kvgroup
For word, counts in Kvgroup (sorted (ITER)):
Yield word, sum (counts)
if__name__ = = ' __main__ ':
Job =job (). Run (input=["Http://discoproject.org/media/text/chekhov.txt"],
Map=map,
Reduce=reduce)
For Word, Count Inresult_iterator (job.wait (show=true)):
Print (Word, count)
This blog all content is original, if reproduced please indicate source http://blog.csdn.net/myhaspl/
If all is proper, you can see the job execution, input read from Tagdata:bigtxt, this is the final printout created at the beginning, and when the job runs, you can open (or run the port of Disco master) to see the real-time process of the job.
Python count_words.py
You can also view the job process on the console as follows:
Disco_events=1 python count_words.py
As you can see, creating a new disco job is fairly straightforward. You can extend this simple example in any number of ways. For example, you include a list of deactivated words by using the params object.
If you put the Disco Distributed file system data, you can try to change the output to Tag://data:bigtxt, as well as add Map_reader =disco.worker.task_io.chain_reader.
You can try to use Sum_combiner () to make the job more efficient.
You can also try customizing functional partitioning and reading functions, written in the same way as the map and reduce functions, and then you can try to link the job together so that the previous job output becomes the next input.
Disco is designed to be as simple as possible so that you can focus on your own problems, not the framework.
The road to mathematics-distributed computing-disco (4)