The road to mathematics-distributed computing-disco (4)

Source: Internet
Author: User

The first parameter, ITER, is an iterator that involves the keys and values generated by the map function, which are the reduce instances.

In this case, the word is randomly delegated to a different reduce instance, and then the same word is used, and the reduce for processing it is the same, ensuring that the final total is correct.

The second parameter, params, is consistent with the map function, where only simple Disco.util.kvgroup () is used to extract each word count, cumulative count, yield (yield) result.

Run Job

Starting the job below, you can customize the job with a large number of parameters, but typically only 3 of them are used for simple tasks. In addition to starting the job, we also need to output the results, first of all, we wait before the job completes, by calling wait waits for the call to complete, the completion will return the results, for convenience, through the job object calls wait and other related methods.

The Result_iterator () function takes the result file address list, which is returned by the wait () function, and iterates (iterates) through the key-value pairs in all the results.

Defmap (line, params):

For word in Line.split ():

Yield Word, 1

Defreduce (ITER, params):

From Disco.util import Kvgroup

For word, counts in Kvgroup (sorted (ITER)):

Yield word, sum (counts)

if__name__ = = ' __main__ ':

Job =job (). Run (input=["Http://discoproject.org/media/text/chekhov.txt"],

Map=map,

Reduce=reduce)

For Word, Count Inresult_iterator (job.wait (show=true)):

Print (Word, count)

This blog all content is original, if reproduced please indicate source http://blog.csdn.net/myhaspl/

If all is proper, you can see the job execution, input read from Tagdata:bigtxt, this is the final printout created at the beginning, and when the job runs, you can open (or run the port of Disco master) to see the real-time process of the job.

Python count_words.py

You can also view the job process on the console as follows:

Disco_events=1 python count_words.py

As you can see, creating a new disco job is fairly straightforward. You can extend this simple example in any number of ways. For example, you include a list of deactivated words by using the params object.

If you put the Disco Distributed file system data, you can try to change the output to Tag://data:bigtxt, as well as add Map_reader =disco.worker.task_io.chain_reader.

You can try to use Sum_combiner () to make the job more efficient.

You can also try customizing functional partitioning and reading functions, written in the same way as the map and reduce functions, and then you can try to link the job together so that the previous job output becomes the next input.

Disco is designed to be as simple as possible so that you can focus on your own problems, not the framework.

The road to mathematics-distributed computing-disco (4)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.