Install and use MrJob

Source: Internet
Author: User

1. Install mrjob

Pip install mrjob

For pip installation, see the previous article.

2. Code Testing

After mrjob is installed, you can use it directly. If hadoop has been configured,No additional configuration required(The environment variable HADOOP_HOME must be configured.) mrjob-based programs can run directly on the hadoop platform.

Mrjob provides several code running methods. 1) Local testing means directly Running code locally. 2) simulating hadoop running locally. 3) Running code on a hadoop cluster. Next, let's take a look at the local running status.

A piece of code from the official website:

from mrjob.job import MRJobclass MRWordCounter(MRJob):    def mapper(self, key, line):        for word in line.split():            yield word, 1    def reducer(self, word, occurrences):        yield word, sum(occurrences)if __name__ == '__main__':    MRWordCounter.run()

Run locally: python MRWrodCounter. py-r inline <input> output

This will Output the result to the Output.

Another usage is found: python MRWrodCounter. py-r inline input1 can be printed directly to the screen, and multiple inputs can be set at this time, such as python MRWordCounter. py-r inline input1 input2 input3.

Use the python MRWordCounter. py-r inline input1 input2 input3> out command to output the results of processing multiple files to out.

Locally simulate hadoop running: python MRWordCounter-r local <input> output

This will output the result to the output, which must be written.

Run on the hadoop cluster: python MRWordCounter-r hadoop <input> output

3. mrjob usage

The usage of mrjob is comprehensive in its official documents. The most basic part of the knowledge written here is.

First, analyze the above Code.

The simplest method for a map-reduce task is to overwrite the mapper, combiner, and CER functions of the MRJob. In the default configuration, the key input to mapper is None. mapper (word, 1), which is transmitted in each task through JSON. Therefore, Your python must support JSON. Again, the combiner and reducer keys are correct. Note that the value part is correct. According to the official document, this value is an iterator of the numbers, therefore, it is reasonable to use the sum function here. The final reduce output is a key-value Pair separated by the tab key.

Mrjob also provides the ability to define multiple steps, covering the steps () function. The following code shows the process:

from mrjob.job import MRJobclass MRDoubleWordFreqCount(MRJob):    """Word frequency count job with an extra step to double all the    values"""    def get_words(self, _, line):        for word in line.split():            yield word.lower(), 1    def sum_words(self, word, counts):        yield word, sum(counts)    def double_counts(self, word, counts):        yield word, counts * 2    def steps(self):        return [self.mr(mapper=self.get_words,                        combiner=self.sum_words,                        reducer=self.sum_words),                self.mr(mapper=self.double_counts)]if __name__=='__main__':        MRDoubleWordFreqCount.run()

 

 

This mrjob is easy to use and can easily test the code and develop quickly.

 

 

 

 

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.