1. Install mrjob
Pip install mrjob
For pip installation, see the previous article.
2. Code Testing
After mrjob is installed, you can use it directly. If hadoop has been configured,No additional configuration required(The environment variable HADOOP_HOME must be configured.) mrjob-based programs can run directly on the hadoop platform.
Mrjob provides several code running methods. 1) Local testing means directly Running code locally. 2) simulating hadoop running locally. 3) Running code on a hadoop cluster. Next, let's take a look at the local running status.
A piece of code from the official website:
from mrjob.job import MRJobclass MRWordCounter(MRJob): def mapper(self, key, line): for word in line.split(): yield word, 1 def reducer(self, word, occurrences): yield word, sum(occurrences)if __name__ == '__main__': MRWordCounter.run()
Run locally: python MRWrodCounter. py-r inline <input> output
This will Output the result to the Output.
Another usage is found: python MRWrodCounter. py-r inline input1 can be printed directly to the screen, and multiple inputs can be set at this time, such as python MRWordCounter. py-r inline input1 input2 input3.
Use the python MRWordCounter. py-r inline input1 input2 input3> out command to output the results of processing multiple files to out.
Locally simulate hadoop running: python MRWordCounter-r local <input> output
This will output the result to the output, which must be written.
Run on the hadoop cluster: python MRWordCounter-r hadoop <input> output
3. mrjob usage
The usage of mrjob is comprehensive in its official documents. The most basic part of the knowledge written here is.
First, analyze the above Code.
The simplest method for a map-reduce task is to overwrite the mapper, combiner, and CER functions of the MRJob. In the default configuration, the key input to mapper is None. mapper (word, 1), which is transmitted in each task through JSON. Therefore, Your python must support JSON. Again, the combiner and reducer keys are correct. Note that the value part is correct. According to the official document, this value is an iterator of the numbers, therefore, it is reasonable to use the sum function here. The final reduce output is a key-value Pair separated by the tab key.
Mrjob also provides the ability to define multiple steps, covering the steps () function. The following code shows the process:
from mrjob.job import MRJobclass MRDoubleWordFreqCount(MRJob): """Word frequency count job with an extra step to double all the values""" def get_words(self, _, line): for word in line.split(): yield word.lower(), 1 def sum_words(self, word, counts): yield word, sum(counts) def double_counts(self, word, counts): yield word, counts * 2 def steps(self): return [self.mr(mapper=self.get_words, combiner=self.sum_words, reducer=self.sum_words), self.mr(mapper=self.double_counts)]if __name__=='__main__': MRDoubleWordFreqCount.run()
This mrjob is easy to use and can easily test the code and develop quickly.