Recently, I joined Cloudera. Before that, I have been working in computational biology genomics for almost 10 years. My analysis is mainly based on the Python language and its great scientific computing stack. However, most ApacheHadoop ecosystems are implemented in Java and are also prepared for Java, which makes me very annoyed. So, my head
Recently, I joined Cloudera. Before that, I have been working in computational biology/genomics for almost 10 years. My analysis is mainly based on the Python language and its great scientific computing stack. However, the Apache Hadoop ecosystem is mostly implemented in Java and prepared for Java, which makes me very annoyed. So, my head
Recently, I joined Cloudera. Before that, I have been working in computational biology/genomics for almost 10 years. My analysis is mainly based on the Python language and its great scientific computing stack. However, the Apache Hadoop ecosystem is mostly implemented in Java and prepared for Java, which makes me very annoyed. Therefore, my top priority is to find some Hadoop frameworks that can be used by Python.
In this article, I will write down some of my personal views on these frameworks that are irrelevant to science. These frameworks include:
- Hadoop stream
- Mrjob
- Dumbo
- Hadoopy
- Pydoop
- Others
In the end, in my opinion, Hadoop's data stream (streaming) is the fastest and most transparent option and most suitable for text processing. Mrjob is best suited for quick work on Amazon EMR, but with significant performance loss. Dumbo is convenient for most complex tasks (Objects Act as keys), but it is still slower than streaming. Read on to learn more about implementation details, performance, and function comparison.
An interesting question
In order to test different frameworks, we will not conduct an experiment on the number of statistical words, but instead convert the N-Metadata of Google Books. N-elements represent the tuples composed of n words. The n-yuan dataset provides statistics on the number of all 1-, 2-, 3-, 4-, and 5-yuan records in the Google Books Collection. Each row of records in this n-Metadata set is composed of three fields: n-yuan, year, and number of observations. (You can obtain data at http://books.google.com/ngrams ).
We want to summarize the data to observe the number of times that any combination of adjacent words appears and group the data by year. The experiment results will enable us to determine whether a word combination has occurred more frequently in a year than normal. If two words appear within the distance of the four words during statistics, we define the two words as "approaching. Or, if two words appear in a 2-, 3-, or 5-yuan record, we also define them as "approaching. Once, the final product of the experiment will contain a 2-yuan record, year and count.
There is a subtle point that must be emphasized. N-the value of each data in the metadata set is calculated through the entire Google library corpus. In principle, given a 5-yuan dataset, I can calculate a 4-yuan, 3-yuan, and 2-yuan dataset by simply aggregating the correct n-yuan. For example
(the, cat, in, the, hat) 1999 20(the, cat, is, on, youtube) 1999 13(how, are, you, doing, today) 1986 5000
We can aggregate it into a 2-RMB dataset to obtain the following records.
(The, cat) 1999 33 // that is, 20 + 13
However, in practice, only n tuples that appear more than 40 times in the entire corpus are counted. Therefore, if a 5-tuple does not reach the threshold of 40 times, Google also provides the data of 2 tuples that make up the 5-tuple, some of which may reach the threshold. For this reason, we use the binary data of adjacent words, the productkey, devicename, and devicesecret of a word, and so on. In other words, compared to a given binary group, the three tuples only have the outermost layer of words. In addition to being more sensitive to possible sparse n metadata, only the words at the outermost layer of n tuples can help avoid repeated computation. In general, we will calculate the data in a set of 2, 3, 4, and 5 metadata.
MapReduce pseudo code to implement this solution is similar to this:
Def map (record): (ngram, year, count) = unpack (record) // make sure that word1 is the first word of the Dictionary (word1, word2) = sorted (ngram [first], ngram [last]) key = (word1, word2, year) emit (key, count) def reduce (key, values): emit (key, sum (values ))
Hardware
These MapReduce components are executed on a random subset of approximately 20 GB of data. The complete dataset contains 1500 files. We use this script to select a random subset. It is important to keep the file name complete because the file name determines the value of n in the n-element of the data block. The Hadoop cluster contains five virtual nodes that use CentOS 6.2 x64, each of which has 4 CPUs, 10 gb ram, GB hard disk capacity, and runs CDH4. The cluster can execute 20 parallel operations each time, and each component can execute 10 reducers.
The software version running on the cluster is as follows:
- Hadoop: 2.0.0-cdh4.1.2
- Python: 2.6.6
- Mrjob: 0.4-dev
- Dumbo: 0.21.36
- Hadoopy: 0.6.0
- Pydoop: 0.7 (PyPI) library contains the latest version
- Java: 1.6
Implementation
Most Python frameworks encapsulate Hadoop Streaming, some encapsulate Hadoop Pipes, and some are based on their own implementations. Next, I will share some of my experiences in using various Python tools to write Hadoop jobs, and I will provide a comparison of the performance and features. I am interested in the ease of use and operation. I will not optimize the performance of a single software.
When processing each dataset, there will be some corrupt records. For each record, we need to check whether there are errors and identify the types of errors, including missing fields and incorrect N-element sizes. In the latter case, we must know the file name of the record to determine the N-element size. All code can be obtained from GitHub.
Hadoop Streaming
Hadoop Streaming provides the use of other executable programs as Hadoop er or reduce methods, including standard Unix tools and Python scripts. This program must use the specified semantics to read data from the standard input and then output the results to the standard output. One disadvantage of using Streaming directly is that when reduce input is grouped by key, it is still a row of iteration, and the user must identify the boundary between key and key.
Below is the mapper code:
#! /usr/bin/env pythonimport osimport reimport sys# determine value of n in the current block of ngrams by parsing the filenameinput_file = os.environ['map_input_file']expected_tokens = int(re.findall(r'([\d]+)gram', os.path.basename(input_file))[0])for line in sys.stdin: data = line.split('\t') # perform some error checking if len(data) < 3: continue # unpack data ngram = data[0].split() year = data[1] count = data[2] # more error checking if len(ngram) != expected_tokens: continue # build key and emit pair = sorted([ngram[0], ngram[expected_tokens - 1]]) print >>sys.stdout, "%s\t%s\t%s\t%s" % (pair[0], pair[1], year, count)
The following is CER:
#! /usr/bin/env pythonimport systotal = 0prev_key = Falsefor line in sys.stdin: data = line.split('\t') curr_key = '\t'.join(data[:3]) count = int(data[3]) # found a boundary; emit current sum if prev_key and curr_key != prev_key: print >>sys.stdout, "%s\t%i" % (prev_key, total) prev_key = curr_key total = count # same key; accumulate sum else: prev_key = curr_key total += count# emit last keyif prev_key: print >>sys.stdout, "%s\t%i" % (prev_key, total)
By default, Hadoop Streaming uses a tab character to separate keys and values ). Because we also use tab characters to separate fields, we must pass the following three options to Hadoop to tell them the keys of data) it consists of the first three domains.
-jobconf stream.num.map.output.key.fields=3-jobconf stream.num.reduce.output.key.fields=3
Execute the Hadoop Task Command
hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.1.2.jar \ -input /ngrams \ -output /output-streaming \ -mapper mapper.py \ -combiner reducer.py \ -reducer reducer.py \ -jobconf stream.num.map.output.key.fields=3 \ -jobconf stream.num.reduce.output.key.fields=3 \ -jobconf mapred.reduce.tasks=10 \ -file mapper.py \ -file reducer.py
Note: mapper. py and reducer. py appears twice in the command. The first is to tell Hadoop to execute two files, and the second is to tell Hadoop to distribute these two files to all nodes in the cluster.
The underlying mechanism of Hadoop Streaming is simple and clear. In contrast, Python executes their own serialization/deserialization in an opaque way, which consumes more resources. Moreover, if the Hadoop software already exists, Streaming can run without the need to configure other software. Not to mention the ability to pass Unix commands or Java class names as mappers/reducers.
The disadvantage of Streaming is that manual operations are required. You must decide how to convert an object to a key-Value Pair (such as a JSON object ). Binary data is not supported. As mentioned above, the reducer must manually monitor the key boundary, which is prone to errors.
Mrjob
Mrjob is an open-source Python framework that encapsulates Hadoop data streams and actively develops Yelp ., Since Yelp operates entirely on Amazon's Web Service, mrjob integration and EMR are incredibly smooth and easy (using the boto package ).
Mrjob provides a Python API and Hadoop data stream, and allows users to use any object as a key and er. By default, these objects are serialized into the interior of the JSON object, but there are also pickle-supported objects. Is there any other out-of-the-box binary I/O format available, but there is a mechanism to implement custom serialization.
It is worth noting that mrjob seems to develop very fast and has good documentation.
All Python frameworks look like pseudo-code implementation:
#! /usr/bin/env pythonimport osimport refrom mrjob.job import MRJobfrom mrjob.protocol import RawProtocol, ReprProtocolclass NgramNeighbors(MRJob): # mrjob allows you to specify input/intermediate/output serialization # default output protocol is JSON; here we set it to text OUTPUT_PROTOCOL = RawProtocol def mapper_init(self): # determine value of n in the current block of ngrams by parsing filename input_file = os.environ['map_input_file'] self.expected_tokens = int(re.findall(r'([\d]+)gram', os.path.basename(input_file))[0]) def mapper(self, key, line): data = line.split('\t') # error checking if len(data) < 3: return # unpack data ngram = data[0].split() year = data[1] count = int(data[2]) # more error checking if len(ngram) != self.expected_tokens: return # generate key pair = sorted([ngram[0], ngram[self.expected_tokens - 1]]) k = pair + [year] # note that the key is an object (a list in this case) # that mrjob will serialize as JSON text yield (k, count) def combiner(self, key, counts): # the combiner must be separate from the reducer because the input # and output must both be JSON yield (key, sum(counts)) def reducer(self, key, counts): # the final output is encoded as text yield "%s\t%s\t%s" % tuple(key), str(sum(counts))if __name__ == '__main__': # sets up a runner, based on command line options NgramNeighbors.run()
Mrjob only needs to be installed on the client, where it is submitted during the job. The command to be run is as follows:
export HADOOP_HOME="/usr/lib/hadoop-0.20-mapreduce"./ngrams.py -r hadoop --hadoop-bin /usr/bin/hadoop --jobconf mapred.reduce.tasks=10 -o hdfs:///output-mrjob hdfs:///ngrams
MapReduce writing is very intuitive and simple. However, there is a major internal serialization plan that generates costs. The most likely binary plan will be implemented by users (for example, to support typedbytes ). There are also some built-in utility Log File Parsing. Finally, mrjob allows users to Write multi-step MapReduce workflows, where the intermediate output from one MapReduce job is automatically used as input to another MapReduce job.
(Note: The Other implementations are very similar, except for the specific implementation of the package, they can all be found here .)
Dumbo
Dumbo is another framework that uses Hadoop stream packaging. Dumbo appeared earlier and should have been used by many people, but it is difficult to develop because of the lack of documentation. This is not as good as mcjob.
Dumbo uses typedbytes to execute serialization, which allows for more concise data transmission. It can also read SequenceFiles or other format files by specifying JavaInputFormat. For example, dumbo can also execute Python egg and Java JAR files.
In my impression, I have to manually install every node in dumbo, which can only run when typedbytes And dumbo are created in the form of eggs. Just as he will terminate because of onMemoryErrors, he will also stop because of the use of aggregator.
The code for running the dumbo task is:
dumbo start ngrams.py \ -hadoop /usr \ -hadooplib /usr/lib/hadoop-0.20-mapreduce/contrib/streaming \ -numreducetasks 10 \ -input hdfs:///ngrams \ -output hdfs:///output-dumbo \ -outputformat text \ -inputformat text
Hadoopy
Hadoopy is another Streaming encapsulation compatible with dumbo. Similarly, it uses typedbytes to serialize data and directly writes typedbytes data to HDFS.
It has a great debugging mechanism, in which it can directly write messages to the standard output without interfering with the Streaming process. It is similar to dumbo, but the documentation is much better. This document also provides integration with Apache HBase.
When hadoopy is used, there are two sending methods to start jobs:
- Launch requires that Python/hadoopy has been installed on each node, but the load will be smaller after this.
- Launch_frozen does not require Python to be installed on the node. It will be installed at runtime, however, this will take about 15 seconds (it is said that this time can be shortened through some optimization and Caching Techniques ).
Hadoopy job must be started in a Python program. It does not have a built-in command line tool.
I wrote a script to start hadoopy through launch_frozen.
python launch_hadoopy.py
After running it with launch_frozen, I installed hadoopy on each node and ran it again using the launch method. The performance was significantly better.
Pydoop
Compared with other frameworks, pydoop encapsulates the Hadoop pipeline (Pipes), which is the Hadoop C ++ API. For this reason, the project claims that they can provide more abundant Hadoop and HDFS interfaces, as well as the same performance. I did not verify this. However, you can use Python to implement a Partitioner, RecordReader, and RecordWriter. All input and output must be strings.
Most importantly, I cannot successfully build pydoop from PIP or source code.
Others
- Happy is a framework that uses Jython to write Hadoop jobs, but it seems to have crashed.
- The mature and non-Hadoop MapReduce Implementation of Disco. Its core is written in Erlang and provides Python APIs. It is developed by Nokia and is not as widely used as Hadoop.
- Octopy is a pure Python MapReduce implementation. It has only one source file and is not suitable for "real" computing.
- Mortar is another Python option, which was released not long ago. Users can submit Apache Pig or Python jobs to process data stored on Amazon S3 through a web application.
- There are some high-level interfaces in the Hadoop ecosystem, such as Apache Hive and Pig. Pig allows you to use Python to write custom functions. It runs through Jython. Hive also has a Python encapsulation called hipy.
- (Added Jan. 7 2013) Luigi is a Python framework used to manage multi-step job flows. It is similar to Apache Oozie, but it has built-in encapsulation of Hadoop Streaming (lightweight encapsulation ). Luigi has a very good function that can throw an error stack of Python code when a job fails, and its command line interface is also very good. Its README file contains a lot of content, but it lacks detailed reference documents. Luigi is developed by Spotify and widely used internally.
Special instructions on counters
In the initial implementation of my MR jobs, I used counters to track and monitor bad records. In Streaming, you need to write the information to stderr. It turns out that this will bring an extra overhead that cannot be ignored: the Streaming job spends 3.4 times the Time of the native java job. This framework also has this problem.
Performance Comparison
MapReduce jobs implemented in Java are used as performance benchmarks. The value of the Python framework is its ratio to the Java performance indicator.
Java is obviously the fastest, Streaming takes half of the time, and Python framework takes more time. From the profile data of mrjob mapper, it takes a lot of time in serialization/deserialization. Dumbo and hadoopy are better at this. Dumbo can be faster if combiner is used.
Feature comparison
Most of them come from documents and code libraries in their respective software packages.
Conclusion
Streaming is the fastest Python solution, which has no magic in all aspects. However, when using it to implement the reduce logic, and when there are many complex objects, be especially careful.
All Python frameworks look like pseudo code, which is great.
Mrjob is fast to update and mature and easy to use. It is easy to use it to organize multi-step MapReduce workflows and to conveniently use complex objects. It also allows seamless use of EMR. But it is also the slowest execution speed
There are also some not very popular Python frameworks. Their main advantage is the built-in support for Binary formats, but if necessary, this can be fully implemented by the user code. For now:
- Hadoop Streaming is the best choice in general cases. It is easy to use as long as you are careful when using CER Cer.
- In terms of computing overhead, select mrjob because it works best with Amazon EMR.
- If the application is complex and contains a composite key, you need to combine multiple steps. dumbo is the most suitable. It is slower than Streaming, but faster than mrjob.
If you have your own understanding in practice or find an error in this article, please submit it in your reply.
Http://www.oschina.net/translate/a-guide-to-python-frameworks-for-hadoop
Original article: A Guide to Python Frameworks for Hadoop
Original article address: Hadoop's Python framework guide. Thank you for sharing it.