Guidelines for using the Python framework in Hadoop

Source: Internet
Author: User
Tags emit unpack hadoop mapreduce hadoop ecosystem
Hadoop

I recently joined Cloudera, and before that, I have been working on computational biology/genomics for almost 10 years. My analytical work is mainly done using the Python language and its great scientific computing stack. But most of the Apache Hadoop ecosystem is implemented in Java and is prepared for Java, which makes me very annoyed. So, my first priority became the search for some of the Hadoop frameworks that Python can use.

In this article, I will write down some of my personal views on these frameworks that are unrelated to science, including:

    • Hadoop Stream
    • Mrjob
    • Dumbo
    • Hadoopy
    • Pydoop
    • Other

Ultimately, in my opinion, Hadoop's data flow (streaming) is the fastest and most transparent option, and is best suited for text processing. The mrjob is best suited to work quickly on Amazon EMR, but there is a significant performance penalty. Dumbo is convenient for most complex tasks (the object is a key name), but it is still slower than the data flow (streaming).

Read on to learn more about implementation details, performance, and feature comparisons.


An interesting question.

In order to test the different frameworks, we will not do the "count of Words" experiment, in turn to convert Google Books N-metadata data. N-ary represents a tuple of n words. This n-metadata set provides statistics on all 1-,2-,3-,4-,5-Meta records in the Google Book collection that are grouped by year. Each row of records in this n-metadata set consists of three fields: N-yuan, year, and number of observations. (You can get data in http://books.google.com/ngrams).

We want to summarize the data to observe the number of occurrences of any pair of combinations of words approaching each other, grouped by year. The results of the experiment will enable us to determine whether there is a more frequent occurrence of word combinations in a given year than in normal conditions. If there are two words in the range of four words at the time of the statistic, then we define two words "close" to each other. Or equivalently, if two words appear in 2-,3-or 5-Meta records, then we also define them as "near". Once, the final product of the experiment will contain a 2-dollar record, year and number of statistics.


There is a delicate place that must be emphasized. The value of each data in the N-metadata dataset is calculated by the entire Google Book Corpus. In principle, given a 5-metadata set, I can calculate 4-, 3-, and 2-metadata datasets by simply aggregating the correct n-elements. For example, when the 5-meta dataset contains

(The, Cat, in, the, hat)    1999   (the, Cat, be, ON, YouTube)  1999   (How, was, you, doing, today) 1986  5000


, we can aggregate it into a 2-metadata set to produce the following records

(The, Cat) 1999   //i.e., 20 + 13

However, in real-world applications, only more than 40 n tuples appear in the entire corpus to be counted. So, if a 5-tuple has less than 40 thresholds, Google also provides 2-tuple data that makes up the 5-tuple, some of which may be able to reach the threshold. For this reason, we use the two metadata of the adjacent words, separated by three triples of one word, four tuples of two words, and so on. In other words, compared to a given two-tuple, the ternary group is only the outermost word. In addition to being more sensitive to possible sparse N-metadata, using only the outermost words of the n-tuple helps to avoid duplicate computations. In general, we will calculate on the 2, 3, 4 and 5 metadata datasets.


MapReduce pseudo-code to implement this solution is similar to this:



def map (record):  (Ngram, year, count) = Unpack (record)  //Make sure Word1 is the first word of the dictionary  (word1, word2) = sorted (ngram[ First], Ngram[last])  key = (word1, Word2, year)  emit (Key, count) def reduce (key, values):  emit (Key, sum ( Values))


Hardware

These mapreduce components are executed on a random subset of data about 20GB. The complete data set covers 1500 files; We use this script to select a random subset. The file name remains intact, which is important because the file name determines the value of n in the N-ary of the data block.

The Hadoop cluster contains 5 virtual nodes using the CentOS 6.2 x64, each with 4 CPU,10GB RAM,100GB hard disk capacity and running CDH4. The cluster can perform 20 parallel operations at a time, and each component can perform 10 gearboxes.

The software versions running on the cluster are as follows:

  hadoop:2.0.0-cdh4.1.2  python:2.6.6  mrjob:0.4-dev  dumbo:0.21.36  hadoopy:0.6.0  pydoop:0.7 The (PyPI) library contains the latest version  java:1.6


Realize

Most python frameworks encapsulate Hadoop streaming, others encapsulate Hadoop Pipes, and some are based on their own implementations. I'll share some of the experience I used to write Hadoop jobs using a variety of Python tools, along with a comparison of performance and features. I am more interested in features that are easy to get started and run, and I will not optimize the performance of a single software.

There are some corrupted records when working with every data set. For each record, we check for errors and identify the kind of error, including missing fields and the wrong N-ary size. In the latter case, we must know the file name of the record in order to determine the n-ary size.

All code can be obtained from GitHub.


Hadoop Streaming

Hadoop streaming provides a way to use other executable programs as mapper or reduce for Hadoop, including standard UNIX tools and Python scripts. The program must read the data from the standard input using the specified semantics and then output the results to the standard output. One drawback to using streaming directly is that when the input of reduce is grouped by key, it is still a row of iterations, and the user must identify the boundary between key and key.

Here is the code for mapper:

#! /usr/bin/env python import osimport reimport sys # determine value of N in the current block of Ngrams by parsing the file Nameinput_file = os.environ[' map_input_file ']expected_tokens = Int (Re.findall (R ' ([\d]+) gram ', Os.path.basename ( input_file))) [0]) for line in Sys.stdin:  data = Line.split (' \ t ')   # perform some error checking  if Len (data) < ; 3:    Continue   # unpack data  Ngram = Data[0].split () year  = data[1]  count = data[2]   # more Error che cking  If Len (ngram)! = Expected_tokens:    Continue   # Build key  and emit pair = sorted ([ngram[0], ngram[ EXPECTED_TOKENS-1]])  print >>sys.stdout, "%s\t%s\t%s\t%s"% (Pair[0], pair[1], year, count)

Here is reducer:

#! /usr/bin/env python import sys total = 0prev_key = FalseFor line in Sys.stdin:  data = Line.split (' \ t ')  Curr_key = ' \ t '. Join (Data[:3])  count = Int (data[3])   # found a boundary; emit current sum  if Prev_key and curr_key! = prev _key:    print >>sys.stdout, "%s\t%i"% (Prev_key, total)    Prev_key = Curr_key Total    = Count  # same key ; Accumulate sum  else:    prev_key = Curr_key Total    + = count # emit last Keyif prev_key:  print >>sys. STDOUT, "%s\t%i"% (Prev_key, total)

The Hadoop stream (streaming) By default divides the key and value (value) with a tab character. Because we also split the fields with the tab character, we have to tell it by passing it to Hadoop three options to show that our data's health (key) is made up of the first three domains.

-jobconf stream.num.map.output.key.fields=3-jobconf stream.num.reduce.output.key.fields=3

To perform a Hadoop task command

Hadoop jar/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.1.2.jar \    -input/ Ngrams \    -output/output-streaming \    -mapper mapper.py \    -combiner reducer.py \    -reducer reducer.py \< c5/>-jobconf stream.num.map.output.key.fields=3 \    -jobconf stream.num.reduce.output.key.fields=3 \    - jobconf mapred.reduce.tasks=10 \    -file mapper.py \    -file reducer.py

Note that mapper.py and reducer.py appear two times in the command, the first time to tell Hadoop to execute two files, and the second is to tell Hadoop to distribute the two files to all nodes in the cluster.

The underlying mechanism of Hadoop streaming is simple and clear. In contrast, Python performs their own serialization/deserialization in an opaque manner, which consumes more resources. And, if the Hadoop software already exists, streaming can run without having to configure additional software on it. Not to mention the ability to pass UNIX commands or Java class names called Mappers/reducers.

The disadvantage of streaming is that it has to be done manually. The user must decide for themselves how to convert the object into a key-value pair (such as a JSON object). Support for binary data is not good either. And as mentioned above, you must manually monitor key boundaries in reducer, which is prone to error.

Mrjob

Mrjob is an open-source Python framework that encapsulates the data flow of Hadoop and actively develops yelp. Since Yelp operates entirely on Amazon Web services, Mrjob's integration with EMR is incredibly smooth and easy (using Boto pack).

Mrjob provides a Python API with Hadoop data flow and allows the user to use any object as a key and mapper. By default, these objects are serialized as internal to the JSON object, but there are also objects that support pickle. There is no other binary I/O format available out of the box, but there is a mechanism for implementing custom serialization.

It's worth noting that Mrjob seems to be developing very quickly and with good documentation.

All of the python frameworks that look like pseudo-code implementations:

#! /usr/bin/env python import osimport re from mrjob.job import mrjobfrom mrjob.protocol import Rawprotocol, Reprprotocol CLA SS Ngramneighbors (Mrjob): # mrjob allows to specify INPUT/INTERMEDIATE/OUTPUT serialization # Default Output Protoc OL is JSON; Here we set it to text Output_protocol = Rawprotocol def mapper_init (self): # determine value of N in the current BL Ock of Ngrams by parsing filename input_file = os.environ[' map_input_file '] self.expected_tokens = Int (Re.findall (R ') ([\d]+) gram ', Os.path.basename (input_file)) [0]) def mapper (self, Key, line): data = Line.split (' \ t ') # error CHEC King if Len (data) < 3:return # unpack Data Ngram = Data[0].split () year = data[1] count = Int (dat A[2]) # more Error checking if Len (ngram)! = Self.expected_tokens:return # generate key pair = sorted (    [Ngram[0], ngram[self.expected_tokens-1]]) K = pair + [year] # Note, the key is an object (a list in this case) # THat Mrjob would serialize as JSON text yield (k, count) def combiner (self, Key, counts): # The combiner must is Sep  Arate from the reducer because the input # and output must both is JSON yield (key, sum (counts)) def reducer (self, Key, Counts): # The final output is encoded as text yield "%s\t%s\t%s"% tuple (key), str (SUM (counts)) If __name__ = = ' __main__ ': # sets up a runner, based in command line Options Ngramneighbors.run ()

Mrjob only needs to be installed on the client, which is submitted at the time of the job. Here is the command to run:

Export Hadoop_home= "/usr/lib/hadoop-0.20-mapreduce"./ngrams.py-r HADOOP--hadoop-bin/usr/bin/hadoop--jobconf Mapred.reduce.tasks=10-o Hdfs:///output-mrjob Hdfs:///ngrams


The work of writing mapreduce is very straightforward and straightforward. However, there is a significant cost generated by the internal serialization plan. The most likely binary plan will need to be implemented by the user (for example, in order to support Typedbytes). There are also some built-in utility log files for parsing. Finally, Mrjob allows the user to write a multi-step mapreduce workflow where the intermediate output from a MapReduce job is automatically used as input to another mapreduce job.

(Note: The rest of the implementations are very similar, and they can all be found here, in addition to the specific implementation of the package.) )

Dumbo

Dumbo is another framework that uses Hadoop streaming wrappers. Dumbo appeared earlier, should have been used by many people, but due to the lack of documentation, resulting in development difficulties. This is not as good as McJob.

Dumbo performs serialization through Typedbytes, allowing for more concise data transfer, and more natural reading of sequencefiles or other format files by specifying Javainputformat, for example, Dumbo can also execute Python's egg and Java jar files.


In my impression, I had to manually install each node in Dumbo, which only runs when Typedbytes and Dumbo are created in eggs form. Just as he would have stopped because of the onmemoryerrors, he would also stop using the combo.

The code to run the DUMBO task is:

Dumbo start ngrams.py \    -hadoop/usr \    -hadooplib/usr/lib/hadoop-0.20-mapreduce/contrib/streaming \    - Numreducetasks \    -input hdfs:///ngrams \    -output hdfs:///output-dumbo \    -outputformat text \    - InputFormat text


Hadoopy

The hadoopy is another DUMBO-compatible streaming package. Similarly, it uses typedbytes to serialize data and write typedbytes data directly to HDFs.

It has a great debugging mechanism in which it can write messages directly to the standard output without interfering with the streaming process. It's very similar to Dumbo, but the documentation is much better. The document also provides content that is integrated with Apache hbase.

When using hadoopy, there are two types of hair to start jobs:

    • Launch needs to have Python/hadoopy installed on each node, but the load is small after that.
    • Launch_frozen does not require that Python be installed on the node, it will be installed at runtime, but this will result in an additional time consumption of about 15 seconds (which is said to be shortened by some optimizations and caching techniques).

The hadoopy job must be started in a Python program, and it does not have a built-in command-line tool.

I wrote a script to start the hadoopy in a Launch_frozen way.

Python launch_hadoopy.py

After running with Launch_frozen, I installed hadoopy on each node and then ran it again with the launch method, which was significantly better performance.


Pydoop

Compared to other frameworks, the Pydoop encapsulates the Hadoop pipeline (Pipes), which is the C + + API for Hadoop. Because of this, the project claims that they are able to provide richer Hadoop and HDFS interfaces, as well as good performance. I didn't verify this. However, one benefit is that you can implement a partitioner,recordreader and recordwriter with Python. All input and output must be a string.

Most importantly, I cannot successfully build Pydoop from PIP or source code.

Other

    • Happy is a framework for using Jython to write Hadoop jobs, but it seems to have hung up
    • Disco Mature, non-Hadoop MapReduce. Implementation, the core of which is written in Erlang, provides the Python API, which is developed by Nokia and is not as widely used as Hadoop.
    • Octopy is a pure Python implementation of MapReduce, which has only one source file and is not suitable for "real" computations.
    • Mortar is another Python option that was released recently, and users can submit Apache Pig or Python jobs to the data placed on Amazon S3 via a Web application.
    • There are interfaces in a higher-level hadoop ecosystem, like Apache hive and Pig. Pig allows users to write custom functions in Python, which is run through Jython. Hive also has a Python package called hipy.
    • (Added Jan. 7 2013) Luigi is a Python framework for managing multi-step job flows. It's a bit like Apache Oozie, but it's built-in with a Hadoop streaming (lightweight package). Luigi has a very good feature of being able to throw the wrong stack of Python code when the job goes wrong, and its command-line interface is great. Its readme file has a lot of content, but it lacks detailed reference documentation. Luigi is developed by Spotify and is widely used within its interior.


Local Java

Finally, I implemented the MR Task using the new Hadoop Java API interface, and after compiling it, I ran it:

Hadoop Jar/root/ngrams/native/target/ngramscomparison-0.0.1-snapshot.jar ngramsdriver hdfs:///ngrams hdfs:/// Output-native


Special Notes on Counters

In my initial implementation of Mr Jobs, I used counters to track bad records. In streaming, it is necessary to write the information to stderr. This has proved to be an extra cost to ignore: The streaming job spends 3.4 times times as much time as the native Java job. This framework also has this problem.


The MapReduce job implemented in Java is used as a performance benchmark. The value of the Python framework is the ratio of its performance metrics relative to Java.

Java is obviously the fastest, and streaming takes more than half the time, and the Python framework spends more time. From the profile data of Mrjob Mapper, it spends a lot of time on serialization/deserialization. Dumbo and Hadoopy are a little better in this regard. If you use Combiner, Dumbo can be faster.

Features comparison

Mostly from the documentation in their respective packages and the code base.

Conclusion

Streaming is the fastest Python solution, and there's no magic in it. But when using it to implement the reduce logic, and when there are many complex objects, you should be very careful.

All of the Python frames look like pseudo code, which is great.

Mrjob is fast, mature and easy to use, it is easy to organize multi-step mapreduce workflows, and can easily be used in complex objects. It can also be used seamlessly with EMR. But it's also the slowest execution.

There are also some of the Python frameworks that are not very popular, and their main advantage is that they have built-in support for binary formats, but if necessary, this can be fully implemented by user code.

So far:

    • Hadoop streaming is the best choice in general, and it's easy to use as long as you're careful when using reducer.
    • In terms of computational overhead, choose Mrjob because it is best combined with Amazon EMR.
    • If the application is more complex, contains the composite key, to combine the multi-step process, dumbo most appropriate. It is slower than streaming, but faster than mrjob.

If you have your own knowledge in practice, or if you find that this article has errors, please ask in the reply.

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.