A guide to the use of the Python framework in Hadoop _python

Source: Internet
Author: User
Tags documentation emit serialization stdin unpack hadoop mapreduce hadoop ecosystem

Recently, I joined Cloudera, and before that, I had been working on computational biology/genomics for almost 10 years. My analytical work is mainly done using the Python language and its great scientific stack of calculations. But I'm annoyed that most of the Apache Hadoop ecosystems are implemented in Java and are prepared for Java. So my top priority is to look for some Hadoop frameworks that Python can use.

In this article, I will write down my personal views on some of the irrelevant science of these frameworks, which include:

    • Hadoop Stream
    • Mrjob
    • Dumbo
    • Hadoopy
    • Pydoop
    • Other

Eventually, in my opinion, Hadoop's data stream (streaming) is the fastest and most transparent option, and is best suited for text processing. Mrjob is best suited for fast working on the Amazon EMR, but has a significant performance penalty. Dumbo is convenient for most complex tasks (objects as key), but still slower than data flow (streaming).

Read on to see implementation details, performance, and functionality comparisons.


an interesting question

In order to test the different frameworks, we do not do "statistical word number" experiment, instead to transform Google book N-metadata data. N-ary represents a tuple of n words. This n-metadata set provides the number of statistics for all 1-,2-,3-,4-,5-that are grouped by year in the Google book anthology. Each row of records in this n-tuple dataset consists of three domains: N-yuan, year, and number of observations. (You can get data in http://books.google.com/ngrams).

We want to summarize the data to observe the number of occurrences of any pair of word combinations that are near each other and to group by year. The results of the experiment will enable us to determine whether a word combination occurs more frequently than normal in a given year. If there are two words in the distance of four words in the statistics, then we define two words as "near". Or equivalently, if two words appear in 2-,3-or 5-Meta records, then we also define them as "near". Once, the final product of the experiment will contain a 2-meta record, the year and the number of statistics.


There is a delicate place to emphasize. The value of each data in the N-metadata dataset is computed by the entire Google Library corpus. In principle, given a 5-meta dataset, I can compute 4-, 3-, and 2-metadata sets by simply aggregating the correct n-element. For example, when a 5-meta dataset contains

(The, Cat, in, the, hat)    1999   (
the, cat, are, on, YouTube)  1999   (How
, are, your, doing, today) 1986  5000


, we can aggregate it into a 2-metadata set to produce the following records

(The, Cat) 1999   //i.e., 20 + 13

However, in practice, only the N-ary groups that appear more than 40 times in the whole corpus will be counted in. So, if a 5-tuple reaches a threshold of less than 40, then Google also provides 2-tuple data that makes up the 5-tuple, some of which may be able to reach the threshold. For this reason, we use the adjacent words of two of the data, separated by the three-word group, two-word four-tuple, and so on. In other words, compared with a given two-tuple, the ternary group is more than the outermost word. In addition to being more sensitive to possible sparse n-ary data, using only the outermost words of an n-tuple can also help avoid duplicate computations. In general, we will calculate on 2, 3, 4 and 5 metadata datasets.


MapReduce pseudocode to implement this solution is similar to this:



def map (record):
  [Ngram, year, count] = unpack (record)
  //ensures that word1 is the first word in the dictionary
  (word1, word2) = sorted (ngram[ Ngram[last])
  key = (word1, Word2, year)
  emit (key, count)
 
def reduce (key, values):
  emit (Key, sum (values))


Hardware

These mapreduce components are executed on a subset of random data of approximately 20GB. The complete dataset covers 1500 files; We use this script to select a random subset. The file name remains intact, which is important because the filename determines the value of N in the data block's n-ary.

The Hadoop cluster contains 5 virtual nodes that use CentOS 6.2 x64, each with 4 CPU,10GB RAM,100GB hard disk capacity, and runs CDH4. The cluster can perform 20 parallel operations at a time, and each component can perform 10 reducer.

The software version running on the cluster is as follows:

  hadoop:2.0.0-cdh4.1.2
  python:2.6.6
  mrjob:0.4-dev
  dumbo:0.21.36
  hadoopy:0.6.0
  pydoop:0 .7 (PyPI) library contains the latest version
  java:1.6


Implement

Most python frameworks encapsulate Hadoop streaming, some encapsulate the hadoop pipes, and some are based on their own implementations. Below I'll share some of my experience with various Python tools to write Hadoop jobs and attach a performance and feature comparison. I am more interested in features that are easy to start and run, and I will not be able to optimize the performance of a single software.

There are a number of corrupted records that will be in the process of processing each dataset. For each record, we check for errors and identify the type of error, including missing fields and the wrong N-ary size. For the latter case, we have to know the file name of the record to determine the N-ary size.

All code can be obtained from GitHub.


Hadoop Streaming

The Hadoop streaming provides a way to use other executable programs as a mapper or reduce for Hadoop, including standard UNIX tools and Python scripts. The program must read the data from the standard input using the specified semantics and then output the result to the standard output. One disadvantage of using streaming directly is that when reduce input is grouped by key, it is still a line of iterations and the user must identify the bounds between key and key.

Here is the code for mapper:

#! /usr/bin/env python
 
import os
import re
import SYS # determine value of N in the current block of
 
Ngrams b Y parsing the filename
input_file = os.environ[' map_input_file ']
expected_tokens = Int (Re.findall (r) ([\d]+) Gram ', Os.path.basename (input_file)) [0]) for line in
 
Sys.stdin:
  data = line.split (' t ')
 
  # perform some Error checking
  If Len (data) < 3:
    continue
 
  # unpack data
  Ngram = Data[0].split () year
  = data[ 1]
  count = data[2]
 
  # more Error checking
  if Len (ngram)!= Expected_tokens:
    continue
 
  # Build key and emit
  pair = sorted ([ngram[0], ngram[expected_tokens-1]])
  print >>sys.stdout, "%s\t%s\t%s\t%s"% ( Pair[0], pair[1], year, count)

Here is reducer:

#! /usr/bin/env python
 
import sys total
 
= 0
prev_key = False for line in
Sys.stdin:
  data = line.split (' \ T ')
  Curr_key = ' \ t '. Join (Data[:3])
  count = Int (data[3])
 
  # found a boundary; emit current sum
  if Prev_key a nd curr_key!= prev_key:
    print >>sys.stdout, "%s\t%i"% (Prev_key, total)
    Prev_key = Curr_key
    Total = Count
  # same key; accumulate sum
  else:
    prev_key = Curr_key Total
    + = Count
 
# emit last key
If Prev_key:
  print >>sys.stdout, "%s\t%i"% (Prev_key, total)

The Hadoop stream (streaming) divides the key and value (value) By default with a tab character. Because we also split fields (field) with the tab character, we have to tell it that our data's health (key) consists of the first three domains by passing three options to the Hadoop.

-jobconf stream.num.map.output.key.fields=3
-jobconf stream.num.reduce.output.key.fields=3

To perform a Hadoop task command

Hadoop jar/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.1.2.jar \
    -input/ Ngrams \
    -output/output-streaming \
    -mapper mapper.py \
    -combiner reducer.py \
    -reducer reducer.py c6/>-jobconf stream.num.map.output.key.fields=3 \
    -jobconf stream.num.reduce.output.key.fields=3 \
    - jobconf mapred.reduce.tasks=10 \
    -file mapper.py \
    -file reducer.py

Note that mapper.py and reducer.py appear in the command two times, the first time to tell Hadoop to execute two files, and the second is to tell Hadoop to distribute the two files to all nodes in the cluster.

The underlying mechanism of the Hadoop streaming is simple and clear. In contrast, Python executes their own serialization/deserialization in an opaque way, which consumes more resources. Also, if Hadoop software already exists, streaming can run without having to configure other software on it. Not to mention the ability to pass UNIX commands or Java class names called Mappers/reducers.

The disadvantage of streaming is that it must be done manually. Users must decide for themselves how to convert an object to a key-value pair (such as a JSON object). Support for binary data is also bad. And as mentioned above, it's very easy to make mistakes by manually monitoring key boundaries in reducer.

Mrjob

Mrjob is an open source Python framework that encapsulates the data flow of Hadoop and actively develops yelp. As Yelp operates entirely on Amazon Web services, Mrjob's integration with EMR is incredibly smooth and easy (using Boto packs).

Mrjob provides a Python API with a data stream for Hadoop and allows the user to use any object as a key and mapper. By default, these objects are serialized as internal to the JSON object, but there are also objects that support pickle. There is no other binary I/O format for out-of-the-box, but there is a mechanism to implement custom serialization.

It is noteworthy that Mrjob seems to develop very quickly and has good documentation.

All the Python frameworks look like Pseudocode implementations:

#! /usr/bin/env python import os import re from mrjob.job import mrjob from Mrjob.protocol import Rawprotocol, Reprprotoc Ol class Ngramneighbors (mrjob): # Mrjob allows you to specify INPUT/INTERMEDIATE/OUTPUT serialization # default OU Tput protocol is JSON; Here we set it to text Output_protocol = Rawprotocol def mapper_init (self): # determine value of N in the Curren T block of Ngrams by parsing filename input_file = os.environ[' map_input_file '] self.expected_tokens = Int (re.find  All (R ' ([\d]+) gram ', Os.path.basename (input_file)) [0]) def mapper (self, Key, line): data = Line.split (' t ') #
    Error checking if Len (data) < 3:return # unpack Data Ngram = Data[0].split () year = data[1] count = Int (data[2]) # more Error checking if Len (ngram)!= Self.expected_tokens:return # genera Te key pair = sorted ([ngram[0], ngram[self.expected_tokens-1]) k = pair + [year] # the key IS an object (a list in the case) # that mrjob'll serialize as JSON text yield (k, count) def combiner (self,
    Key, Counts): # The combiner must is separate from the reducer because the input # and output must both is JSON Yield (key, sum (counts)) def reducer (self, Key, counts): # The final output is encoded as text yield '%s\t% s\t%s "% tuple (key), str (SUM (counts)) if __name__ = = ' __main__ ': # sets up a runner, based on command line options N
 Gramneighbors.run ()

Mrjob only needs to be installed on the client, which is submitted at the time of the job. Here is the command to run:

Export hadoop_home= "/usr/lib/hadoop-0.20-mapreduce"
./ngrams.py-r HADOOP--hadoop-bin/usr/bin/hadoop--jobconf Mapred.reduce.tasks=10-o Hdfs:///output-mrjob Hdfs:///ngrams


The work of writing MapReduce is very intuitive and simple. However, there is a significant cost of the internal serialization plan generated. The most likely binary plan will need to be implemented by the user (for example, in order to support Typedbytes). There are also some built-in utility log file parsing. Finally, Mrjob allows the user to write multi-step mapreduce workflows, where the intermediate output from a MapReduce job is automatically used as input to another mapreduce work.

(Note: The rest of the implementations are very similar and they can be found here in addition to the specific implementation of the package.) )

Dumbo

Dumbo is another framework for using Hadoop streaming packaging. Dumbo appeared earlier and should have been used by many people, but the lack of documentation caused development difficulties. This is not as mcjob a point.

Dumbo performs serialization via Typedbytes, allows for more concise data transfers, and can be more natural by specifying Javainputformat to read sequencefiles or other file formats, such as Dumbo can also execute Python egg and Java jar files.


In my impression, I have to manually install every node in Dumbo, which runs only when Typedbytes and Dumbo are created in eggs form. Just as he would end up with Onmemoryerrors, he would also stop because of the use of a combo device.

The code to run the DUMBO task is:

Dumbo start ngrams.py \
    -hadoop/usr \
    -hadooplib/usr/lib/hadoop-0.20-mapreduce/contrib/streaming \
    - Numreducetasks
    -input hdfs:///ngrams \
    -output hdfs:///output-dumbo \
    -outputformat text \
    - InputFormat text


hadoopy

Hadoopy is another compatible Dumbo streaming package. Similarly, it uses typedbytes to serialize data and write typedbytes data directly to HDFs.

It has a great debugging mechanism, under which it can write messages directly to the standard output without interfering with the streaming process. It's very similar to Dumbo, but the documentation is much better. The documentation also provides content that is integrated with the Apache hbase.

When using hadoopy, there are two kinds of hair to start jobs:

    • Launch requires each node to have the python/hadoopy installed, but after that the load is small.
    • Launch_frozen does not require Python to be installed on a node, it will be installed at run time, but this will result in additional time consumption of about 15 seconds (it is said that some optimizations and caching techniques can shorten this time).

You must start the hadoopy job in a Python program, and it does not have a built-in command-line tool.

I wrote a script to start the hadoopy in a Launch_frozen way.

Python launch_hadoopy.py

After running with Launch_frozen, I installed the hadoopy on each node and ran it again with the launch method, which was significantly better performance.


Pydoop

Compared to other frameworks, Pydoop encapsulates the pipeline of Hadoop (pipes), which is the C + + API for Hadoop. Because of this, the project claims that they are able to provide richer Hadoop and HDFS interfaces, as well as good performance. I didn't verify this. However, one advantage is that Python can be used to implement a partitioner,recordreader and Recordwriter. All input and output must be a string.

Most importantly, I can't successfully build Pydoop from PIP or source code.

Other

    • Happy is a framework for writing a Hadoop job with Jython, but it seems to have hung up
    • Disco Mature, non-Hadoop MapReduce. Implementation, its core is written in Erlang, providing the Python API, which is developed by Nokia and is not as widely used as Hadoop.
    • Octopy is a pure Python mapreduce implementation that has only one source file and is not suitable for "real" computing.
    • Mortar is another Python option, which was released not long ago, and allows users to submit data to the Amazon S3 via a Web application that the Apache Pig or Python jobs handles.
    • There are some higher levels of Hadoop ecosystem interfaces, like Apache hive and Pig. Pig allows users to write custom functions in Python, which is run by Jython. Hive also has a Python package called hipy.
    • (Added. 7 2013) Luigi is a Python framework for managing multi-step job processes. It's a bit like the Apache Oozie, but it's built in to encapsulate the Hadoop streaming (lightweight encapsulation). A good feature of Luigi is the ability to throw the error stack of the Python code when the job goes wrong, and its command-line interface is great. Its readme file contains a lot of content, but lacks a detailed reference document. Luigi is developed by Spotify and is widely used within it.


Local Java

Finally, I implemented the MR Task with the new Hadoop Java API interface, and then ran it after the compilation was completed:

Hadoop Jar/root/ngrams/native/target/ngramscomparison-0.0.1-snapshot.jar ngramsdriver hdfs:///ngrams hdfs:/// Output-native


Special notes on counters

In my first implementation of Mr Jobs, I used counters to track and monitor Bad records. In streaming, information needs to be written to stderr. This has proven to be an extra overhead: The streaming job takes 3.4 times times as much time as the native Java job. This framework also has this problem.


The Java-implemented MapReduce job is used as a performance benchmark. The value of the Python framework is the ratio of its performance metrics relative to Java.

Java is obviously the fastest, and the Python framework spends more time streaming to spend more than half the time. From the profile data of Mrjob Mapper, it spends a lot of time on serialization/deserialization. Dumbo and Hadoopy are better at this. Dumbo can be quicker if you use combiner.

Characteristic comparison

Most of them come from documents and code libraries in their respective packages.

Conclusion

Streaming is the fastest Python solution, and there's no magic in it. But be particularly careful when using it to implement the reduce logic, and when there are many complex objects.

All the Python frames look like pseudo codes, which is great.

Mrjob update fast, mature and easy to use, it is easy to organize multi-step mapreduce workflow, and can easily use complex objects. It also allows for seamless use of EMR. But it's also the slowest performer.

There are a few Python frameworks that are not very popular, and their main advantage is the built-in support for binary formats, but if necessary, this can be achieved by user code alone.

As far as the current view:

    • The Hadoop streaming is the best choice in general, and it's easy to use if you're careful when using reducer.
    • In terms of computational overhead, choose Mrjob because it is best combined with Amazon EMR.
    • If the application is more complex, including the composite key, to combine multi-step process, dumbo most appropriate. It's slower than streaming, but faster than mrjob.

If you have a knowledge of yourself in practice, or if you find a mistake in this article, please put it in your reply.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.