How to choose the best elastic mapreduce framework for Hadoop
Source: Internet
Author: User
Keywordsnbsp run CAN select
The Python framework for Hadoop is useful when you develop some EMR tasks. The Mrjob, Dumbo, and pydoop three development frameworks can operate on resilient MapReduce and help users avoid unnecessary and cumbersome Java development efforts. But when you need more access to Hadoop internals, consider Dumbo or pydoop. This article comes from Tachtarget.
The following is the original text:
The resilient MapReduce of Amazon Web Services is a Hadoop based implementation that allows you to run large preprocessing tasks such as format conversion and data aggregation. While we can choose a lot of programming languages to encode these tasks, time-intensive developers need a programming framework that minimizes coding overhead. Mrjob, Dumbo, and Pydoop are three resilient MapReduce frameworks that are based on Python to meet these requirements.
So why are popular programming languages such as Java or Apache Pig not qualified for this task? Amazon's flexible MapReduce (EMR) tasks are typically written in the Java language, but even simple applications may require more lines of code than scripts developed in Python. Pig is a high-level data processing language designed to load and transform data applications, but this is not a common programming language.
If developers prefer to use more advanced programming languages than Java, while using more functionality than pig data management, they should try to use Python. Currently, there are three kinds of EMR frameworks based on Python to choose from: Mrjob, Dumbo, and Pydoop.
Mrjob Open Source Development Package
Mrjob is an open source package that can run tasks on Amazon EMR or on your local machine. The Flex MapReduce task is defined in a single Python class and contains methods related to Mappers, reducers, and Combiners.
Most of the lower-level details of Hadoop are hidden under mrjob abstract operations, which is beneficial. This simplified model allows developers to focus on the logical design of map-reduce functionality. However, this means that you will be subject to a subset of the Hadoop API. If you need to access more Hadoop APIs, then Dumbo or pydoop may be a better choice.
An important advantage of using mrjob is that it does not need to install Hadoop. Developers can use Python, mrjob, and others on a single device to write, test, and debug resilient MapReduce programs. Once the program is ready, you can migrate it to EMR, and the same code can be run correctly on the Hadoop cluster without any modification. The social network that hosts 57 million reviews and browses more than 130 million visitors a month still uses mrjob, so it can meet the needs of many Hadoop users.
Working with Dumo
Dumbo is another Python framework that supports EMR. Similar to Mrjob, you can write mapper classes and reducer classes to perform flexible mapreduce tasks. In addition to the basic functionality in Mrjob, DUMBO provides additional task-handling options. One of its task classes allows developers to define multiple sets of map-reduce operations run by a single command. This is useful when performing multiple operations on a dataset.
Dumbo supports text and sequence file formats, and it also supports user-defined formats by using user-specified Java classes. On the downside, Dumbo has fewer documentation, especially compared to mrjob technical documents.
Dumbo follows the MapReduce paradigm, so it is similar to developing core components under this framework for development in Mrjob and Pydoop. It also allows you to perform partitioners, which is similar to reducers in addition to running locally. They can reduce the amount of data transferred between the map and the reduce operation.
By using Dumbo, developers can also control the Hadoop parameters at the command line when the task is started. Hadoop uses a plain-text file format by default, but users can handle other formats by specifying a custom Recordreader class. This class includes methods such as initialization, next step, shutdown, and getprogress.
It also supports the Hadoop file system API, which connects a HDFS installation and read-write file. In addition, the API retrieves metadata from files, directories, and file systems. When you need to access the file system at a lower level, the Dumbo API can help you because it has the same feature set as the HDFs API.
Research package access using Pydoop
Python developers who need access to third-party libraries and packages may need to consider using Pydoop. The interdisciplinary Research center of CRS4 developed and maintained the framework. One advantage of doing this is that you can count the access to popular Python research packages such as scipy.
While the Mrjob, Dumbo, and pydoop frameworks all have a number of benefits, they increase the overhead of running, so your task may run longer than you would have in the Java language or through the Hadoop stream. If the low cost of EMR is a key consideration, comparing the MapReduce tasks developed separately with Python streams and other frameworks can tell you the extra time it takes to run a task.
The Python framework for Hadoop is useful when you develop some EMR tasks. These three development frameworks can operate on flexible mapreduce and help users avoid unnecessary and cumbersome Java development efforts. Consider Dumbo or pydoop when you need more access to Hadoop internals.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.