The traditional MapReduce framework was proposed by Google in 2004 in the paper: "Mapreduce:simplified Data processing on Large clusters", The framework simplifies the process of data processing for data-intensive applications into maps and reduce two phases, when users design distributed programs by implementing map () and reduce () two functions, as well as other details such as data fragmentation, task scheduling, machine fault tolerance, communication between machines, etc. are referred to the MapReduce framework for processing. With the development of technology, on the basis of the traditional MapReduce framework, there are some mapreduce frameworks for special applications, which mainly include the following kinds:
(1) Support for iterative MapReduce Twister and Haloop (see my blog: An iterative MapReduce framework introduction).
(2) Sector/sphere for multistage streaming computations (see my blog: streaming mapreduce implementation sector/sphere).
(3) Dryad and cascading for Dag (directed acyclic Graph) (see article: Dryad, cascading, and cascading's homepage).
(4) MapReduce and database combination of products: hadoopdb and Greenplum.
This article mainly explains the current more famous traditional MapReduce open source realization. Now there are a lot of open source MapReduce framework implementations, the most famous is the Java implementation version of Hadoop. In fact, it's a heavyweight implementation version (a large amount of code), and it takes a lot of work to understand its details or improve it. To overcome the drawbacks of heavyweight implementations, some lightweight versions have emerged, such as the Erlang implementation version Disco,python implementation version Micemeat,bash version bashreduce.
This article mainly introduces disco, a rough explanation of micemeat and bashreduce.
Disco of traditional MapReduce realization
1. Overview
Disco is a lightweight mapreduce framework implementation, with core modules implemented in Erlang language, and external interfaces for easy programming python. Like Hadoop, it has its own distributed file System Ddfs, but Ddfs is highly coupled to the computational framework. Disco is developed by the Nokia Research Center to handle large-scale data in practical applications.
2, Disco's overall design framework
Disco is made up of distributed storage System Ddfs (Disco distributed File system) and MapReduce framework, this section introduces the overall design architecture of disco, the following section describes Ddfs.
Disco is also master/slave architecture:
Disco Master receives jobs from the client side and adds them to the job queue for scheduling.
Client processes are some Python programs that use function Disco.job () to submit jobs to master.
The worker supervision is initiated by master, and each node initiates one to monitor the operation of the Python worker on that node.
Python worker is used to perform a user-submitted job.
The input file is obtained by HTTP, but the file to be read is available locally, directly from the disk. In order to be able to fetch data from a remote node, a httpd background process is performed on each node.
3, the DDFS structure
The Ddfs is embedded in disco, and there is only one master node (single point of failure). Each storage node consists of a set of disks or files (vol0. VOLN), which are mounted separately in $ddfs_root/vol0 ... $DDFS _root/voln. Below each file are two files, tag and blob, respectively, for storing tags (tag, equivalent to key) and tag corresponding value (value). DDFS monitors the disk usage on each node and balances the load at intervals.
4. Distributed Index Discodex
Discodex is a distributed indexing system designed specifically for disco.
Discodex is actually a distributed key/value storage system. With a key, you can retrieve all the value associated with the key. Discodex provides some rest APIs that users can use to retrieve data from these APIs.
When we use Discodex, we actually run an HTTP server that maps restful URLs to disco jobs. Discodex stores key and value values on Ddfs, where each file is distributed, called Index,chunks or ichunks.
5. Reference materials
(1) Official website: http://discoproject.org/
(2) Installation method: Http://blog.csdn.net/socrates/archive/2009/05/26/4217641.aspx
The micemeat of traditional mapreduce realization
1. Introduction
Micemeat is a python implementation of MapReduce, the entire code consists of a python file (<13kb) that relies only on the Python standard library, is easy to deploy, and supports the following features:
(1) Fault tolerance: Any slave can join or leave the cluster at any time without affecting other slave.
(2) Security: Micemeat will authenticate each connection to prevent unauthorized code from being executed.
2. Reference materials:
Official website: http://remembersaurus.com/mincemeatpy/
The bashreduce of traditional mapreduce realization
1. Introduction
Bashreduce is written in Bash scripting language and incorporates common shell commands such as sort, awk, ssh, Netcat, and so on. Currently, testing is only done on the Ubuntu/debian system.
2. Reference materials
(1) Https://github.com/erikfrey/bashreduce
In addition, Perl implements Robotarmy and Ruby implementations Skyner
Resources:
(1) Robotarmy official homepage: http://bulletsweetp.github.com/robotarmy/
(2) Robotarmy thesis: Robotarmy:a Casual Framework for massive distributed processing
(3) Skynet official homepage: http://skynet.rubyforge.org/
Original articles, reproduced please specify: Reprinted from Dong's Blog
This article link address: http://dongxicheng.org/mapreduce/traditional-mapreduce-framework/