Data allocation method for Hadoop cluster in cloud computing environment

Source: Internet
Author: User
Keywords Value function

Introduction

Cloud computing (Cloud Computing) is a new business computing model. It distributes computing tasks on a large pool of computer-made resources, enabling various application systems to acquire computational power, storage space and various software services as needed. Cloud computing is Grid computing (Computing), distributed Computing (distributed Computing), parallel Computing (parallelcomputing), Utility computing (Utility Computing), Network Storage (receptacle Storage Technologies), Virtualization (virtualization), load Balancing (loading Balance) and other traditional computer technologies and network technology development and integration of products.

Its data storage is implemented by the distributed storage method, this ensures high reliability, high availability, and economy, and the high reliability of data storage is ensured by redundant storage, with reliable software to compensate for the lack of hardware, thus providing low-cost and reliable mass distributed storage services and computing services. In addition, the data storage technology must have the characteristics of high throughput and high transmission rate. So the cloud computing system can meet the needs of a large number of users, in order to provide a large number of users in parallel services. The most notable of cloud computing's data storage systems is the Open-source system HDFs (Hadoop distributed filesystem) developed by Google's non-open-source system GFS (Google File system) and the Hadoop development team.

1. MapReduce Programming Model

Parallel computing is the core technology of cloud computing and one of the most warfare technologies. MapReduce is Google's core computing model, which derives its name from two core operations in the functional programming model: map and reduce operations. The map operation operates independently of each element, and the operation has no side effects; The reduce operation map[1,2 the n map result, which is the result of the., n] is the parameter of the reduce operation. The order of evaluation in an instruction language is certain, and each function may change or depend on the external state, so these functions must be executed in an orderly manner. In the MapReduce programming model, the execution order of N map operations can be unordered as long as there is no function modification or reliance on thousand global variables, which makes the MapReduce model suitable for parallel processing of large-scale data.

In the MapReduce computing model, there are two key processes: mapping process map and aggregation process reduce. Therefore, the user is required to provide two key functions, mapping (map) function and aggregation (Reduce) function, which computes a set of key-value pairs (key/value) for a group of losers, and obtains another set of output key values pairs, namely:

Map: (In_key, In_value)-{(Keyj, Valuej) j=l ... k}

Reduce: (Key,[valuel, ..., Valuem]) One (key, Fina_value)

In different applications, the input and output parameters of map and reduce are not the same. The input parameters of map In_key and In_value give s what data the map function deals with. Each map function calculates and outputs a set of key/value pairs, which are intermediate results returned after the execution of the mad task completes. Before performing the reduce task, the system first examines the intermediate results returned by the previous map task, classifies the keys according to the key, merges the values corresponding to the same key value together and sends them to the same reduce task for processing, thus seeing that the input parameter of reduce is (key , [Valuel,...,valuem]). The REDUC task mainly deals with the value of these values corresponding to the same key value, and outputs the f of the key, finalvalue after the reduction task executes. A key value corresponds to a reduce task, combining the results of all reduce thousand executions to form the final output.

The typical MapReduce calculation process is shown in Figure 1

Figure 1 MapReduce Workflow

(4) The MapReduce library transmits all intermediate value values with the same intermediate key value I to the reduce function;

(5) The user-defined reduce function accepts the value of an intermediate key and a set of related value values. The reduce function merges these value values into a collection of smaller value values. Normally, each reduce function call produces only 0 or I output value. The intermediate value value can be supplied to the reduce function through an iterator, so that a collection of value values that cannot be fully placed in memory can be processed.

2. Working mechanism of the Hadoop framework

Hadoop is an open source project under the Apache Software Foundation (Apache Softwarefoundation) that provides reliable and scalable software in a distributed computing environment. The Hadoop platform has its own Distributed File System (HDFS), which is implemented using MapReduce mode. Hadoop takes the form of file backups, making several copies of each data, with high security and reliability. As an open source distributed system platform, Hadoop has the advantages of fast updating and wide application, as well as some of the advantages shared by some of the other distributed cloud computing frameworks: High scalability, economic practicality, high efficiency, high reliability.

HDFs employs the Master/slave architecture, and a HDFS cluster consists of a named node (Namenode) and a set of data nodes (DataNode). A named node is a central server responsible for managing the file system's namespace (NameSpace) and client access to files. In the cluster system, a data node is usually run on a node, which is responsible for managing the data storage on its node, and is responsible for processing the file system client's read and write requests, and under the unified dispatch of the named node to create, delete and copy the data blocks. Hadoop also implements Google's MapReduce distributed computing model, mapreduce the application's overages into many subtasks, and each subtask can be processed in parallel on any cluster node (data node, usually also as a compute node). HDFs creates copies (replicas) of multiple data blocks (Blocks) to ensure the reliability of each subtask node calculation (reliability). Because of the use of Distributed file system and Mapredace model, the Hadoop framework has high fault tolerance and high throughput of data reading and writing, and can handle the failure nodes automatically. Figure 2 is a schematic diagram of the Hadoop cluster system architecture.

As shown in Figure 2, HDFs is made up of a named node and multiple data nodes. The data node stores the metadata for the file system. It works like a file system Commander, maintaining file system namespaces, standardizing customer access to files, and providing for file directory operations, data nodes store the actual data, responsible for managing storage node storage space and from the customer's read and write requests. Data nodes also perform block creation, deletion, and replication commands from named nodes.

(Responsible editor: The good of the Legacy)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.