Using MapReduce and load balancing in the cloud

Source: Internet
Author: User
Keywords MapReduce

Cloud computing is designed to provide on-demand resources or services over the Internet, usually depending on the size and reliability of the data center. MapReduce is a programming model designed to handle large amounts of data in parallel, dividing work into a collection of independent tasks. It is a parallel programming, supported by a functional, on-demand cloud (such as Google's BigTable, Hadoop, and sector).

In this article, you will use a load balancing algorithm that complies with the randomized hydrodynamic load balancing technology, which is described in more detail below. Leverage virtualization to reduce the actual number of costs and physical servers; More importantly, virtualization is used to achieve efficient physical computer CPU utilization.

To get the most out of this article, you should have a general idea of cloud computing concepts, randomized hydrodynamic Load balancing technology, and the Hadoop MapReduce programming model. It is best to have a basic understanding of concurrent programming, and it is helpful to understand the programming knowledge of Java™ or other object-oriented languages.

In this article, to implement the MapReduce algorithm, the system should be equipped with the following software:

1. Hadoop 0.20.1.

2. Eclipse IDE 3.0 or more (or Rational creator Developer 7.1).

3. Ubuntu more than 8.2.

Before we dive into the MapReduce algorithm, we'll build basic cloud architectures, load balancing, MapReduce, and parallel programming-at least for this article.

Cloud Architecture: Basic Content

Figure 1 shows a complete system detail, including platforms, software, and how to use them to achieve the goal setting for this article.

Figure 1. Cloud architecture

As you can see, we use Ubuntu 9.04 and 8.2 as the operating system, the platform is Hadoop 0.20.1, Eclipse 3.3.1, and Sun Java 6; programming languages use Java; scripting languages use HTML, JSP, and XML.

The cloud schema has a master node and some subordinate nodes. In this implementation, the primary server is maintained, the client request is obtained, and processing is based on the type of request.

As you can see in Figure 2, the search request is forwarded to the Hadoop Namenode. The Hadoop Namenode is then responsible for search and indexing operations, which will start a large number of Map and Reduce processes. After a specific search keyword MapReduce operation is completed, Namenode returns the output value to the server and delivers the client.

Figure 2. The MAP and Reduce functions perform search and indexing

If a specific software is requested, the verification steps are completed based on the customer tenant ID, payment fee, eligibility to use the specific software, and the lease period of the software. The server then serves the request and allows the user to use a specific combination of software.

This provides the multi-tenant functionality of SaaS, where a single software instance can serve multiple tenants. In this way, the same set of software images generates different instance builds based on the tenant ID.

These services mean that clients will use platforms such as Hadoop, Eclipse, and the operating system that is used when searching for files or using certain software. In addition, in order to store its data (database or file), the client will have to occupy some data center (IaaS) memory space in the cloud. All of this is transparent to end users.

Randomized hydrodynamic Load balancing: basic content

Load balancing is used to ensure that your existing resources are not idle when you use other resources. To balance the load distribution, you can migrate the load from the source node (with excess workload) to the relatively light load target node.

When load balancing is applied at run time, it is called dynamic load balancing-this can be implemented either directly or iteratively based on the execution node selection:

1. The iterative approach determines the final target node through several iterative steps.

2. The direct method selects the final target node in one step.

This paper uses the randomized hydrodynamic Load balancing method, which is a hybrid method that utilizes both direct and iterative methods.

MapReduce: Basic Content

The MapReduce program is used to compute large amounts of data in parallel. This requires a workload to be distributed among many computers. Hadoop provides a systematic way to implement this programming paradigm.

The calculation needs to enter a set of key/value pairs to generate a set of output key/value pairs. Calculates the two basic operations involved: MAP and Reduce.

A user-written Map operation requires input and generates a set of intermediate key/value pairs. The MapReduce library combines all the intermediate values associated with the same intermediate key #1 and passes them to the Reduce feature.

Also a user-written Reduce function accepts a middle key #1 and a set of values for that key. It merges these values into a potentially smaller set of values. Typically, only one output value of 0 or 1 is generated for each Reduce call. The median value is provided to the user by the Reduce function through an iterator (an object that allows the programmer to traverse all the elements of the collection, ignoring its specific implementation). This allows you to handle a list of values that are too large to fit in memory.

Take the WordCount problem as an example. That is, the number of occurrences of each word in a large file set is calculated. The Mapper and reducer functions are shown in code Listing 1.

Listing 1. Map and Reduce for resolving WordCount problems

Mapper (filename, file-contents):

For each word in file-contents:

Emit (Word, 1)

REDUCER (Word, values):

sum = 0

For each value in values:

sum = sum + value

Emit (word, sum)

The MAP function emits a count of associated occurrences of each word. The Reduce function sums the specific word count emitted. This basic feature, if built on a cluster, can easily be transformed into a high-speed parallel processing system.

Performs calculations on a large number of previously completed data, typically in a distributed environment. Hadoop is unique in its simple programming model-enabling users to quickly write and test distributed systems-and its efficient, automatic allocation of data and processing across computers, thereby leveraging the underlying parallelism of the CPU kernel.

Let's sort out our thoughts. As discussed earlier, there are the following nodes in the Hadoop cluster:

1.NameNode (Cloud Master node).

2.DataNodes (subordinate node).

Nodes in the cluster have preloaded local input files. When you start the MapReduce process, Namenode uses the Jobtracker process to assign tasks that must be done by datanodes through the tasktracker process. Several MAP processes are run in each DataNode, and intermediate results are provided to the linker process to generate a word count of the files on a computer (in the WordCount issue). The value is disrupted and sent to the Reduce process, and then the final output of the target problem is generated.

How to use load balancing

Load balancing helps distribute the load evenly across idle nodes when a node's load exceeds the threshold level. Although the load balance is not significant when executing the MapReduce algorithm, it is essential to make large file processing and hardware resource utilization critical. A notable role is to increase hardware utilization and improve performance with resource constraints.

When some data nodes are full or new empty nodes join the cluster, a module is implemented to balance the disk space usage on the Hadoop Distributed File System cluster. When the threshold is reached, the Balancer (Class balancer tool) is started; This parameter is a fraction from 0至100%, and the default value is 10%. This option sets the target for balancing the cluster; the smaller the threshold, the more balanced the cluster, and the longer the balancer will run. (Note: Thresholds can be so small that you cannot balance the state of the cluster because the application may write and delete files at the same time)

If the ratio of space to total capacity (called node utilization) for each data node is different from the ratio of the cluster's used space to the total space (cluster utilization) and does not exceed the threshold, the cluster is considered balanced.

The module moves the data blocks of the data nodes with high utilization to the nodes with low utilization, and in each iteration, the nodes move or receive a threshold scale that does not exceed the capacity, which runs no more than 20 minutes per iteration.

In this implementation, nodes are classified as high utilization, average utilization, and underutilized. The load is shifted between nodes to balance the cluster based on the amount utilized by each node. Modules work in the following ways:

1. Preferred, it gets the neighboring node details:

1. When the DataNode load is added to the threshold level, it sends a request to the Namenode.

2. Namenode obtains load-level information for the most neighboring nodes of a particular DataNode.

3. Namenode compares the load and then sends detailed information about the most idle neighboring nodes to a specific DataNode.

2. Next, Datanodes begins to work:

1. Each DataNode compares its own load to the sum of the load of its nearest node.

2. If the DataNode load level is greater than its neighboring node, the load target node (directly adjacent nodes and other nodes) will be randomly selected.

3. Then send the node request to the target node.

3. Finally, receive the request:

1. Each node will maintain a buffer to receive the load request.

2. The message Passing Interface (MPI) manages this buffer.

3. The main thread listens to the buffer queue and serves the requests it receives.

4. The node enters the load balancing execution phase.

Evaluate performance

Provides different sets of input files, each with a different size and performs MapReduce tasks in a single node and a two-node cluster. Measuring the corresponding execution time, we can conclude that, so far, running MapReduce in a cluster is a more efficient way to handle a large number of input files.

The diagram in Figure 3 illustrates the performance results that we run on each node.

Figure 3. MapReduce load Balancing is more effective in clusters

Concluding

Our Hadoop MapReduce and load-balancing experiments can get two corollary conclusions:

1. In the cloud environment, the MapReduce structure enhances the throughput efficiency of large datasets. Instead, you don't necessarily see such throughput increases in a non-cloud system.

2. When the dataset is small, MapReduce and load balancing do not have a significant impact on the increase in cloud system throughput.

Therefore, when planning to process large amounts of data on a cloud system, consider the combination of MapReduce style parallel processing and load balancing.

"Recommended reading": 1. MapReduce Data flow optimization based on Hadoop system

2.Hadoop Technology Center

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.