Deep understanding of Hadoop clusters and networks.

Source: Internet
Author: User
Tags new set
Reproduced from Http://www.csdn.net/article/2012-08-30/2809380-understanding-hadoop-clusters-network, for backup only


Summary:This article will focus on the architecture and methodology of the Hadoop cluster and how it relates to the network and server infrastructure. Let's start by learning the basics of the Hadoop cluster operation.

Introduction : The network in cloud computing and Hadoop is a relatively small area of discussion. This article was written by Dell Enterprise Technical expert Brad Hedlund, who worked in Cisco for years, specializing in data centers, cloud networks, etc. The article material is based on the author's own research, experiment and cloudera training material.

This article will focus on the architecture and methodology of the Hadoop cluster and how it relates to the network and server infrastructure. Let's start by learning the basics of how Hadoop clusters work.

The server role in Hadoop

Hadoop's main task deployment is divided into 3 parts: The client machine, the master node, and the From node. The master node is primarily responsible for the oversight of two key functional modules HDFs, Map reduce. When job Tracker uses map reduce to monitor and schedule parallel processing of data, the name node is responsible for HDFS monitoring and scheduling. From the node responsible for the vast majority of the operation of the machine, to assume all data storage and instruction calculation of the drudgery. Each role that acts as a data node from a node is also charged with a daemon that communicates with their primary node. The daemon is subordinate to the job Tracker, and the data nodes belong to the name node.

The client machine sets up all the cluster settings on Hadoop, but does not include either the primary node or the from node. Instead, the role of the client machine is to load the data into the cluster, submit it to the map reduce data processing, and retrieve or view the results at the end of the work. In a small cluster (approximately 40 nodes), a single physical device may be confronted with multitasking, such as simultaneous job tracker and name nodes. As a middleware of large clusters, it is common to use a stand-alone server to handle a single task.

There is no virtual server or management layer in the real product cluster, so there is no excess performance loss. Hadoop works best on Linux systems and operates directly on the underlying hardware infrastructure. This means that Hadoop is actually working directly on the virtual machine. This has unparalleled advantages in cost, ease of learning and speed.

Hadoop cluster

Above is the construction of a typical Hadoop cluster. A series of racks are connected by a large number of rack transitions to a rack-less server (not a blade server), typically with a 1GB or 2GB broadband support. 10GB bandwidth is uncommon, but it can significantly increase the density of CPU cores and disk drives. The rack transitions on the previous layer connect many racks at the same bandwidth and form clusters. A large number of servers with their own disk storage, CPU and DRAM will become from the node. Also, some machines will become master nodes, and the machines with a small disk memory have faster CPUs and larger DRAM.

Now let's look at how the application works:

Workflow of Adoop

What is the survival of Hadoop when the computer industry is so fiercely competitive. It also solves the problem of what actually. In short, there is a lot of data in business and government that needs to be quickly analyzed and processed. Cut these chunks of data and divide them into large numbers of computers, allowing the computer to process the data in parallel-that's what Hadoop can do.

In the simple example below, we will have a large data file (email to the customer service department). I want to quickly intercept the number of "Refund" that appear in the message. This is a simple word count exercise. The client will load the data into the cluster (File.txt), submit a description of the data analysis (word cout), and the cluster will store the results in a new file (Results.txt), and the client will read the resulting document.

Write file to HDFs

The Hadoop cluster does not work until the data is injected, so let's start with loading a huge File.txt into the cluster. The primary goal, of course, is fast parallel processing of data. To achieve this goal, we need to be able to work at the same time as many machines. Finally, the client divides the data into smaller modules and then divides it across the entire cluster on a different machine. The smaller the module, the more machines do the parallel processing of the data. At the same time, these machines may fail, so in order to avoid data loss, a single data need to be processed on different machines at the same time. So each piece of data will be loaded repeatedly on the cluster. The default setting for Hadoop is that each piece of data is repeatedly loaded 3 times. This can be set by the Dfs.replication parameter in the Hdfs-site.xml file.

The client divided the File.txt file into 3 pieces. Cient the name node (usually the TCP 9000 protocol) and gets a list of the 3 data nodes that will copy the data. The client then writes each piece of data directly to the data node (usually the TCP 50010 protocol). The data node that receives the data will replicate the data to other data nodes, looping only until all data nodes have finished copying. The name node is responsible only for providing the location of the data and where the data is located in the population (file system meta data).

The Rack awareness of Hadoop

Hadoop also has the idea of "Rack awareness". As the administrator of Hadoop, you can define the number of racks from the node itself in the cluster. But why does it cause you trouble. Two key reasons are: Data loss prevention and network performance. Don't forget, in order to prevent data loss, each piece of data will be copied on multiple machines. If multiple copies of the same piece of data are on the same rack, and that happens to be a failure of the rack, it's definitely a mess. In order to prevent such a thing from happening, someone must know the location of the data node and make a sensible allocation in the cluster according to the actual situation. This person is the name node.

If two machines in a rack compare two machines with different racks, there will be more bandwidth and less latency. This is true in most cases. The uplink bandwidth of the rack conversion is generally lower than its downlink bandwidth. In addition, the delay in the communication within the rack is generally lower than the cross rack (and not all). Then if Hadoop can achieve the concept of "Rack awareness", then there will undoubtedly be a significant increase in cluster performance. Yes, it really did. That's great, right?

But the disappointing thing happened, for the first time you have to define it manually. Continuous optimization to keep the information accurate. This is perfect if the rack conversion can automatically provide the name node with its list of data nodes. Or, in turn, the data nodes can tell the name nodes themselves that they are connected to the rack conversion, which is also perfect.

In a network of enclosed complements, it is certainly more exciting to know that the name node can be queried to the location of the node through the OpenFlow controller.

Preparing HDFs writes

Now that the client has split the File.txt and ready to load the cluster, start with block a. The client sends a write File.txt request to the name node, obtains a pass from the name node, and then gets a list of each data target data node. The name node uses its own rack awareness data to alter the data node supply list. The core rule is that for 3 copies of each piece of data, there are two parts on the same rack, and the other one must be placed on another rack. So the list to the client must follow this rule.

Before the client writes the "Block A" section of File.txt to the cluster, the client also expects to know if all the target data nodes are ready. It takes out the first Data node in the list that is prepared for block A, opens the TCP 50010 protocol, and tells the data node. Ready to receive 1 pieces of data, and here is a list of data nodes 5 and Data node 6 to ensure they are also ready. Then it is communicated by 1 to 5, then 5 to 6.

The data node responds to the command at the top level from the same TCP channel, only to the client receiving "ready" sent by the original data node 1. Only to this point, client is really ready to load the data blocks in the cluster.

HDFs Loading Channel

When a block of data is written to the cluster, the data node opens a synchronous channel with 3 (of course the number of data nodes refer to the settings above). This means that when a data node receives the data, it also sends a copy of the next data node in the channel.

This is also an example of improving cluster performance with the help of rack awareness data. Note that no second and third data nodes are transported in the same rack, so that the transmission between them has high bandwidth and low latency. Only the data block is successfully written to 3 nodes and the next one begins.

HDFs Channel Loading successful

When 3 nodes have successfully received the data blocks, they will send a "block Received" report to the name node. and return the "Success" message to the channel, and then close the TCP reply. The client receives a successfully received message and reports to the name node that the data has been successfully received. The name node updates the node location information in its metadata. The client will open the processing channel for the next block of data, only to all blocks of data to be written to the data node.

Hadoop uses a lot of network bandwidth and storage. We will typically process some TB-level files. Using the default configuration of Hadoop, each file will be copied three copies. That is, 1TB of files will consume 3TB of network traffic and 3TB of disk space.

Client write span cluster

The file is successfully written to the cluster after the completion of the copy pipeline for each block. As expected the files are scattered across the cluster machine, each machine has a relatively small portion of the data. The larger the number of blocks in a file, the more machine data is likely to propagate. More CPU cores and disk drives mean that data can get more parallel processing power and faster results. This is the motivation behind the construction of large, wide clusters, in order to data processing more and faster. When the number of machines increases and the clusters widen, our network needs to be extended appropriately.

Another way to extend a cluster is to drill down. is to expand more disk drives and more CPU cores on your machine, rather than increasing the number of machines. At the extended depth, you focus your attention on using fewer machines to meet more network I/O requirements. In this model, how your Hadoop cluster transitions to Gigabit Ethernet nodes becomes an important factor to consider.

Name node

The name node contains the file system metadata for all clusters and the data nodes that oversee health, and the coordination of access to the data. This name node is the central controller of the HDFs. It does not itself have any cluster data. This name node only knows that the block forms a file and that the blocks are located in the cluster.

The data node sends a heartbeat to the name node every 3 seconds through a TCP signal exchange, using the same port number to define the name node daemon, usually TCP 9000. Each 10 heartbeat is reported as a block, where the data node informs it of all blocks of the name node. The Block report allows the name node to build its metadata and ensure that the third copy exists on different racks on various nodes.

The name node is a key component of the Hadoop Distributed File System (HDFS). Without it, the client will not be able to write or read files from HDFs, and it will not be possible to schedule and perform map reduce work. Because of this, it is a good idea to equip name nodes and configure highly redundant enterprise Servers with dual power supplies, hot-swappable fans, redundant NIC connections, and so on.

Re-copying missing copies

If the name node stops receiving a heartbeat from a data node, assuming it is dead, any data must also disappear. Based on the block received from the death node to the report, the name node knows which copy is dead with the node block, and can decide to replicate the blocks to other data nodes. It will also refer to the rack-aware data to keep two copies in a rack.

Consider this scenario, the entire rack's server network is off, perhaps due to a rack switch failure or power failure. This name node will begin to instruct the remaining nodes in the cluster to replicate all the missing blocks in the rack. If there is 12TB of data per server in that rack, this could be hundreds of terabytes of data that needs to start across the network.

Second-level name node

The Hadoop server role is called a level two name node. A common misconception is that this role provides a high-availability backup of the name node, which is not the case.

The second-level name node occasionally connects to the name node and gets a copy of the name node in memory of the metadata and files used to store the meta data. The second-level name node combines this information in a new set of files and delivers it back to the name node, keeping a copy of itself.

If the name node dies, the file reserved by the level two name node can be used to recover the name node.

Read from HDFS client

When the customer wants to read a file from HDFs, it consults the name node again and asks for the location of the file block.

The customer selects a data node from each block list and reads a block using TCP's 50010 port. It will not go into the next block until the front block completes.

Reading data nodes from HDFs

In some cases, a data node daemon itself needs to read chunks from the HDFs. One such scenario is that a data node is required to process data that is not locally, so it must retrieve data from another data node on the network before it starts processing.

Another important example is the rack awareness cognition of this name node provides the best network behavior. When the data node asks for the location of the name node in the data block, the name node checks to see if there is data in another data node in the same rack. If so, the name node provides a location on the rack from the retrieved data. The process does not require traversing more than two switches and congested links to find data in another rack. The data retrieved on the rack is faster, data processing can start earlier, and the work is done faster.

Map Task

Now that file.txt is spreading in my machine cluster, I have the opportunity to provide extremely fast and efficient parallel processing data. The parallel processing framework containing Hadoop is known as Map Reduce, and the two steps after naming in the model are map and reduce.

The first step is the map process. This is where we also ask our machines to run a calculation on their local block of data. In this case, we ask our machine to count the number of times that the word "Refund" appears in the File.txt block of data.

To begin this process, the client machine submits the job Tracker of the map reduce job and asks "How many times will not appear in File.txt refund" (transliteration Java code). The Job Tracker query name node knows which data nodes have File.txt blocks. JOB Tracker provides map computations that run on these nodes tracker and Java code need to execute on their local data. This task tracker initiates a map task and monitors the progress of the task. This task tracker provides a heartbeat and returns the task status to the job tracker.

After each map task completes, each node stores the results of its local calculations in its temporary local storage. This is called "intermediate data." The next step is to send this intermediate data over the network transport to the reduce task final compute node.

Map Task Non-local

Although job tracker always tries to select a node to do a map task with local data, it may not always be able to do so. One reason may be that all nodes and local data already have too many other tasks to run and are not acceptable.

In this case, the Job tracker will look up the rack awareness knowledge of the name node and recommend the name node of the other node in the same rack. The job tracker will give this task to one node in the same rack, where the node will be looking for the data, and the name node it needs would indicate another node in its rack to fetch the data.

The Reduce task calculates the data received from the map tasks

The second phase of the map reduce framework is called reduce. The map task on the machine has completed and generated intermediate data for them. Now we need to collect all of these intermediate data, combine and purify it for further processing so that we have a final result.

Job Tracker starts a reduce task on any node in the cluster and instructs the reduce task to obtain intermediate data from all completed map tasks. The map task may respond to reducer almost simultaneously, causing you to suddenly have a large number of nodes sending TCP data to a node. This flow condition is often referred to as "Incast" or "fan-in". For network processing a large number of incast conditions, its important network switch has a well-designed internal traffic management capabilities, as well as adequate buffer (not too large nor too small).

The reducer task has now collected all the intermediate data from the map task and can begin the final calculation phase. In this case, we simply add the total number of the word "Refund" and write the result to a TXT file named results.

This TXT file, called results, is written to the process that we have covered under HDFs, dividing the files into chunks, and assembling the blocks in the pipeline. When finished, the client can read Results.txt from the HDFs and the work that is considered complete.

Our simple word count does not result in a large amount of intermediate data being transmitted over the network. However, other work may produce a large amount of intermediate data, such as sorting terabytes of data.

If you are a diligent network administrator, you will learn more about map reduce and what types of jobs your cluster will run, and how job types can affect your network traffic. If you're a Hadoop network star, you can even come up with better code to solve the map reduce task to optimize the performance of your network, thus speeding up work completion time.

Unbalanced Hadoop cluster

Hadoop can provide a real success for your organization, and it allows you to develop many of the business values that you have not discovered before. When business people understand this, you can be sure that more money will soon be available to buy more rack servers and networks for your Hadoop cluster.

When you add new rack servers and networks to existing Hadoop clusters, your cluster is unbalanced. In this case, the rack 1&2 is my existing data containing the File.txt rack and running my map reduce task. When I add two new racks to the cluster, my File.txt data does not automatically start to spread to the new rack.

The new server is idle until I start loading new data into the cluster. Also, if the server is very busy on the rack 1&2, the Job tracker may have no other choice, but the map task on File.txt is specified to a new server with no local data. The new server needs to get data over the network. As a result, you may see more network traffic and longer work completion times.

Hadoop Cluster Equalizer

In order to make up for the balance of the cluster, Hadoop also includes a balancer.

Balancer focus on the difference between the effective storage of nodes, to maintain the balance in a certain critical value. If a node with a large amount of storage space is found, balancer will identify the nodes with less storage space and cut the data to a node with a large amount of space left. Only the input instruction balancer on the terminal will run, and the balancer will be closed when the terminal cancellation command is received or the terminal is closed.

The network bandwidth that balancer can call is very small, by default only 1mb/s. The bandwidth can be set by the DFS.BALANCE.BANDWIDTHPERSEC parameter in the Hdfs-site.xml file.

Balancer is a good steward of the cluster. It is not used when a new unit is added, or it will run for the whole week once it is opened. Giving the equalizer a low bandwidth can keep it running for a long time.

I think it would be interesting if the equalizer could become the core of Hadoop rather than just a function. (

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.