Deep understanding of Hadoop clusters and networks

Last Update:2014-12-25 Source: Internet

Author: User

Keywords Server working understanding and then deep

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction: The Network in cloud computing and Hadoop is a relatively small area of discussion. This article was written by http://www.aliyun.com/zixun/aggregation/13533.html ">dell, a technical expert in business, Brad Hedlund, who worked in Cisco for years, specializing in data centers, cloud networks, etc." The article material is based on the author's own research, experiment and cloudera training material.

This article will focus on the architecture and methodology of the Hadoop cluster and how it relates to the network and server infrastructure. Let's start with the basics of how the Hadoop cluster works.

Server roles in Hadoop

Hadoop's main task deployment is divided into 3 parts: The client machine, the master node, and the From node. The master node is primarily responsible for the oversight of two key functional modules HDFs, Map reduce. When job Tracker uses map reduce to monitor and schedule parallel processing of data, the name node is responsible for HDFS monitoring and scheduling. From the node responsible for the vast majority of the operation of the machine, to assume all data storage and instruction calculation of the drudgery. Each role that acts as a data node from a node is also charged with a daemon that communicates with their primary node. The daemon is subordinate to the job Tracker, and the data nodes belong to the name node.

The client machine sets up all the cluster settings on Hadoop, but does not include either the primary node or the from node. Instead, the role of the client machine is to load the data into the cluster, submit it to the MAP reduce data processing, and retrieve or view the results at the end of the work. In a small cluster (approximately 40 nodes), a single physical device may be confronted with multitasking, such as simultaneous job tracker and name nodes. As a middleware for large clusters, it is common to use a stand-alone server to handle a single task.

There is no virtual server or management layer in the real product cluster, so there is no excess performance loss. Hadoop works best on Linux systems and operates directly on the underlying hardware. This means that Hadoop actually works directly on the virtual machine. This has unparalleled advantages in cost, ease of learning and speed.

Hadoop cluster

Above is the construction of a typical Hadoop cluster. A series of racks are connected by a large number of rack transitions to a rack-less server (not a blade server), typically with a 1GB or 2GB broadband support. 10GB bandwidth is uncommon, but it can significantly increase the density of CPU cores and disk drives. The rack transitions on the upper layer connect many racks at the same bandwidth and form clusters. A large number of servers with their own disk storage, CPU and DRAM will become from the node. Also, some machines will become master nodes, and the machines with a small disk memory have faster CPUs and larger DRAM.

Now let's look at how the application works:

Adoop Workflow

What is the survival of Hadoop when the computer industry is so competitive? In short, there is a lot of data in business and government that needs to be analyzed and processed quickly. Cut these chunks of data and divide them into large numbers of computers, allowing the computer to process the data in parallel-that's what Hadoop can do.

In the simple example below, we will have a large data file (email to the customer service department). I want to quickly intercept the number of "Refund" that appear in the message. This is a simple word count exercise. The client will load the data into the cluster (File.txt), submit a description of the data analysis (word cout), and the cluster will store the results in a new file (Results.txt), and the client will read the resulting document.

Write file to HDFs

The Hadoop cluster does not work until the data is injected, so let's start with loading the huge File.txt into the cluster. The primary goal, of course, is fast parallel processing of data. To achieve this goal, we need to be able to work at the same time as many machines. Finally, the client divides the data into smaller modules and then divides it across the entire cluster on a different machine. The smaller the module, the more machines do the parallel processing of the data. At the same time, these machines may fail, so in order to avoid data loss, a single data need to be processed simultaneously on different machines. So each piece of data will be loaded repeatedly on the cluster. The default setting for Hadoop is that each piece of data is repeatedly loaded 3 times. This can be set by the Dfs.replication parameter in the Hdfs-site.xml file.

The client divided the File.txt file into 3 pieces. Cient the name node (usually the TCP 9000 protocol) and gets a list of the 3 data nodes that will copy the data. The client then writes each piece of data directly into the data node (usually the TCP 50010 protocol). The data node that receives the data will replicate the data to other data nodes, looping only until all data nodes have finished copying. The name node is responsible only for providing the location of the data and where the data is located in the population (file system metadata).

123 Next

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More