Deep understanding of Hadoop clusters and networks
Source: Internet
Author: User
KeywordsServer working understanding and then deep
Introduction: The Network in cloud computing and Hadoop is a relatively small area of discussion. This article was written by http://www.aliyun.com/zixun/aggregation/13533.html ">dell, a technical expert in business, Brad Hedlund, who worked in Cisco for years, specializing in data centers, cloud networks, etc." The article material is based on the author's own research, experiment and cloudera training material.
This article will focus on the architecture and methodology of the Hadoop cluster and how it relates to the network and server infrastructure. Let's start with the basics of how the Hadoop cluster works.
Server roles in Hadoop
Hadoop's main task deployment is divided into 3 parts: The client machine, the master node, and the From node. The master node is primarily responsible for the oversight of two key functional modules HDFs, Map reduce. When job Tracker uses map reduce to monitor and schedule parallel processing of data, the name node is responsible for HDFS monitoring and scheduling. From the node responsible for the vast majority of the operation of the machine, to assume all data storage and instruction calculation of the drudgery. Each role that acts as a data node from a node is also charged with a daemon that communicates with their primary node. The daemon is subordinate to the job Tracker, and the data nodes belong to the name node.
The client machine sets up all the cluster settings on Hadoop, but does not include either the primary node or the from node. Instead, the role of the client machine is to load the data into the cluster, submit it to the MAP reduce data processing, and retrieve or view the results at the end of the work. In a small cluster (approximately 40 nodes), a single physical device may be confronted with multitasking, such as simultaneous job tracker and name nodes. As a middleware for large clusters, it is common to use a stand-alone server to handle a single task.
There is no virtual server or management layer in the real product cluster, so there is no excess performance loss. Hadoop works best on Linux systems and operates directly on the underlying hardware. This means that Hadoop actually works directly on the virtual machine. This has unparalleled advantages in cost, ease of learning and speed.
Hadoop cluster
Above is the construction of a typical Hadoop cluster. A series of racks are connected by a large number of rack transitions to a rack-less server (not a blade server), typically with a 1GB or 2GB broadband support. 10GB bandwidth is uncommon, but it can significantly increase the density of CPU cores and disk drives. The rack transitions on the upper layer connect many racks at the same bandwidth and form clusters. A large number of servers with their own disk storage, CPU and DRAM will become from the node. Also, some machines will become master nodes, and the machines with a small disk memory have faster CPUs and larger DRAM.
Now let's look at how the application works:
Adoop Workflow
What is the survival of Hadoop when the computer industry is so competitive? In short, there is a lot of data in business and government that needs to be analyzed and processed quickly. Cut these chunks of data and divide them into large numbers of computers, allowing the computer to process the data in parallel-that's what Hadoop can do.
In the simple example below, we will have a large data file (email to the customer service department). I want to quickly intercept the number of "Refund" that appear in the message. This is a simple word count exercise. The client will load the data into the cluster (File.txt), submit a description of the data analysis (word cout), and the cluster will store the results in a new file (Results.txt), and the client will read the resulting document.
Write file to HDFs
The Hadoop cluster does not work until the data is injected, so let's start with loading the huge File.txt into the cluster. The primary goal, of course, is fast parallel processing of data. To achieve this goal, we need to be able to work at the same time as many machines. Finally, the client divides the data into smaller modules and then divides it across the entire cluster on a different machine. The smaller the module, the more machines do the parallel processing of the data. At the same time, these machines may fail, so in order to avoid data loss, a single data need to be processed simultaneously on different machines. So each piece of data will be loaded repeatedly on the cluster. The default setting for Hadoop is that each piece of data is repeatedly loaded 3 times. This can be set by the Dfs.replication parameter in the Hdfs-site.xml file.
The client divided the File.txt file into 3 pieces. Cient the name node (usually the TCP 9000 protocol) and gets a list of the 3 data nodes that will copy the data. The client then writes each piece of data directly into the data node (usually the TCP 50010 protocol). The data node that receives the data will replicate the data to other data nodes, looping only until all data nodes have finished copying. The name node is responsible only for providing the location of the data and where the data is located in the population (file system metadata).
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.