How to configure the appropriate hardware for the Hadoop cluster

Last Update:2014-12-22 Source: Internet

Author: User

Keywords Name hardware can fit

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The concept of Hadoop has become less unfamiliar with the advent of the big data age, and in practical applications, how to choose the right hardware for the Hadoop cluster is a key issue for many people to start using Hadoop.

In the past, large data processing was mainly based on a standardized blade server and storage Area Network (SAN) to meet grid and processing-intensive workloads. However, as the amount of data and the number of users has grown dramatically, infrastructure needs have changed, and hardware vendors must establish innovative systems to meet the requirements of large data pairs including storage blades, SAS (serial Attached SCSI) switches, external SATA arrays, and larger capacity rack units. The idea is to find a new way to store and process complex data. The data in Hadoop is distributed evenly across the cluster, and replicas are used to ensure the reliability and fault tolerance of the data. Because the data and the operation of the processing are distributed on the server, processing instructions can be sent directly to the machine that stores the data. Each server in such a cluster needs to store and process data, so each node in the Hadoop cluster must be configured to meet the data storage and processing requirements.

The most central design in the Hadoop framework is the mapreduce of storing HDFs and computing data for massive amounts of data. MapReduce's operations mainly include reading data from disk or from the network, i.e. IO intensive work, or computational data, that is, CPU intensive work. The overall performance of the Hadoop cluster depends on the performance balance between CPU, memory, network, and storage. So the operations team chooses the appropriate hardware type for the different work nodes when selecting the machine configuration. The main nodes in a basic Hadoop cluster are: Namenode is responsible for coordinating the data storage in the cluster, Datanode storing the data blocks that are split, Jobtracker coordinating the data Computing task, and the last node type is Secondarynamenode, Helps Namenode to collect status information for file system runs.

In the cluster, most of the machine equipment is worked as Datanode and Tasktracker. The following scenarios can be used for Datanode/tasktracker hardware specifications:

4 disk drives (Tan Pan 1-2t), support for JBOD

2 4 core CPUs, at least 2-2.5ghz

16-24GB Memory

Gigabit Ethernet

Namenode provides the entire HDFs file System namespace management, block management, and so on all services, so need more RAM, and the number of blocks in the cluster corresponds to, and need to optimize RAM memory channel bandwidth, using dual-channel or more than three channels of memory. Hardware specifications can take the following scenarios:

8-12 disk drives (Tan Pan 1-2t)

2 4 Cores/8 cores CPU

16-72GB Memory

Gigabit/Wan Shao Ethernet

Secondarynamenode can share a single machine with Namenode in a small cluster, and larger clusters can use the same hardware as Namenode. Given the fault tolerance of critical nodes, it is recommended that customers purchase hardened servers to run Namenodes and jobtrackers, with redundant power supplies and enterprise-class RAID disks. It is best to have a standby machine that can be substituted when one of the Namenode or jobtracker suddenly fails.

At present, the hardware platform in the market to meet the needs of Datanode/tasktracker node configuration, and it is understood that deep network security hardware platform for many years of Li Hua technology aimed at the development prospects of Hadoop, The timely introduction of the device specifically for Namenode----dual to strong processor with 12 hard drive FX-3411, the calculation and storage of the perfect fusion, four-channel memory maximum capacity can reach 256GB, Fully meet namenode requirements for a large memory model and heavy reference data cache combinations.

At the same time in the network, the FX-3411 supports 2 pci-e*8 network expansion, the network throughput reaches 80Gbps, but also satisfies the node to the Gigabit Ethernet or the million Gigabit Ethernet demand. In addition, for Datanode/tasktracker and other node configuration needs, Li Hua technology not only launched a single Xeon E38 core processor and 4 hard disk of the standard FX-3210, there can be a comprehensive customized solution to meet the different needs of customers.

Hadoop clusters often need to run dozens of, hundreds of or thousands of nodes to build hardware that matches their workloads, which can save considerable cost for an operating team, and therefore require careful planning and careful choice.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More