How to configure the appropriate hardware for the Hadoop cluster

Source: Internet
Author: User
Keywords Name hardware can fit

The concept of Hadoop has become less unfamiliar with the advent of the big data age, and in practical applications, how to choose the right hardware for the Hadoop cluster is a key issue for many people to start using Hadoop.

In the past, large data processing was mainly based on a standardized blade server and storage Area Network (SAN) to meet grid and processing-intensive workloads. However, as the amount of data and the number of users has grown dramatically, infrastructure needs have changed, and hardware vendors must establish innovative systems to meet the requirements of large data pairs including storage blades, SAS (serial Attached SCSI) switches, external SATA arrays, and larger capacity rack units. The idea is to find a new way to store and process complex data. The data in Hadoop is distributed evenly across the cluster, and replicas are used to ensure the reliability and fault tolerance of the data. Because the data and the operation of the processing are distributed on the server, processing instructions can be sent directly to the machine that stores the data. Each server in such a cluster needs to store and process data, so each node in the Hadoop cluster must be configured to meet the data storage and processing requirements.

The most central design in the Hadoop framework is the mapreduce of storing HDFs and computing data for massive amounts of data. MapReduce's operations mainly include reading data from disk or from the network, i.e. IO intensive work, or computational data, that is, CPU intensive work. The overall performance of the Hadoop cluster depends on the performance balance between CPU, memory, network, and storage. So the operations team chooses the appropriate hardware type for the different work nodes when selecting the machine configuration. The main nodes in a basic Hadoop cluster are: Namenode is responsible for coordinating the data storage in the cluster, Datanode storing the data blocks that are split, Jobtracker coordinating the data Computing task, and the last node type is Secondarynamenode, Helps Namenode to collect status information for file system runs.

In the cluster, most of the machine equipment is worked as Datanode and Tasktracker. The following scenarios can be used for Datanode/tasktracker hardware specifications:

4 disk drives (Tan Pan 1-2t), support for JBOD

2 4 core CPUs, at least 2-2.5ghz

16-24GB Memory

Gigabit Ethernet

Namenode provides the entire HDFs file System namespace management, block management, and so on all services, so need more RAM, and the number of blocks in the cluster corresponds to, and need to optimize RAM memory channel bandwidth, using dual-channel or more than three channels of memory. Hardware specifications can take the following scenarios:

8-12 disk drives (Tan Pan 1-2t)

2 4 Cores/8 cores CPU

16-72GB Memory

Gigabit/Wan Shao Ethernet

Secondarynamenode can share a single machine with Namenode in a small cluster, and larger clusters can use the same hardware as Namenode. Given the fault tolerance of critical nodes, it is recommended that customers purchase hardened servers to run Namenodes and jobtrackers, with redundant power supplies and enterprise-class RAID disks. It is best to have a standby machine that can be substituted when one of the Namenode or jobtracker suddenly fails.

At present, the hardware platform in the market to meet the needs of Datanode/tasktracker node configuration, and it is understood that deep network security hardware platform for many years of Li Hua technology aimed at the development prospects of Hadoop, The timely introduction of the device specifically for Namenode----dual to strong processor with 12 hard drive FX-3411, the calculation and storage of the perfect fusion, four-channel memory maximum capacity can reach 256GB, Fully meet namenode requirements for a large memory model and heavy reference data cache combinations.

At the same time in the network, the FX-3411 supports 2 pci-e*8 network expansion, the network throughput reaches 80Gbps, but also satisfies the node to the Gigabit Ethernet or the million Gigabit Ethernet demand. In addition, for Datanode/tasktracker and other node configuration needs, Li Hua technology not only launched a single Xeon E38 core processor and 4 hard disk of the standard FX-3210, there can be a comprehensive customized solution to meet the different needs of customers.

Hadoop clusters often need to run dozens of, hundreds of or thousands of nodes to build hardware that matches their workloads, which can save considerable cost for an operating team, and therefore require careful planning and careful choice.

Related Article

Beyond APAC's No.1 Cloud

19.6% IaaS Market Share in Asia Pacific - Gartner IT Service report, 2018

Learn more >

Apsara Conference 2019

The Rise of Data Intelligence, September 25th - 27th, Hangzhou, China

Learn more >

Alibaba Cloud Free Trial

Learn and experience the power of Alibaba Cloud with a free trial worth $300-1200 USD

Learn more >

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.