A Free Trial That Lets You Build Big!
Start building with 50+ products and up to 12 months usage for Elastic Compute Service
With the start of Apache Hadoop, the primary issue facing the growth of cloud customers is how to choose the right hardware for their new Hadoop cluster.
Although Hadoop is designed to run on industry-standard hardware, it is as easy to come up with an ideal cluster configuration that does not want to provide a list of hardware specifications. Choosing the hardware to provide the best balance of performance and economy for a given load is the need to test and verify its effectiveness. (For example, users of IO-intensive workloads will invest more for each core spindle).
In this blog post you will learn some principles of workload evaluation and its crucial role in hardware selection. In the process, you will also learn that Hadoop administrators should consider a variety of factors.Combined with storage and calculation
For the past ten years, IT organizations have standardized blade servers and storage area networks (SANs) to address networking and processing-intensive workloads. Although this model is quite relevant to some standard programs such as web servers, program servers, small structured databases, data movement, etc., the requirements for infrastructure have also changed as the number of data and the number of users increase . The site server now has a cache tier; the database needs large local hard disks to parallelize; and the amount of data migrated exceeds the amount that can be processed locally.Most teams started building their Hadoop cluster without knowing the actual workload requirements.
Hardware providers have produced innovative product systems to address these needs, including storage blade servers, serial SCSI switches, external SATA disk arrays, and high-capacity rack units. However, Hadoop is based on a new implementation approach to storing and processing complex data with a reduction in data migration. Hadoop handles big data and reliability at the software level relative to SAN-dependent mass storage and reliability.
Hadoop distributes data between clusters of balanced nodes and uses synchronous replication to ensure data availability and fault tolerance. Because data is distributed to computationally capable nodes, the processing of the data can be sent directly to the node where the data is stored. Because each node in a Hadoop cluster stores and processes data, these nodes need to be configured to meet data storage and computing requirements.Is the workload important?
In almost all cases, MapReduce either encounters a bottleneck (called an IO-restricted application) when it reads data from the hard disk or from the network, or encounters a bottleneck (CPU-limited) while processing the data. Sorting is an IO-limited example that requires very little CPU processing (just a simple comparison operation), but requires a lot of reading and writing data from the hard disk. Pattern classification is a CPU-constrained example of complex data processing used to determine the ontology.
Here are some more examples of IO-limited workloads:
Data import and export
Data movement and conversion
Here are some examples of more CPU-constrained workloads:
Clustering / classification
Complex text mining
Natural language processing
Cloudera's customers need a complete understanding of their workloads in order to choose the best Hadoop hardware, which seems to be a chicken egg problem. Most workgroups have already built Hadoop clusters without thoroughly analyzing their workloads, and the workloads that Hadoop typically runs are quite different as their proficiency increases. And, some workloads may be limited by some unexpected reasons. For example, some theoretically IO-constrained workloads eventually became CPU-bound, either because the user selected a different compression algorithm or different implementations of the algorithm changed the way the MapReduce task was constrained. For these reasons, when working groups are not yet familiar with the type of task to run, in-depth analysis is the most logical work to do before building a balanced Hadoop cluster.
Next you need to run MapReduce benchmark tasks on the cluster and analyze how they are limited. The most straightforward way of accomplishing this goal is to add a monitor at the appropriate place in the running workload to detect bottlenecks. We recommend installing Cloudera Manager on a Hadoop cluster that provides real-time statistics on CPU, hard disk, and network load. (Cloudera Manager is a component of the Cloudera Standard and Enterprise Edition with Enterprise Edition that also supports rolling upgrades.) Once Cloudera Manager is installed, Hadoop administrators can run MapReduce tasks and view Cloudera Manager dashboards to monitor each machine's working condition.
The first step is to figure out what hardware your job group already has
In addition to building the right cluster for your workload, we advise customers and their hardware providers to work together to determine the power and cooling budget. Since Hadoop runs on dozens, hundreds and thousands of nodes. Workgroups can save a lot of money by using hardware with high performance / power ratio. Hardware providers often provide tools and advice to monitor power consumption and cooling.Choose the hardware for your CDH (Cloudera distribution for Hadoop) Cluster
The first step in choosing a machine configuration type is to understand the type of hardware your O & M team is managing. When purchasing new hardware, O & M teams often choose from a range of perspectives or mandates, and they tend to work on the familiar platform types. Hadoop is not the only system that benefits from economies of scale. Again, as a more general recommendation, if the cluster is newly established or you are not able to accurately predict your ultimate workload, we recommend that you choose a balanced hardware type.
Hadoop clusters have four basic task roles: name nodes (including alternate name nodes), work-tracking nodes, task execution nodes, and data nodes. A node is a workstation that performs a specific function. Most nodes within your cluster need to perform two roles of roles as data nodes (data stores) and task execution nodes (data processing).
This is the recommended specification for data node / task tracker in a balanced Hadoop cluster:
In a disk array to have 12 to 24 1 ~ 4TB hard drive
2 frequency of 2 ~ 2.5GHz quad-core, six-core or eight-core CPU
64 ~ 512GB of memory
Guaranteed Gigabit or 10 Gigabit Ethernet (Larger storage densities, higher network throughput required)
The name node role is responsible for coordinating data storage on the cluster, job trackers coordinating data processing (alternate name nodes should not coexist with name nodes in the cluster, and run on the same hardware environment). Cloudera recommends that customers purchase name machines and job trackers for commercial machines that have enough power and enterprise-class disks on RAID1 or 10 configurations.
NameNode also directly requires RAM that is in proportion to the number of data blocks in the cluster. A good but imprecise rule is to allocate 1GB of NameNode memory for every one million blocks stored in the distributed file system. For 100 DataNodes in a cluster, 64GB of RAM on the NameNode provides enough room to keep the cluster growing. We also recommend configuring the HA on both the NameNode and the JobTracker,
Here is the recommended technical details for the NameNode / JobTracker / Standby NameNode node group. The number of drives, more or less, will depend on the amount of redundancy required.
4-6 1TB hard drives come in a JBOD configuration (1 for OS, 2 for file system image [RAID 1], 1 for Apache ZooKeeper and 1 for Journal node)
2 4- / 16- / 8-core CPUs, running at least 2-2.5GHz
64-128GB random access memory
Bonded Gigabit Ethernet or 10Gigabit Ethernet
Remember, ideally, the Hadoop system is designed for use in a parallel environment.
If you want your Hadoop cluster to scale to more than 20 machines, we recommend that initially configured clusters be distributed in two racks and each rack has a 10G Ethernet switch at the top of the rack. When the cluster spans multiple racks, you will need to add a core switch using 40G Ethernet to connect the switch located at the top of the rack. Two logically separate racks allow the maintenance team a better understanding of network requirements for inter-rack communication and inter-rack communication.
After the Hadoop cluster is installed, the maintenance team can begin to determine the workload and prepare to benchmark those workloads to determine the hardware bottleneck. After a period of benchmarking and monitoring, the maintenance team will understand how to configure the added machine. Heterogeneous heterogeneous Hadoop clusters are common, especially when the capacity and number of user machines in a cluster is constantly growing - so the "bad" set of configured machines for your workload is not wasted time. Cloudera Manager provides templates that allow group management of different hardware configurations, allowing you to easily manage heterogeneous clusters.
The following is a list of the various hardware configurations for different workloads, including our original "Load Balancing" configuration:
Lightweight configuration (1U machine): two 16-core CPUs, 24-64GB of memory, and eight hard drives (1TB or 2TB each).
Load-balancing configuration (1U machine): Two 16-core CPUs, 48-128GB of memory, and 12-16 hard drives (1TB or 2TB each) directly connected by the system board controller. Usually in a 2U cabinet with two motherboards and 24 hard disks to achieve mutual backup.
Large storage configuration (2U machine): two 16-core CPU, 48-96GB of memory and 16-26 hard drives (2TB-4TB each). This configuration generates large amounts of network traffic when multiple nodes / racks fail.
Strong computing mode configuration (2U machines): two 16-core CPU, 64-512GB of memory and 4-8 hard drives (1TB or 2TB each).
(Note that Cloudera expects you to configure the configuration of 2 × 8, 2 × 10, and 2 × 12 core CPUs that it can use.)
The figure below shows you how to configure a machine based on workload:
Other things to consider
It is important to remember that the design of the Hadoop ecosystem is a consideration of the parallel environment. When buying a processor, we do not recommend buying the highest frequency (GHZ) chips, which have high power consumption (130 watts or more). There are two problems with doing this: higher power consumption and greater heat dissipation. In the middle of the type of CPU in terms of frequency, price and number of cores is the best value for money.
When we run into applications that generate a lot of intermediate data - that is, when the amount of output data is equal to the amount of data read in - we recommend enabling two ports on a single Ethernet interface card or bundling two Ethernet cards, Let each machine provide 2Gbps transfer rate. The maximum amount of data that a node bound to 2Gbps can hold is 12TB. Once you have transferred more than 12TB of data, you will need to use 4Gbps (4x1Gbps) at a transport rate of up to 4x1Gbps. In addition, for those Ethernet or wireless network users who already use 10Gb bandwidth, such a scheme can be used to distribute the workload according to the network bandwidth. If you are thinking about switching to a 10GB Ethernet network, verify that the operating system and the BIOS are compatible with such features.
When calculating how much memory is needed, remember that Java itself is going to use up to 10% of the memory to manage the virtual machines. We recommend configuring Hadoop to use only the heap so that you can avoid switching between memory and disk. Switching greatly reduces the performance of MapReduce tasks and can be avoided by configuring the machine with more memory and setting the appropriate kernel for most Linux distributions.
It is also important to optimize the memory channel width. For example, each machine should be configured as a pair of memory modules (DIMMs) when we use dual-channel memory. When we use three channels of memory, we should use three times the number of memory modules (DIMMs) per machine. Similarly, four-channel memory modules (DIMMs) should be grouped by four to use memory.Beyond MapReduce
Hadoop is more than just HDFS and MapReduce; it's an all-encompassing data platform. CDH therefore contains many different ecosystem products (in fact, only rarely used as MapReduce). Additional software components to consider when selecting your cluster include Apache HBase, Cloudera Impala, and Cloudera Search. They should all run in the DataNode to maintain data locality.
HBase is a reliable column data storage system that provides consistency, low latency, and random reads and writes. Cloudera Search addresses the need for full-text search of stored content in the CDH, simplifying access for new types of users, but also opens the door for new types of data storage in Hadoop. Cloudera Search is based on Apache Lucene / Solr Cloud and Apache Tika, and extends valuable capabilities and flexibility for searches that are broadly integrated with CDH. The Impala project, based on the Apache protocol, brings Hadoop scalable parallel database technology that enables users to initiate low-latency SQL queries to data stored in HDFS and HBase without requiring data movement or conversion.
Due to the garbage collector (GC) timeout, HBase users should pay attention to the size of the heap restrictions. Other JVM column storage is also facing this problem. Therefore, we recommend a maximum of 16GB for each regional server. HBase does not require too many other resources to run on top of Hadoop, but to maintain a real-time SLAs, you should use multiple schedulers, such as the fair and capacity scheduler, in conjunction with Linux Cgroups.
Impala uses memory to do most of its functionality, and by default it will use up to 80% of the available RAM resources, so we recommend using at least 96GB of RAM per node. Users who use Impala with MapReduce can refer to our recommendation - Impala and MapReduce for Multi-tenant Performance. You can also specify for Impala the memory required by a particular process or the memory required for a particular query.
Search is the most interesting custom-sized component. The recommended custom size of the practice is to buy a node, install Solr and Lucene, and then load your document group. Scalability will come into play once the document groups are indexed and searched in the desired way. Continuous loading of document groups until the delay of indexes and queries exceeds the necessary value for the project - at this point, you get the cardinality of the maximum number of documents that each node can handle on the available resources , And the total cardinality of the number of nodes that do not include this factor in the intended cluster.Conclusion Purchasing the right hardware, for a Hapdoop cluster, requires performance testing and careful planning to fully understand the workload. However, a Hadoop cluster is usually a morphing system, and Cloudera recommends that you use a load-balanced technical document to deploy the booted hardware initially. It is important to remember that when using multiple system components, the use of resources will be diverse, and focus and resource management will be key to your success.
We encourage you to include your experience in configuring Hadoop production cluster servers in your comments!
Kevin O'Dell is a systems engineer working at Cloudera.English Original: How-to: Select the Right Hardware for Your New Hadoop Cluster
Translation: http: //www.oschina.net/translate/how-to-select-the-right-hardware-for-your-new-hadoop-cluster
Start building with 50+ products and up to 12 months usage for Elastic Compute Service