With the start of Apache Hadoop, the primary challenge for cloud customers is how to choose the right hardware for their new Hadoop cluster.
Although Hadoop is designed to run on industry-standard hardware, it is as simple as proposing an ideal cluster configuration that does not want to provide a list of hardware specifications. When selecting hardware to provide the optimal balance between performance and economics for a given load, You need to test and verify its effectiveness. (For example, IO-intensive workload users will invest more for each core spindle ).
In this blog post, you will learn some workload evaluation principles and play a crucial role in hardware selection. In this process, you will also learn that the Hadoop administrator should consider various factors.
Combined storage and computing
Over the past decade, IT organizations have standardized blade servers and storage area networks (SAN) to meet network-and processing-intensive workloads. Although this model is quite meaningful for some standard programs, such as website servers, program servers, small structured databases, and data movement, as the number of data and number of users increase, the requirements for infrastructure have also changed. The website server now has a cache layer. The database requires local hard disks to support large-scale parallel operations. The data migration volume also exceeds the number of local processors.
Most teams have not yet figured out the actual workload requirements and started to build their Hadoop clusters.
Hardware providers have produced innovative product systems to address these needs, including storage blade servers, serial SCSI switches, external SATA disk arrays, and large capacity rack units. However, Hadoop stores and processes complex data based on new implementation methods, and reduces data migration. Hadoop processes big data and reliability at the software level to meet the requirements of high-capacity storage and reliability with respect to SAN-dependent services.
Hadoop distributes data among balanced nodes in a cluster and uses synchronous replication to ensure data availability and fault tolerance. Because data is distributed to nodes with computing power, data processing can be directly sent to nodes with data stored. Because each node in the Hadoop cluster stores and processes data, these nodes must be configured to meet the data storage and operation requirements.
Is workload very important?
In almost all cases, MapReduce may encounter bottlenecks (I/O-restricted applications) when reading data from the hard disk or network, or encounter bottlenecks (CPU restrictions) when processing data ). Sorting is an example of IO restrictions. It requires a small amount of CPU processing (only simple comparison operations), but requires a large amount of data read and write from the hard disk. Pattern Classification is an example of CPU restriction. It performs complex processing on data and is used to determine the ontology.
The following is an example of more I/O-constrained workloads:
- Index
- Group
- Import and export data
- Data movement and conversion
The following is an example of more CPU-constrained workloads:
- Clustering/Classification
- Complex Text Mining
- Natural Language Processing
- Feature Extraction
Cloudera's customers need to fully understand their workloads so that they can select the best Hadoop hardware, which seems to be a problem of being a zombie. Most working groups have already built Hadoop clusters without thoroughly analyzing their workloads. Generally, the workload of Hadoop running is completely different as their proficiency increases. In addition, some workloads may be limited by unexpected causes. For example, some workloads with limited I/O in theory eventually become CPU-constrained. This may be because different compression algorithms are selected or different implementations of algorithms change the way MapReduce tasks are constrained. For these reasons, when the Working Group is not familiar with the task type, in-depth analysis is the most reasonable task to be done before creating a balanced Hadoop cluster.
Next, you need to run the MapReduce benchmark test tasks on the cluster to analyze how they are restricted. The most direct way to accomplish this goal is to add a monitor in the appropriate position of the running workload to detect the bottleneck. We recommend that you install Cloudera Manager on a Hadoop cluster, which provides real-time statistics on CPU, hard disk, and network load. (Cloudera Manager is a component of Cloudera Standard Edition and Enterprise Edition. The Enterprise Edition also supports rolling upgrade.) After Cloudera Manager is installed, the Hadoop administrator can run MapReduce tasks and view the Cloudera Manager dashboard, used to monitor the operation of each machine.
The first step is to figure out what hardware your job group already has.
In addition to building a suitable cluster for your workload, we recommend that customers work with their hardware providers to determine the power and cooling budget. Because Hadoop runs on dozens of machines and hundreds to thousands of nodes. By using hardware with a high-performance power consumption ratio, the job group can save a lot of money. Hardware providers generally provide tools and recommendations for monitoring power consumption and cooling.
Build a Hadoop environment on Ubuntu 13.04
Cluster configuration for Ubuntu 12.10 + Hadoop 1.2.1
Build a Hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)
Configuration of Hadoop environment in Ubuntu
Detailed tutorial on creating a Hadoop environment for standalone Edition
Build a Hadoop environment (using virtual machines to build two Ubuntu systems in a Winodws environment)