Performance comparison and tuning experience for Hadoop virtualization

Last Update:2015-03-17 Source: Internet

Author: User

Keywords Virtualization comparing

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Virtualization has injected unprecedented energy into Hadoop, from the perspective of it production management, as follows:

· Using Hadoop and other applications that consume different types of resources to deploy shared data centers increases overall resource utilization;

• Flexible virtual machine operations enable users to dynamically create, expand their own Hadoop clusters based on datacenter resources, or reduce current clusters and release resources to support other applications if needed;

• The integration of HA and FT with the virtualization architecture to avoid single point failures in traditional Hadoop clusters, coupled with the data reliability of Hadoop itself, provides a reliable guarantee for large data applications in the enterprise.

For these reasons, vsphere Big Data Extensions (BDE) provides effective support for users to flexibly deploy and manage Hadoop clusters in virtualized environments. Aside from these advantages, will virtualization hurt the performance of Hadoop running? To this end, we do the same scale of virtualization deployment and physical deployment of the Hadoop cluster performance comparison and optimization, the experiment shows that the virtualization Hadoop cluster can support the production environment well.

Performance comparisons between virtualized and physical environments

Figure 1 shows the deployment style for the performance tuning test, where only one virtual machine is deployed on a physical server, and Tasktracker and Datanode run together in the same node. Because each virtual node can use all of the server resources, it facilitates the performance comparison and analysis of the virtualization and Hadoop deployed in the traditional physical environment. As shown in Figure 2, the performance comparison of virtualized Hadoop with respect to the physical environment is almost flat.

Figure 1: Performance Comparison deployment

Figure 2:apache Hadoop 1.2 performance comparisons for physical and virtualized deployments

Figure 3 shows a deployment topology that is more recommended for production environments, with multiple virtual nodes deployed on a single physical server. As shown in Figure 2, this deployment increases resource utilization to achieve higher performance.

Figure 3: Deployment of multiple virtual machines

At the same time, we embed these experimental experiences into the vsphere BDE deployed Hadoop cluster system configuration, shielding the complexity of performance optimization. Although different data center settings and cluster configurations can lead to different performance, here are some common experiences to create, configure, and expand the Hadoop cluster:

Tuning experience for Hadoop virtualization:

(1) Planned initial scale: clustering, which is closely related to the data center infrastructure and configuration, advises users to start small clusters, such as 5 or 6 servers, deploy Hadoop, and then run the standard Hadoop benchmarks to understand the characteristics of their data centers at the outset of unpredictable environmental performance. Then incrementally add resources such as servers and storage as needed.

(2) Select server: CPU does not recommend less than 2 * Quad-core and activate HT (threading), configure at least 4G of memory for each compute kernel, and reserve 6% of memory for virtual operation. Hadoop performance is sensitive to I/O, and it is recommended that each server be configured with more than one block of local storage instead of a small, large-capacity hard drive. Considering the cost of task scheduling, it is not recommended to configure more than 2 blocks of local storage for each compute kernel. 10G network adapters are recommended for high-performance considerations. Consider configuring dual power supplies for the primary node server (running Namenode, jobtracker) to improve reliability.

(3) Virtualization configuration: Local storage to avoid configuration as raid, for each physical disk to create a datastore virtualized network configuration for reliability and network transmission efficiency, isolation Management Network and Hadoop cluster network. As shown in Figure 4:

Figure 4: Virtualized network configuration

(4) System setup: BDE will automatically configure the virtual disk and operating system parameters obtained based on the experimental experience to mask the specific details of performance optimization. It is recommended that you replace the default template for performance-sensitive users with centos6*, because the Linux 6.* kernel THP (transparenthuge Page) and ept (Extended Pagetables,intel processors) can help virtualize performance together.

(5) Hadoop configuration: BDE will automatically generate and configure Hadoop profiles (mainly within Map-site.xml,core-site.xml, and Hdfs-site.xml), including block size (BlockSize), session management, and logging capabilities. But there are some parameters related to the MapReduce task, including Mapred.reduce.parallel.copies,io.sort.mb,io.sort.factor,io.sort.record.percent, and Tasktracker.http.thread need to be set according to different load.

(5) Extended recommendation: If the user observes that CPU utilization in the cluster is often more than 80%, it is recommended to add a new node. In addition, the capacity of a single storage node does not recommend more than 24TB, otherwise, once the node fails, the data backup copy is liable to cause data congestion. Extensions can be performed based on performance benchmarking experience and resource usage on a small cluster.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More