Hadoop virtualization Performance Comparison and tuning experience

Source: Internet
Author: User
Tags node server

Virtualization brings unprecedented vitality to Hadoop. From the perspective of IT production management, the performance is as follows:

· Deploying a shared data center with Hadoop and other applications that consume different types of resources can improve overall resource utilization;

· Flexible virtual machine operations allow users to dynamically create and expand their own Hadoop clusters based on data center resources, or reduce the current cluster and release resources to support other applications if needed;

· By integrating with the HA and FT provided by the virtualization architecture, the single point of failure in the traditional Hadoop cluster is avoided, coupled with the data reliability of Hadoop itself, which provides a reliable guarantee for enterprise big data applications.

For these reasons, vSphere Big Data Extensions (BDE) provides users with effective support for flexible deployment and management of Hadoop clusters in a virtualized environment. In addition to these advantages, Will virtualization damage the performance of Hadoop? Therefore, we have compared and optimized the performance of Hadoop clusters deployed in virtualization and Physical Mode on the same scale. The experiment shows that the virtualized Hadoop cluster can well support the production environment.

Performance Comparison between a virtualized environment and a physical environment

Figure 1 shows the deployment style of the Performance Tuning Test. Only one virtual machine is deployed on one physical server, and Tasktracker and Datanode run together on the same node. Because each virtual node can use all the server resources to facilitate performance comparison and analysis of Hadoop deployed in virtualization and traditional physical environments. The test results show in Figure 2 that the performance of virtualized Hadoop is almost the same as that of the physical environment.


650) this. length = 650; "src =" http://www.bkjia.com/uploads/allimg/131228/052H34330-0.png "title =" 1vmperhost_m.png "width =" 600 "height =" 357 "border =" 0 "hspace =" 0 "vspace =" 0 "style =" width: 600px; height: 357px; "alt =" 152109994.png"/>

: Performance Comparison and deployment

650) this. length = 650; "src =" http://www.bkjia.com/uploads/allimg/131228/052H31M2-1.png "title =" perf comparison.png "width =" 600 "height =" 357 "border =" 0 "hspace =" 0 "vspace =" 0 "style =" width: 600px; height: 357px; "alt =" 154041173.png"/>

: Performance Comparison between physical deployment and virtual deployment of Apache Hadoop 1.2


Figure 3 shows the deployment topology recommended for the production environment. Multiple virtual nodes are deployed on one physical server. 2. This deployment increases resource utilization to improve performance.

650) this. length = 650; "src =" http://www.bkjia.com/uploads/allimg/131228/052H31S3-2.png "title =" xVMperhost_m.png "width =" 600 "height =" 355 "border =" 0 "hspace =" 0 "vspace =" 0 "style =" width: 600px; height: 355px; "alt =" 152.164490.png"/>

: Multi-Virtual Machine deployment


At the same time, we embedded these experiment experiences into the Hadoop Cluster System Configuration deployed in vSphere BDE, shielding the complexity of performance optimization. Although different data center settings and cluster configurations may lead to different performance, here we will share some common experiences with you in the order of creating, configuring, and expanding the Hadoop cluster:


Hadoop virtualization optimization experience:


1)Initial Plan Scale: Clusters are closely related to the data center infrastructure and configuration. We recommend that you create a small-scale cluster, such as five or six servers, when the environment performance is unpredictable at the beginning, deploy Hadoop and run the standard Hadoop benchmark to understand the characteristics of your data center. Add servers, storage resources, and other resources as needed.


2)Select Server: It is recommended that the CPU should not be less than 2 * Quad-core and HTHyper-Threading be activated). At least 4 GB memory should be configured for each computing kernel and 6% of memory should be reserved for the effective operation of virtualization. Hadoop performance is very sensitive to I/O. We recommend that you configure multiple local disks on each server instead of a large disk. Considering the cost of task scheduling, we do not recommend that you configure more than two local disks for each computing kernel. 10g Nic is recommended for high performance. Consider running namenode and Jobtracker on the master node server. Configure dual power supply to improve reliability.


3)Virtualization Configuration: Do not configure local storage as RAID. When creating a datastore virtual network configuration for each physical disk, the management network and Hadoop cluster network are isolated for reliability and network transmission efficiency. 4:

650) this. length = 650; "src =" http://www.bkjia.com/uploads/allimg/131228/052H35013-3.png "title =" network deployment_m.png "width =" 600 "height =" 451 "border =" 0 "hspace =" 0 "vspace =" 0 "style =" width: 600px; height: 451px; "alt =" 154628452.png"/>

: Virtual Network Configuration


4)System settings: BDE will automatically configure Virtual Disks and operating system parameters obtained based on experiment experience to shield users from specific performance optimization details. We recommend that you replace the default template with CentOS6 * for performance-sensitive users, because THPTransparentHuge Page of Linux 6. * kernel and EPTExtended PageTables, Intel processor) can help with virtualization performance.


5)HadoopConfiguration: BDE will automatically generate and configure hadoop profiles primarily in map-site.xml, core-site.xml, and hdfs-site.xml), including block size blocksize), session management and logging capabilities. However, there are some parameters related to MapReduce tasks, including mapred. reduce. parallel. copies, io. sort. mb, io. sort. factor, io. sort. record. percent, and tasktracker. http. thread, which must be set according to different loads.


5)Scaling recommendations: If you observe that the CPU usage in the cluster often exceeds 80%, we recommend that you add a new node. In addition, the size of a single storage node cannot exceed 24 TB. Otherwise, once the node fails, data backup and copy may cause data congestion. Scale-up can be performed based on the performance benchmark experience and resource usage of small-scale clusters.



If you have any questions, you can send an email to the bigdata_apac@vmware.com.



AboutVSphere Big Data Extensions:

VMware vSphere Big Data Extensions (BDE) supports Big Data and Hadoop jobs based on the vSphere platform. Based on the open-source Serengeti project, BDE provides enterprise users with a series of integrated management tools. By virtualizing Hadoop on vSphere, it helps users flexibly, elastically, securely, and quickly deploy, run, and manage big data on their infrastructure. Understanding

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.