Virtualization has injected an unprecedented amount of energy into Hadoop, from the perspective of IT production management, as follows:
· Hadoop and other applications consuming different types of resources together to deploy a shared data center can improve overall resource utilization;
Flexible virtual machine operation allows users to dynamically create and extend their own Hadoop cluster based on the data center resources, can also reduce the current cluster, release resources to support other applications if needed;
· By integrating with the HA and FT provided by the virtualized architecture, a single point of failure in a traditional Hadoop cluster is avoided, together with the data reliability of Hadoop itself, providing a reliable guarantee for enterprise big data applications.
For these reasons, vSphere Big Data Extensions (BDE) provide users with the flexibility to flexibly deploy and manage Hadoop clusters in a virtualized environment. In addition to these advantages, does virtualization hurt the performance of Hadoop? For this reason, we compared and optimized the performance of Hadoop clusters with virtualized deployments and physical deployments on the same scale. Experiments show that virtualized Hadoop clusters can perform well Support the production environment.
Performance comparison between virtualized environment and physical environment
Figure 1 shows the performance tuning test deployment style, a physical server deployed only a virtual machine, Tasktracker and Datanode run together in the same node. Because each virtual node can use all the server resources, convenient for Hadoop virtualization and traditional physical environment deployment performance comparison and analysis. The results of the experiment are shown in Figure 2, which shows that the performance comparison of virtualized Hadoop versus the physical environment is almost flat.
Figure 1: Performance Comparison Deployment
Figure 2: Performance Comparison of Apache Hadoop 1.2 Physical and Virtualized Deployments
Figure 3 shows the deployment topology more recommended for the production environment, with multiple virtual nodes deployed on one physical server. As shown in Figure 2, this deployment will increase resource utilization for better performance.
Figure 3: Multi-VM deployment
At the same time, we embed these experimental experiences into the Hadoop cluster system configuration deployed by vSphere BDE, blocking the performance optimization complexity. Although different datacenter settings and cluster configurations may bring different performance, here are some common practices to share with you in terms of creating, configuring, and extending Hadoop clusters:
Hadoop virtualization tuning experience:
(1) The initial scale of the project: The performance of the cluster is closely related to the data center infrastructure and configuration. Users are advised to set up small-scale clusters such as 5 or 6 servers and deploy Hadoop at the beginning when the environment performance is unpredictable. Then run a standard Hadoop benchmark to understand the characteristics of your data center. Then gradually add resources such as servers and storage as needed.
(2) Select Server: The CPU recommends not less than 2 * Quad-core and activates HT (Hyper-Threading); configure at least 4 Gbytes of memory for each compute core and reserve 6% of the memory for effective virtualization operation. Hadoop performance is very sensitive to I / O, and it is recommended that you configure more than one local storage per server rather than a small, high-capacity hard disk. Considering the cost of task scheduling, it is not advisable to configure more than two local stores for each compute kernel. For high performance, it is recommended to use 10G card. Consider configuring a dual power supply for the primary node server (running namenode, Jobtracker) for increased reliability.
(3) virtualization configuration: local storage as far as possible to avoid the RAID configured for each physical disk to create a datastore virtual network configuration for reliability and network transmission efficiency, isolated management network and Hadoop cluster network. As shown in Figure 4:
Figure 4: Virtualized Network Configuration
(4) system settings: BDE will automatically configure virtual disk and operating system parameters obtained based on experimental experience, shielding the user to optimize the performance of the specific details. It is recommended that performance-sensitive users replace the default template with CentOS6 * because THP (TransparentHuge Page) and EPT (Extended PageTables) of the Linux 6. * kernel work together to help virtualize performance.
(5) Hadoop configuration: BDE will automatically generate and configure hadoop configuration files (mainly in map- site.xml, core-site.xml, and hdfs-site.xml), including blockize, session management and Log function. However, there are some parameters related to MapReduce tasks, including mapred.reduce.parallel.copies, io.sort.mb, io.sort.factor, io.sort.record.percent, and tasktracker.http.thread, depending on the load Specific settings.
(5) Suggestions for expansion: If you often observe that the CPU utilization in a cluster often exceeds 80%, you are advised to add a new node. In addition, the capacity of a single storage node is not recommended for more than 24TB, otherwise once the node fails, the data backup copy will easily cause data congestion. Extensions can be run on the basis of running performance benchmarks and resource usage on small clusters.
If you have any questions, you can email bigdata_apac@vmware.com.
About vSphere Big Data Extensions:
VMware vSphere Big Data Extensions (BDE) supports big data and Hadoop jobs based on the vSphere platform. Based on the open source Serengeti project, BDE provides enterprise-class users with a suite of integrated management tools to help users implement big data deployments, operations, and management that are flexible, resilient, secure, and fast on the infrastructure by virtualizing Hadoop on vSphere jobs. To learn more about VMware vSphere Big Data Extensions, see http://www.vmware.com/hadoop.
About the Author
Li Xinhui
VMware Software Senior Engineer
He is now a senior engineer at VMware Big Data and is dedicated to service-oriented and efficient big data at cloud computing centers, working in the area of distributed system performance optimization. Li Xinhui graduated from the CAS Institute of Computing, after joining IBM Lab - Distributed Computing, the main work in the field of cloud computing and parallel data processing, large-scale data center to provide optimal monitoring and operation and maintenance of industrial solutions. There are nine patents registered in the United States and China, in internationally renowned conferences, academic journals published 5 articles.
Original link: http://vbigdata.blog.51cto.com/7526470/1298757