Why is deploying a Hadoop cluster a preferred hardware approach rather than a virtualization approach?

Source: Internet
Author: User

Blade Server, SAN, virtualization Technology

The "spiral forward" thing exists in all fields, as is the case with large-scale data storage and processing.

Once, when managers purchased servers for higher performance, they purchased higher-provisioned servers, a practice known as"Scaling Up"Later, when we realized that vertical scaling would bring higher overhead, we started to buy more servers to solve problems rather than buying higher-end servers, a practice called"Scale out"。 This is true of today's data centers, because rack space is a very important factor, so we have developed what we now call"Blade Server"1U, 2U This concept of the server, can make in a rack to load more servers. Later, found that in fact, a lot of server utilization is not high, and then developed a virtualization technology to more fully utilize the various server resources.

Storage development is also the same, first stand-alone hard disk storage to pursue higher capacity and higher Io, after encountering difficulties, in a single machine to develop a number of hard disk raid small scale scaling, and later, independent of the San and NAS with a lot of hard disk to do large scale scaling.

hadoop is also developed on the basis of the same horizontal expansion concept. So is Hadoop using today's popular blade servers, Sans, and virtualization technologies to improve performance like traditional services? Not really

First, the features of Hadoop operations. Hadoop is very sensitive to IO performance and there will be a lot of IO operations in the process. Hadoop does not require RAID, instead, using RAID may degrade worker performance; Hadoop wants the hard drive to use JBOD (Just Bunch of disk is simply a bunch of hard drives) in a simple form, plainly, the disk does not need special processing, simply by the operating system hangs can be downloaded. The advantage is that each disk will have a separate IO, no collaboration between them, and whose data is ready to send the data from the buffer directly to the corresponding process. The trouble with raid is that there is collaboration between the disks, often the slowest disk determines the performance of the raid, and, more seriously, because all the disks through a raid control chip, often the chip becomes the bottleneck of disk performance.

So, when Hadoop encounters virtualization, the problem is also exposed, because the virtualized system is unable to realize the existence of other systems, so in the disk IO scheduling, the overall optimization of scheduling. We know that unlike Flash, the physical nature of the disk is that sequential reads and writes are fast, but the seek time is slow. and   Virtualization technology makes it impossible for the operating system to optimize the dispatch seek operation, resulting in a large number of seek paths that are not able to take full advantage of the disk sequential read and write capabilities of &NBSP, while reducing performance (which is one reason why Hadoop block is very large, Used to write data sequentially to disk and read sequentially to significantly increase IO performance.

The problem with the blade server is that the internal space is very limited , as previously known, Hadoop requires a lot of Io, so within a node, if there are more independent disks, its IO performance will be better. This is why the previous article proposed configuration, the X 1TB disk than the X 3TB disk for a better reason. The space constraints inside the blade server tend to constrain the possibility of adding more hard drives.

From here, we are better able to see why Hadoop is so-called running on a standalone commercial server, and its deliberately Share architecture of nothing. Task Independent, Io Independent for Hadoop, there will be better performance, in other words, this Share nothing architecture, will make more use of hardware resources.

Reference 1:http://twang2218.github.io/readings/hadoop-operations/hadoop-operations-notes-ch04.html


Reference 2:http://www.ha97.com/5673.html


This article is from the "Divinity New Space" blog, please make sure to keep this source http://abool.blog.51cto.com/8355508/1883429

Why is deploying a Hadoop cluster a preferred hardware approach rather than a virtualization approach?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.