Virtualization technology: An algorithm for delving into the Hadoop disk deployment

Source: Internet
Author: User

There are different types of nodes in a Hadoop cluster, and their requirements for disk are different. The primary (master) node focuses on storage reliability, and data nodes require better read and write performance and larger capacity.

In a virtual cluster, storage (datastore) can be divided into two types: local and shared. Local storage can only be accessed by virtual machines on the host on which it resides, while shared storage is accessible to virtual machines on other hosts. Local storage has better read and write performance, and shared storage is more reliable.

The algorithm for disk deployment provides the optimal storage scheme for different types of Hadoop nodes based on user input.

First look at the master node in the Hadoop cluster. Shared storage is necessary because the primary node requires higher reliability and is typically configured with vsphere high-availability (high availability) and fault tolerant (Fault tolerance) features. The following is a fragment of a JSON-formatted configuration file that shows how to specify the storage for the primary group of nodes.

1 {
2 "nodegroups": [
3 {
4 "name": "Master",
5 "Roles": [
6 "Hadoop_namenode",
7 "Hadoop_jobtracker"
8],
9 "Instancenum": 1,
"Instancetype": "LARGE",
One "Cpunum": 2,
"MEMCAPACITYMB": 4096,
"Storage": {
"Type": "SHARED",
"SIZEGB": 20
16},
"Haflag": "On",
"Rpnames": [
"Rp1"
20]
21},


Starting at line 13th is about the configuration of the store, specifying the storage using the shared type (line 14th "type": "Shared") with a size of 20GB. Line 17th "Haflag": "On" Specifies the high Availability (HA) attribute to use vsphere. Serengeti allocates a shared type of storage to the primary node when allocating disk.


Here's a look at the data nodes in the Hadoop cluster. These nodes have a large amount of disk read and write operations, and depending on the type of storage available to the user, the system uses a different disk deployment algorithm.

For a user-specified case of shared storage, because the underlying storage is SAN or NAS, which is already made up of multiple physical disks, and the scheduling of the disks is hidden behind the controller, the optimal disk deployment algorithm is to assemble the required storage space on a shared storage. If there is not enough space for a shared storage, the system places the remaining storage space on the next available shared storage until all of the required storage space is allocated.

Another scenario is the user-specified local storage. For local storage, we recommend that users define each physical disk separately (one datastore for each physical disk). In this way, we can increase the overall disk throughput of Hadoop by (1) using as much local storage as possible and (2) providing more physical disk topology information.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.