There are different types of nodes in a Hadoop cluster, and their requirements for disk are different. The primary (master) node focuses on storage reliability, and data nodes require better read and write performance and larger capacity.
In a virtual cluster, storage (datastore) can be divided into two types: local and shared. Local storage can only be accessed by virtual machines on the host on which it resides, while shared storage is accessible to virtual machines on other hosts. Local storage has better read and write performance, and shared storage is more reliable.
The algorithm for disk deployment provides the optimal storage scheme for different types of Hadoop nodes based on user input.
First look at the master node in the Hadoop cluster. Shared storage is necessary because the primary node requires higher reliability and is typically configured with vsphere high-availability (high availability) and fault tolerant (Fault tolerance) features. The following is a fragment of a JSON-formatted configuration file that shows how to specify the storage for the primary group of nodes.
1 {
2 "nodegroups": [
3 {
4 "name": "Master",
5 "Roles": [
6 "Hadoop_namenode",
7 "Hadoop_jobtracker"
8],
9 "Instancenum": 1,
"Instancetype": "LARGE",
One "Cpunum": 2,
"MEMCAPACITYMB": 4096,
"Storage": {
"Type": "SHARED",
"SIZEGB": 20
16},
"Haflag": "On",
"Rpnames": [
"Rp1"
20]
21},
Starting at line 13th is about the configuration of the store, specifying the storage using the shared type (line 14th "type": "Shared") with a size of 20GB. Line 17th "Haflag": "On" Specifies the high Availability (HA) attribute to use vsphere. Serengeti allocates a shared type of storage to the primary node when allocating disk.
Here's a look at the data nodes in the Hadoop cluster. These nodes have a large amount of disk read and write operations, and depending on the type of storage available to the user, the system uses a different disk deployment algorithm.
For a user-specified case of shared storage, because the underlying storage is SAN or NAS, which is already made up of multiple physical disks, and the scheduling of the disks is hidden behind the controller, the optimal disk deployment algorithm is to assemble the required storage space on a shared storage. If there is not enough space for a shared storage, the system places the remaining storage space on the next available shared storage until all of the required storage space is allocated.
Another scenario is the user-specified local storage. For local storage, we recommend that users define each physical disk separately (one datastore for each physical disk). In this way, we can increase the overall disk throughput of Hadoop by (1) using as much local storage as possible and (2) providing more physical disk topology information.