Using Hadoop to drive large-scale data analysis does not necessarily mean that building a good, old array of distributed storage can be a better choice.
Hadoop's original architecture was designed to use a relatively inexpensive commodity server and its local storage in a scale-out manner. Hadoop's original goal was to cost-effectively develop and utilize data, which in the past did not work. We've all heard of words like large-scale data, large-scale data types, large-scale data speeds, etc. that describe these previously unmanageable data sets. Given the broad definition of goals, most companies can point to some sort of large-scale data they plan to exploit.
With large-scale data expanding daily, storage vendors began participating in large-scale data parties with their relatively expensive SAN and network attached storage (NAS) systems. They can not leave all of this data to server vendors in the chassis lined commodity disk drives. Although adoption of Hadoop is still in its early stages, competitive and chaotic marketing is stirring.
High-level Hadoop and HDFS
In Hadoop's scale-out design, each physical node in the cluster hosts local computing and data sharing; it is designed to support applications that often traverse large data sets, such as searches. Much of Hadoop's value lies in the way it effectively executes parallel algorithms on distributed blocks in a scale-out cluster.
Hadoop consists of a compute engine based on MapRduce and a data service called the Hadoop Distributed File System (HDFS). Hadoop leverages the high "locality" of data, using HDFS to distribute large-scale data sets across many nodes, entrusting parallel computing tasks to each data node, the MapReduce map, Subsequent steps of rearranging and collating to yield the results. (That is, the "reduce" section)
Normally, each HDFS data node will be assigned a DAS disk. HDFS will then replicate the data to all the data nodes, usually generating two to three copies on different data nodes. The duplicated data is placed on different server nodes, while the second replicated data is placed on different nodes "racks" to avoid rack-level losses.
Clearly, replication takes up more raw capacity than RAID, but it also has some advantages, such as avoiding rebuilding windows.
Why use enterprise storage?
So if HDFS easily handles the largest part of the dataset with a MapReduce-based approach, using relatively inexpensive local disks and providing built-in "fabric-aware" replication, why consider Enterprise Storage? For example, there are still potential vulnerabilities in HDFS metadata server nodes. Although every release of Hadoop has improved HDFS reliability, the use of HDFS metadata servers for more reliable RAID-based storage remains controversial.
Backing up, securing, or auditing native HDFS is not easy. Of course, NAS and SAN have excellent built-in data protection and snapshots.
There are many IT reasons for using external shared storage for large amounts of data. First, although Hadoop scales horizontally to handle multiple PB data, most large-scale datasets are likely to have only 10TB to 50TB amplitude. Traditional data sets of several terabytes are almost equal to zero processing, but just within the cost-effectiveness of scale-out SAN and NAS solutions. Those shared data sets are often integral to the company's existing business processes and can be more efficiently controlled, managed, and integrated at the enterprise level than HDFS.
Despite the safety-conscious components used in the Hadoop ecosystem (such as Sentry and Accumulo), data security and data protection are other major reasons to consider using external storage. Backing up, securing, or auditing native HDFS is not easy. Of course, NAS and SAN have excellent built-in data protection and snapshots.
With external enterprise storage, a highly available Hadoop application, which is becoming more common as Hadoop evolves into real-time query and flow analysis, may never know that a disk failure has occurred.
Building Hadoop with external storage allows you not only to separate storage management but also to take advantage of independent "vectors of growth." Without additional extra resources can easily increase the storage or calculation. There is also a cost advantage because enterprise RAID solutions will use less disk space than the "gross" copy of Hadoop.
Deploying Hadoop scale-out nodes as virtual machines allows for on-demand provisioning and easy scaling or clustering.
Sharing is the way to win external storage because of the challenge of moving large amounts of data in and out of a Hadoop cluster. When using external storage, multiple applications and users can access the same "master" data set through blocking clients and even update and write data while they are being used by Hadoop applications.
Virtualize Hadoop
External storage also has advantages in virtualized Hadoop scenarios. We expect this to be the common method of deploying Hadoop in the enterprise. Deploying Hadoop scale-out nodes as virtual machines allows for on-demand provisioning and easy scaling or clustering.
Multiple virtual Hadoop nodes can be hosted on each hypervisor and can easily allocate more or fewer resources to a given application. Hypervisor-level high-availability (H / A) / fault tolerance can be used for production-level Hadoop applications. Performance is a concern, but more resources can be dynamically applied to where it's needed, providing some, if not all, superior performance for some Hadoop applications.
Virtual storage of large-scale data
One compelling reason to consider the physical Hadoop architecture is to avoid costly SANs, especially as data sets grow larger. However, in a virtual environment, it may be more appropriate to consider external storage. One reason for this is that it is fairly straightforward to streamline a compute-only virtual Hadoop cluster, but distributing large-scale data sets can still be a challenge. By hosting data to external shared storage, it becomes almost trivial to streamline the hosting of virtual Hadoop, while the capabilities of hypervisors like DRS and HA can be leveraged.
Because a single large dataset can easily be "properly" shared among multiple virtualized Hadoop clusters, there is an opportunity to serve multiple clients with the same storage. Hadoop, an enterprise-grade production-level application, becomes more manageable and easily supported by eliminating multiple copies of the data set, reducing the amount of data migration, and ensuring greater availability and data protection. The TCO for hosting virtualized Hadoop on virtual servers that are less expensive but have more expensive storage options will still be lower than insisting on using dedicated commodity server physical clusters.
How to use is the key
External storage is more expensive than the default DAS option, but it's just "other" things that balance the accounts with the data stored. Decisions about using external storage must be based on TCO, including both the incoming sources of the data sets and the end-to-end workflow. Other workloads may effectively share a single repository of data resources, while existing assets and skills may also be leveraged. In addition, high-end storage intake, performance, capacity or scalability may be limited.