Large Data applications: Local server storage trumps San?

Source: Internet
Author: User
Keywords Large data large data server large data server selection large data server selection local service large data server selection local service providing

Disk storage is like a closet, never enough, especially in the big data age. "Big data" means that more data is needed than a traditional storage platform. The choice of storage services for large data is not entirely traceable.

What is large data

First, we need to be clear about the difference between large data and other types of data and the technology associated with it (primarily analytical applications). Large data itself means very much data that needs to be processed using standard storage techniques. Large data may consist of terabytes (or even petabytes) of information, including structured data (databases, logs, SQL, and so on) and http://www.aliyun.com/zixun/aggregation/13739.html "> Unstructured data ( Social media posts, sensors, multimedia data. In addition, most of these data lack indexes or other organizational structures and may consist of many different file types.

Because of the lack of consistency of these data, there is nothing to do with standard processing and storage technology, and the operational overhead and sheer volume of data make it difficult to use traditional server and San methods to effectively process them. In other words, big data requires a different approach: its own platform, where hadoop can be useful.

Hadoop is an open source distributed computing platform that provides a platform-by-system approach, consisting of standardized hardware (server and internal server storage), and a cluster that can handle large data requests in parallel. On the storage side, the key component of this open source project is the Hadoop Distributed File System (HDFS), which has the ability to store very large files across multiple members of the cluster. HDFS provides convenient, reliable, and fast computing power by creating multiple copies of blocks of data and then distributing them across a cluster of computer nodes.

For now, the easiest way to build large storage platforms for big data is to buy a set of servers, with terabytes of drives for each server, and then have Hadoop do the rest. For some smaller companies, it may be as simple as that. However, once consideration is given to processing performance, algorithmic complexity, and data mining, this approach may not necessarily guarantee success.

Your storage architecture

All this boils down to the storage structure and network performance involved. For organizations that frequently analyze large data, a separate infrastructure may be required because bandwidth overhead increases as the number of compute nodes in the cluster grows. In general, a HDFS cluster that uses a large number of modules will generate a large amount of traffic when processing larger data. This is because Hadoop transmits data (and compute resources) across the cluster's member servers.

In most cases, server-based local storage does not have the benefit of efficiency, which is why many companies are moving to a high speed fibre-structured san to maximize throughput. However, the SAN method itself is not necessarily suitable for large data deployments. In particular, large data deployments that use Hadoop, because sans assume the responsibility for data on a centralized hard disk, which in turn means that each compute server will need access to the same SAN to restore normal-distribution data.

However, when comparing local server storage with SAN based storage, local storage has two advantages: cost and overall performance. In short, a raw disk that does not place raid on every calculated member will outperform the San when processing HDFs requests, however, server-based disks are flawed, primarily in terms of scalability.

The problem is how you can add more capacity when necessary when the server is dependent on local storage. There are usually two ways to deal with this dilemma. The first approach is to increase the number of additional servers with more local storage. The second approach is to increase the capacity of the clustered server. Both methods require the purchase and configuration of hardware, which results in downtime and may require a redesign of the schema. However, it's a significant cost advantage to use whichever approach is cheaper than adding capacity to a SAN.

However, when it comes to Hadoop, there are other storage choices. For example, some leading storage vendors are building storage devices specifically for Hadoop and large data analysis. These vendors, including EMC, currently offer Hadoop solutions, such as Greenplum HD Data Computing appliance. Oracle is considering further deepening the Exadata series of devices, providing computational power and high speed storage.

The last storage option is cloud-style storage, with Cloudera, Microsoft, Amazon, and many other vendors providing cloud-based, large data solutions that provide processing power, storage, and support.

When choosing a large data storage solution, you need to consider how much space is needed, how often it is analyzed, and what type of data needs to be processed. These factors, as well as security, budgeting, and processing time, are factors to consider when choosing a large data storage solution.

Perhaps from an insurance standpoint, a pilot project may be a good start, and commodity hardware is also a low-cost investment option for large data pilot projects.

(Responsible editor: The good of the Legacy)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.