Prepare for storage of large data

Last Update:2014-12-09 Source: Internet

Author: User

Keywords Large data large data distributed computing large data distributed computing can large data distributed computing can response large data distributed computing can respond DFS

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Today, we are frequently exposed to the term "large data". But the industry still lacks a standardized definition of what big data is. So what does big data mean to the data storage infrastructure?

The definition of large data by the Enterprise Strategy Group (ESG) is "a dataset that is larger than the conventional processing capacity boundary, which makes you have to resort to unconventional means." "In simple terms, we can use the word big data on any data collection that breaks through the traditional it processing to support the day-to-day operational capabilities of the business."

These boundaries may occur in the following situations:

The high transaction data volume causes the traditional data storage system to reach the bottleneck, cannot complete each operation task in time. In short, it does not provide the ability to deal with so many I/O requests. At some point, the disk speed within the user environment does not respond to all I/O requests. This often allows the user to place a small portion of the data on each disk drive and take "short hit control." This means increasing the overall speed of data per gigabyte by using a small portion of the disk, even with more disk drives to handle I/O. This situation can also cause users to deploy many storage systems to use in parallel, but they do not use their full capacity because of performance bottlenecks. or both. This is a costly way to buy too many disk drives and most of them are empty.

The size of the data (individual records, files, or objects) makes the traditional system not have enough throughput to transmit data in time. This may simply be because there is not enough bandwidth to handle the transaction volume. But the challenges posed by bandwidth are very rigorous. We see many businesses adopting "short hit control" to increase system bandwidth and drive volumes, which leads to low utilization and increased overhead.

? The entire volume capacity exceeds the threshold that traditional storage system capacity can withstand. Simply put, the storage system cannot provide enough capacity to handle the data on the volume. This causes storage to spread into dozens of or hundreds of storage stacks, managed by a number of 10 or hundreds management nodes, resulting in low utilization and a large footprint, energy and refrigeration.

These symptoms can become very serious at the same time--there's nothing to prove that users don't face large amounts of data in large files at the same time, and require a lot of I/O. In fact, the word "big data" began to appear in some of the special vertical industry it needs discussions, such as healthcare and entertainment industry organizations, and oil and gas companies.

Storage infrastructure that supports large data

We're looking for a new way of changing the storage infrastructure to handle the growing data capacity associated with large data. The characteristics of each approach are different, but overlap.

In high I/O-sensitive transaction processing, ESG Discovery has applied a number of infrastructure approaches that can be scaled vertically by increasing the disk. This system is the most traditional solution for companies such as Emcvmax, IBM DS800, and HDS VSP.

In the large file size response, the forefront of the enterprise in a few years ago began to use a scale-out system, configured enough bandwidth to deal with large file size, so as to solve the problem of large data. Such systems include DataDirect NX, has Ibrix, Isilon (now purchased by EMC) and Panasas. These systems meet performance requirements by scaling vertically (increasing the number of disks) and scaling horizontally (increasing bandwidth and processor power). As problems with large data sizes become more common, some of these systems are also looking for more mainstream business applications. These more mainstream environments typically mix I/O and throughput-sensitive high performance requirements, so the ability to scale horizontally and vertically expands must be available.

Finally, in terms of content capacity, we are seeing a scale-out, object-based storage infrastructure system that can be scaled up to tens of billions of data objects more easily in a single, simple management system. The advantage of this type of system is that it is easier to manage and track robust metadata, and that it can be designed to use high-density, low-cost hard drives, like Dell DX.

About Hadoop

No application of large data has any relation to distributed computing. The ability of distributed computing to speed up business analysis cycles (from weeks to hours or even minutes) at reasonable cost is attractive to businesses. This Open-source technology is typically run on inexpensive servers, using less expensive direct-attached storage (DAS).

Distributed computing is used to process large amounts of data and is composed of two parts: mapping simplification (MapReduce) and Distributed File System (HDFS). Mapping simplification handles the work of managing computer tasks, while HDFs automates the management of which computer clusters the data is stored (thereby reducing the load on the development device). When a computing task is started, the mapping simplification takes over the task and decomposes it into subtasks that can run in parallel. The mapping of Jane runs the data storage locations of the subtasks to the HDFs query, and then sends those subtasks to the compute node where the data store resides. In fact, it is sending the computing task to the data end. The results of each subtask are sent back to the mapping Center for Integration and the final conclusion is deduced.

In contrast, traditional systems require a very large and expensive server, configured with strong computing power, and an equally expensive storage array to accomplish the same task. The traditional system needs to read all the required data in a relatively continuous way, run the analysis operation and obtain the conclusion, under the same data quantity, the Mapping simplification task processing method based on distributed computation needs longer processing time.

The difference can be summed up in such a simple way. If 20 people in a grocery store go through the same checkout. If each person buys 200 dollars worth of goods, and need 2 minutes to complete all of their purchases of goods scanned. Then even the best employees need 40 minutes to handle the 4,000-dollar purchase. But if you're using distributed computing: There will be 10 checkout counters, each with a low-cost, part-time college student who needs an extra 50% time (3 minutes) to process each transaction. Then the same 20 people only need 6 minutes, and you can still get 4,000 dollars. From a business point of view, what does it mean to compress a working time from 40 minutes to 6 minutes? How much extra work can you do with more than 34 minutes? Can you do more research and have a better understanding of market trends? This is similar to the business aspect that you don't have to wait long to get the analytical results you want.

Distributed computing is not the perfect solution. Clustered file systems are complex, and many times this complexity hides the need to spend a lot of time building distributed clusters on the HDFS administrator side and making them run efficiently. In addition, in HDFs, the data map (or named node, Namenode) that maintains all data location (metadata) paths has a single point of failure in the most recently published Apache distributed computing--some of the important issues will be addressed in the next scheduled release of the major release. Data protection also relies on administrator control; Data replication settings determine how many times each data file is replicated within the cluster. The default setting is 3 times, which makes the overall capacity 3 times times larger than the actual usage capacity. And this is only protection within the local cluster; the backup disaster within the remote site is not considered in the distributed computing of the existing version. Keep in mind that there are no big-name experts on distributed computing in the market, but companies such as Cloudera, EMC and MAPR are now playing an important role in training, and the formation of a professional team will take time. This is one thing that should not be overlooked. Recent studies have shown that the cost of using external consultancy services is as high as $250,000 a year.

Large data, larger facts

These kinds of deficiencies, combined with the huge potential market in the business world, make large storage vendors such as EMC, IBM, and NetApp focus on market opportunities for big data. Vendors have released, or are around the corner, storage systems designed for distributed computing environments to help users overcome hdfs deficiencies in manageability, scalability, and data protection. Most of these can replace the HDFS storage layer with open interfaces, such as NFS and CFS, while others provide their own version of a simple, mapped framework that provides better performance than open source distribution. Some provide features that can fill open source HDFs differences, such as the ability to share data between other applications through standard NFS and CFS interfaces, or data protection and disaster-recovery capabilities.

NetApp takes a completely different approach. It has built-in open source distributed computing standards and uses DAS on data nodes. Unlike using proprietary file systems in the name of distributed computing, NetApp uses SAS-connected JBOD as Das on low-end Engenio platforms. In the case of named nodes, it uses an NFS direct-attached FAS box to provide a quick recovery from a fault-named node. This is the "best in two areas" hybrid problem solving approach.

It is still too early to say whether markets will be willing to buy more reliable or more promising tools.

Large data is real and varied: different types of large data require different storage methods. If you're already facing problems with big data, and you're facing some obstacles, that means you should take a different approach, and the best way to talk to suppliers about demand is to focus on the problem itself rather than on the topic of big data. You should talk about business issues and case situations, which help to refine the problem to a particular load. So you can quickly find the corresponding storage solution.

(Responsible editor: The good of the Legacy)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More