Introduction to storage and management of large data

Last Update:2017-02-27 Source: Internet

Author: User

Tags execution final hash

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Any machine will have physical limitations: memory capacity, hard disk capacity, processor speed, etc., we need to make trade-offs between the limitations and performance of these hardware, such as memory read faster than the hard disk, so the memory database is better than the hard disk database performance, But a machine with a memory of 2GB is unlikely to put all 100GB of data into memory, and perhaps a machine with a memory size of 128GB can do it, but when the data is added to 200GB, there is nothing to do.

The continuous increase of data results in the continuous decrease of system performance, even if the hardware configuration is constantly upgraded, it is difficult to keep pace with the data growth. However, today's mainstream computer hardware is cheaper and scalable, and now it is much more cost-effective to acquire eight 8 cores, 128GB of RAM than a 64-kernel, TB-level server, and to increase or reduce the number of machines to cope with future changes. This kind of distributed architecture strategy is suitable for the massive data, therefore, many massive data systems choose to put the data in many machines, but they also bring a lot of problems for the stand-alone system.

Here are four large data storage and management database systems that occur during large data storage and management development.

Parallel database

Parallel databases are database systems that perform data operations in a shared-free architecture. Most of these systems adopt relational data model and support SQL statement query, but in order to execute SQL query operation in parallel, two key technologies are used: horizontal division of relational table and partition execution of SQL query.

The main idea of horizontal division is to distribute the tuples in the relational table to different nodes in the cluster according to some strategy, and the table structure on these nodes is the same, so that the tuple can be processed in parallel. Existing partitioning policies include hash partitions, scope partitions, circular partitions, and so on. For example, a hash partitioning policy is to distribute the tuples in table T to n nodes, and you can use a uniform hashing algorithm to hash one or more of the properties in a tuple, such as hash (t.attribute1) mod n, and then place the tuples on a different node based on the hash value.

Processing SQL queries in a partitioned stored table requires the use of a partitioned execution strategy. For a tuple within a range of values in table T, the system first generates the total execution plan p for the entire table T, and then splits p into N child plan {P1,..., Pn}, and the child Plan Pi executes independently on the node ni. Finally, each node sends the resulting intermediate results to a selected node that aggregates the intermediate results to produce the final result.

The goal of parallel database system is high performance and high availability, and it can improve the performance and usability of the whole database system by executing the database task in parallel through multiple nodes. In recent years, new technologies have been emerging that improve system performance, such as indexing, compression, materialized views, result caching, I/O sharing, and so on, which are more mature and stand the test of time. Unlike some early systems, such as Teradata, which must be deployed on proprietary hardware, recently developed systems such as Aster and Vertica can be deployed on ordinary commercial machines, which can be called quasi cloud systems.

The main drawback of parallel database systems is that there is no good elasticity, which is advantageous to small and medium enterprises and start-ups. When we design and optimize the parallel database, we think that the number of nodes in the cluster is fixed, and if we need to expand and shrink the cluster, we must develop a comprehensive plan for the data transfer process. The cost of this data transfer is expensive and results in a system that is inaccessible for a certain period of time, and this poor flexibility directly affects the resilience of parallel databases and the usefulness of Pay-as-you-go business models.

Another problem of parallel database is the system fault tolerance is poor, in the past, people think that node failure is a special case, not often appear, so the system only provides transaction-level fault-tolerant function, if the node failure in the query process, then the whole query to start from scratch. This strategy of restarting tasks makes it difficult for parallel databases to handle lengthy queries on clusters that have thousands of nodes, because node failures often occur in such clusters. Based on this analysis, the parallel database is only suitable for applications with relatively fixed resource requirements. However, many design principles of parallel databases provide a good reference for the design and optimization of other mass data systems.

NoSQL Data Management System

The word NoSQL first appeared in 1998, and it was a lightweight, open source, not SQL-supplied relational database developed by Carlo Strozzi (he argues that because NoSQL contradicts the traditional relational database model, it should have a brand new name, such as "Norel" or a similar name).

In 2009, Last.fm's Johan Oskarsson launched a discussion on distributed open source databases, and Eric Evans, from Rackspace, once again proposed the concept of NoSQL, where NoSQL mainly refers to the non-relational, distributed, does not provide an acid database design pattern.

The "No:sql (east)" seminar, held in Atlanta in 2009, was a milestone with the slogan "Select Fun, Profit from Real_world where Relational=false;". Therefore, the most common interpretation of NoSQL is "non-relational", emphasizing the merits of key-value storage and document databases, rather than simply opposing relational databases.

Traditional relational database is unable to deal with data-intensive applications, mainly in the aspects of flexibility, poor scalability and bad performance. Some recent storage systems have abandoned the design idea of traditional relational database management systems and instead adopted different solutions to meet scalability requirements. These systems, which do not have fixed data patterns and can be scaled horizontally, are now collectively referred to as NoSQL (some people think that the Norel is more reasonable), where NoSQL refers to "not just SQL", which complements the relational SQL data system. Some of the technologies commonly used in NoSQL systems are:

Simple data Model. Unlike distributed databases, most NOSQL systems adopt a simpler data model in which each record has a unique key, and the system simply supports the single record level of atomicity and does not support the relationship between foreign keys and cross records. The constraints of such an operation acquiring a single record greatly enhance the scalability of the system, and data operations can be performed on a single machine without the overhead of a distributed transaction.

Separation of metadata and application data. NoSQL data management system needs to maintain two kinds of data: metadata and application data. Metadata is used for system management, such as mapping data to nodes and replicas in a cluster. Application data is the business data that the user stores in the system. The system separates the two types of data because they have different consistency requirements. In order for the system to function correctly, the metadata must be consistent and real-time, and the consistent demand for application data will vary depending on the application situation. Therefore, in order to achieve scalability, NoSQL systems adopt different strategies for managing two types of data. There are also some nosql systems that do not have metadata, and they solve data and node mapping problems in other ways.

Weak consistency. The NoSQL system achieves consistency by replicating application data. This design makes replica synchronization expensive when updating data, and in order to reduce this synchronization overhead, weak consistency models such as final consistency and timeline consistency are widely used.

With these techniques, NoSQL is able to cope well with the challenges of massive data. Relative to relational database, the main advantages of NOSQL data storage Management system are:

Avoid unnecessary complexity. relational databases provide a wide variety of features and strong consistency, but many features can only be used in certain applications and most are rarely used. The NoSQL system provides fewer features to improve performance.

High throughput. Some NoSQL data systems have a much higher throughput than traditional relational data management systems, such as Google's use of MapReduce to process 20PB of data stored in bigtable every day.

High-level scalability and low-end hardware clustering. NoSQL Data Systems can scale well horizontally, and unlike relational database clustering methods, this extension does not require a significant cost. The design concept based on low-end hardware saves a lot of hardware overhead for users who use NoSQL data systems.

Avoids expensive object-relational mappings. Many NoSQL systems can store data objects, which avoids the cost of transforming the object models in the relational models and programs in the database.

NoSQL offers people an efficient and inexpensive data management solution, and many companies no longer use Oracle or even MySQL, they build their own massive data storage management systems using the main ideas of Amzon's dynamo and Google's bigtable, and some systems start to open source, Facebook, for example, donated its developed Cassandra to the Apache Software Foundation.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More