Storage and management of big data

Last Update:2018-12-03 Source: Internet

Author: User

Tags voltdb

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Storage and management of big data

Any machine has physical limitations: memory capacity, hard disk capacity, processor speed, etc. We need to make a trade-off between these hardware restrictions and performance, for example, the read speed of the memory is much faster than that of the hard disk, so the memory database has better performance than the hard disk database, but it is impossible for a machine with 2 GB memory to put all the data in the memory, A machine with a memory size of GB can do this, but it cannot do anything when the data is increased to GB.

As data continues to grow, the performance of the single-host system is declining. Even if the hardware configuration is constantly improved, it is difficult to keep up with the data growth rate. However, today's mainstream computer hardware is relatively cheap and scalable. It is much more cost-effective to purchase eight 8-core and GB memory machines than to purchase a 64-core, Tb-level memory server, in addition, machines can be increased or reduced to cope with future changes. This distributed architecture strategy is suitable for massive data. Therefore, many massive data systems choose to place data on multiple machines, but it also brings many problems that are not found in single-host systems.

The following describes four types of big data storage and management database systems that emerged during the development of big data storage and management.

Parallel Database

Parallel Database [1] refers to database systems that operate data in a non-shared architecture. Most of these systems use relational data models and support SQL statement queries. However, to perform SQL query operations in parallel, the system uses two key technologies: horizontal Table Partitioning and SQL query partition execution.

The main idea of horizontal partitioning is to distribute the tuples in the relational table to different nodes in the cluster according to a certain policy. The table structures on these nodes are the same, so that the tuples can be processed in parallel. Existing partition policies include hash partitions, range partitions, and cyclic partitions. For example, the hash partition policy distributes the tuples in table t to N nodes. You can use a unified hash algorithm to hash one or more attributes of the tuples, such as hash (T. attribute1)
MOD n, and then place the tuples to different nodes based on the hash value.

To process SQL queries in a partitioned table, you need to use a partition-based execution policy. For example, to obtain the tuples in a value range of table t, the system first generates a total execution plan P for the entire table t, then split P into n sub-plans {p1 ,..., Pn}, the sub-plan pi is executed independently on the node Ni. Finally, each node sends the generated intermediate results to a selected node, this node aggregates intermediate results to generate the final results.

The goal of the parallel database system is high performance and high availability. It executes database tasks in parallel through multiple nodes to improve the performance and availability of the entire database system. In recent years, new technologies have emerged to improve system performance, such as indexing, compression, materialized views, result caching, and I/O sharing. These technologies are mature and time-proven. Unlike earlier systems such as teradata, which must be deployed on proprietary hardware, recently developed systems such as Aster and vertica can be deployed on common commercial machines, these database systems can be called quasi-cloud systems.

The main disadvantage of the parallel database system is the lack of good elasticity, which is advantageous to small and medium enterprises and start-up enterprises. When designing and optimizing parallel databases, people think that the number of nodes in the cluster is fixed. If you need to expand or contract the cluster, you must formulate a comprehensive plan for the data transfer process. The cost of data transfer is expensive, and the system may be inaccessible for a certain period of time. This poor flexibility directly affects the elasticity of the parallel database and the practicality of the pay-as-you-go business model.

Another problem with parallel databases is that the system's fault tolerance is poor. In the past, people thought that node faults were a special case and were not frequent. Therefore, the system only provides transaction-level fault tolerance functions, if a node fails during the query, the entire query must be re-executed from the beginning. This restart policy makes it difficult for parallel databases to process long queries on clusters with thousands of nodes, because node faults often occur in these clusters. Based on this analysis, parallel databases are only suitable for applications with relatively fixed resource requirements. In any case, many design principles of parallel databases provide a good reference for the design and optimization of other massive data systems.

Nosql Data Management System

Nosql [5] was first introduced in 1998. It is a lightweight, open-source, relational database developed by Carlo STROZZI that does not provide SQL functions, because nosql is far away from the traditional relational database model, it should have a brand new name, such as "norel" or similar name [6]).

On July 6, 2009, Johan oskarsson of last. FM initiated a discussion on Distributed Open-source databases [7], Eric from rackspace.
Evans once again proposed the concept of nosql. At this time, nosql mainly refers to the database design mode that is not relational, distributed, and does not provide acid.

The "No: SQL (East)" seminar held in Atlanta in 2009 was a milestone and Its slogan was "select fun, profit from real_world whererelational = false ;". Therefore, the most common interpretation of nosql is "non-relational", which emphasizes the advantages of key-value storage and document databases, rather than simply opposing relational databases.

Traditional relational databases cannot handle data-intensive applications, mainly in poor flexibility, poor scalability, and poor performance. Recently, some storage systems have abandoned the design philosophy of traditional relational database management systems and adopted different solutions to meet scalability requirements. These systems that do not have a fixed data model and can be horizontally expanded are now collectively referred to as nosql (some people think it is more reasonable to call norel). Here nosql refers to "not
Only SQL ", that is, Supplement to the relational SQL data system. Nosql systems generally use the following technologies:

Simple data model. Unlike distributed databases, most nosql systems use simpler data models. In this data model, each record has a unique key, and the system only needs to support single-record-level atomicity, foreign keys and cross-record relationships are not supported. The constraints on getting a single record for one operation greatly enhance the system's scalability, and data operations can be executed on a single machine without the overhead of distributed transactions.

The separation of metadata and application data. Nosql data management systems need to maintain two types of data: Metadata and application data. Metadata is used for system management, such as the ing data from data partitions to nodes and copies in the cluster. Application Data is the commercial data stored by users in the system. The system separates these two types of data because they have different consistency requirements. Metadata must be consistent and real-time if the system is to run normally, while the consistency requirements of application data vary with application scenarios. Therefore, to achieve scalability, nosql systems adopt different policies for managing two types of data. Some nosql systems do not have metadata. They solve the ing problem between data and nodes in other ways.

Weak Consistency. Nosql systems replicate application data to achieve consistency. This design makes replica synchronization Overhead high when updating data. To reduce this overhead, weak consistency models such as final consistency and timeline consistency are widely used.

With these technologies, nosql can well cope with the challenges of massive data. Compared with relational databases, nosql data storage and management systems have the following advantages:

Avoid unnecessary complexity. Relational databases provide a variety of features and strong consistency, but many features can only be used in certain applications, most of which are rarely used. Nosql systems provide fewer features to improve performance.

High throughput. Some nosql data systems have much higher throughput than traditional relational data management systems. For example, Google uses mapreduce to process 20 Pb of data stored in bigtable every day.

High-level scalability and low-end hardware clusters. The nosql data system can perform horizontal scaling well. Unlike the relational database cluster method, this scaling does not have to be costly. The design concept based on low-end hardware saves a lot of hardware overhead for users who use nosql data systems.

Avoid expensive object-link ing. Many nosql systems can store data objects, which avoids the cost of converting Relational Models in databases and object models in programs.

Nosql provides efficient and inexpensive data management solutions. Many companies no longer use Oracle or even MySQL, they used dynamo of amzon and Google's bigtable idea to build their own massive data storage and management systems, and some systems began to be open-source. For example, Facebook donated their developed Cassandra to the Apache Software Foundation.

Although nosql databases provide high scalability and flexibility, they also have their own shortcomings, mainly including:

The data model and query language are not verified in mathematics. SQL, a query structure based on relational algebra and relational calculus, has a solid mathematical guarantee. Even if a structured query is complex, it can obtain all the data that meets the conditions. Because nosql systems do not use SQL, some of the models used do not have a complete mathematical foundation. This is one of the main reasons for the confusion of nosql systems.

Acid features are not supported. This brings both advantages and disadvantages to nosql. After all, transactions are still needed in many cases. The acid feature enables the system to accurately execute online transactions when interrupted.

Simple functions. Most nosql systems provide simple functions, which increases the burden on the application layer. For example, if the acid feature is implemented at the application layer, it is extremely painful for programmers to write code.

There is no unified query model. Nosql systems generally provide different query models, which increases the burden on developers to a certain extent.

Newsql Data Management System

It has been widely believed that the support of acid and SQL in traditional databases limits the scalability of databases and the ability to process massive data. Therefore, we try to sacrifice these features to improve the storage and management capabilities of massive data, however, some people have different ideas. They do not think it is acid or the feature that supports SQL, some other mechanisms, such as the lock mechanism, log mechanism, and buffer management, restrict the system performance. as long as these technologies are optimized, the relational database system can still achieve good performance when processing massive data.

When a relational database processes transactions, it has a significant impact on performance and needs to be optimized:

Communication. Communication between applications and DBMS through ODBC or JDBC is a major overhead in OLTP transactions.

Logs. Changes to data in relational database transactions must be recorded in logs, and logs must be constantly written to hard disks to ensure durability. This is expensive and reduces transaction performance.

Lock. The modification operation in the transaction requires locking the data, which requires the write operation in the lock table, resulting in a certain amount of overhead.

. Some data structures in relational databases, such as B-tree, lock tables, and resource tables, affect the transaction performance. These data structures are often read by multiple threads, so short-term locks are required.

Buffer Management. Relational Data organizes data into fixed pages, and the cache of disk pages in memory causes a certain amount of overhead.

To solve the problem above, some new databases adopt different designs. It removes the resource-consuming buffer pool and runs the entire database in the memory. It also abandons the lock mechanism of a Single-thread service, and uses redundant machines to achieve replication and fault recovery, replacing the original expensive recovery operations. This scalable and high-performance SQL database is called newsql. "New" is used to indicate the difference with traditional relational database systems, but newsql is also a broad concept. It was first proposed by the 451 group in a report, which mainly includes two types of systems: Having relational database products and services, and bringing the benefits of the relational model to the distributed architecture; you can also improve the performance of relational databases so that they do not need to consider horizontal scaling issues. The first type of newsql includes clustrix, geniedb, scalarc, scalebase, and nimbusdb, as well as MySQL clusters with NDB and drizzle. The next type of newsql includes tokutek and justone.
DB. There are also some "newsql as a service", including Amazon's relational database service, Microsoft's SQL Azure, and fathomdb.

Of course, newsql and nosql are also cross-cutting. For example, rethinkdb can be considered as a high-speed cache system for middle-key/value storage in a nosql database, or as a storage engine for MySQL in a newsql database. Currently, many newsql providers use their own databases to provide storage services for data without fixed modes. At the same time, some nosql databases begin to support SQL queries and acid transactions.

Newsql can ensure the quality of SQL databases and provide scalability of nosql databases. Voltdb is one of the implementations of newsql. Its Development Company's CTO claims that their systems use the newsql method to process transactions 45 times faster than traditional database systems. Voltdb can be expanded to 39 machines and processes 300 tasks per minute in 16 million CPU cores. The number of machines required is much less than that of hadoop clusters.

With the rapid rise of nosql and newsql database camps, today's database systems are "Blooming" and there are hundreds of existing systems. Figure 1-1 classifies database systems in a broad sense.

Figure 1-1 Classification of Database Systems

In Figure 1-1, databases are divided into relational databases, non-relational databases, and database cache systems. Among them, non-relational databases mainly refer to nosql databases, which can be divided into four categories: Key-value database, column-store database, graph-store database, and document database. Relational databases include traditional relational database systems and newsql databases.

The demand for high-capacity, high-distributed, and high-complexity applications forces traditional databases to constantly expand their capacity limit, these six key factors driving the adoption of different data management technologies in traditional relational databases can be summarized as "sprain", namely:

Scalability-hardware price

High Performance-mysql performance bottleneck

Relaxedconsistency-cap Theory

Agility-persistent diversity

Intricacy-Massive Data

Necessity-Open Source

Author Profile

Lu jiaheng is a professor at Renmin University of China and a doctoral advisor. In, he graduated from the Department of Computer Science from the National University of Singapore and obtained a doctorate degree. In, he studied postdoctoral studies at the University of California (Irvine) and joined the People's University of China in, in 2012, he was promoted to a professor. The main research fields include database technology and cloud computing technology. He has published more than 40 database papers at major international conferences and journals, such as sigmod, vldb, icde, and www. He has edited many cloud computing and Big Data textbooks and works.

This article is excerpted from the book Big Data Challenges and nosql database technology. Edited by Lu jiaheng, published by the Electronic Industry Publishing House.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More