From NoSQL to Newsql, talking about the key points of the transaction-type distributed database construction

Source: Internet
Author: User
Tags cassandra compact cockroachdb disk usage redis cluster

In the previous article, "from architectural features to functional defects, re-understanding of the analytical distributed database", we completed the different "distributed database" of the horizontal analysis, this article Ivan will talk about the second part of the disassembly, combined with the difference between NoSQL and Newsql, from the portrait of the OLTP scenario "distributed Key technical points of the implementation plan. This article is not only the extension of the previous article, but also a special topic of distributed database, the main points of which Ivan will also be a separate article elaborated.

First, Newsql & NoSQL

Newsql is the focus of this topic, but also referred to in the previous article, "Distributed Database", which is suitable for OLTP scenarios, with high concurrency and low latency, features close to ORACLE/DB2 and other traditional databases, relying on the general X86 server to achieve performance level expansion, Ability to withstand the performance pressures of massive transactions.

Currently has a high visibility of the newsql has Google's spanner/f1, Ali's Oceanbase, Cockroachdb, Tidb. The latter two are growing open source projects, which were released in 2018 with 2.0 releases.

Newsql and NoSQL have deep roots, so the following are some of the ways in which NoSQL can be implemented in the introduction of Newsql.

1. Storage Engine

B + Tree

B + Tree is a common index storage model of relational database, it can support efficient range scan, leaf node related link and order by primary key, and avoids time-consuming traversal tree operation when scanning. The limit of B + trees is that it is not suitable for a large number of randomly written scenes, and "write amplification" and "storage fragmentation" occur.

The following borrows the example from the Kang teacher's book [1] to illustrate the operation of the B + Tree ↓

There is a B + tree with a height of 2, stored in 5 page tables, each page can hold 4 records, fan out is 5. Shows the structure of the B + tree, omitting the pointer to the data from the leaf node and the sequential pointer between the leaf nodes:

B + trees consist of two nodes, the inner node (internode) and the leaf node (leafnode), which carry pointers to data that contain only the index information.

When inserting a record with an index value of 70, because the record of the corresponding page table is full, you need to rearrange the B + tree, change the record of the page table that contains the parent node, and adjust the records of the adjacent page table. After the redistribution is complete, the effect is as follows:

There are two issues in the change process:

    • Write amplification

In this case, only one write record (yellow callout) is logically required, and 7 Index records in 3 page tables are actually changed, and an additional 6 records (green callout) are maintained to maintain the write amplification generated by the B + tree structure.

Note: Write amplification (write amplification): Write amplification is the amount of data written to storage compared to the amount of data th At the application wrote, that is, the actual size of the data written to the disk and the ratio that the application requires to write the data

    • Storage not contiguous

The new leaf nodes are added to the ordered linked list of the original leaf nodes, and the whole is logically continuous; But on disk storage, the storage space for new page table requests is likely to be nonadjacent from the original page table. In this way, in subsequent queries that contain new leaf nodes, multiple consecutive reads will occur, and the time of disk addressing will increase. Further, a large number of random writes on the B+tree can result in fragmentation of the storage.

In a database product (such as MySQL) that is actually applied to b+tree, a fill factor (Factor fill) is typically provided for targeted optimization. Fill factor setting too small will cause the number of page table expansion, increase the scanning range of the disk, reduce query performance; setting too large will result in write enlargement when data is inserted, produce a large number of paging, reduce the insert performance, and also reduce the query performance due to the discontinuous data storage.

Lsm-tree

Lsm-tree (Log structured-merge Tree) was first presented by Patrick O ' Neil, which systematically elaborated the differences with B + trees in paper [2]. Google then introduced the model in BigTable, as shown in:

The main idea of Lsm-tree is to use memory to convert random writes to Sequential writes, which improves write performance, and because of the significant reduction in the disk usage of the write operation, the read operation gains more disk control and the read performance is not unduly affected.

The write operation simplification process is as follows:

    • When the write request arrives, the memory memtable is written first, processing the incremental data changes, and recording the Wal pre-write log;
    • After the memory delta data reaches a certain threshold, the current memtable is frozen and a new memtable is created, and the data in the frozen memtable is written sequentially to the disk, forming an ordered file sstable (Sorted String Table), This operation is called minor compaction (in HBase This operation is called a flush operation, and minor compaction has other meanings);
    • These sstable meet certain rules after merging, i.e. major compaction. All sstable under each column family are merged into a large sstable.

This model avoids the IO efficiency problem of random write, and effectively alleviates the write amplification problem of B-tree index, and greatly improves the writing efficiency.

NoSQL uses a wide range of lsm-tree models, including HBase, Cassandra, LevelDB, Rocksdb, and other k/v storage.

Of course, Lsm-tree also has its own flaws:

    • First, the operation of its major compaction greatly affects the online read and write, but also produces write amplification. For this reason, the use of hbase usually prevents the system from automatically executing major compaction.

Comments:

The significance of the Major compaction operation is to reduce the time complexity of the read operation. The system consists of multiple sstable files, total n data, and sstable average containing M data.

When performing a read operation, the time complexity of using the binary lookup method for a single sstable file is O (log2m), the overall time complexity is O (n/m* log2m), and the time complexity is reduced to O (sstable) after merging into one log2n

    • The second is the impact on reading efficiency, because the sstable files are at the same level, depending on the execution sequence of batch write to form several files, so the key (record primary key) in different files overlap, so that when performing a read operation, each file is looked up, resulting in unnecessary I/O overhead.

    • Finally, space amplification, in the worst case, the LSM tree needs free space equal to the size of the data to complete the compact action, where space is magnified by 100%, while the B + tree has a space amplification of about 1/3.

Leveled LSM Tree

The change in the leveled LSM Tree is to further layer the sstable, reduce the write amplification situation, narrow the read file range, take the lead in Leveldb, and then Cassandra 1.0 introduces the strategy [3].

Sstable's hierarchical design strategy is:

    • The size of a single sstable file is fixed and is set to 5M by default in Cassandra;
    • Levels start at level 0, and the amount of stored data grows as the level increases, with a consistent growth factor (growth Factor) between tiers. Cassandra growth factor is set to 10,level 1 file is 1-10m then level 2 file is 10-100m, so 10TB data volume will reach levels 7;
    • Level 0 Sstable is very special, fixed to 4 files, and there is a key overlap between files, starting from Level 1, sstable no longer appear key intersection;
    • Level 0 Sstable more than the capacity of the high-compaction, to levels 1, because there is a key intersection, so to read all the level 0 sstable, when the level 1 file size exceeds the threshold, the level will be created 2 sstable and delete level 1 of the original sstable; when the key range of the scale 1 corresponds to more than one of the sstable, multiple sstable are rewritten, but because the sstable is fixed, it usually involves only a few sstable.
Compact operation between level

Multiple ordered sstable, avoids the major compaction such heavyweight file rewrite, each time only updates some content, reduces the write magnification rate.

For reading metadata to lock related sstable, the efficiency obviously exceeds the binary lookup and Bloom Filter for all sstable. As a result, the reading efficiency has been significantly improved, and in some way [3], 90% of the read operation will only access one sstable if the data size of each row is essentially the same.

Under this strategy, compaction operations are more frequent, resulting in more I/O overhead, and in the case of write-intensive operations, whether the end result is sufficiently efficient to raise uncertainty and need to be weighed in the application.

Newsql's strategy

K/V storage is widely used in the storage layer of Newsql database, so the LSM tree model is basically adopted. Among them, cockroachdb and TIDB are used in the KV layer rocksdb. Oceanbase adopts different methods to evade the influence of major compaction, and generally uses the idle copy (Follower) for compaction operation, avoids the blocking of the read operation and replaces the role of the copy after the compaction is completed.

At the same time, the k/v storage engine continues to evolve, and some other improvements such as the fractal tree (Fractal trees) are limited to the extent that we do not expand.

2. sharding

The concept of sharding (sharding) is similar to that of RDBMS, which is the most important feature of distributed database or distributed storage System, it is the foundation of horizontal expansion, and it is widely used in NoSQL class system.

The goal of Sharding is to distribute the data as evenly as possible on multiple nodes, and to improve the overall performance of the database by using multi-node data storage and processing ability.

Range&hash

Although there are many segments of the sharding strategy in different systems, it can be broadly summed up in two ways, range and hash.

A range shard is useful for range queries, and hash shards are easier to distribute data evenly. In practice, range shards seem to be used more, but there are many applications that mix two types of sharding.

HBase uses a range method, arranged according to the dictionary order of the Rowkey, and splits into two new region when the upper limit of the individual region is exceeded. The advantage of range is that data location is close, the cost of scope lookup is low when accessing data, the disadvantage is obvious, and the problem of hotspot concentration is easy to appear.

For example, in hbase it is generally not recommended to use the business serial number as Rowkey, because successive increments of sequential numbers are allocated to the same regionserver for most of the time, resulting in concurrent access competing for this Regionserver resource. To avoid this problem, it is recommended to encode the Rowkey, reverse the sequence number, or add salt. This approach essentially uses the application-layer design strategy to convert a range shard into a hash-like shard.

Spanner's underlying storage inherits many of BigTable's design ideas, but has been tweaked on the shards, adding dynamic provisioning of the directory within the tablet to circumvent the mismatch between the range Shard and the operating hotspot, which is described in detail later in the Transaction Management section.

Static Shards & Dynamic shards

According to the generation strategy of shards, static shards and dynamic shards can be divided into two classes.

Static shards have decided the number of shards at the beginning of system construction, and the cost of later changes is very high; dynamic sharding refers to a shard strategy that is specified according to the data, and its change cost is lower and can be adjusted on demand.

The traditional DB + proxy scheme, which is a common static Shard, is the horizontal sub-Library table. Several of the Internet companies we know have done similar designs in large-scale trading systems, by default making data into a fixed number of shards, such as 100, 255, 1024, or other numbers you like. The number of shards can be evaluated according to the overall service capability, data volume and single node capacity of the expected target of the system, of course, the specific to 100 tablets suitable or 1024 pieces suitable, how much still have the composition of the head.

In NoSQL, the Redis cluster also uses the same static shard mode, which defaults to 16,384 hash slots (equivalent to shards).

The disadvantage of static shards is that the number of shards has been determined, based on the single-point processing capability to form a capacity limit, the flexibility is poor, subsequent to make the number of shards adjustment, data migration difficult to achieve complexity. The advantages are also obvious, the static Shard strategy is almost cured, so the partition key, partition strategy, such as the dependency of metadata management is very low, and these metadata tend to form a single point in the distributed database, become an obstacle to improve reliability, usability.

In contrast, dynamic sharding is more flexible and suitable for richer scenarios, so Newsql also uses dynamic sharding, at the cost of increasing the complexity of metadata management.

In shard processing, NoSQL is very close to the problem faced by Newsql.

3. Copy

First, because of the low reliability of general-purpose equipment, it is necessary to pass the multi-machine copy method. This article focuses on two issues: replica consistency, and the difference between replica reliability and replica availability.

Data copy Consistency

Multiple replicas inevitably introduce data consistency issues for replicas. There has been a well-known cap theory before, and I believe everyone is familiar with it, but here we have to say again that consistency in the CAP and consistency in transaction management is not the same thing. Ivan has encountered many students have misunderstood, using the CAP as the basis to prove that the distributed architecture can not do the strong consistency of the transaction, but only the final consistency.

Transactional consistency refers to the fact that different data entities change together in the same transaction, either all succeed or all fail, whereas consistency in the CAP refers to how the atomic granularity of the data copy guarantees consistency, and the multiple replicas are logically the same data entity.

Replica synchronization is roughly summed up in the following three modes:

    • Strong synchronization: That is, the update must be completed on multiple replicas before the data update succeeds. The problem with this pattern is high latency, low availability, a single operation waiting for all replicas to be updated, a lot of network traffic overhead, and an increase in latency. If multiple replica nodes must be functioning properly, the entire system is available, and any single point unavailability will cause the entire system to be unusable. Assuming that the availability of a single point is 95%, then a multi-copy of the three-node composition, with a reliability of 95% * 95% * 95% = 85.7%. Therefore, although the mainstream databases such as Oracle/mysql provide strong synchronization, there are few applications in the actual production environment of enterprises.

    • Semi-synchronous: MySQL provides a semi-synchronous way, where multiple slave nodes synchronize data from the primary node, and when any of the nodes are successfully synchronized, the master node is considered successful. This logic model effectively avoids the problem of strong synchronization, and the effect of multi-node usability is changed from "and" to "or" to ensure the overall availability. Unfortunately, there are flaws in the implementation of the technology, and there are problems that can degenerate into asynchrony.

    • Paxos/raft: This method divides the participating nodes into roles such as Leader/follower, the master node writes data to multiple standby nodes, and when more than half of the nodes are successfully written, the client writes are returned successfully. This method can avoid the effect of network jitter and the exception of node service on the whole cluster. Other Zab protocols like zookeeper, Kafka's ISR mechanism, though different from Paxos/raft, are roughly One direction.

Replica Reliability and replica availability

The data copy only guarantees the persistence of the data, that is, the data is not lost. We are also faced with the availability of replicas, that is, whether data continues to provide services. Take HBASE-10070 as an example to illustrate this problem:

HBase enables the storage of multiple copies of data through the Distributed File System HDFs, but when the service is provided, the client is connected to the Regionserver to access the data on the HDFs. Because a region is managed by the only Regionserver, Regionserver is still a single point.

When Regionserver is down, it needs to be hmaster aware after a certain interval, the latter dispatching a new regionserver and loading the corresponding region, the entire process can reach dozens of seconds. In large-scale clusters, a single point of failure occurs frequently, with each single point bringing a few 10 seconds of local service disruption, greatly reducing the availability of hbase.

To solve this problem, HBase introduces the concept from the Regionserver node and continues to provide services from the node as the primary node goes down. Regionserver is not a stateless service, which stores data in memory, and also has the problem of data synchronization between master and slave regionserver.

HBase achieves the reliability of the data, but still does not fully realize the availability of the data. The idea of Cockroachdb and TIDB is to implement a distributed kv storage that supports raft, which completely ignores the differences in memory data and disk data on a single node, ensuring the availability of data.

4. Transaction Management

Because of its complexity, distributed transaction processing is the first feature to be discarded in nosql development. However, due to the widespread use of large-scale Internet applications, its practical significance is gradually prominent, and again become newsql can not evade the problem. With the improvement of transaction processing in Newsql, the evolution of database technology in the past more than 10 years has finally achieved a nearly complete spiral of ascent.

Given the complexity of distributed transaction management, Ivan is only briefly described in this article and will be further expanded in subsequent articles.

Newsql transaction management is divided into two kinds of lock (lock-base) and no Lock (Lock-free) from the control means, in which the lock-free mode is usually the conflict of coordinated transaction based on timestamp. From the way of resource occupation, it is divided into optimistic agreement and pessimistic agreement, the difference is the expectation of conflict of resources is different: The pessimistic agreement believes that the conflict is frequent, so it will seize the resources as soon as possible, ensure the smooth completion of the transaction; The optimistic agreement considers the conflict to be sporadic and only seizes the resources at the latest tolerable time.

Ivan understands this kind of transactional design as: "The best way to deal with distributed transactions is not to do distributed transactions and turn them into local transactions." In earlier versions of Oceanbase, a separate server updateserver was used to centralize transaction operations with similar concepts.

Percolator

PERCOLATOR[5] is a Google-developed incremental processing Web indexing system, before its birth, Google used MapReduce for the full amount of page indexing processing. Such a processing time depends on the number of pages in stock, it takes a long time, and even if there is only a small number of page changes a day, it is also a full-scale index processing, wasting a lot of resources and time. With the incremental processing of percolator, the processing time is greatly reduced.

In this paper, a distributed transaction model is presented, which is the transformation of the "two-phase commit protocol", which simplifies the work of the second stage to the extreme and greatly improves the processing efficiency.

Implementation, Percolator is based on bigtable implementation of distributed transaction management, through the MVCC and lock two mechanisms, all the records within the transaction are new versions without updating the existing version. The benefit of this is that the read operation is not blocked in the entire transaction.

    • The locks in the transaction are primary (Primary) and slave locks (secondary), the first operation within the transaction to the record pair plus the Master lock, and then the other records within the transaction as the operation process is gradually added from the lock and point to the master lock record, once the lock conflict, the priority of the low-level transaction release lock, transaction rollback;
    • After all the records in the transaction have been updated, the transaction enters the second phase, where only the state of the primary lock is updated and the transaction ends;
    • The state of the lock depends on the asynchronous process and the associated read operation to assist the completion, because the pointer to the primary lock record is retained from the lock record, and the asynchronous process and read operations are easily judged from the correct state of the lock and updated.

Other content of distributed transaction management, including the lock-free transaction control, the necessity of global clock, and so on, to be discussed later in the article.

Second, conclusion

The original idea is to face several types of typical technical background of students, the "distributed database" to expand the interpretation of different directions, and some of the technical aspects of the elaboration, so that the students in different technical fields can have some knowledge about the relevant technology, for interested in-depth study of students to do a cushion.

With the deepening of the analysis, the article frame is too large and difficult to control, so the interpretation of key technologies also exist in different shades of the situation. For the part of this article, Ivan will try to supplement the following series of articles, the level of the text must have errors and omissions, you are welcome to discuss and correct.

Literature Reference:

[1] Kang, MySQL technology insider: InnoDB Storage engine, mechanical industry press, 2011

[2] Patrick O ' Neil the log-structured merge-tree

[3] leveled compaction in Apache Cassandra

Https://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra

[4] James c. Corbett, Jeffrey Dean, Michael Epstein, et al Spanner:google ' s globally-distributed Database

[5] Daniel Peng and Frank Dabek, large-scale Incremental processing Using distributed transactions and notifications

From NoSQL to Newsql, talking about the key points of the transaction-type distributed database construction

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.