With the rise of the concept of large data and real demand in various industries to the landing, many people are keen to discuss the distributed database, today on this topic, mainly divided into three parts: the first part of the distributed database on the past and the status quo, I hope you can have a comprehensive understanding of this area The second part is about the structure of TIDB and some recent progress; Finally, we will discuss the possible future trend of distributed database in the light of some thoughts in the process of developing TIDB.
The history and actuality of distributed database
1. From the stand-alone database
Relational database originated from the 1970 's, its most basic function has two:
Save the data;
Meet the user's need for data calculation.
The 1th is the most basic requirements, if a database can not be able to secure the integrity of the data to save, then any subsequent functions are meaningless. When the 1th is satisfied, the user will then request to be able to use the data, may be a simple query, such as according to a key to find value, or a complex query, such as the data to do complex aggregation operations, table operations, grouping operations. Often the 2nd is a more difficult to meet than the 1th demand.
In the early stages of database development, these two requirements are not difficult to meet, such as a lot of good business database products, such as ORACLE/DB2. After 1990 years, there was open source database MySQL and PostgreSQL. These databases continue to enhance the performance of stand-alone instances, coupled with the Moore's Law of hardware upgrade speed, often can well support business development.
Next, with the advent of the Internet, especially the rise of the mobile Internet, data scale exploded, and hardware over the years the pace of progress is slowing down, people are also worried that Moore's law will fail. In this case, stand-alone database is increasingly difficult to meet user needs, even if the data to save the most basic requirements.
2. Distributed Database
So 2005 years or so, people began to explore the distributed database, brought up the wave of NoSQL. The primary problem with these databases is that the entire data cannot be saved on a single computer, which is represented by Hbase/cassadra/mongodb. To achieve horizontal expansion of capacity, these databases tend to discard transactions, or simply provide a simple KV interface. The simplification of the storage model facilitates the development of the storage system, but lowers the support for the business.
(1) The NoSQL
HBase is one of the typical representatives. HBase is an important product in the Hadoop ecosystem, Google bigtable open source implementation, so here is the first to say bigtable.
BigTable is a distributed database used internally by Google, built on GFS, which makes up for the defects of the Distributed file system for inserting, updating and randomly reading requests for small objects. HBase is also implemented according to this architecture, with the underlying base based on HDFs. The hbase itself does not actually store the data, the persistent logs and SST file are stored on the HDFS, Region server provides a quick query through the memtable, written in the first log, the compact in the background, the random write to the sequential write. The data is logically segmented by Region, and the load balancing is implemented by adjusting the Region intervals responsible for each Region server, Region is split after continuous writing and then dispatched to multiple Region servers by the load balancing policy.
As mentioned earlier, the hbase itself does not store data, where the region is only a logical concept, the data is stored as a file on the HDFs, HBase does not care about the number of replicas, location, and horizontal scaling issues, all of which depend on the HDFS implementation. As with BigTable, HBase provides line-level consistency, which, from the CAP theory point of view, is a CP system, and it is also regrettable that there is no further provision of ACID-trans-line transactions.
The advantage of HBase is that by extending the region server, the throughput of the system can be almost linearly raised, and the HDFs itself has the ability of horizontal expansion, and the whole system is mature and stable. But HBase still has some deficiencies. First, Hadoop uses Java development, and GC latency is an unavoidable problem that has some impact on system latency. In addition, because the hbase itself does not store the data, the interaction with HDFS can be more than one layer of performance loss. Third, hbase, like BigTable, does not support cross-line transactions, so a team within Google has developed Megastore, percolator these bigtable based transaction layers. Jeff Dean admits he regrets not being involved in bigtable, which is one of the reasons why spanner appears.
(2) Redemption of Rdms
In addition to NoSQL, the RDMS system has made a lot of efforts to adapt to the changes in the business, that is, relational database middleware and library sub-table scheme. To do a middleware need to consider a lot of, such as parsing SQL, parsing out the Shardkey, and then according to Shardkey distribution request, and then merge the results. In addition, in the middleware layer also need to maintain session and transaction state, and most scenarios do not support cross shard transactions, which inevitably led to business use will be more cumbersome, the need to maintain their own state of affairs. In addition, there are dynamic expansion capacity and automatic fault recovery, in the larger scale of the cluster, the complexity of the operation and the DDL is to refer to the number of levels of rise.
Domestic developers in this field have a lot of well-known projects, such as Ali's Cobar, TDDL, and later the community based on Cobar improved mycat,360 open source Atlas, etc., belong to this class of middleware products. In middleware This program has a well-known open source project is YouTube vitess, this is a synthesis of middleware products, built-in thermal data cache, horizontal dynamic fragmentation, read and write separation, but this also caused the whole project is very complex.
Another worth mentioning is PostgreSQL XC this project, its overall architecture somewhat resembles earlier version of Oceanbase, by a central node to handle the coordinated distributed transaction, the data is dispersed on each storage node, should be the current PG Community Best distributed expansion scheme, Many people are doing their own systems based on this project.
3, the development of Newsql
2012-2013 years after Google published the Spanner and F1 two systems, the industry for the first time to see the relationship model and NoSQL extensibility in a large-scale production system to integrate the possibility. Spanner uses hardware devices (GPS clocks + atomic clocks) to skillfully solve the clock synchronization problem, while in distributed systems, clocks are the most vexing problem. The strength of spanner is that even if two data centers are very far apart, the time error obtained through the TrueTime API is guaranteed to be within a very small range (10ms) and does not require communication. The bottom layer of spanner is still based on distributed file systems, but it is also said to be a point that can be optimized in the future.
Google's internal database storage business, mostly 3~5 replicas, important data requires 7 copies, and these replicas spread across the world's data centers on all continents, due to the widespread use of Paxos, delays can be shortened to an acceptable range (write latency of more than 100ms), In addition, the auto-failover ability brought by Paxos, even if the data center is paralyzed, the business layer is transparent and without perception. F1 is built on the spanner, the external provided the SQL interface, F1 is a distributed MPP SQL layer, which itself does not store data, but the client's SQL translation to the operation of KV, call spanner to complete the request.
The presence of spanner and F1 marks the first newsql service in a production environment, with the following features available in a set of systems:
SQL support
Acid transaction
Horizontal extension
Auto Failover
Multi-machine room disaster tolerance
With so many enticing features, within Google, a lot of business has switched from the original bigtable to spanner. Believe this will have a huge impact on the industry's thinking, just like Hadoop, Google's basic software technology trend is in front of the community.
The SPANNER/F1 paper aroused widespread concern in the community and soon began to appear followers. The first team was Cockroachlabs's cockroachdb. COCKROACHDB's design is similar to Spanner's, but instead of selecting the TrueTime API, it uses HLC (Hybrid logical clock), the NTP + logical clock instead of the TrueTime timestamp, In addition, COCKROACHDB select Raft to Do data replication protocol, the underlying storage landed in Rocksdb, the external interface selected PG protocol.
COCKROACHDB's technology selection is more radical, such as relying on the HLC to do business, the accuracy of the time stamp does not have the means to achieve the delay in 10ms, so commit wait requires the user to specify their own choice, depending on the user's NTP service clock error, this is very unfriendly to the user. Of course cockroachdb These technical choices also bring good usability, all logic is in a component, deployment is very simple, this is a very big advantage.
Another follower is the tidb we do. The project has been in development for two years, and of course we have been preparing for it for a long time before we started. Next I'll introduce the project.
II. structure and recent progress of the TIDB
TIDB is essentially a more orthodox spanner and F1 implementation, and does not cockroachdb the option of merging SQL and KV, but rather separating them like spanner and F1. The following is the TIDB architecture diagram:
This layered thinking is also throughout the entire TIDB project, for testing, rolling upgrades and the complexity of each layer control will be more advantageous, in addition TIDB chose the MySQL protocol and syntax compatible, MySQL Community ORM Framework, operational tools, directly can be applied to the TIDB, plus Like Spanner, TIDB is a stateless MPP SQL Layer that relies on TIKV to provide support for distributed storage and distributed transactions at the bottom of the system, and the TIKV distributed transaction model is based on the Google Percolator model, But on top of doing a lot of optimization, the advantage of percolator is to go to the central degree is very high, the whole continue do not need a separate transaction management module, transaction submission status is actually evenly dispersed in the system of the various key meta, the entire model relies on a service server, On our system, limit situation this service server per second can allocate more than 400w of monotonically increasing the time stamp, most of the situation is basically enough (after all, there is a Google level of the scene is not much to see), at the same time in tikv, the timing services itself is highly available, there is no single point of failure problems.
Above is the TIKV architecture diagram. TIKV and Cockroachdb are also selected raft as the basis for the entire database, not the same, tikv the overall use of rust language development, as a language without GC and runtime, the potential for mining in the performance can be greater. Multiple replicas on different tikv instances together constitute a raft group,pd is responsible for scheduling the location of replicas, and by configuring the scheduling policy, multiple copies of a raft Group are guaranteed to not be stored in the same machine/rack/engine room.
In addition to the core TIDB, TIKV, we also provide a number of easy-to-use tools to facilitate users to do data migration and backup. For example, we provide the syncer, not only can the single MySQL instance of the data synchronization to TIDB, but also the number of MySQL instances of the data can be aggregated into a tidb cluster, or even the database has been divided into a table of the data collection. This data synchronization way more flexible and easy to use.
Tidb is about to release the RC3 version, which is expected to release GA in June. In the upcoming RC3 release, a lot of work has been done on MySQL compatibility, SQL Optimizer, System stability, and performance. For OLTP scenarios, focus on optimizing write performance. Additionally provides the privilege management function, the user may control the data access right according to the MySQL Authority management way. For OLAP scenarios, a lot of work has been done on the optimizer, including the optimization of more statements, the support of sortmergejoin operators, and the indexlookupjoin operators. In addition to memory usage has done a lot of optimization, some scenarios, memory usage down 75%.
In addition to the optimization of TIDB itself, we are still working on a new project named Tispark. In short, it is to let spark better access to TIDB. Now in fact Spark can already read the data from the TIDB through the JDBC interface, but here are two questions: 1. Data can only be read through a single TIDB node and the data needs to pass through TIDB in the tikv. 2. Can not be combined with the Spark optimizer, we expect to be able to integrate with the Spark Optimizer, filter, polymerization can be through the tikv of distributed computing ability to speed up. The project has begun to develop and is expected to open in the near future, with a first version in May.
third, the future trend of distributed database
As for the future, I think there will be several trends in the future database, which is also the goal of TIDB project:
1, the database will be with the business cloud, all the future business will run in the clouds, whether it is a private cloud or public cloud, operational team contact may no longer be a real physical machine, but a separate container or " Computing resources, which is also a challenge for the database, because the database is inherently stateful, the data is always stored on the physical disk, and the cost of data movement may be much greater than the cost of moving the container.
2, multi-tenant technology will become standard, a large database carrying all the business, the data at the bottom through, the upper level through the permissions, containers and other technologies to isolate, but the data to get through and expand will become very simple, combined with the 1th mentioned cloud, the business layer can no longer care about the capacity and topology of the physical machine, Only need to think that the bottom is an infinite database platform, no longer worry about stand-alone capacity and load balancing problems.
3, OLAP and OLTP business integration, users will be stored in the data, the need for a more efficient way to access the data, but OLTP and OLAP in the SQL Optimizer/executor this layer of implementation must be very different. In the past, users often used ETL tools to synchronize data from OLTP database to OLAP database, which caused the waste of resources, on the other hand, also reduced the real-time of OLAP. For users, a better experience would be to use the same set of standard syntax and rules for reading and writing data.
4, in the future distributed database system, master-slave log synchronization Such backward backup way will be multi-paxos/raft such a stronger distributed consistency algorithm substitution, artificial database operation and maintenance in the management of large-scale database cluster is not possible, all the recovery and high availability will be highly automated.