Evolution and development philosophy of TIDB architecture

Last Update:2017-02-09 Source: Internet

Author: User

Tags grafana etcd

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This is a creation in Article, where the information may have evolved or changed.

This article comes from the cover story of the February 2017 CSDN programmer.
For a zero-based database: Choose what language, the overall architecture how to do, do not open source, how to test ... Too many questions need to be considered.

In this article, Pingcap co-founder and CTO Huangdongxu a detailed introduction to TIDB's development process to restore TIDB's architectural evolution.

About two years ago, I had one time to do MySQL sub-database sub-table and middleware experience, at that time in the middleware to do sharding, the 16 nodes of MySQL to expand to 32 nodes, almost one months in advance to do the walkthrough, and then a week to go online. I was just wondering if I could have a database that would let us stop thinking about sorting these things out of the library? At that time we also just finished Codis, feel that the distribution is a more appropriate solution. In addition, I have been concerned about the latest development of the academic circle on the distributed database, see Google in 2013 issued by Spanner and F1 paper, so decided to simply start writing a database, fundamentally solve the problem of MySQL extensibility.

And then the decision to find the problem is very complex: choose what language, the whole structure how to do, in the end to open source ... The basic software has a very important thing: it is not difficult to write, the difficulty is how you make sure that this thing is right. Especially for business parties, all of their business correctness is built on the correctness of the underlying software. So, for the distributed system, what is written right, how to test, this is a very important issue. I've been thinking about that for a long time.

At first it always starts. At that time decided to calm down, first set a goal: solve the MySQL problem. MySQL is a stand-alone database, it does not have the means to do full expansion, we choose MySQL compatibility, first of all choose the protocol and grammar level compatibility, because there are many of the community inside the massive test. The 2nd is the user's migration cost, can let the user migrate very smoothly. The third is because the beginning of everything is difficult, you have to have a clear goal, choose a goal to do, to the developer of the psychological pressure is minimal. After setting the target, our founding team of 3 people came out of the original company, took a bigger venture capital, began to formally do this thing.

The simplest solution for MySQL is to use MySQL directly. In order to make this thing as soon as possible, we began to do a simple version, the reuse of MySQL front-end code, to do a distributed storage engine can be, this thing think is quite simple, so very optimistic, think this strategy is perfect.

It was the first version of the framework that I completed in April 2015 in six weeks, but then I didn't have the nerve to open it up, but it was completely unacceptable in performance, although I could run. I wonder why this thing is so slow? Step by step to see each layer, you want to change, but found a huge amount of work, such as the MySQL SQL optimizer, transaction model, etc., there is no way to start. As you can see in this architecture diagram, because there are so few things on the MySQL Engine layer, there is no way.

The first version of the experiment to fail, and now seems to write SQL parser and optimizer, and so these are not open, we simply decided to write from scratch, the only consolation is that I can finally use our favorite programming language, is Go.

In contrast to the other engineers who do this kind of software, we chose to write from top to bottom, write the topmost SQL interface SQL layer, and I want to make sure that this thing looks exactly like MySQL, including the network protocol and Grammar layer. From the TIDB network protocol, SQL syntax parser, to the SQL optimizer, the actuator and so on basic from top to bottom write again. This phase lasted about three months. From this stage, we have gradually developed a few practical experience in the development of philosophy.

First, all of the problems in computer science can be abstracted, abstracted, and resolved at another level.

We have completed schema 0.2, when TIDB has only one SQL parser and cannot save the data at all because the storage engine underneath is not implemented at all. I want to make sure the database is right, make sure the SQL Layer is right, and let it run MySQL test completely. As for the bottom of the storage I can achieve a fake or in memory, first to ensure that my SQL normal operation can be.

In fact, in the bitter haha write tidb SQL Parser time We also do a lot of things, whether it is MySQL unittests,sql logic tests,orm tests, and so on, the test is all collected, up to now there are about 10 million integration test cases. One of the things we did was to abstract the concept of the storage engine into a very thin number of interfaces, so that it could access a KV engine. The vast majority of KV engine is very large, such as LEVELDB,ROCKSDB, the semantics of the interface are very clear.

A few months later, the team also has about more than 10 people, because at each level I have very strict requirements of our team to use the interface to divide, so that each level of work can be parallel, which for the whole project to advance is very advantageous.

Probably last September, the first database in history that could not be used to store data-the first version without the storage engine, TIDB open source, was very popular in Hackernews and was recommended on the homepage.

Second, talk is cheap,show me the tests.

Doing the basic software test is a more important thing than code. For example, you mention a Feature, whether I am a merger or not, can not directly judge, need to see your test. We are now running TIDB on GitHub, a new commit if the coverage of the code test for the entire project drops, I am not allowed to merge your code into the Trunk branch, very strict. The hardest thing to build a database is not to write it, but to prove it's right, especially when testing a distributed system is more difficult than a single-machine test. Because every node in a distributed system can be crash, the latency of each network may be erratic, and a variety of anomalies can occur. When we do the entire database, the first step is to complete the SQL Layer, the second step is to each IO, each cluster node interaction behavior is all abstracted into an interface, so that we can play back the entire include TCP/IP packet receive order. Once the bug is found, replay it back into the unit test. Whether it is a new developer or a new module, it is impossible to trust the "human", only trust the machine. I only believe that strong test can keep the project in a controlled range.

Later made a schema 0.5, because the SQL layer has been, the SQL layer and the storage layer has basically done a completely separate, finally can be like the original 0.1, I can take a distributed engine up, then we picked up HBase. HBase is a phased strategy selection, because we think that since my SQL Layer is stable enough, then we go to a distributed engine, but I can not introduce too many variables in the architecture, so I picked a most stable distributed engine found in the market, I think, Go ahead and see if the whole system can run. As a result, we can run, but our requirements will be higher, so then we throw HBase away. The connection to HBase is a sign that the abstraction of our upper layer of SQL layer with our interface is stable enough that our test is robust enough to allow us to step down and do distributed things.

This architecture is probably like this:

The upper level is the MySQL business layer client, you can use any of the MySQL clients to connect it, if the data is large, you do not need to go to the Sub-Library table, it as an infinite MySQL with good, this user experience is very good. Because TIDB is a stateless design, it does not store data, so you can deploy countless multiple tidb load balancers. The bottom is a HBase at first, then November, so far from the start of the six months.

Because there is a huge amount of Test guarantees, the whole design process is not too difficult. But here's a question: when we're doing technology selection, how do we control the desire and expansion ambitions if we have a lot of freedom? Your enemy is not a budget, but a complication. How you control the complexity of each layer is very important, especially for an architect, all the work is to avoid complexity, improve development efficiency and stability.

At the time, we chose a very small program language that was Rust. First it is a high performance programming language, it has no GC, no runtime, a lot of innovation is done in the compiler layer, the biggest feature is security, security and security, I think it is more modern C + +, but C + + The biggest problem is that it's easy to chop off your hands and feet if you're not familiar. When I chose Rust, many friends asked me why you chose it. To tell the truth at first I was very afraid, because the language is a new relatively small language, community is not so big, but it was at that time for the situation of our team is the best choice. We wrote it with Rust, and TIKV became one of the largest open source projects in the rust community. Since we started using rust very early, rust officials have been looking for us to share Rust's experience, and we are passionate about hugging rust community. The Rust Community's weekly weekly contains a fixed column called "This Week in Tikv", which is for us to build:)

We spent the winter of 2015 in a tangle. First, with the latest programming language, Rust, everyone has never been in touch before, and the second is that we want the "elastic expansion, true high availability, high performance, strong consistency" of the four requirements, each one is very difficult.

What about the

? Can only embrace the community, do not do all of their own things, one is limited number, the second is a good habit of reuse, since others have done these things, do not go to repetitive work. We're going to make a really high-availability database, find a lap of highly available distributed storage and find ETCD,ETCD behind the algorithm called Raft which is a consistency algorithm equivalent to Paxos. The most stable implementation of this algorithm is the Raft in Etcd. And ETCD is the implementation of a Raft that is truly certified in a production environment. I carefully read the ETCD source code, each state of the switch is abstracted into an interface, we test can be separated from the entire network, out of the entire IO, out of the entire hardware environment to build. I think this idea is very good, this is why CoreOS's ETCD including meta-information storage like k8s is also used it, the quality is very high, the performance is excellent. But the problem with ETCD is that it was written by go, and we've decided to use Rust to develop a database of underlying storage. If I use an algorithm like Paxos, I don't believe that companies other than Google Chubby have the ability to write it right. But Raft is not the same, although it is also difficult, but after all it is achievable, so we for its quality, speed up the progress of our development, we did a more crazy thing, is that we put ETCD Raft state machine of each line of code, lines by the translation into Rust. And the first thing we're going to do is test cases for all the ETCD itself. We write the exact same test to make sure that this thing is not a problem with our port process.

TIDB the underlying storage engine is not able to save data at first, it is time to choose a real Storage engine, we think this thing is a huge pit. Local storage Engine let a small team to write the word is basically unrealistic, we choose from the bottom rocksdb. Rocksdb Everyone can be considered a stand-alone Key-value engine, the predecessor is actually LevelDB, is Google in 2011 years or so of open source Key-value storage engine. The structure behind the ROCKSDB is the LSM tree, which is a very good data structure for writing very friendly, while the memory of your machine is relatively large and its reading performance is great. The storage engine also has a very important work is, needs according to your machine's performance to do the pertinence tuning, everybody will see like MySQL tuning all quickly becomes the black magic Same thing, Rocksdb is also a tune can write the book the existence. As you can see, the new generation of distributed database storage engine Everyone will choose Rocksdb, I think this is the trend.

From the beginning of the winter of 15, we have been hard-pressed to write 5 months of code, with Rust to write, until April 1, 2016 TIKV finally open source.

From See, the bottom is Rocksdb, the above distributed layer is used Raft, these two layers although we write, but the quality is our community's allies to help us guarantee. Above the Raft is the MVCC, from here up, it is all we write. So TIKV is finally a scalable, ACID-enabled transaction, global consistency, a storage engine that is highly available across the data center, and has a great performance. Because I'm not at the bottom to pick up a file system like HDFS.

In fact, from open source to the present, we have been doing TIKV performance tuning, stability and many other work, but from an architectural perspective, this architecture I feel at least in the next five years, there will be no big changes. The point I've been stressing is that complexity is your biggest enemy, and I'd rather be a status quo posture to cope with future mutation needs. Fortunately, there is not much change in the demand for this thing in the database.

Third, where there ' s a metric there's a-to-a-.

Say metrices. An important task for architects is to look at what blocks are in the system and solve them. We found that there are many, many points in the database field that can be 10 times times more powerful if solved. I have a point of view, all things, as long as there is metrices, can be monitored, this thing can be solved. That is: "Where there's a metric there's a-means". Once you can repeatedly observe the balance of performance, performance problems are the best solution, but writing is the most difficult problem.

In general, everyone in the company to Judge Metrics, there are monitoring tools. For our small team, or a team that embraces the community, this is basically a matter of not worth the candle. Because you have a lot of effort to write one, it is not as good as the community, it is very troublesome. So, we have embedded Prometheus and Grafana in the database. Prometheus now in Silicon Valley is too hot, it is actually a distributed time series database, but it is very suitable for log collection and performance tuning, it does a more perfect place is that it provides a DSL to query to provide monitoring of the alarm, including you can write the alarm rules. It does not have a good-looking Dashboard, then the community inside another friend out, said I will go to make you a good-looking interface, this project is Grafana. Grafana is a visual dashboard that allows you to customize the position, type, style, size, and width of each dashboard arrangement. and the entire Metrics collection Prometheus provides two modes, one is push mode, and the other is pull mode. There is little impact on the performance of the code that collects monitoring.

The last thing to add, the first is the tool. For an internet-born team, the tool is a point that we attach great importance to to minimize the cost of moving the business. I think a lot of big companies do refactoring, or do the basic software engineer, the best way is to moisten things silently. You have no idea what I'm doing at the bottom of the refactoring is inexplicably reconstructed, this is the most perfect state. For example, I did Codis before, my request is if the user is now using Twemproxy, he moved to the new scheme must be a line of code has not changed, you do not even know I am doing the migration, this is the best, I think this should be all to do the infrastructure team self-cultivation.

The second is not to be surprised. For example, you are working on a database, you claim to be the same as MySQL, then you show that anything that is different from MySQL will scare the user, and this is a very important development principle.

The third is pessimistic presets. There will always be all sorts of disgusting things and unusual situations happening, in fact, as distributed system development engineers have to say to themselves every day. The business data is is extremely heavy, but any infrastructure will be hanging, your network cable may be broken, the entire data center may shut down ... You have to preset your database is bound to lose data, this time your database design will be better. How to protect yourself how to protect the business, we have done some magical tools, such as syncer, this thing is to put TIDB cluster as a fake slave to MySQL, the business on the top of the run MySQL, back to save up is actually a cluster. In addition to do a more perverted thing, is the reverse. We're on top of the business and we can get to MySQL below. MySQL can do tidb slave,tidb can do MySQL slave. This feature is very pleasant for the business. Some customers at first want to use TIDB but a little afraid, I said it's OK, you take a back from, and then you go from above to inquire. For example, you have a SQL run for 20 minutes, and now I can let you run to less than 10 seconds; or you ran for about half a year. Data is not lost, the system is very stable, and then cut into the main business code changes.

An architect is always going to think about what happens in the next ten years. There is a noun cloud-native, I think everything in the future will be running in the cloud, how to design the basic software for this environment in the cloud? Database design a very important principle is that once the data has been down, it can automatically repair and balance the data, people in the inside to add machines to these clusters on the line, the whole cluster must be able to have their own thinking. How to do infrastructure for Cloud in the future is a question that needs to be considered.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More