This is a creation in Article, where the information may have evolved or changed.
This topic series is compiled from the Pingcap newsql Meetup 26th issue of the "Deep Exploration Distributed System Testing" topic. Article longer, for the convenience of everyone to read, will be divided into the next three, this article for medium.
Connect to the following article:
Of course, testing might make your code less beautiful, for example:
This is a well-known Kubernetes code, that is, it has a daemonsetcontroller, the controller is injected with three test points, such as this place injected a handler, you can think that all the injection is interface. For example, if you write a simple 1+1=2 program, suppose we write a calculator, the function of this calculator is to sum, then it is difficult to inject errors. So you have to inject the test logic into your correct code. Another example of someone call you this add function, and then you do not have an error? The problem with this error is that it may never return an error, so you have to put the human flesh into it and see if the application is behaving correctly. After the addition, let's say we do a division. Division people know that there may be a deal with the abnormal, it is not normal to deal with it? There is no above, it says a 6÷3, and then wrote a Test,coverage 100%, but a exception of 0, the system is collapsed, so this time need to inject errors. The famous Kubernetes in order to test the various abnormal logic also uses a similar way, this structure is not long, about more than 10 members, and then injected three points inside, you can inject errors inside.
So what did we think about test when we were designing tidb? First of all, a millions test cannot be written by human flesh, that is, if you redefine your own so-called SQL syntax, or a query language, then you need to build millions test, even if the whole company to write, write for two years is not enough, So this is obviously not a reliable thing. But unless I say that my query language is particularly simple, like the one in the early days of MongoDB, that I am a "greater than how much" of this, or equal this conditional query is particularly simple, then you really do not need to build this millions test. But if you do a SQL database, it is necessary to build this very very complex test. This test can't be written for two years by the whole company, right? So what's the best way to do that? MySQL-compatible systems can be used for test, so we were compatible with the MySQL protocol, which means we can get a lot of MySQL test. I don't know if anyone has counted the number of MySQL test, product-level test is scary, tens. Then there are many ORM, and the various applications that support MySQL have their own tests. As you know, every language builds its own ORM, and then even a single language ORM has several. For example, MySQL may have the first row, the second row, then we can take these all to test our system.
But for some applications, this is a bit of a pit. is an application you have to set it up and then manipulate the application, like WordPress, and then look at the result. So in order to avoid just human flesh to test, we did a program to automate the Record---Replay. Is that when you first run, we record all of its executed SQL statements, what do I do next time I need to rerun this program? I don't need to run this program anymore, I don't need to get up, I just need to replay the SQL record that I recorded earlier, which is equivalent to the whole behavior of the program I modeled. So we're automating this in this part.
So, what did you actually do when you just said so much? In fact, the correct path is tested, and the millions of test is the correct path test, but what about the wrong path? A typical example is how to do Fault injection. Hardware comparison simple rough analog network fault can be pulled out of network cable, for example, when you can put this cable, but this approach is extremely inefficient, and it is not scale, because this requires the participation of people.
Then there are, for example, CPU, this CPU damage probability is actually very high, especially for the insured machine. Then there is the disk, which is probably the three-year Ba damage rate, which is the data given in a paper. I remember Google seems to have given a data before, that is, CPU, Nic and disk in how many years the damage rate is probably what.
There is a clock that people are not too concerned about. Previously, we found that the system clock has a bounce back, and then we are determined to add a monitoring module inside the program, once the system clock back, we immediately detected this. Of course, when we first monitored this thing, the user felt it was impossible, and the clock would bounce back? I said it's OK, first put our program to monitor, and then a period of time to detect, the system clock has recently jumped back. So how to match NTP is important. Then there are more, such as the file system, have you ever thought about when you write disk, disk error will do? OK, write disk without error, success, and then disk a sector is broken, read out the data is corrupted, how to do? Do you have any checksum? There is no checksum then we directly use this data, and then directly to the user back, this time may be very deadly. If the data is just a meta-data, and the metadata points to other data, and then you write another piece of data based on the information of the metadata, it's even more deadly, and maybe the data is further damaged.
So what is the better way?
Fault Injection
Hardware
Disk error
Network card
Cpu
Clock
Software
File system
Network & Protocol
Simulate everything
Simulate everything. Is that the disk is analog, the network is analog, then we can monitor it, you can at any time, any scene to inject a variety of errors, you can inject any error you want. For example, if you write a disk, I will tell you that the disk is full, I tell you that the disk is broken, and then I can let you hang up, like sleep for more than 50 seconds. We did have this in the cloud, that is, we write it once, then hang it for 53 seconds, and finally write it, it must be a network disk, right? This kind of thing is really scary, but certainly no one would want to say that I once disk writes and then consumes 53 seconds, but when 53 seconds appear, the whole program behavior is what? Tidb inside with a lot of Raft, so at that time a situation is 53 seconds, and then all the machine began to elect, said this must be what is wrong, re-leader are selected, this time card 53 seconds Buddy said "I finished", then the whole system state has done a brand-new migration. What are the benefits of this error injection? is to know how serious your mistakes can be when they go wrong, it's important to predictable that the whole system is predictable. If there is no test for the wrong path, it is a simple question, now suppose you go to one of the wrong paths, what is the whole system behavior? This is not known to be very scary. You don't know if it's possible to break the data, or if the business will block, or the business will retry?
Before I met a question very interesting, at that time we are doing a message system, there is a large number of connections will even this, a single machine is probably even about 800,000 of the connection, is to do message push. Then I remember, the swap partition was open, what was the concept? When you have more connections, and then your memory's going to explode, right? Memory explosion will automatically enable the swap partition, but once you enable the swap partition, then your system card into a dog, outside the user after the disconnection he has failed, he has to reconnect, but the connection to your normal program can respond, it may take 30 seconds, then the user must feel the timeout, and cut off the connection and re-connect, What kind of state does it cause? Is that the system is always retrying, never a success. So is this behavior predictable? Did the mistake have a good test at the time? This is a very important lesson.
The previous approach to hardware testing is this (Joke):
Suppose I had a disk that was broken, suppose I had a machine to hang on to, and another hypothesis that it doesn't have to be broken or not, like if it's on fire? The first two months, it is Switzerland or where a bank to do the test, the man is also very funny, human flesh to the server so blowing, to see the monitoring data that change, and then there immediately began to alarm. This is just blowing, if more complex tests, such as where you start burning from the fire, burn the hard drive first, or burn the network card first, the result may be different. Of course, the cost is very high, and then it is not a plan to scale, but also difficult to replicate.
This is not just hardware monitoring, it can also be thought of as a mistake injection. Like a cluster. What if I burn one now? Fire, very typical, although the important computer room will have this fire, waterproof and other strategies, but really when the fire? Of course you can't really burn, this burn may be more than a bad one machine, but we need to use Fault injection to simulate.
Let me introduce what is Fault injection. To give a visual example, everyone knows that everyone has used Unix or Linux system, we all know that many people are accustomed to open the system the first line of command is LS to list the files in the directory, but have you ever thought of an interesting question, if you want to test the LS command implementation of the correctness, how to measure? If there is no source code, how should this system be measured? If it's a black box, what's the system supposed to measure? What if you have a disk error when you ls? What happens if I read a sector read failure?
This is a very fun tool, recommend everyone to play. It is when you have not done more in-depth testing, you can first understand what is Fault injection, you can experience its strong, we will use it to find a MySQL bug.
Libfiu-fault Injection in userspace
It can be used to perform fault injection in the POSIX API without have to modify the application's source code, that CA n Help to test failure handling in a easy and reproducible.
That's mostly used to hook up these APIs, and it's important that it provides a library that can be embedded into your program to hook those APIs. For example, when you go to read a file, it can give you back this file does not exist, you can return disk errors and so on. The most important thing is that it can be redone.
For example, normally when we knock the LS command, we are sure to be able to display the current directory.
What does this program do? is run, specify a parameter, now is to have a enable_random, that is, all of the following for the IO below the operation of these APIs, there is a 5% failure rate. It was a good luck for the first time and there was no failure, so we listed the entire directory. Then we ran again, when it told me that one read failed, that is, when it read this directory, encountered a bad file descriptor, it can be seen that the list of files is less than the above, because there is a path to let it fail. Next, we ran further, and found that a directory was just listed, and then the next read went wrong. And then run back again, this time, the luck is also better, the whole is listed, this is only a simulation of the failure rate of 5%. There is a 5% probability that you go to read, to open the time will fail, then this time can see LS command behavior is very stable, there is no common segment fault these.
People may say that this is not very fun, that is, to find the LS command whether there is a bug, then we reproduce the MySQL bug play a bit.
Bug #76020
InnoDB does not report filename in I/O error message for reads
Fiu-run-x-C "Enable_random name=posix/io/*,probability=0.05" Bin/mysqld--basedir=/data/ushastry/server/ mysql-5.6.24--datadir=/data/ushastry/server/mysql-5.6.24/76020--core-file--socket=/tmp/mysql_ushastry.sock-- port=15000
2015-05-20 19:12:07 31030 [ERROR] innodb:error in System call Pread (). The operating system error number is 5.
2015-05-20 19:12:07 7f7986efc720 innodb:operating system error number 5 in a file operation.
Innodb:error number 5 means ' input/output Error '.
2015-05-20 19:12:07 31030 [ERROR] Innodb:file (unknown):
' Read ' returned OS error 105. Cannot continue operation
This is a bug found with Libfiu MySQL, this bug is this, the bug number is 76020, is said InnoDB in error when the file name is not reported, the user gave you a wrong, you are silly at this time, right? What the hell is wrong with this place? And how did it get out of this place? You can see it or with the fiu-run we just mentioned, and then to simulate, the failure probability of simulation is still so much, we can see that the parameters of a unchanged, then the MySQL started, and then ran, appeared, you can see InnoDB in the newspaper when did not report filename, File: ' read ' returned OS error, then this side is auto error, you do not know which file name.
To put it another way of thinking, assuming that there is no such thing, what is the cost of reproducing the bug? You can think, if there is no this thing, how to reproduce the bug, how to let MySQL read something wrong? It is too difficult for you to make it read in the normal path, which may not have happened for many years. At this point we further zoom in, this in 5.7 Also, also in MySQL is likely to have more than 10 years everyone has not how to encounter, but this bug in the aid of this tool, can come out immediately. So Fault injection it brings a very important benefit is to make a thing can become more easily reproduced. This is still the probability of a simulated 5%. This example is what I did last night, I want to give you an intuitive understanding, but the distributed system inside the error injection is more complex than this. And if you come across a mistake for ten years without appearing, are you too lonely? This movie is probably still an impression, Will Smith starring, the world is a person alive, the only partner is a dog.
Actually, it's not, it's a lot more than the people we're suffering from.
One example of Netflix is the Netflix system.
They wrote a blog in October 2014, called "Failure injection Testing", about how their entire system was injected incorrectly, and then they said it was the Internet scale, the level of the entire multi-datacenter Internet. You might remember when Spanner first came out, they called the Global scale, and then this place can see that the blue is the injection point, the black is the network call, and that's all the requests under these circumstances, all of these blue boxes are likely to go wrong. You can think of a business call on a microservice system that might involve dozens of system calls, what if one fails? If it is the first failure, the second failure, the third failure, what is the third one? Are there any systems that have done this kind of testing? There is no system in their own program to well verify that each can be expected errors are predictable, this becomes very important. Here, for example, in the case of the cache, it is said that every time a visit to Cassandra may be wrong, then it gives us a wrong injection point.
And then we talk about OpenStack
OpenStack Fault-injection Library:
https://pypi.python.org/pypi/os-faults/0.1.2
The well-known OpenStack also has a Failure injection Library, and then I put this example also posted here, we are interested to see this OpenStack Failure injection. This before we may not be too concerned about, in fact, everyone in this point is very painful, OpenStack now has a bunch of people are cursing, said the stability is too poor, in fact, they have been very hard. But the whole system is actually doing unusually complex, because there are too many components. If you make a lot of mistakes, that may lead to another problem, that is, the point of error can also be combined, that is, A error, then B error, or AB is wrong, this is a few cases, fortunately. So if you have 100,000 wrong points, how do you get this combination? Now, of course, there are new papers to study this, and in 2015 it seemed like there was a paper that would probe the path of your program and then inject the error under the corresponding path.
Again, Jepsen.
jepsen:distributed Systems Safety Analysis
All of the well-known open source distributed systems that we've heard of have basically been found to have bugs. But before that everyone felt that they are still OK, our system is still relatively stable, so when the new tool or New method appears, such as I just mentioned that can be linear scale to check the wrong paper, the time to check the wrong power is amazing, because it can automatically help you detect. In addition I introduced a tool Namazu, later said, it is also very powerful. Here first said Jepsen, this is a heavy weapon, whether it is ZooKeeper, MongoDB and Redis, and so on, all of these have been found bugs, all the databases are now used to find the bug, the biggest problem is the small language closure written, It's a little cumbersome to expand. Let me start with the basic principle of Jepsen, a typical test that uses Jepsen is to run a related Clojure program on a control node, and control node uses SSH to log on to the associated system node (Jepsen is called db n ODE) to perform some test operations.
When our distributed system starts up, control node initiates many processes, each of which can use a specific client access to our distributed system. A generator generates a series of operations for each process, such as Get/set/cas, to be executed. Each operation will be recorded in the history. While performing the operation, another nemesis process attempts to destroy the distributed system, such as using iptable to disconnect the network, and so on, when all operations are completed, Jepsen uses a checker to analyze whether the system behaves as expected. Pingcap's chief architect, Tang Liu, wrote two articles about how we actually used Jepsen to test the TIDB, and we can search for it, and I'm not going to go into detail here.
Foundationdb This is the predecessor, acquired by Apple in 2015. They do a lot of things to solve the problem of wrong injection, or how to make it reappear, and one of the most important things is deterministic. If I give you the same input, run a few times, can I get the same output? This sounds very scientific and natural, but in reality most of our programs are not, for example, do you have random numbers in the process of judging? How many threads do you have? Do you have a decision on disk space? Do you have time to judge? Are you still the same when you judge again? You run again, the same input, but the behavior is not the same, for example, you have a random number, such as you determine disk space, this judgment and the next judgment may be different.
So they do "I give you the same input, must be able to get the same output", took about two years to do a library. This library has the following characteristics: It is a single-threaded, and then a pseudo-concurrency. Why? Because if you use multithreading how do you make it this same input into the same output, who gets the lock first? There are a lot of problems, so they choose to use a single thread, but the single thread itself has a single thread problem. And if you use the Go language, then you single-threaded it is also a concurrent. Then its language specification tells us that if a select function on two channel, two channel is ready, it will be random, that is, in the language definition of the specification, it is impossible to let you get a deterministic. But fortunately Foundationdb is written in C + +.
Foundationdb
single-threaded pseudo-concurrency
Simulated the implementation of all the external communication
Determinism
Disasters happen more frequently this than in the real world.
In addition, FOUNDATIONDB simulates all the networks, that is, two of them think through the network communication, right? In fact it is through its own simulation of a set of things in communication. It has a very important point of view that is, if the disk is damaged, the probability of occurrence is three years 8%, then the probability of the user's occurrence is three years 8%. But once the user has appeared, that proves to be very serious, so how do they approach the problem? Is that I make it all the time with my own simulation system. They are likely to produce a disk damage every two minutes, that is, it is hundreds of thousands of times times higher than the probability of reality, so it feels it to adjust the technology more frequently, that is, I this error occurs more frequently, the network card damage probability is how much? This is all very low, but you can use this system to make it happen every minute, so that the probability of your system encountering this error is much greater than in reality. Then you reproduce, for example, the reality of running three years can reproduce once, you may run 30 seconds can be reproduced once.
But what's the scariest thing about a bug? Is that it cannot reproduce. Found a bug, then said I fix, and then can not reproduce, then you fix it? I don't know, this thing has become very scary. So through deterministic sure can be guaranteed to reproduce, I just put my input replay once, I recorded it, every time I recorded it once, and then as long as it has appeared, I replay, will be able to appear. Of course, this price is too big, so now the academia is the other way, not completely deterministic, but I just need it reasonable. For example, I can reproduce it in 30 minutes, and I don't need to reproduce it in three seconds. Therefore, every previous step to pay the corresponding cost costs.
Not to be continued ...