Testing distributed System linear consistency using Chaos

Source: Internet
Author: User
This is a creation in Article, where the information may have evolved or changed.

Background

In the previous article testing the linear consistency of distributed systems and using porcupine for linear conformance testing, I introduced the linear Conformance test tool for Go Porcupine and some examples of simple use, here I will briefly introduce a simple distributed based on Porcupine Linear conformance Test framework: Chaos.

For linear conformance testing of distributed systems, usually we use JEPSEN,TIDB and of course support Jepsen, so why bother with a linear conformance testing framework? I think the main points are as follows:

    • Clojure:jepsen uses Clojure, a functional programming language that runs on top of the JVM. Although it is very strong, but I am not proficient. So every time I look at Jepsen's code is a torment for me, and most of our team's classmates are not at all.
    • Oom: The Linearizability check is easy to get to when you run a little longer, so our test case will not run very long.

I've always had an idea to write a linear conformance test framework with go, but the main difficulty is how to go to linearizability check, fortunately I found the porcupine, the whole work will be able to start, so first made a simple chaos, If feasible, continue to improve.

Architecture

Similar to Jepsen,chaos also runs DB service on five node, node named N1 to N5, we can also connect to the corresponding node by name, for example, we can ssh n1 directly login to node N1.

Chaos also has a controller node that is used to control the entire cluster, including initializing the DB to be tested, creating the corresponding client running the actual test, starting the Nemesis to interfere with the system, and finally verifying the linearizability of the history. The architecture diagram is as follows:


Unlike Jepsen, in Jepsen, the controller is all sent via SSH command to the node to perform all operations, but chaos will start a agent,controller on each node via HTTP The API interacts with the agent to manipulate node. The main reason for this design is to use Go directly to write related db,nemesis logic, instead of using the Linux command every time as Jepsen.

But the only problem with the agent is that you need to explicitly start the agent on different node, using the above is a little more troublesome than Jepsen, but can also be done by the script.

Because Go is a static language, if we need to verify the linear consistency of our DB within chaos, we need to implement the relevant interface first and then register it with chaos so that chaos can verify it. Here, we take tidb as an example to illustrate.

Db

The DB interface corresponds to the Db,db interface definition we actually want to test as follows:

type DB interface {    SetUp(ctx context.Context, nodes []string, node string) error    TearDown(ctx context.Context, nodes []string, node string) error    Start(ctx context.Context, node string) error    Stop(ctx context.Context, node string) error    Kill(ctx context.Context, node string) error    IsRunning(ctx context.Context, node string) bool    Name() string}

We initialize the entire DB cluster in the SETUP function and use TearDown to deconstruct the entire cluster. The meanings of the functions such as start,stop are very clear and are not explained here. Name is the DB name, because we are going to register to chaos, so the name must be unique, for example our TIDB name is
"Tidb".

The parameter nodes is the node information that the whole cluster is in, usually N1 to N5,node is the name of the current node.

In Tidb, we download tidb binary in the SetUp function, unzip it into a fixed location, update the configuration file, and then start the entire cluster. Instead, TearDown sends the KILL command to kill the entire cluster. In the start function, we will start Pd-server,tikv-server and Tidb-server separately on each Node.

After we have implemented the TIDB DB interface, we register the TIDB with the chaos through the RegisterDB function so that we can find the TIDB and manipulate it through the DB name in the agent.

Client

The Client is the component that the controller side uses to interact with the DB to be tested. The Client interface is defined as follows:

type Client interface {    SetUp(ctx context.Context, nodes []string, node string) error    TearDown(ctx context.Context, nodes []string, node string) error    Invoke(ctx context.Context, node string, request interface{}) interface{}    NextRequest() interface{}}

The Invoke function is the interface that the client actually sends commands to the DB, because we don't know the command parameters of the different DB clients, so the request here is a interface. Invoke execution will return a response, we do not know the actual response of each client, also used interface to express.

Nextrequest returns the next request that can be called, because only the client knows how to construct a request.

In Tidb bank case, we define a bank client that randomly selects the data for all the accounts at each nextrequest, or selects two accounts for the transfer. If it is read, then response is the query data, if it is transfer, then response is the success. It is important to note that for distributed systems, an operation can have three results, success, failure, and unknown, so we also need to consider handling Unknown on response side. Refer to issue above for details.

Because we have 5 node,controller. Each Node will have a client counterpart, so actually we also need to implement a client creator to generate multiple client.

type ClientCreator interface {    Create(node string) Client}

Linearizability Check

Above we talk about the client interface, we will use Nextrequest to generate a request, and then invoke the request, get a response. The Controller will record both the request and the response in a history file. So once operation, there is a request and a response of two events.

For simplicity, we are writing the request and response directly into the history with JSON encoding. When the test runs out, we need to analyze whether the history file is linear or not. First, we need to parse this history, and here we need to implement a record parser:

type RecordParser interface {    OnRequest(data json.RawMessage) (interface{}, error)    OnResponse(data json.RawMessage) (interface{}, error)    OnNoopResponse() interface{}}

When parser reads a row of the record, we will first determine whether the record is request or response, then call the corresponding Recordparser interface, and then decode the data into the actual type.

Here we need to pay attention to the OnNoopResposne interface, as mentioned above, so Unknown's response, we OnResponse need to return nil in this function, let chaos first ignore this event, and then in the last call to OnNoopResposne get a response, complete before the Operat Ion

To implement Linearizability check, we also need to implement our own Porcupine model, and then call VerifyHistory(historyFile string, m porcupine.Model, p RecordParser) the function to validate the resulting history.

The key step function for porcupine model at TIDB Bank is defined as follows:

Step: func(state interface{}, input interface{}, output interface{}) (bool, interface{}) {    st := state.([]int64)    inp := input.(bankRequest)    out := output.(bankResponse)    if inp.Op == 0 {        // read        ok := out.Unknown || reflect.DeepEqual(st, out.Balances)        return ok, state    }    // for transfer    if !out.Ok && !out.Unknown {        return true, state    }    newSt := append([]int64{}, st...)    newSt[inp.From] -= inp.Amount    newSt[inp.To] += inp.Amount    return out.Ok || out.Unknown, newSt}

If it is a read operation, then determine whether the result is the same as the last state, or whether it is Unknown, if it is transfer, then in the existing state, a transfer operation, return to the new state.

Nemesis

During the run test, the controller also performs some nemesis operations periodically to disrupt the entire system, such as killing all of the DB at once, or dropping the network packets that are sent from some Node. Nemesis interface is defined as follows:

type Nemesis interface {    Invoke(ctx context.Context, node string, args ...string) error    Recover(ctx context.Context, node string, args ...string) error    Name() string}

Because Nemesis is also registered for use with chaos, Name must be unique. We use Invoke to interfere with the system and then Recover to restore the system. When you implement your own nemesis, you also need to call RegisterNemesis to register so that the agent can use it.

On the controller side, we need to implement Nemesisgenerator:

type NemesisGenerator interface {    Generate(nodes []string) []*NemesisOperation    Name() string}

Generate generates a nemesisoperation operation for each Node, nemesisoperation inside defines the nemesis to be executed, along with the associated parameters, and the execution time. The Controller sends the nemesisoperation to the agent, allowing the agent to execute the corresponding nemesis.

Agent and Controller

When we define our own db,client,nemesis and so on, we need to integrate them together. We need to register our own DB and related nemesis in the agent first. In the cmd/agent/main.go file, the TIDB related registration code is as follows:

    // register nemesis_ "github.com/siddontang/chaos/pkg/nemesis"// register tidb_ "github.com/siddontang/chaos/tidb"

Then we start node, and then we pass

NewController(cfg *Config, clientCreator core.ClientCreator, nemesisGenerators []core.NemesisGenerator) *Controller`

Creating a controller,controller requires accepting a clientcreator and a list of nemesis Generator. Config will specify the maximum number of request to be sent per client, as well as the time of the entire test execution, and the DB name to be manipulated.

Start the controller, execute the test, and after the end, there will be a history file generated and we can verify the linear consistency.

Summarize

Chaos is a very rudimentary version at this stage, and there is much work to be done, such as better interface definitions and easier use of them. But now at least can work, now only TIDB transfer test, later, I will give tidb more than a few linear conformance test, if you are interested, also welcome to join other Open source project linear conformance test case.

Chaos:https://github.com/siddontang/chaos

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.