This article understands All About Eve: Execute-verify replication for multi-core servers
Eve is a distributed replication solution designed to adapt to multi-core servers. State Machine Replication is designed to implement fault tolerance. Because it is difficult for all replicas to execute requests in the same order, Eve takes the measure to eliminate the limit that all replicas must execute requests in the same order, however, this does not mean that the request is not prepared for parallel execution, otherwise the consistency will not be reached. Eve divides requests into one batch, and all the requests in one batch do not affect each other. In this way, you can execute all the requests in one batch in parallel. Traditionally, to achieve consistency, replicas needs to reach an agreement on the request execution sequence before execution. on Eve, requests are executed in parallel and then verified to be consistent. If divergence occurs, eve performs roll back and re-serializes the request sequence. To reduce the appearance of divergence, Eve uses mixer, which actually makes the group and the group (or the group ?) The request will not interfere with each other during parallel execution, and will make the repair more effective. Eve's execute-verify model can be used for crash tolerant and Byzantine tolerant. Eve's robustness comes from two aspects: first, Eve's mixer reduces the possibility of triggering a potential concurrency bugs, because under the role of mixer, parallel execution of requests does not interfere with each other.
? Why not deterministic execution?
In general, there will be uncertainty in multithreading. How to Implement deterministic multithreading is currently the key.
In fact, the solution is not limited to allow all replicas to process the same sequence of input. In fact, there is another idea: using the request semantics (semantics) to implement replica coordination. For example, for read requests, the SMR system does not require replicas to be processed in the same order, because read requests do not modify the state of the replicated application. Therefore, if we can determine whether it is a read request, we can let it go and execute it in parallel, but for some types of requests, we still need to reach a unified processing order.
Let the read request be executed only on the preferred legal number of replicas, rather than on all replicas. Why?
? Synchronous primary-Backup
Primary will accept the request and divide the request into one batch. If a Batch B is formed, it will send the information <execute-batch, N, B, nd> to the backup, n is the number of the batch, and Nd is the uncertain call such as random () and gettimeofday () to ensure consistent execution of data. Backup will execute the request based on this information and return a token to primary. Primary will compare the token sent from it with its own. If the token is consistent, it indicates that no divergence occurs, primary marks the batch sequence number as stable. If they are inconsistent, it indicates that divergence exists. In this case, primary rolls back to the previous stable batch sequence number, and let the backup roll back to the previous stable batch number, that is, give up the sent batch.
? Evaluation
How will the mixer of Eve affect the performance, the throughput of Eve, and the performance of unreplicated multithreaded execution? How does Eve handle concurrency bugs? We hope to tell us the answer through a key-value storage program and the H2 database engine.
Eve's prototype limitations include: (I) not implementing extra protection mode Optimization for our asynchronous configurations. (ii) our current implementation does not handle applications that include objects for which Java's Finalize method modifies state that need to be consistent into SS replicas. (iii) our current prototype only supports in-memory application state.
Eve's advantages is: after 16 threads are used, Eve will get 6.5 times faster than sequential execution. Eve also has a limitation: As the workload gets lighter (the execution time per request reduces), the overhead of Eve becomes more pronounced.
To have a good performance, Eve needs a good mixer. However, for Eve, it is easy to construct a mixer that can detect all conflicts and allow a large number of parallel operations.
? Failure and recovery
A primary error occurs in 30 s, a primary is restored in 60 s, a secondary error occurs in 90 s, and a secondary is restored in S. It can be seen that when an error occurs, it can be remedied within a short period of time. Eve's fault tolerance capability is good.
What is Eve's ability to process concurrency faults? If the bug is in a replica, Eve can detect it and then fix it through roll back and re-ordered execution. However, if the bug appears in both replicas, eve cannot detect it, so it should be a limitation of Eve.
? Remus
Compared with Remus, Eve uses two orders of magnitude less network bandwidth.
? Latency and batching
Eve has tradeoff between latency and throughput. As the load increases, the latency of Eve begins to increase until it reaches a saturation point: the throughput of 1225 requests per second. This requires less than 1470 requests per second from the unreplicated server. How can we achieve low latency but a high throughput? Eve uses a dynamic batching scheme: the batch size varies according to the situation. For example, when the system starts to become saturated, the batch size increases to obtain more concurrency.
All About Eve: Execute-verify replication for multi-core servers