Designing data-intensive applications: opportunities and challenges for distributed systems

Source: Internet
Author: User
Tags cassandra commit local time mongodb riak rollback value store

In the first part of "Designing Data-intensive Applications" (see above), the basic theory and knowledge of the data system are introduced, all based on single node. In the second part of the Ddia (distributed data), the field of view is extended to the distributed Data system. There are three main reasons for the distribution of data:

Distributed consensus ledger

    • Scalability
    • Fault Tolerance/high Availability
    • Reduce Latency

When the load increases, there are two ways to respond, scale up vs. scale out, which refers to the use of more powerful but expensive devices: faster, more cores, more RAM, higher capacity, faster read and write speeds, which in shared-memory form is not only expensive, And the fault tolerance is poor. The latter, the shared-nothing architecture used by distributed data systems, is the main way to deal with large volumes of data by increasing the number of common machine nodes (node) to cope with the increase in load.

Consensus algorithm distributed systems

There are two ways to distribute data across multiple nodes, replication and partition. The figure in distributed systems for fun and profit illustrates both of these ways:

 Distributed consensus    

Of course,distributed consensus in multi vehicle cooperative control, distributed systems are not silver bullets, and distributed, while providing scalability and high availability, poses many challenges, such as distributed transactions and consensus.

This address:

As shown, replication (replica set) is a copy of the data (copy) stored on multiple nodes, the data redundancy has the following benefits

Quorum consensus protocol in distributed database

    • Reduce latency: To keep data geographically close to your users
    • Increase availability: To allow the system to continue working even if some of its parts has failed
    • increase read throughput : To scale out the number of machines that can serve read queries

The biggest challenge of a replica set is the consistency of the data: how to ensure that the data of all replicas in the replication set is consistent under certain constraints. According to the different roles in the replication set (Leader, Follower), there are three kinds of algorithms single Leader, multi Leader, no Leader. Among them, about the centralized replication set protocol (single leader) I have learned more in detail in the article on distributed centralized replication sets with questions.

Single leader

The following issues need to be considered in a centralized replication set protocol:

(1) Whether the data is written synchronously or asynchronously between multiple points

(2) How the new follower (secondary) synchronizes data quickly

(3) How to deal with the node fault: for follower (secondary) fault, need catch up, for leader (Primary) fault, need to elect a new leader, how to determine the leader fault, how to ensure that the leader Failover the process of not losing data, and avoiding brain fissures (with multiple leader at the same time) are challenges.

In many cases, the asynchronous writing of data is a better approach because there is better availability and higher concurrency. Asynchronous writes, however, need to handle the replication lag problem, the data delay between leader and follower, so that the data that the user reads from the different nodes in the replication set may be inconsistent. Here are some specific cases to see how to ensure a certain degree of consistency.

Reading Your Own writes

Users can query for content that they have successfully updated, but do not care whether other users can query immediately. That would require read-after-write consistency.

Implementation method:

(1) When reading a content that may be modified by the user is, read from leader, otherwise it can be from follower

(2) Record the update time, more than a certain time from follower read

Monotonic reads: (monotonic reading)

Only means so if one user makes several reads in sequence, they would not see time go backward

That is, if a user reads a new version of the data, they cannot read the old version of the data when they read it repeatedly.

Implementation approach:

(1) Each user reads from a fixed copy

Consistent prefix reads

Causality: For example, "Ask a question" and "answer this question", must be the first occurrence of the former. But in the asynchronous communication between multiple nodes of the replica set, the third party (Observer) may see the answer first, then see the problem, which violates the causality. :


This guarantee says so if a sequence of writes happens in a certain order, then anyone reading those writes would see the M appear in the same order.

One solution is to make sure this any writes that's causally related to all other be written to the same partition

, this problem does not occur in a single copy set (single partition), only in partitioned (sharded) environments.


(1) A causal operation routed to the same partition

leaderless Replication

leaderless Replication, de-centralized copy protocol, that is, there is no central node in the replica set, all nodes are equal status, everyone can accept the update request, mutual agreement to reach the consensus of the data. In Amazon's Dynamo:amazon ' s highly Available Key-value store and its open source implementations (Riak, Cassandra, and Voldemort), leaderless is used Replication

The biggest advantage of leaderless is high availability, not because the failure of a single (minority) node causes the system to be unavailable, the core of high availability is the quorum protocol: The number of nodes in the replication set is N, when a data is successfully written to the W node, each time the read is returned by the R node, as long as W + R > N, then R must contain the latest data. As shown in the following:


In fact, every write or read is sent to all nodes, but only when the successful return of a W (R) node notifies the client of the result.

As shown, Node 3 (Replica 3) returns stale data due to data write failures, and the data system needs to align the data of the replica set to achieve eventual consistency. There are two ways

  Read repair: reads more than a few replica of data, repair outdated data.

  anti-entropy process: Background processes check for differences

Quorum is not omnipotent, in the leaderless, even if the use of quorum, there are the following potential problems

    • Conflicts caused by concurrent writes at different nodes, which is the biggest challenge of leaderless
    • In the case of read and write concurrency, lack of isolation, may read to the old data
    • Write failure (less than W node writes succeeded), no rollback
Detecting Concurrent Writes

Leader less concurrent writes can conflict, and conflicts can occur in Read-repair or hinted handoff . The following is an example of a conflict:


In the case of concurrency, if each node is requested to write the data, then the copy set cannot be agreed, as shown in the different node data is inconsistent. How to resolve concurrency conflicts, one of the ways is last write win, Cassandra is so the conflict, the role of premise: accurate judgment recent; every write is copied to all copies. The disadvantage is that there is a case of data loss: Some writes are silently if the write W tells the client that the write is successful discard

The "happens-before" relationship and concurrency

How to tell if two operations are not concurrent: there is no happened before relationship

An operation a happens before another operation B if B knows about a, or depends on a, or builds upon a in some.

In fact, we can simply say that, operations is concurrent if neither happens before the other

If there is happened before: then the latter is feasible to cover the former; only concurrent conflict.

Use version vector to determine the dependencies of multiple write operations.


About Partitioning (sharding), I have a problem with the study of distributed system data fragmentation is also described in detail, for reference. Therefore, in this section, only new knowledge is added.

The main reason for partitioning is scalability (scalability). How to divide the data and how to rebalance the data are the two basic problems partition need to solve.

If there is more data or query on a partition than other partition, then this phenomenon is called skewed, high load partition is hot spot

Partitioning and secondary Indexes

Partitioning is according to primary index to Shard, then secondary indexes how to solve it

Both main approaches to partitioning a database with secondary indexes: document-based partitioning and term-b ased partitioning.

Partitioning secondary Indexes by Document

Each of the partition maintains its own secondary indexes, and covering only the documents on that partition.

Each shard maintains its own secondary index, containing only the secondary index information for the data on that shard.

A document-partitioned index is also known as a local index

Therefore, when writing data, only modify the local secondary index file.

When querying with a secondary index, the query statement needs to be executed on all shards and summarized (Scatter-gather). As shown, color is a secondary index.


Local index is widely used: MongoDB, Riak, Cassandra, Elasticsearch Solrcloud and Voltdb

Partitioning secondary Indexes by term

Also known as Global Index, secondary index data is also fragmented.

The advantage compared to the local index is that it is more efficient to read data using a secondary index (no scatter gather) reads more efficient. The disadvantage is that the write operation is slow and complex (requires a distributed transaction to guarantee)

Rebalancing partitions

Rebalance's goals are:

    1. Load balancing between nodes after rebalance

    2. Rebalance does not affect (uninterrupted) read and write services

    3. Less data migration between nodes (not much)

Request Routing

In the Shard environment, how the client learns which node to communicate with.

This was an instance of a and general problem called Service discovery


(1) The client connects to either node, if the node cannot process the request, then forwards to the correct node

(2) client sends request to route (routing tier)

This routing tier does no itself handle any requests; It is only acts as a partition-aware load balancer.

(3) The client knows the mapping between the Shard information and the node


Transactions are an important means of improving system reliability (reliable) in the case of various anomalies (fault) of hardware and software.

A transaction is a-application to group several reads and writes together into a logical unit.

Conceptually, all the reads and writes in a transaction is executed as one operation:either the entire transaction succe EDS (commit) or it fails (abort, rollback).

Multiple operations that make up a transaction are either successful (commit) or are not executed (rollback, abort), and there is no case of partial execution success, that is, all-or-nothing.

Transactions simplify the application layer's handling of exceptions, and whether the system requires transactions depends on the security of the transaction and the costs associated with it. Traditional relational databases choose to support transactions, whereas in distributed databases, such as NoSQL, the support for transactions is discarded because transactions are the opposite of scalability and can affect the performance and reliability of large systems.

When we talk about transactions, we generally refer to the acid characteristics of the transaction.

The implementation of a database for acid (or even understanding) is not necessarily equivalent to other databases, where the most complex is isolation (isolation).

Isolation refers to the execution of two concurrent transactions without interfering with each other, and one transaction cannot see the middle state of the other transaction running process. Of course, concurrent reading is not interfering with each other, only the concurrent reading and writing, or concurrent writing, will bring race condition. The best way to achieve isolation is to serialize the serializable to achieve the same effect as sequential execution, but there are performance issues with this approach. Therefore, the database provides different levels of isolation to take into account the isolation line and concurrency performance.

The author intends to write another note on this part of the isolation type.

The trouble with distributed Systems

Distributed systems bring more challenges, more intentional errors and anomalies, and in addition to the single point system problems, the two challenges that distributed systems need to address are:

    • Problems with networks
    • Clocks and timing issues

Unlike a single point system, a distributed system is prone to partial failure: Partial work, partial anomalies. The biggest problem with partial failure is nondeterministic, uncertainty. Distributed systems need to implement fault tolerance (fault tolerance) at the software level to address the partial failure.

Unreliable neteork vs detecting faults vs timeout

The network used by distributed systems is unreliable, and packets may be lost and may be delayed. and the loss or delay can occur either on the request or on the response road, which is uncertain.

One of the most important applications of network messaging is the heartbeat.

The system needs to detect abnormal nodes, such as load balancer, which need to be monitored for non-working nodes, such as the centralized replica set protocol for leader monitoring.

When the node is crash, it is best if it can be accurately judged and notified to other nodes in the system. But a lot of times, can not determine whether a node is crash, and, although a node is not crash but can not continue to work, this time still depends on the heartbeat timeout, before writing such an article "Hey,man,are you OK?" --about heartbeat, fault monitoring, lease mechanism to introduce related issues.

When using timeouts in Network information, the timeout is a problem: the time-out is too long, it takes a long time, it's too short, and it's easy to misjudge.

If The system is already struggling with high load, declaring nodes dead prematurely can make the problem worse. Cascading failure

and the network delay in various environment changes very big, congestion control causes sender queue, network switch queue, virtual machine management queue, CPU busy queue, multi-tenant environment (oversold) affected by other services may affect the network delay. The more severe is based on the network delay automatically adjust the time-out period, such as Phi accrual failure detector, TCP time-out retransmission using a similar idea.

Unreliable clocks vs Process pause

Time is important because time means: order,duration,points in.

The time that we often use is time-of-day (Wall-clock time).: Refers to the times that are returned based on a calendar. In the program, there are some problems with wall-clock time

    • NTP may cause time fallback
    • Leap seconds are usually ignored

Therefore Wall-clock time is not suitable for measuring the time difference (measuring elapsed time)

Thus, the operating system provides another time monotonic clocks, such as Clock_gettime (Clock_monotonic) on Linux, monotonic clocks guarantees that time will not jump back.

When the clocks of each node in the distributed system are inconsistent, a variety of problems arise, such as a common but prone scenario: using time (timestamp) to determine the order of events on multiple nodes


In the leaderless replication set, last write Win (LWW) is a way to resolve concurrency conflicts, which can cause data to be silently lost if different node data is inconsistent at this time.

Even with NTP, it is not possible to fully guarantee the consistency of data across nodes. An interesting idea is to use confidence intervals:

Clock Readings has a confidence interval:

It doesn ' t make sense to think of a clock reading as a point in time-it are more like a range of times, within a confidence Interval

Many algorithms and protocols rely on local time judgments, such as lease, even if the data of each node is consistent, and in some cases there will be problems, that is, process Pause.

For example, a piece of code before the execution will go to check lease,check time to meet the lease, and then a process Pause, recovery may no longer meet the lease. Because you don't know where you might pause, you won't be able to check again.

What can lead to process pause, many:

    • Gc
    • Virtual machine can be suspended and resumed
    • Multithreading
    • Disk IO: Unintended disk access, such as Python import
    • Swap
    • Unix SIGSTOP (Ctrl z)

The feature is that the GC stops the world's behavior, which occurs in the memory-managed programming language Java and Python.

The GC causes process Pause, which occurs in HBase:


In the implementation of the distributed lock, lease is used, even though the STOP-THE-WORLD-GC Pause,client 1 still considers itself to hold lease, while in fact the lease held by client 1 has expired. So in a distributed system:

The Truth is Defined by the Majority.

A node cannot necessarily trust its own judgment of a situation.

The solution is simple: Fencing token

System Model and Reality

When we refer to algorithms and protocols, always based on a certain system model, the system model is the premise or hypothesis of the algorithm working environment

System model, which is a abstraction that describes what things a algorithm may assume.

Assumptions about the time:

    1. Synchronous model
    2. Partially synchronous model: the vast majority of synchronous, bounded, occasionally beyond the bound problem, rely on the imeout mechanism
    3. Asynchronous model

The assumptions about node failure

    1. Crash-stop faults
    2. Crash-recovery faults (nodes is assumed to the stable storage)
    3. Byzantine (arbitrary) faults

How to measure the design and implementation of an algorithm is correct: under the system model, the committed attributes (properties) are satisfied. For example, a unique attribute, such as a atomic attribute in a transaction.

Attributes can be divided into two categories:

Safety: Nothing bad happens,

liveness: Something good eventually happens.

Distributed algorithms, under any system model, need to satisfy the safety property

For distributed algorithms, it's common to require that safety properties always hold, in all possible situations of a SY Stem Model,however, with liveness properties we is allowed to make caveats:

Consistency and Consensus

This chapter discusses the fault-tolerant algorithms and protocols in distributed systems.

The best way to build a fault-tolerant system is to find and implement a generic abstract model that solves a class of problems so that the application layer code does not have to consider and handle these problems, even if there are various exceptions. Transactions such as those provided by the database. In distributed systems: The important abstraction is the consensus consensus:that is, getting all of the nodes to agree on something.

Linearizability & Causality

The CAP theory is described in the article on cap theory and MongoDB consistency, usability, and the CAP theory is that for distributed data storage, the maximum consistency (c,consistency), availability (A, availability), Partition fault tolerance (P, Partition tolerance). Strong consistency guarantees that the most recent written data can be read or wrong for every read operation.

Linearizability can achieve strong consistency, because

Make a system appear as if there were only one copy of the data, and all operations on it is atomic

Linear consistency is a useful feature: for example, to elect a leader in the form of lock, the lock must be a linear linearizable: such as unqueness constraints.

Can the different replica set protocols remain linear? For single leader: If you read data only from leader, then there are basically linear, with exceptions, such as data being rolled back, which is not guaranteed to be linear at this time. For leaderless, the quorum is theoretically used to ensure linearity, but in practice, there are also non-linear, as shown in

This figure illustrates that in the case of satisfying quorum, there is no guarantee of linearity, dirty read, and if some nodes fail to write, the read time cannot guarantee linearity.

Linearizability is actually strong consistency, although linearizability easy to understand, easy to use, but the majority of distributed systems do not support linearizability, because the linear consistency of fault tolerance is poor, performance is not good.

In distributed consistency semantics, linearity is only one piece of data, and each operation is atomically executed at a certain point in time, which means that some sort of order

The linearizability is total order, only one copy, and the operation atomicity occurs, and all operations have a relative order. In fact, many operations can be performed concurrently, as long as they do not affect each other.

Causality Consistenc (causal consistency) is partial order, some operations are sequential, and others can be concurrent.

In fact, causal consistency are the strongest possible consistency model that does isn't slow down due to network delays, and Remains available in the face of the network failures linear consistency is expensive, and many times it is not necessary

It is more complicated to record the causal order between multiple nodes, and the specific reference Lamport TimestampConsensus & epoch & Quorum

The above-mentioned causality does not solve all problems, such as when two users register the same username, there is no cause and effect, but does not satisfy the username uniqueness constraints, therefore need consensus algorithm. The consensus is that several nodes agree on something, and it is clear that the consensus can solve the uniqueness constraint problem. For the first time, for example, a single leader election, such as a atomic commit for a distributed transaction, requires consensus.

Two-phase Commit (2PC) is a classic means of implementing distributed transactions, through 2PC, can also achieve consensus. However, the problem with 2PC is poor fault tolerance, node failures and network timeouts can cause retries until the node or network recovers

The consensus algorithm defines:

One or more nodes could propose values, and the consensus algorithm decides on one of the those values

The properties to be satisfied by the formula algorithm:

Uniform Agreement

No. Nodes decide differently.


No node decides twice.


If a node decides value V, then V is proposed by some node.


Every node, does not crash eventually decides some value.

The first three is the safety attribute, the last one is the liveness attribute, the last one also requires the system to have fault tolerance (2pc can not satisfy this property)

Single leader can guarantee consensus, but single leader election relies on consensus algorithms, common fault tolerant consensus algorithms include (viewstamped Replication (VSR), Paxos, Zab)

The consensus algorithm relies on leader, but leader is not fixed: The protocols define an epoch number (called the ballot number in Paxos, view number in Viewstam PED Replication, and term number in Raft) and guarantee this within each epoch, the leader is unique

Therefore, single leader is only delaying tactic, not the need for consensus, but not the frequent consensus.

Different data systems choose different forms to meet the consensus needs of leader elections, such as MongoDB, which uses raft-like algorithms to elect leader between replica node. Other systems, such as hbase, use outsourced services (such as zookeeper) to reach consensus, fault detection, and the professional to the professional people, greatly simplifying the complexity of the data system.


Ddia the second part of the information is very large, design to a large number of algorithms and theories, just look at this book is very difficult to understand. To me, leaderless replication and consensus are not very clear, such as leaderless causality, vector clock, Lamport clock, Paxos & raft algorithm, It's going to take some time to look into it.


Designing Data-intensive Applications

Distributed systems for fun and profit

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.