1. Abstract & Introduction
Ref:http://static.googleusercontent.com/media/research.google.com/zh-cn//archive/spanner-osdi2012.pdf
Spanner is a new generation of database systems that Google has launched to compensate for bigtable shortcomings. First of all, look at BigTable's shortcomings.
Read this article before you should look at BigTable's article http://www.cnblogs.com/zwCHAN/p/3698191.html
- supports only row-level things, doesn't support cross-line things, not to mention cross-table and cross-Library, or even transactional across data centers;
- Good reliance on GFS natural support solves the problem of fault-tolerant, read-write separation concurrency in other databases through complex replication, but in the cross-database disaster-tolerant problem, GFS does not propose an effective solution, BigTable certainly does not support;
- It has the advantage of the K-V model in the massive data, also has the disadvantage of this model: compared with the support SQL relational database, the extremely limited query function;
In addition, for Gmail, Calendar, Android Market, appengine These are my usual applications in the Megastore run on the expression is very surprised.
Therefore, the design goal of Spanner is to combine the advantages of the traditional relational database and the K-V model database, and to support the extensible, multi-version, global distributed, Global synchronous replication of the semi-relational session database. In order to support these features, it puts forward the concept of TrueTime, uses Paxos synchronization algorithm, read-only transaction non-lock processing and so on, and designs a set of SQL language similar to Megastore; In order to support the global distributed Database (synchronous replication) function, Spanner has several features that are instructive,
- Policies that are responsible for thinning data replication and distribution by the application. This is a traditional database such as MySQL is transparent; this is obviously not a combination of the fish and Bear paw performance: The global theoretical delay is more than 100ms, even if G-God can not exceed the speed of light. So let the application make the tradeoff according to its own needs; the application can control the data for several backups (backups in different datacenters, backed up by GFs's complex redundancy in a data center, back to backup <replica>), and in which data centers to place these backups While Spanner is responsible for the replica of data between these designated data centers.
- Spanner supports the global consistency of time-stamped read and write. This is a very difficult thing to do.
2 The logical structure of the implementation:2.1 spanner:
- The outermost, universe, is the largest collection on the entire database schema, and from that name you can see how arrogant gger is.
- Universemaster except the name is also very gorgeous outside only do some monitoring debugging function;
- Placement Driver;placement driver such a low-key name is really the big thing: responsible for automatic cross-zone data replication;
- Zone: After the zone level, it is basically equivalent to a bigtable system.
- Zonemaster
- Location Proxy
- Spanner Server
- Directory: In addition, a directory that expands the concept of locality group in BigTable, which is the basic unit of data replication, management, and movement, and a major change in directory are: Within directory, Key is globally unique and is sorted by my dictionary, but in different directory it is overlapped by. This is different from bigtable.
2.2 Software Architecture
- Data Model:
- (Key:string, Timestamp:int64) –> string, where column is not drawn, but they certainly exist; The biggest difference is that timestamp is contained in the data, This is similar to MySQL in the multi-version parallel control strategy (Mvcc,multi-version Concurrency control). This is obviously correct the bigtable design in the timestamp to put the outside of the scheme;
- Spanner requires that some columns be specified as primary keys in each row of data, and the other part corresponds to the value of this key. So it looks like a k-v model, too.
- The paper provides a view of two software architectures,
- I put these two diagrams together to understand the possible better, the function of each module:
- Paxos: Each tablet has a corresponding Paxos state machine that stores the replica state of the tablet within each zone, ensuring consistency between them, and through these state information, You can determine whether the corresponding replica is new enough (suffieciently up-to-date) to meet timestamp-based read operations.
- Paxos group: All replica Paxos of a directory together form a Paxos group, one of whom will be chosen as the Paxos of this leader group;
- Lock table: Each leader has a locktable record of which key values for replicas are being consumed by the synchronization operation.
- Leader: Each directory data will have multiple replica, different replica distributed in different zones, one of the replica of the tablet Paxos will be selected as leader, Responsible for the synchronization between each replica;
- Transaction manager: Each spanner server has a Transaction Server service that is responsible for coordinating the relationship of things with different directory data;
- Participant Leader/slave:paxos replica corresponding participant naturally become participant leader, others not particicant slave When only one Paxos group participates in a transaction, the synchronization capabilities of lock table and Paxos enable transactional: atomicity, consistency, independence, durability (ACID).
- Coordinator: When there are multiple Paxos paxosgroup involved in a transaction, the Paxos group is required to perform a 2PC (both phase commit) operation. 2PC requires a facilitator, so there is the concept of coordinator. It differs from the leader in that the leader is a leader within the Paxos group, leader because Paxos consistency algorithm needs a leader, and coordinator is Paxos group "leader", The reason for this is that the 2PC algorithm requires a facilitator (it is not called leader, but is actually a meaning). As for coordinator Leader/slave, it means the distinction within the Paxos group, which is elected Coordinator, with no particular meaning.
- Movedir is a transaction that is responsible for directory mobility. Because a Paxos has multiple directory, only one directory can occupy this Paxos state machine at the same time. To reduce the probability of collisions, Movdir is implemented by moving data two times: the first move does not occupy Paxos, and does not affect any read-write operations (which can be understood to be copied to the buffer first), and the second time it takes Paxos to move, However, you only need to move the point in time of the first move to the incremental data in the time period when the Paxos state lock is acquired;
3. TrueTime
The concept of truetime is very important in spanner, but it does not say how to achieve it. said in another paper stating, but not found. TrueTime's API:
Tt.now () Such an interface will return an exact time in the usual API, but the return is a time range, which indicates that the real time of the call to Tt.now is definitely within the time range that is worth returning. In fact, this is also very good understanding, because if on a machine, to determine a time is very simple, get the current clock is OK But if you want to be in the same absolute time (absolute time refers to the objective time that does not depend on the computer clock system) on the machine to call Tt.now, all need to get a same return value, which requires that the machine's clock system is exactly the same. This is actually impossible to do (see http://www.cnblogs.com/zwCHAN/p/3652948.html), assuming that the time difference of all machines (the slowest clock minus the fastest clock is the difference of E, Then call Tt.now on any machine to determine the true time range at this point (-e+tt.now, E+tt.now). For example, the perfect clock indicates absolute time, and every 10s after the absolute time of the 9:00:00, at the same time on three servers to get now time, they get different values, the error is 7s. So when a person does not know the absolute time, but until this error is 7s, in Server1 up with now, to get 9:00:08 at this point in time, he can be sure that the current real time between (9:00:01,9:00:15) (clock frequency deviation is below 10**-6, can be ignored).
In fact, it is impossible to get the exact value of the error E (which is the same as clock synchronization), only to determine an E ', and E ' >=e, so that we are equivalent to taking a "safe value". For example, taking a 10s error in this example is certainly correct. It also shows that the smaller the error between servers, the less e ' we can get.
Paper said Google uses GPS clock synchronization, can reduce E to 10ms, and their goal is 1ms ... The effect of the TrueTime error on the system is also stark: within this error range, only one transaction can be executed .
4. Concurrency Control
In the spanner system, only the read-write transaction operation uses the pessimistic locking mechanism, the other operations are unlocked. Spanner relies on the monotonic increment of timestamps assigned to individual transactions, i.e. the need to ensure the ordering of transactions.
For read-Write transaction operations, coordinator leader, which stands at the top of the software architecture, assigns a timestamp si to a write transaction, making sure that SI is not less than Tt.now (). Latest, that is, this timestamp is sure that the timestamp of the transaction on all other machines involved in the transaction is large, When each transaction is doing this, it ensures that the timestamp is monotonically incremented (of course, coordinator may have to wait to meet this condition.)
The process of parsing a complete read-write transaction: A client transaction requires that the value of group1, group2, and group3 be read after the update. The digital identification step in the figure. Knowing the most complex read-write transactions, other transactions are easy.
- Client notifies leader replica to obtain read lock Read-lock;
- Client read last-update data on a replica;
- Client write all to the buffer
- The client chooses a coordinator group and then initiates a 2PC submission;
- The client assigns the selected coordinator identity +buffer write to all paticipant leader; (thus avoids sending coordiantor, then coordinator to participant, At this point the success of the business is optimistic);
- Each non-coordinator-participant leader (that is, leader of Paxos group that is not coordinator) acquires Wirte-lock within this group, Then select a time stamp that is larger than all previous transaction timestamps in the group and write a prepare Paxos log;
- Non-coordinator-participant leader then sends the selected time stamp to coordinator. (completed at 2PC prepare phase)
- Coordnator himself also acquired Write-lock locally (apparently it didn't have to do the prepare phase, and the client helped it out). Then coordinator waits for all the participant Learder sent over the timestamp (7th step), select a maximum value Pmax;
- Coordinator gets Tt.now () and compares Tt.now (). Latest and Pmax. If Tt.now (). Laster>=pmax (identifies at this time Tt.now () is s), proceed down; otherwise repeat 9th. (This step guaranteed a monotonically increasing of the transaction timestamp);
- Coordinator commit a log to the Paxos state machine (identifies a commit transaction);
- Since coordinator commits the commit to the Paxos state machine, the other participant knows that the transaction has been committed (but not knowing the details of the timestamp of the commit that the Coordnator chose). But other replica that need to be synchronized don't get the data right away, coordinator need to wait until after (s) to allow synchronization. This is to wait for the previous twice times the error time to pass, to ensure that the timestamp (time period) really passed (so far, all locks are not released), and then coordinator began to notify all participant and client transaction commit timestamp and transaction result status;
- Each participant submits the result state of the transaction to its own Paxos group;
- All participants release the lock resource after obtaining the result state and timestamp of the transaction;
5.Evaluation
Shows that with the increase of replicas, the snapshot efficiency is linearly increasing (lock-free), the transaction write throughput decreases, and the delay changes little (acquiring lock efficiency is high).
Is the scalability performance metric for 2PC. The performance deterioration is not obvious around 50 nodes. Greater than 100 after deterioration was evident;
Is Google use GPS clock synchronization, can do the clock error e, basically within 10ms, so its write transaction delay in 10ms+;
Is the 24h draw delay and deviation in practical application of spanner. (Trans-American east coast). The results are also not optimistic.