Google Megastore Distributed Storage Technology All Secrets (2)
2011-02-16 09:41 | 22,951 Times Read | "22 comments have been published"
Source: CSDN | Author: | Collect to my net to pick
Megastore supports transactional and concurrency control. A transaction write is first written to the corresponding entity group's log before the specific data is updated. BigTable has an item that stores multiple versions of data with different timestamps in the same row/column. It is because of this feature that Megastore implements multiple versioning concurrency control (MVCC, which includes oracle,innodb to implement acid in this way, of course): When multiple updates to a transaction are implemented, the written value carries the timestamp of the transaction. The read operation uses the timestamp of the last fully-active transaction to avoid seeing incomplete data. read-write operations do not block each other, and reads are isolated during write transactions (. )。
Megastore provides current,snapshot, and inconsistent reads, and the current and snapshot levels are typically read from a single entity group. When a current read operation is started, the transaction system first confirms that all previously submitted writes are in effect, and then the system reads the data from the last successfully committed transaction timestamp location. For snapshot reads, the system gets the fully-known transaction timestamp and reads the data directly from that location, unlike current reads, the transaction update data that may be submitted at this time is not fully valid (submission and entry into force are different). The third reading provided by Megastore is inconsistent read, which ignores the log state and reads the last value directly. This type of reading is useful for those who have a strong need to reduce latency and can tolerate data expiration or incomplete read operations.
A write transaction usually starts with a current read operation to determine the next available log location. The commit operation aggregates the data changes into the log and assigns a timestamp that is higher than any previous one, and adds the log entry to the journal using Paxos. This protocol uses optimistic concurrency: even if there are multiple writes that attempt to write the same log location at the same time, there will be only 1 successes. All failed writes will observe successful writes, then abort and retry their operation. Consultative locking can reduce the impact of contention. Partial writes through specific front-end servers seem to be able to avoid competition altogether (which is somewhat incomprehensible) [advisory locking is available to reduce the effects of contention. Batching writes through session affinity to a particular front-end server can avoid contention altogether.].
The full transaction life cycle consists of the following steps:
1. READ: Get the timestamp and the log location of the last commit transaction
2. Application logic: Read from BigTable and gather write operations to a log entry
3. Submit: Use Paxos to add log entry to the log
4. Entry into force: Update data to bigtable entities and indexes
5. Cleanup: Delete data that is no longer needed
The write operation can be returned at any point after the commit, but it is best to wait until the most recent copy (replica) is in effect (and then returned).
Megastore provides Message Queuing that provides transactional messages between different entity group. They can be used as operations across entity group, to perform multiple updates in a single transaction, or to postpone work (. )。 A transaction on a single entity group can send or receive multiple messages atomically, in addition to updating its own entity. Each message has a entity group that is sent and received, and if the two entity group is different, the transmission will be asynchronous.
Message Queuing provides a way to affect the operations of multiple entity group, for example, in Calendar applications, each calendar has a separate entity Group, and we now need to send an invitation to many other people's calendars, A transaction can send an invitation message atomically to multiple separate calendars. Each calendar receives a message and adds the invitation to its own transaction, and the transaction updates the invitees status and deletes the message. Megastore uses this pattern on a large scale: When you declare a queue, you automatically create a inbox on each entity group.
Megastore supports atomic update operations across entity group using two-paragraph submissions. The use is generally discouraged because these transactions have a higher latency and increase the risk of competition.
The next step is to introduce the Megastore's most core synchronous replication mode: A low latency Paxos implementation. Megastore's replication system provides a single, consistent view of data that can be read and written from any replica (repli CA), and can be guaranteed to be acid-semantically, regardless of which replica's client starts. Each entity group replication end flag replicates this entity group transaction log to a set of replicas synchronously. Write operations typically require a network interaction within a data center, and can run to check health status read operations. The current-level read operation will have the following guarantees:
1. One reading is always able to see the last confirmed write. (visibility)
2. After a write is confirmed, all future readings will be able to observe the results of this writing. (Persistence, a write may be observed before confirmation)
Database typical use Paxos is usually used to make transaction log replication, each location in the log is a Paxos instance. The new value will be written to the last previously selected location.
Megastore in the prior Paxos process, first set a requirement that the current reads may be in any replica, and does not require RPC interaction between any replicas. Since write operations generally succeed on all replicas, it is realistic to allow local reads anywhere. These local reads can be well exploited, with low latency in all areas, fine-grained read failover, and a simple programming experience.
The Megastore design implements a service called the Coordinator (Coordinator), which is distributed within the data center of each replica. A coordinator server tracks a entity groups collection in which the entity groups needs to have a copy that has been observed in all Paxos writes. Entity Groups in this collection, their replicas can be read locally (local read).
The write operation algorithm has the responsibility to keep the coordinator state is conservative, if a write on a copy failed, then this operation can not be considered as committed, until the entity group key from the copy of the coordinator removed. (not understood here)
In order to achieve a fast single interaction of the write operation, Megastore adopted a master-slave mode of optimization, if one write success, then the next write guarantee (that is, the next write will not need to prepare to apply for a log position), The next time you write, skip the prepare process and go directly to the accept phase. Megastore does not use a dedicated masters, but uses leaders.
Megastore runs a Paxos algorithm instance for each log location. [The leader for each log position is a
Distinguished replica chosen alongside the preceding log position ' s consensus value.] Leader arbitration uses which value in proposition No. 0. A value submitted to leader by the first writer will win an opportunity to receive this value from all copy requests as the final value of Proposition No. 0. All other writers must return to the second stage of Paxos.
Because a write must interact with leader before submitting the value to another copy, it is necessary to minimize the delay between the writer and the leader. Megastore designed their own rules for selecting the next write leader, which is determined by the write operations that are submitted by most applications in the same area. This creates a simple but effective principle: Use the most recent copy. (Here I understand where to write more, then use the most recent copy from this position as leader)
Total 3 Pages: Prev 1 2 3 Next