One, introduction of quorum mechanism
There is a cap theory in distributed systems, which is practically unavoidable for P (partition tolerance). Because, the processing in the distribution system is not in this machine, but many machines in the network communicate with each other, so the network partition, network communication failure problem can not be avoided. Therefore, only try to find a balance between C and A. For data storage, for increased availability (availability), a copy backup is used , such as HDFs, which defaults to three copies per block of data. The machine on which the data block resides is down and reads on the machine where the copy of the block is located (it can be seen that the data is distributed in units of "data blocks")
However, the problem is that when the data needs to be modified, all of the replica data needs to be updated to ensure data consistency (consistency). Therefore, it is necessary to weigh between C (consistency) and A (availability).
and the quorum mechanism, is such a trade-off mechanism, a "read-write transformation" model. Before introducing quorum, let's look at an extreme situation: the Waro mechanism
Waro (Write all Read one) is a simple copy control protocol that when a client requests to write data to a replica (update data), the write operation succeeds only if all copies are updated, otherwise it is considered a failure.
Two points can be seen here: the ① write operation is fragile because the write operation is considered a failure as soon as a copy update fails. The ② read operation is simple because all replica updates are successful and are considered to be successful, ensuring that all replicas are consistent. This way, you only need to read the data on any one copy. Assuming that there are N replicas, the N-1 are down, the remaining copy still provides read service, but the write service will not succeed as long as one copy goes down.
Waro has sacrificed the availability of update Services to maximize the availability of read services. And quorum is a tradeoff between the update service and the read service.
Quorum mechanism is an application of "drawer principle". The definition is as follows: Assume that there are n copies, update operation WI in the W copy of the update succeeded, only think that the update operation WI successfully. The data for the successfully submitted update operation is: "Data submitted successfully". For a read operation, you need to read at least R copies to read the updated data. where, W+r>n, W and R have overlapping. General, W+r=n+1
Suppose there are 5 copies in the system, w=3,r=3. Initial data is (V1,V1,V1,V1,V1)--successfully submitted version number is 1
When an update operation succeeds on 3 replicas, the update operation is considered successful. Data becomes: (V2,V2,V2,V1,V1)--After successful submission, the version number becomes 2
Therefore, a maximum of 3 copies must be read to be able to read the V2 (the updated data is successful). In the background, the remaining V1 can be synchronized to V2 without the need for the client to know.
Second, quorum mechanism analysis
①quorum mechanism does not guarantee strong consistency
The so-called strong consistency is: any time any user or node can read to the last successful submission of the replica data . Strong consistency is the most consistent requirement and the most difficult to achieve in practice.
Because, only through the quorum mechanism is unable to determine the latest successfully submitted version number.
For example, the above V2 successfully submitted (has been written to w=3), although read 3 copies must be able to read the V2, if just read (V2,V2,V2), the data read is the most recent successful submission of data, because w=3, and at this time just read 3 copies of V2. If you read (V2,V1,V1), it is not possible to determine whether it is a successfully committed version, and you need to continue to read it until you have read the V2 to 3 copies before you can be sure that V2 is the latest data that has been successfully submitted.
1) How to read the latest data? ---The most recent successful submission of the data version number, the most read R copies will be able to read the latest data.
2) How to determine the highest version number of data is a successful submission of data? ---Continue reading other copies until the highest copy of the version that you read appears w times.
② selection of primary based on quorum mechanism
The central node (server) reads the R replicas and selects the copy with the highest version number in the R replica as the new primary.
The newly elected primary cannot provide services immediately and will need to be synchronized with at least one copy of W to provide service---in order to guarantee the rules of the quorum mechanism: W+r>n
As for how to handle conflicting data during synchronization, it is subject to availability.
For example, (V2,V2,V1,V1,V1), R=3, if the 3 copies read are: (V1,V1,V1) The higher version of V2 needs to be discarded.
If the 3 copies read are (V2,V1,V1), the lower version of V1 needs to be synchronized to V2
Three, quorum mechanism application example
HDFs High Availability implementation
The operation of HDFs depends on Namenode, if Namenode is hung, then the whole HDFS will not be used, so there is a single point of failure (the one of the failure), and secondly, if you need to upgrade or maintenance stop Namenode, The whole of HDFs is not going to work. To solve this problem, the QJM mechanism (Quorum Journal Manager) was used to implement the HA (high availability) of HDFs. Note that the "shared storage" mechanism at the outset, for the lack of a shared storage mechanism, can be consulted: (also mention the advantages of QJM)
Maintaining enough state to provide a fast failover if necessary.
In order to achieve ha, two NameNode machines are required and one is active NameNode, which is responsible for client requests. The other is standby NameNode, which is responsible for synchronizing data with the active NameNode for fast failover.
So, here's the question, how does StandBy Namenode synchronize the data on the active namenode? What data is the main synchronization?
The quorum mechanism is used for data synchronization. The data that is synchronized is mainly editlog.
Data synchronization uses a third-party "cluster": Journal Nodes. Both Active NameNode and StandBy NameNode communicate with Journalnodes for synchronization.
‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘
Each time NameNode writes Editlog, in addition to writing Editlog to the local disk, a write request is sent to each journalnode in the Journalnode cluster in parallel, as long as most (majority) Journalnod The e-node succeeds in thinking that writing to the Journalnode cluster is editlog successful. If there are 2n+1 Journalnode, then according to most of the principles, you can tolerate a maximum of N journalnode nodes hanging off.
This is: quorum mechanism. Each time the number of machines written to Journalnode reaches the majority (W), the write operation is considered successful.
‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘‘
In this way, each time the metadata in the active Namenode is modified, the modification is written to most machines in the Journalnode cluster before the modification is considered successful.
When active NameNode is down, StandBy NameNode synchronizes editlog to Journalnode, ensuring ha.
The Active NameNode submits editlog to the Journalnode cluster, but Standby NameNode uses a way to synchronize Journalnode from the Editlog cluster at timed intervals, Standby Namenod E in-memory file system mirroring is likely to be lagging behind the active NameNode, so Standby NameNode in the transition to active NameNode need to fill up the backward editlog.
The specific synchronization process can be consulted: Hadoop NameNode high Availability (HI availability) Implementation resolution
In order to provide a fast failover, it's also necessary that the Standby node has up-to-date informationregarding the L Ocation of blocks in the cluster. In order to achieve this, the Datanodes is configured with the location of the both Namenodes, and send block location inform Ation and heartbeats to both.
In addition, in order to achieve fast failover, StandBy NameNode need to communicate with each datanode in real time to obtain address information for each chunk . For impurity to do this?
Because: the address information of each block is not "meta information", and is not saved in Fsimage, CheckPoint ..., this is because the address information changes relatively large. For example, a datanode is offline, the data block address information above it is all invalid, and in order to achieve the specified data block "copy factor", but also need to replicate the data block on other machines.
And fast failover, refers to the active Namenode downtime, StandBy Namenode immediately can provide services. Therefore, Datanode also needs to send block report to StandBy NameNode in real time
In addition, there are manual failover and automatic failover, automatic failover need zookeeper support, specific to the official website: HDFS High Availability Using the Quorum Journal Manager
How to avoid "split Brain" (brain fissure) problem?
Split Brain means that at the same time there are two NameNode who think they are in active state.
When a NameNode sends any message (or remote procedure Call) to a journalnode, it includes its epoch number as part of the Request. Whenever the journalnode receives such a message, it compares the epoch number against a locally stored value called the P Romised epoch. If the request is coming from a newer epoch and then it records this new epoch as its promised epoch. If instead the request is coming from a older epoch, then it rejects the request. This simple policy avoids split-brain
This is simply understood as follows: Each Namenode communicates with Journalnodes and requires an epoch numbers (The epoch numbers is unique and only increases). And each journalnode has a local promised epoch. the Namenode of the epoch numbers, which has a large value, will make Journalnode ascend his promised epoch, which makes up the majority, while the numbers of the epoch Namenode is a minority.
The epoch number NameNode is the real active NameNode, with the permission to write Journalnode. Note: (Only one namenode is allowed to have write Journalnode permission at any time)
When using the Quorum Journal Manager, only one NameNode would ever be allowed to write to the journalnodes,so there are no Potential for corrupting the file system metadata from a split-brain scenario.
Specific implementations can be consulted: (also mention the advantages of QJM)
Four, references
Wikipedia quorum
Https://issues.apache.org/jira/secure/attachment/12547598/qjournal-design.pdf
Hadoop2.6.0 Study Notes (ix) SPOF solution quorum mechanism
HDFS ha and qjm[official website]
Quorum mechanism of Distributed system theory