How difficult is it translation? Correction of csdn Translation

Source: Internet
Author: User
Tags failover

Today fenng posted a Chinese translation of csdn on his microblog, "playback and reflection on the worst downtime accident in GitHub history".


Http://weibo.com/1577826897/zdd2J1oh8


Based on my trust in fenng brand, I just clicked to have a look.


The first section of the csdn translation is as follows ......

The open-source Chinese community translated some of the content of this blog post, but unfortunately missed a lot of important content. csdn compiled and compiled the remaining parts to share with you.



Therefore, this Chinese translation has been translated and reviewed at least twice ...... I spent two and a half hours comparing the original text, ignoring all typographical and non-fluent situations, and only conducting the following error surveys on the most basic and serious semantic errors. Except for articles 1 and 2, the translations are basically content representations indicating errors or meanings.The opposite is true to the original..

Csdn translation link: http://www.csdn.net/article/2013-01-05/2813427-Github-Downtime-last-Saturday

Link: https://github.com/blog/1364-downtime-last-saturday

For the errata part, the second part is the csdn translation and the third part is my translation.

1,
Before ,......
Before that ,......

2,
(N-Service
In-service

3,
...
Our traffic was low enough at the time that it didn't pose any real problems.

The real problem is not solved.

No real problems

4,

...
Revert the Software Update and return to a redundant state at 1300 PST if we did not have a plan for resolving the issues...
And rolled back to the status at Pacific time
At PST, rollback and upgrade are started to restore the full redundancy status.

5,

...
When the agent on one of the switches is terminated, the peer has a 5 second timeout period where it waits to hear from it again. if it does not hear from the peer, but still sees active links between them, it assumes that the other switch is still running
But in an inconsistent state. In this situation it is not able to safely takeover the shared resources so it...


A proxy deployed on a vswitch is terminated, and a node has a latency of 5 seconds while waiting for the second response. Nodes cannot respond to each other, but the links between them are still connected. We can expect that other switches are running in a similar state, but they are already in a state that cannot be synchronized (messages. In this case, the switch cannot safely take over shared resources, so it ......

After a proxy deployed on a vswitch is terminated (deployed in pairs), the other node will wait for five seconds to determine whether the former will be restored. If it cannot receive the response from the first node, but the link between the two is active, it will be in the running status but not synchronized by default. In this case, it cannot safely take over resources managed together with another vro, so it ......

6,

When
The agent was terminated on the first switch, the links between peers did not go down since the agent is unable to instruct the hardware to reset the links. they do not reset until the agent restarts and is again able to issue commands to the underlying Switching
Hardware. with unlucky timing and the extra time that is required for the agent to record its running state for analysis, the link remained active long enough for the peer switch to detect a lack of Heartbeat messages while still seeing an active link and
Failover using the more disruptive method.


When the proxy on the first vswitch is terminated, the links between nodes are not reduced, but the proxy cannot instruct the hardware to reset the link until the proxy restarts, then, issue a command to the related switch hardware.
(Note: Half segment is lost here) when the proxy on the first vswitch is terminated, the connection between the two vrouters is not interrupted because the proxy cannot operate the hardware to interrupt the connection. Commands can be sent to the underlying hardware only after the agent restarts. When the time is unfortunate and the router needs more time to record the running status for analysis by the agent, the connection between the routers remains active for a long enough time, in the end, the peer router finds the missing Heartbeat message on the active line. Therefore, the next failover operation is more destructive.

7,

When
This happened it caused a great deal of church n within the network as all of our aggregated links had to be re-established, leader election for Spanning-tree had to take place, and all of the links in the network had to go through a spanning-tree reconvergence.
This variable tively caused

When this happens, it causes huge traffic losses and all our links need to be re-established. The leader chooses to use the Spanning Tree Protocol (spanning-tree network protocol ), all links in the network are restored through the Spanning Tree Protocol.
This process brings about huge fluctuations in the network, because all the aggregation links need to be re-established, and the leader election process required by the spanning-Tree Protocol must be completed, in addition, all links in the network must re-converge the spanning-tree process. All this directly leads ......

8,

We
Want to be certain that we don't wind up in a "Split-brain" situation where data is written to both nodes simultaneously since this cocould result in potentially unrecoverable Data Resume uption.

Notice we want to make sure that we are not in a "split-up" State. For example, if data is written to two nodes and the data transmission is interrupted.
We want to ensure that we are not in the "splitting" status (that is, data is written to two nodes at the same time), because this may cause data errors that cannot be recovered.

(Note: The first half of the errors are not serious, but the "unique" modifier is used as an example of "one. The second half ...... Is it Machine Translation ?)

9,

When
The Network recovered and the cluster messaging between nodes came back, a number of pairs were in a state where both nodes expected to be active for the same resource. this resulted in a race where the nodes terminated one another and we wound up with both
Nodes stopped for a number of our fileserver pairs.
When the network is restored, the cluster information distributed between nodes is sent back. Many peer nodes are scrambling for the same resources, resulting in severe competition. We disabled these nodes.
When the network is restored and messages between nodes are delivered, many servers are in this status: both servers think they should take over the shared resources. This competition caused the two servers to stop the Peer (Note: using the aforementioned stonith process ). Many of our file servers are eventually in the stopped state.

10,

We
Monitored the network for roughly thirty minutes to ensure that it was stable before beginning recovery.
Closely monitors the network for 30 minutes to ensure stable recovery
(Before starting the recovery operation) We closely monitor the network for 30 minutes ......

(Note: it is worrying to turn the previous operation into the last one after the start sequence)

11,

When
Both nodes are stopped in this way it's important that the node that was active before the failure is active again when brought back online, since it has the most up to date view of what the current state of the filesystem shocould be.
It is especially important to activate these active nodes again when the dual-node stops working due to an appeal fault and is re-Online, because they affect the current state of the file system.
For example, when a stopped pair of nodes are restored, it is particularly important to activate the node that has been activated before reactivating the applaud, because it has the latest information about the correct status of the current file system.

12,

This
Recovery was a very time consuming process and we made the demo-to leave the site in maintenance mode until we had recovered every fileserver pair.
This recovery is a very time-consuming process. We decided to leave the site in maintenance mode until every file server pair is finally restored.
......, We decided to maintain (github.com runs in) the maintenance mode before restoring each file server.

13,

... And
We returned the site to service at 20: 23 PST.
...... We returned to the site at Pacific Time.
...... We went online again at Pacific Time.

14,

Our
Vendor plans to revisit the respective timeouts so that more time is given for Link failure to be detected to guard against this type of event.
Our network providers will review individual latencies to address link faults and prevent such incidents from happening again.
Our network vendor will review the status of individual latencies (by adding the router timeout settings) so that the router has more time to check and determine the link timeout to prevent such incidents from occurring here.

15,

We
Are postponing any software upgrades to the aggregation network until we have a functional duplicate of our production environment in staging to test against.
We have postponed all software upgrades for the aggregate network until we test the functional replication mode in a successful production environment.
...... It is not until we create a copy in the test environment (staging) that is exactly the same as the production environment function for testing.

16,

The fact that the cluster communication between fileserver nodes relies on any network infrastructure has been
A known problem for some time. We're re actively working with our hosting provider to address this.

The failure of the network facilities on which the file server node depends affects cluster communication. We actively coordinate with the host provider and solve this problem.
Communication between file server nodes depends on (other) network infrastructure, which has long been known. We are actively cooperating with host providers to find solutions to this problem.

17,

We
Are reviewing all of our high availability deployments with fresh eyes to make sure that the Failover behavior is appropriate.
We are re-evaluating our High Availability Configuration environment and implementing fault migration in a completely new way.
We introduced new personnel to reevaluate our High Availability Configuration environment configuration to ensure that failover is appropriate.


Above.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.