Nineth Chapter Fault Tolerance
At present, due to the large scale of the organization and complexity of the cluster, as well as the general requirements of low-cost hardware, so that the cluster in the running process of error probability, far higher than a single and stable performance of the minicomputer server, and the cluster in the process of operation is almost not allowed to stop, This requires a much more complex error management scenario than a single machine environment. In fact, we have a considerable amount of energy at all stages of product design, development, and operation, all of which are used to obtain a variety of faults and to solve the problem of error handling after various failures. For these error handling, we follow the whole idea to solve: first by the software sense to find and locate the point of failure, and then to judge, if the software can solve the failure, then through the software self-repair mechanism to complete, otherwise, this error will be submitted to the Cluster Administrator manual processing. With these two error handling baselines, we incorporate various error handling into their modules and embed them in order to achieve fault-tolerant management of the entire cluster. Below, we will explain some of the major faults that Laxcus will face and how to deal with them from two aspects of hardware and software.
9.1 Hardware fault tolerance
In the Laxcus cluster error management, we have caused the problem of the running environment or the hardware itself, and cannot fix the software by itself to solve the error, which is called hardware error. The processing of hardware errors is done by the Cluster Administrator, where the software discovers and alarms the function. Based on some of our past experience, this section describes some of the frequently occurring hardware failures and software-aware approaches.
9.1.1 network failure
The current network failure is caused by the following hardware components: switches, routers, hubs, network cables, wiring heads, network cards. In these faults, some of them can be repaired manually, such as the loosening of the wiring head, bad contact with the network card. The other part is hardware-damaged and needs to turn off device replacement. Software discovery of these failures is also very simple, mainly through the network handshake to detect the discovery, such as the integration of ICMP in the software function, in the run to trace the node, found suspicious phenomenon, through the network and outside the network segment between the comparison, you can quickly judge and locate the point of failure. This type of fault checking is usually performed by the management node, and other types of nodes will be actively submitted to the management node for further checking and checking if they find problems or failures during operation.
9.1.2 computer failure
Computer failure is caused by the failure of the internal components of the computer, which includes the motherboard, chip, memory, network card, hard disk, power supply. Based on our past experience, most computer failures are caused by motherboards and hard drives, with the motherboard taking up a significant portion. Computer failure through both internal and external means to check the discovery. Internally, the node self-check mechanism detects the component, and on the outside, the node is traced by the management node. No matter who detects it first, it will be submitted to the administrator immediately after the software has been sensed.
9.1.3 Hard drive failure
There are three cases of hard disk failure: The hard disk is completely damaged (usually due to boot area corruption), the sector is damaged, and the disk space is full. The first scenario usually occurs when the computer starts, and the best solution for this kind of failure is for the administrator to track the computer's operation when it is turned on. The second type of fault is usually generated in the process of reading and writing data, which can be detected immediately by the fault real-time perception and reported to the administrator. The third situation is caused by too much write data overflow, not the problem of the hard disk itself, this situation will be captured immediately by the node, and then to notify the administrator.
9.2 node fault tolerance
The "node" mentioned in this node contains the two concepts of "process" of software and "computer" of hardware. This is slightly different from what you have mentioned before, so please pay attention. In earlier versions, node failures were more caused by software failures, such as improper handling of node operation management mechanisms, API interface coordination between modules, and convergence of errors. These issues are closely related to detailed design and programming, and as the version evolves, more and more cases are now being caused by hardware problems. In the Laxcus cluster, because the front node belongs to the user, and the function is simple, the essence is only a display terminal for the input and output, so this section ignores it, and will mainly introduce the node fault tolerance under the management of the Cluster Administrator.
9.2.1 Management node fault tolerance
As mentioned earlier, whether it is a primary domain cluster or a subdomain cluster, there can be only one master management node to be responsible for the management of the cluster, it is unique in its own cluster, is the key to ensure the normal operation of the whole cluster. At the same time, in order to ensure that the cluster does not cause the cluster management confusion because of the master node failure, there are usually one to several monitor management nodes as backup exists, they will monitor the master node to run.
In our test environment, there are 1 master nodes and 2 monitor nodes. To check the fault tolerance of the management node, we conducted this experiment. We use the Linux kill command to kill a master node process, in the first 5 seconds, one of the monitor node is aware that the master node has failed, and immediately start the failure negotiation mechanism, ask another monitor node, it's the master node judgment , the two sides soon together confirmed that the master node has failed. Then, they sort by their network address, select the Monitor node with the largest number and become the new master node. The new master node immediately transfers itself from the monitor state to the master state and notifies all subordinate nodes (including another monitor node) that they are re-registered under the new master node. Also notifies the watch node of the failed master node and the switchover process. The entire fault tolerance process is completed in 20 seconds.
Figure 9.2.1 Management node fault tolerant processing process
9.2.2 Data node fault tolerance
The data node holds all of the information in the cluster, second only to the management node, so fault-tolerant management of the data node is also very different from other nodes. As described above, each node outage will be captured in time by the management node. After the data node is down, the management node will do the following for the alarm, and then make a process of processing: Take out the data node number, follow the data block number, find the homologous backup, generate a new backup to the other node, if it is the data master node, Other backups from the node are restored to a new primary node, and these are also upgraded from blocks to primary blocks. If the data is from a node, the backups are generated from the primary node and distributed to the other slave nodes. Since the data master node is generally large (up to a few terabytes), and it is guaranteed to not take up too much network bandwidth, the recovery process is actually slow. During this time, the administrator has enough time to check and recover the failed computer. When the computer restarts, it will check the main block conflict on the network to avoid the occurrence of homogeneous data blocks. In particular, the primary node failure, before the recovery is complete, there is the possibility of data is incomplete, this time the update/delete operations, the call node will reject them until all the data to complete the recovery.
Figure 9.2.2 Data node fault tolerance
fault tolerance of other nodes in 9.2.3 cluster
Compared with the above two, the other nodes in the cluster do not save the data, also is not responsible for the cluster management work, they are only to meet the distribution process, there is usually more than one node, each type of node can be replaced each other. The exit and accession of these individual nodes only have a small impact on the operation of the entire cluster. Therefore, in Laxcus fault-tolerant design, the fault-tolerant management of these nodes is much broader, and their failures are usually notified to the watch node by the alarm method, which is judged and processed by the administrator. Their missing work will also be replaced by the management node notifying other similar nodes.
9.3 Data fault tolerance
In our statistics of multiple groups of Laxcus cluster failures, because the node-induced failure probability is not high, a large number of data errors occur. The main reason for data errors is that the disk has a bad area, which is usually captured by the computer's read-write disk data. The way to handle data errors is redundant replication, which is implemented by the data node with the error block using a self-healing mechanism. The process is that the error block belongs to the data node through the network to find the same number on the other database block, and then the correct data block to download, replace the data block has been wrong. During block copying, all data blocks underneath the table are locked until the update is complete. The wrong block is flagged by the data node and stops using it. The last data node sends the block number and its own network address to the watch node, giving the administrator notice that a data error has occurred.
9.4 Distributed Task Component fault tolerance
The current distribution task component failure is basically the result of the user, which is poorly handled by the programmer during programming. Such errors are difficult to find in programmer programming and in the cluster test environment, and can only occur in a formal operating environment. The responsibility for overseeing the distribution of task component errors is the sandbox, which is another primary responsibility outside of security management. To prevent fault propagation, the sandbox restricts the distribution of task component errors to its own space, without affecting nodes and other distributed task components. After a fault occurs, the sandbox immediately unloads the distributed task component, sends an error code and its error stack to the source, and submits the information to the Administrator, who is responsible for the user's intervention.
9.5 Administrator Responsibility
Although the cluster provides fault-aware capability, it also implements some error self-recovery processing, but there are still various post-management tasks that need to be implemented by the administrator to resolve. To accomplish these tasks, the Administrator should have a certain degree of professional knowledge and professional responsibility.
For many of the failures caused by software problems, it is now basic can be traced through the log and breakpoint analysis, for hardware failure, it is more necessary to maintain the experience and expertise, these require a certain amount of time to accumulate work, pay a lot of time and learning at the cost. In fact, according to our experience in operating and managing clusters, networks and clusters will grow in size as the amount of storage and computing needed for future big data markets increases. To do a good job in cluster management is not an easy task, the Cluster Administrator should be able to understand the cluster and various node performance parameters, the implementation of processing range, fault characteristics and reasons, and can quickly solve problems after discovering problems, online and offline with a variety of people to communicate and contact. These requirements, as a Cluster Administrator, need to be fully prepared.
Laxcus Big Data Management System 2.0 (11)-Nineth chapter fault Tolerance