Christmas is coming. A reliable server is essential to the IT staff's wish list. With the advent of virtualization technology, a physical server carries dozens of workloads, and the need for reliability is more intense, because hardware failures and failed migrations can cause the server to crash, which is troublesome.
Technologies related to server reliability, such as redundant power supplies, memory error monitoring, and corrections, are slow. protocols and behaviors that need to identify, accommodate, and resolve fault conditions are expensive, and there is no broad implementation standard for interoperability across all tiers. This article describes some of the latest tools to make it easier for it people to build reliable servers.
Storage Subsystem Reliability
Check parity bit and error correction code ECC technology dating back to more than 10 years ago, newer memory hot backup and mirroring are also relatively perfect. Of course, with the amount of memory and its importance in the server along with the virtualization rub up, we need more robust memory control technology.
Request and patrol clearance is an advanced application of ECC memory. In request cleanup, the system can correct a random or accidental ECC reading error during operation. Patrol clearance can proactively locate and tangle with errors in system memory. If these actions do not work for repairing memory errors, it means permanent failure. Potentially persistent failure triggers resilient functionality, such as dragging data using mirrored memory mode. Some systems label failed locations to prevent future use of problematic memory.
The EEC can only correct unit errors at any memory location and use other techniques if other advanced errors are made. such as a single device data correction device SDDC or advanced ECC, which combines ECC mode to correct multiple bit memory errors in a single memory chip. By comparison, the dual device data DDDC enables the server to withstand simultaneous multiple-bit errors on two memory chips. Enhanced DDDC or dddc+1 can find and correct additional unit errors on this basis. These technologies address a wider range of memory glitches and prevent a total workload crash.
Memory mirroring replicates memory content synchronously by providing DIMM-protected memory. When a memory failure is detected, the system swaps to a mirrored copy until the faulty DIMM is replaced. New servers on the market support local memory mirroring: Mirrors only the partial memory of the server used by the task-critical workload. Obviously, this is a way to reduce costs.
Processor Subsystem Reliability
The greatest threat to server reliability is when memory or processor failures reach the system and are passed between workloads. The data containment pattern recognizes errors in one or more memory locations, preventing other processes from continuing to use. For example, in the event of an unrecoverable error, the filtering mode prevents the system from moving the network data to the PCIe bus, isolating the server, and preventing any accidental network data from being transmitted to the user or other server.
The server uses processor cleanup to seamlessly migrate the workload from the faulty processor core to the idle core. The wrong processor is idle until the error problem is resolved. As for memory removal, processor cleanup can only work when your server has a free core, so it's not convenient for highly utilized hosts because of the inability to tolerate downtime. If your server uses a socket disable feature, it can even start a faulty processor.
Other features of the reliability server
In the past, a server failure could cause the entire system to be shut down and the wrong device repaired. Some servers now include hot-add or hot-plug features, all of which have the technology to upgrade or replace core components such as CPUs, DIMMs, PCIe cards, etc. while the server is running.
Hot add is the crystallization of electrical engineering, BIOS and operating system intelligence. Some operating systems, such as Windows Server 2008 R2, Red Hat Enterprise Linux 6, and SUSE Linux Enterprise Server 11, can identify and configure new resources during server runs.