Reflection on Architect: System Reliability

Source: Internet
Author: User

Recently, the system has learned about system reliability and related knowledge. I will summarize it here today.

First, what is system reliability? System Reliability refers to the ability to complete the required functions within the specified time and in the specified environment, that is, the system's failure-free operation probability.

I will summarize the main content from the following aspects:

1. Fault Model

2. Reliability Model

3. Reliability Indicators

4. Reliability Design

Fault Model

System faults refer to hardware or software errors. Generally, faults are introduced because of failure of components, physical interference of the environment, Operation errors, or incorrect design.

Faults can be divided into permanent, intermittent, and instantaneous based on the length of time.

Fault levels include logical faults, data structure faults, software faults and errors faults, and system-level faults.

Reliability Model

What corresponds to the fault model is the system reliability model. There are three common models: Time Model, fault implantation model, and data model.

These three models are not understood yet (dizzy ).

Reliability Index

Reliability Indicators mainly include the following:

Mean Time to failure (MTTF-mean time to failure)

It indicates the normal running time of a system on average.

The related indicator is "Failure Rate" U. The relationship is u = 1/MTTF.

Average troubleshooting time (MTTR-mean time to fix/repire)

Average time required for each repair

MTBF-mean time between failure)

At first glance, we can see that MTBF = MTTF + MTTR.

In practice, MTTR is usually relatively small, so we think that MTBF = MTTF.

MTTF is an indicator of the time when a software system can run normally. The larger the value, the more reliable the system is. The calculation method is simple,

Reliability Calculation

The Reliability Calculation of a system often cannot be obtained directly. This is because the computer system is a complex system, and the factors that affect its reliability are also very complicated. Therefore, we need to establish an appropriate data model for it, divide the large system into several subsystems, and then perform combined computing based on certain principles.

This computation method simplifies the analysis process.

For system division, we can divide it into series system, parallel system, mode Redundancy System, and hybrid system. (Among them, the Module Redundancy System is m parallel subsystems. More than N subsystems must work properly before the entire system can work normally. This type of system is often followed by a table terminator in parallel .)

When calculating the reliability of these systems, we need to calculate the failure rate of each subsystem, and then carry out comprehensive calculation based on the probability addition principle (Series System) and multiplication principle (parallel system, finally, the reliability of the entire system is obtained.

Reliability Design

This section focuses on the entire ticket.

There are two methods to improve system reliability: Error avoidance and fault tolerance. Error avoidance mainly refers to taking measures in advance to avoid system errors during operation. Fault Tolerance means that some components in the system are not running due to errors, and they can continue to run. Or when data or files are damaged or lost, the system can automatically restore the data to the previous state so that the system can continue to run normally.

Testing is the most common error avoidance technique. Fault Tolerance is generally implemented using redundancy.

Redundancy Technology

Redundancy technology is the main means of fault tolerance. By redundancy of resources, including hardware, software, information, and time, the master can greatly improve the fault tolerance of the system.

Structure Redundancy

Static redundancy and dynamic redundancy are also divided here.

Static redundancy generally refers to the addition of components with the same function and running at the same time. Finally, the table deciders vote on the results and use most of the results as the final results of the system.

Dynamic Redundancy refers to multiple device reserves. When the system detects that a part is invalid, the system enables the corresponding new part to work in place. The process of detection, switching, and recovery is described as dynamic redundancy. These redundant equipment reserves can work with the main module, or do not work, either hot backup or cold backup. The disadvantage of cold backup is that when the master module fails, the backup system may not be connected in time because the backup machine cannot obtain all the data on the original machine.

In fact, we can also combine the advantages and disadvantages of the above two types of redundancy, and use the hybrid redundancy method to design structural redundancy for the system.

Information Redundancy

Add some additional information to ensure its correctness. For example, the error code.

Time Redundancy

Similar to structure redundancy, but this is where repeated computing is performed on the same device.

Fault recovery policy

If a fault has already occurred, some methods are required to restore the fault. Generally, there are two recovery strategies: forward and backward.

Forward Recovery refers to the process of restoring the system to a consistent and correct state without stopping the current computation. Detailed descriptions of errors are required. For example, when a system failure occurs, we can capture and store all the exception information for record filing, and try to keep the system running. This is also the most common strategy.

Back-to-Back recovery restores the system to a previous state and continues execution. This method is relatively simple, but it causesProgramThe operation is inconsistent and does not adapt to some demanding systems, such as real-time systems.

Software Fault Tolerance

There are mainly the following methods:

Block restoration method

This method is a dynamic fault shielding technique that uses a backward recovery strategy. It provides master modules and multiple slave modules with the same functions. verification tests are required after the calculation of the master function is completed. If the test fails, the slave module is used for calculation. If the test still fails, continue to use the next backup module.

The independence between the primary block and the standby block should be ensured during design so that they do not affect each other.

Program Design for version n

This method is a static fault Shielding Technique and adopts a forward recovery strategy.

Multiple n programs with the same function are run simultaneously, and the final result is voted by using the table Terminator.

Focus:

Different methods should be used for Program Design of version n, such as different design languages, different development environments and tools.

At the same time, because n programs run at the same time and finally vote at the same time, it is necessary to solve the concurrency between multiple programs.

Defense Programming

The basic idea of this method is to include Error Detection in programs.Code. Once an error occurs, the program can undo the error and restore it to a known or correct state. Including error detection, damage estimation, and error recovery.

This method is mainly used for Fault Tolerance in the form of software, that is to say, the software itself has a strong fault tolerance and is more commonly used.

Cluster

A cluster is a loosely coupled computing node set composed of more than two node servers, provides users with a single customer view of network services or applications (including databases, web services and file services) and fault recovery capabilities close to fault tolerance machines.

When it comes to a cluster, we generally think of using it to provide a Scalable High-Performance Design for applications. However, clusters can also provide high fault tolerance capabilities for applications. The following is the cluster category:

High-performance computing scientific clusters, Server Load balancer clusters, and high-availability clusters

In practical applications, these three basic types are often used together.

Hardware configuration

(1) Dual-host backup storage

Use two separate servers as mirror servers, and use mirror software to synchronize data over the network. The performance of the backup storage is lower than that of a single backup storage,

Features: simple, inexpensive, low reliability, low network resources occupation, and low performance.

(2) Dual-host and disk array cabinets

This method also uses dual-server, while the back-end data storage uses a disk array cabinet. The array cabinet provides Logical Disk array access for the dual-host machine and does not expand the new physical disk at will.

This method does not require data synchronization, so the performance is much higher than that of the backup storage. However, a single point of failure (spof) may occur, that is, when a part or application of the system fails, all the systems are down. If an error occurs in the disk array, all stored data may be lost.

Features: High Performance, which may cause single point of failure.

(3) Fiber Channel dual-host dual-controller Cluster System

Use optical fiber to establish a channel for connection. Allows image configuration.

Features: High scalability and high cost.

With the development of hardware and network operating systems. Cluster technology will gradually improve system availability, high reliability, and system redundancy.

(For example, a cluster can use the cluster file system to access all files, devices, and network resources in the system and generate a complete system image .)

Summary

Reliability Engineering

Link: http://wenku.baidu.com/view/98b021225901020207409c76.html

This article describes how to carry out the reliability project from an engineering point of view, and illustrates how to apply the reliability project in the software development process.

Concept and development

Simple definition: prediction, modeling, estimation, measurement, and management based on the reliability of software products.

The goal is to improve the reliability of the software system. To achieve this goal, we need to understand the cause of failure.

Core issue: how to develop high-reliability software; another issue: how to evaluate the reliability of existing systems.

Application in software development

Reliability Engineering runs through various stages of the software development lifecycle.

Project development plan and requirement analysis stage

In this phase, the reliability requirements should be clarified and System Reliability Indicators should be established. Generally, the reliability work can be arranged as follows:

1) define the function Overview

The function diagram mainly describes the functions in the system, the environment for use, and the probability of being used.

2) define and classify failures

3) confirm the reliability requirements of users

4) Balance Research

5) Establish Reliability Indicators

Software design and function implementation

Main work in this phase:

1) Allocate Reliability Indicators between modules

The decomposition system is divided into multiple modules, and indicators are allocated between each module to meet the requirements of the total indicators finally calculated.

2) Design Based on Reliability Indicators

For details about the reliability design, refer to the content above.

3) Configure resources based on the functional Overview

4) control the introduction and propagation of errors

Software Review (Code Review), software testing (unit testing and integration testing ).

5) test the reliability of ready-made software

System Test and on-site test run

This stage is the final stage to ensure reliability. Main Tasks:

1) Operation overview

The operation diagram mainly describes the operations (commands) that can be used at the end of the system, as well as their use environment and the probability of being used.

2) Reliability enhancement test

System testing and delivery testing.

Perform the test case according to the probability shown in the operation diagram, and test the application mode of the user.

3) verify whether the reliability indicators have been met based on tests

Collect invalid data for additional testing in the planning room.

4) on-site Reliability Evaluation

Analyze the data and analyze the causes of the difference.

Maintenance Phase

Main Tasks:

1) Plan the personnel requirements after delivery and use.

2) monitors on-site reliability and makes appropriate adjustments.

3) monitor and maintain failures caused by new features.

4) analyze the cause of software failure after delivery, guide engineering improvement, and reduce the possibility of introducing similar errors.

Success stories

This article uses the R & D of a switch as an example to describe the application of the reliability engineering, which has brought amazing benefits to the product:

Decrease in the number of problems, decrease in maintenance costs, shorten the test piece interval, shorten the interval of introducing new products, and improve customer satisfaction.

The reason is as follows:

(1) reliability is used as a standard for determining whether to release the product. This prevents users from reflecting too many problems during use and performing corresponding maintenance work.

(2) The "Operation overview graph driven" test method is adopted to improve the test efficiency. 20% of operations cover 95% of applications, and 20% of errors result in 95% effectiveness; first, test 20% of the most frequently used operations to accelerate the improvement of reliability.

Conclusion

There is no systematic reliability engineering theory at home and abroad. We need to continuously carry out research and summarization in combination with practices, and strive to make reliability work a planned, organized, and targeted research.

High reliability test

Link: http://tech.it168.com/a2008/0829/202/000000202483.shtml

Taking the craftgs system as an example, this article describes how to use testing techniques in the system to ensure high reliability of software. These techniques include software verification, software validation, and software testing management.

Summary

High-reliability software is a type of software. If a fault occurs during the operation of such software, a major catastrophic accident or economic loss occurs. Generally, aerospace software, banking system software, medical software, and communication software are in this category.

The craftgs System of the author is a software system with high reliability requirements. The Reliability Indicators of each subsystem are above 0.95.

Solution: software verification technology + software validation technology + Software Test Management.

Verification techniques are completed manually by face-to-face inquiry, document spot check, informal meetings, and peer review.

Software validation techniques focus primarily on Troubleshooting errors in program code. Currently, great automation is supported.

The control of engineering quality mainly relies on test management, which can be divided into: "software test team organization management, Software Test Plan Management, Software defects (errors) tracking Management and Software Testing Management.

Software Verification Technology

It mainly includes the following aspects:

Requirement Specification Verification

Ensure that all user requirements (functions, services, non-functions, and constraints) have been assigned to each requirement item in the Software Requirement Specification Description.

Design Specification Verification

It is mainly to gradually check whether the summary design and details are all allocated with the previous analysis results. The database design verification is also required.

Code Verification

Including code specification review, code review, and static code analysis.

Delivery Verification

After the test is completed, delivery verification and testing are required before the system delivers the customer. Delivery Verification consists of installation verification and use verification to ensure that the software matches the user manual.

Software validation Technology

In fact, this is the testing technology. Include:

Unit Test (white box)

Build the pile module and drive module to drive the tested unit (functions, classes, modules) to run, and use the designed test cases to test each unit.

Integration Testing (gray box)

Verify whether the assembled software of each module meets the design goal of the module in the Outline Design Specification Description; whether there is a conflict within each module, and whether the module can work normally. Generally, the bottom-up mode is used for integration testing from small to large.

System Testing (black box)

Check whether the system meets all requirements in the Software Requirement Specification Description, including: business requirements, functional requirements, non-functional requirements (quality attributes) and constraints. Although the Code is not involved, the requirements involve a wide range of fields, so there are many testing methods, such:

Functional testing, path testing, reliability testing, stress testing, recoverability testing, and portability testing ...... .

The characteristics of these tests: Design and run various test cases under certain environmental conditions (such as simulating on-site or extreme conditions), based on the test result data, assess whether the software system meets various requirements of software requirements.

Delivery Test

The main participants in the delivery test are the target customers. The more customers participate, the better. Mainly engaged in: installation testing, availability testing, Alpha testing, beta testing, etc.

Software Test Management

Software testing team organization management

Whether a proper test team can be set up directly affects the progress and quality of the test. The test team in the craftgs System of the author includes senior test experts, testers, part-time personnel (peer review), and new testers.

Software Test Plan Management

In fact, it is to arrange the test process.

Mainly include: Software Testing planning, software testing technical tailoring, testing Progress Management, and cost management.

Software Defect (error) Tracking Management

Tracks the full lifecycle of an error to ensure that every error can be corrected in time without introducing new errors. When the tester submits an error, the development team should be urged to correct it in time and perform regression testing after the correction is completed. Generally, you can use the Bug Management System.

Software Test part management

Efforts should be made to build a wealth library for the testing team and provide skill training to test team members to help them make good use of this wealth library.

Testware is a product formed by testing, including lessons learned, testing skills, testing tools, specification documents, and some common scripts.

Testing management mainly involves construction and training.

Conclusion

at present, how to tell the truth about high-reliability software is still an immature field, and there is a lack of systematic methods.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.