Software reliability and Verification
He Guowei, Reliability Engineering Technology Center, National Defense Science and Technology Commission
This topic: Software Reliability
Currently, software is becoming more and more important in the system. As we all know, the hardware in the system has reliability metrics.
And must be verified. However, if the software in the system does not have a reliability indicator requirement or does not require verification, then the entire system
The reliability is still not guaranteed. At present, catastrophic accidents due to software reliability occur from time to time. Therefore
This topic will focus on software reliability issues, including:
1. Software Reliability and Verification
Based on reviewing the Software Reliability development process, this paper introduces some important software reliability indicators and their verification.
According to the actual development of China's software industry, this paper proposes how to reasonably develop some Software Reliability Indicators in China.
.
2. Combined use of software fault tolerance and software fault tree technology
This article introduces how to use the software fault tolerance technology and software
A software system with high reliability is developed based on fault tree technology.
3. Software Reliability Engineering Practices
How to Ensure software reliability has become an important project. This article introduces a software
Reliability Engineering Method.
4. Software Reliability Test
The purpose of software reliability testing is to correctly estimate the reliability of software products. This article introduces software reliability testing
Basic concepts, issues to be aware of, and specific test steps.
5. Aerospace Control Software Reliability Engineering Management
This paper analyzes the existing problems in the reliability engineering management of aerospace control software, and puts forward some soft
The concept of engineering management, software pre-analysis and reliability measurement is designed to promote space control software reliability engineering management.
.
Liu learn, responsible editor of this special period
I. Introduction
At present, software reliability issues have been exposed to a considerable extent. The European Space Agency launched the arina 5 rocket because
Software faults cause losses of hundreds of millions of dollars. There have also been millions of accidents in China due to software faults. All of these
.
The author exchanged views on software reliability with friends of NASA. Based on NASA's experience, their
Software reliability is generally an order of magnitude lower than hardware. NASA's software is strictly organized by software engineering
All units responsible for software development must undergo the five-level standard assessment stipulated by the SEI (Software Engineering Research Institute) in the United States. Therefore, our software
The component reliability level is generally not higher than the hardware level, which is a reasonable estimate.
More and more civil and military systems use more and more software. Hardware reliability requirements and requirements
Verification. For example, the minimum acceptable value of the hardware that has reached MTBF (average fault-free time) is verified based on the gjb899 military standard in China.
. However, there are basically no Reliability Indicators for software in China, and verification is not required. In this way, the software reliability
Unable to guarantee, so even if the hardware reliability meets the required indicators, the reliability of the entire system is still not guaranteed. This
The situation is obviously not suitable for civil and military systems. Not to mention the Software Reliability Indicators, to improve the software can
There is no motivation to rely on performance, and there is no need to verify the proposed indicators, there is no pressure on the software development unit. Therefore, improve software reliability
The key to the performance indicator is not to mention much, rather than to allocate components to the same way as the hardware reliability, but to verify the performance first.
American National Standard ANSI/AIAA R-013 1992 recommended software reliability practices
The verifiable requirement for failure rate is about 10-4/H ", that is, MTBF is about 10 thousand hours. Low level of Software Development in China
In the United States, the MTBF of the software should be between 1000 hours and 10000 hours, which should be an advanced indicator.
Currently, the most important indicator of software reliability is MTBF. Other indicators need to be further explored, such as the hardware
MTTR (average service time) indicators. However, for software, users should not modify the software on their own.
Software modifications are likely to get worse and worse. Under normal circumstances, the software must be modified by the original development organization after a fault occurs.
After a large number of necessary regression tests can be re-delivered, and this period of time is often quite long. Therefore, how to determine
R is a question to be discussed. The Reliability indicators mentioned in this article are limited to MTBF.
Ii. Software defects, faults, failure rates and average service life
Many people in China are still not clear about the basic concept of software reliability. Therefore, this article first clarifies some basic concepts.
The period from the time when the task of proposing a software product starts to the time when the product cannot be used is called the "Life Cycle" (that is, "Life Cycle" of the software.
Period "). In each stage of the software life cycle, human actions make the software unable or will not be able to complete the provision under certain conditions.
Function. For example:
· The user's requirements are incomplete.
· Software production units misunderstand users' needs.
· Software production units do not fully meet user requirements.
· The next process in the software production process does not fully meet the requirements of the previous process.
· The applicable algorithms or discriminative logic are incorrect or incomplete.
· Human omission.
The above actions have led to the root cause of software errors. This is called a "defect ".
A software may have several defects. However, when the software executes a task, it does not always use all program components.
Part. If a defective part of the program is not used, the software can still complete the task correctly.
To a defective part of the program, a part of the software or software output is not in line with the prescribed, that is, a fault (fault ). So
Failure is the exposure of the program path where the defect is located, which is desensitized and under this specific condition.
The components of the software are used in different frequencies during task execution, and some are frequently used.
Some are rarely used, that is, the possibility of failure caused by different defects is different, and there can be several orders of magnitude.
Difference.
For a software defect, the average time of a failure caused by a software task is called the average life of the defect.
MTBF, which is recorded as θ. Its reciprocal value is λ = 1/θ, which is called the failure rate of this defect.
E. N. Adams of the United States, in the 1984 IBM Research and Development Magazine, lacks representative 9 IBM software products
Statistics are collected. He divided Software defects into eight files, with each MTBF representing the value (in years ):
1.58, 5, 15.8, 50,158,500,158 0, 5000
The MTBF of the I file represents θ I, and the total number of defects in this file is DI, so the total Software defects are
@ 16103000.gif; Formula 1 @@
The percentage of the total number of defects in file I to the total number of Software defects is Ri = di/d, and the total failure rate of the defects in file I is λ I, soft
The total failure rate is.
@ 16103001.gif; formula 2 @@
The percentage of the total failure rate of the I-grade defect to the total failure rate of the software is Fi = λ I/λ, as shown in table 1.
@ 16103002.gif; Table 1 Adams Statistical Table @@
Some important concepts can be obtained from the statistical table: the failure rate of Software defects may differ by 3 ~ 4 orders of magnitude; high failure rate
The number of large defects is small, the failure rate is low, the number of small defects is large; a small number of large defects with a high failure rate (for example, 1 of the total number of defects,
Faults 2 and 3 accounts for the majority of the total failure rate (the above three accounts for 72.5% of the total failure rate ).
Therefore, the number of Software defects has no direct relationship with the failure rate of the software.
If you deliver a software to a user and say there are still 10 defects, the user is not at ease. Because if there are ten more
The first fault may occur in one or two months. On the contrary, if there is an average 8th fault in 5000
100 defects are acceptable.
Therefore, the user is most concerned about: "No matter how many defects you deliver the software, their total MTBF or λ is
How much? "
Iii. Analysis of the fault rate of various software Defects
Several defects of a software are arranged in ascending order according to the failure rate: λ 1 ≥λ 2 ≥λ 3 ≥ ......
In theory, we can arrange the λ into a non-optimal diagram, as shown in figure 1.
@ 16103003.gif; Fig 1 true of software defect failure rate in theory @@
For the hardware, this kind of multi-path diagram not only theoretically exists, but also can be obtained through reliability prediction. Each Electronics
The failure rate of equipment components can be found by GJB299-A (domestic) and MIL-STD-217F (imported. But for Software
The failure rate of a defect cannot be estimated.
For example, the TV has a failure rate of the Multi-chart, in which the high-frequency head of the picture tube has the highest failure rate, except for a few failure rates
Other components, such as integrated circuits, resistors, and capacitor components, have low failure rates. While other electronic products
Is another rule. The hardware of different electronic products may be very different. There is no unified hardware package.
Rule.
For the hardware, objective and predictable pairo charts do not have rules that can be pushed out. For example, the TV has
Five faults with a high failure rate, but from the sixth one, the failure rate is greatly reduced. Therefore, five successive major faults are not obtained.
The sixth non-release fault is a major fault!
Compared with hardware design, the reliability design technology is far from mature software. What is the reason for this?
What about the law?
@ 16103004.gif; Formula 3 @@
Objectively, the software has a non-optimal diagram, but a software rule is not necessarily suitable for this software.
In another software. In addition, the rule that is suitable for failure rates of the first few defects is not necessarily able to push the fault to the future.
If the software failure rate has a certain rule, that is: when T = 0, the software failure rate is
@ 16103005.gif; formula 4 @@
When T = T1, the first fault occurs. After corresponding defects are eliminated, the failure rate is reduced:
@ 16103006.gif; formula 5 @@
@ 16103007.gif; Figure 2 λ (t)-t figure @@
Therefore, λ (t) is a stepping down line, as shown in Figure 2. This is a general rule.
Some authors suggest that new defects, or even those with a high failure rate, may occur in the process of eliminating Software defects.
So Lambda (t) may rebound. Therefore, we assume that there is a probability distribution of bounce probability and bounce rate, which leads to the extreme of the λ (t) model.
Its complexity. However, from the perspective of engineering practice, under strict configuration management, the introduction of new defects can be determined immediately
Now a rebound.
Iv. Failure Rate λ (t) testing and Estimation
Set the failure rate of the software to λ, the corresponding average life to θ, and set the software to run at T = 0.
T is the variable with the exponential distribution, and its cumulative distribution function is:
@ 16103008.gif; formula 5 @@
If the software is put into operation, the corresponding defects will be eliminated once a fault occurs and then put into operation, then the failure rate will be put into operation.
It has dropped. That is, there is only one test data t corresponding to the failure rate λ. Based on the above statistical rules, you can:
T has a probability of 80% falling between (0.105 θ, 2.303 θ;
T has a probability of 90% falling between (0.0513 θ, 2.996 θ;
T has a probability of 95% falling between (0.0253 θ, 3.689 θ.
It can be seen that the tvalue fluctuation is normal for two or thirty times. This is because T's standard deviation σ is equal to mean θ, that is
Volatility far exceeds the true value.
If the software is running and the software is put into operation after a fault is eliminated, the change of λ (t) is:
Lambda (0) → Lambda (T1) → Lambda (T2) → ......
However, there is only one test data for each λ (TI), and the error of the test data fluctuates greatly.
It is impossible to obtain a more accurate λ (t) Law from the trial data.
According to the statistical analysis above, for each λ (TI), the test value Ti + 1-ti is a rough value.
Attempt to describe with fuzzy mathematics. Even if you want to describe it with fuzzy mathematics, its membership function is not a general triangular graphic function.
It is not a normal distribution function, but a function with a very wide distribution. It is of little practical significance to deal with fuzzy mathematics.
Result.
V. Software operation profile and reliability data
There are various possible inputs for the software, and the probability of various possible inputs is not the same. Therefore, the possible input value of the software is
It is associated with the probability of occurrence. For example, a production control software has a high probability of appearance under normal circumstances.
The current probability is generally small. A set of possible inputs of the software constitute a point P of the input space. All of these Ppoints are input spaces.
S. On S, define a probability density function f (p ).
If G is a child point set of S, the probability that the input point P falls into the Child point set G is:
P = rjg f (p) DP
{S, f (p)} is called the "Running profile" of the software"
The software user must specify the software operation section, which includes both the normal software input and
Possible abnormal input. Software requirements include normal output under normal input, and possible non-positive
Regular input should have correct responses. For example, the computer software used by a space satellite, its normal input is based on a variety of previous
Satellite status information, data measured by the current sensor, instructions and information sent from the ground or astronauts to control the satellite
.
However, due to heavy particle bombing in space, the digital in the memory of the computer may generate a mutation of 0 and 1. This
It is possible that the input is not normal and the software needs to make a correct response.
A defect di of the software only affects one point set GI in S. When the input vertex P ε GI, the defect di causes the software to fail.
Probability is
@ 16103009.gif; formula 6 @@
The corresponding failure rate is λ I. When the corresponding fault domains of Lambda I are Gi independent of each other
@ 16103010.gif; Formula 7 @@
λ (t) = Σ λ II (t ).
It must be noted that the software reliability is the reliability under the specified conditions, that is, the possibility of task completion under the actual operating conditions.
If it is not the actual running condition, it does not reflect the true reliability.
To give an extreme example, if you use a correct example and input it thousands of times, it will always be correct.
It does not reflect high software reliability. Because of this, many software uses only a few test cases for demonstration during verification.
This shows the reliability. Even if one hundred test cases are passed, the reliability is not high.
. "The software failure rate can correctly reflect the reliability of the software only when the software is input randomly according to the running section ".
Therefore, the software reliability can only be obtained in two situations: the failure rate calculated by the software during actual operation;
The failure rate calculated under the random input test under Simulated Actual running conditions.
Because the software that has not been fully tested should not be delivered for field operation, the software must be simulated for actual operation prior to delivery
Under the conditions, a large number of random input points are used as test cases to assess the software, calculate the failure rate, and use this as the reliability evaluation.
.
The statistical failure rate of tests such as white boxes in software engineering cannot be used as the basis for reliability evaluation. White
Box testing is one of the important methods to improve software reliability. However, the white box test is not a random test based on the running section.
Statistics do not reflect reliability. For example, the statement coverage rate in the white box test is 100%, which is correct, but does not indicate that the software
High reliability. Some strict white box tests require that the branch coverage rate be 85 ~ 95%, program path coverage is
60 ~ 80%. Even if the branches and program paths of these controls are correct, high reliability cannot be ensured.
After all, because only one part is covered, the failure rate may still be high in the uncovered part. Software Engineering
Various tests in are necessary to improve software reliability, but their statistical data does not indicate software reliability.
NTDs software data cited by many software model researchers is a typical model. It has 26 data records during development and is under test.
The trial phase has 5 data ....... Many authors use this data for reliability evaluation. Actually, this is incorrect.
.
Vi. Software Reliability Verification Test
Under the condition of random input by running profile, the software reliability is exponential distribution. Test the reliability of such software
Can reference gjb899 (or MIL-STD-781D) "reliability assessment and acceptance test".
The target value (GOAL) of the software MTBF is the expected usage indicator of the software.
The threshold value (threshold) of the software MTBF is a usage indicator that must be reached by the software.
The specified value of the software MTBF is the contractual indicator that is expected in the software development task book or in the middle of the contract.
. It is the basis for the carrier to design software reliability. The specified value is determined based on the target value.
The minimum acceptable value of the software MTBF (minimum acceptable value) is specified in the software development task book or contract.
Which is the basis for verification. The specified value is determined based on the target value.
The lower threshold θ 1 of MTBF is the lowest acceptable MTBF Value. The maximum MTBF Value θ 0 is determined by reference to the specified value.
Identification ratio D = θ 0/θ 1. Generally, select d = 1.5, 2.0, or 3.0.
When the manufacturer's risk α is MTBF's true value is equal to θ 0, the product is judged to be unsuccessful, and the user's risk β is MTBF's
When the true value is θ 1, the probability that the product is judged to pass. Generally, the values of α and β are 10% and 20%, which can be higher in special cases.
The risk rate is 30%.
The test time is a multiple of θ 1.
The AC is the number of faults that pass the decision, and the Re is the number of failures that fail the decision.
Based on the identification ratio and α and β values, gjb899 provides a scheduled test scheme for table 2.
@ 16103011.gif; Table 2 gjb899 provides a scheduled test scheme @@
For example, in solution 17, D = 3.0, that is, θ n = 3 θ 1, α = 20%, β = 20%, and the test time is 4.3 θ 1. If
If the number of current faults is less than or equal to 2, the fault passes through; if the number is greater than or equal to 3, the fault fails.
The recommended solution is from 19 to 21. Of course, you can also choose the appropriate gjb899 sequential test scheme.
Even if the software passes the verification test, the faults exposed during the test must be identified and eliminated.
, Perform regression testing, and confirm that no new defects are introduced before delivery.
Timing tests are conducted to facilitate the preparation of test plans.
The test time is the cumulative test time for simultaneous reliability tests by any number of computer software. In the preceding example
θ 1 = 1000 h, the test time is 4300 H. If three computers are used for Random verification at the same time, it takes 1433h to end.
The reason for not excluding the defects during the verification test is that the elimination of defects requires a complete set of strict management and technical processes.
Regression tests are also required after the change. The cycle for eliminating defects is generally long.
Verification test passes the minimum acceptable value of the software MTBF and does not indicate that the software MTBF has been passed
Value.
After receiving, the SRGT of the software should be continued, and the faults exposed after delivery and use should be collected to continue to eliminate defects so that the software can
Continue to grow by nature to reach the specified value.