Avoiding faults in key data centers is always a top priority. If a short downtime may have a negative impact on the market value of your business, the physical infrastructure that supports this network environment must be reliable. How can we be confident that our solutions are reliable? MTBF is the most common method for reliability. However, without a thorough understanding of MTBF, you may not be able to achieve the business reliability goal. "Average fault interval: Description and standard" describes the basic MTBF principles. MTBF is meaningless if the definition of the fault is unclear or the assumption is unrealistic or misinterpreted.
This article describes how to use MTBF and how to use MTBF as the specification and selection basis. This article also provides a checklist as a guiding principle to ensure fair and effective cross-system comparison.
Methods and steps of MTBF Comparative Analysis
This article describes several methods to predict MTBF. Since there are so many available methods, it seems impossible to find two systems that use the same method. However, there is still a way to apply to different processes in most organizations. The field data evaluation method uses the actual field fault data, so it can provide a more accurate failure rate assessment than the simulation. This data may not be available for products or new products produced in small batches. However, this data should always be used for products that have been widely used in the field. Therefore, for cross-system comparison, comparison from field data evaluation is the most reasonable and realistic. Note that this method, like many other methods, is based on the assumption of a stable failure rate.
This article describes the steps to complete this method, lists and describes variable factors that may affect the results of each step. If key assumptions or variable factors between systems need to be compared to change, it is important to assess the potential impact of these changes on MTBF estimation results.
Step 1: define and estimate the sample size to determine the annual failure rate (AFR) and determine the MTBF of the product. The first step is to determine the sample population of the specific product to be analyzed. Is Computing Based on a specific product model or the entire product series? In this sample population, how long should the product production time span be (in days or months )? When will the production date start and end? The products selected for sampling should be very similar in design and have enough quantity to ensure the statistical validity of collected data, which is very important.
Step 2: Determine the sample time range of the collected data. Step 2 is to determine the sample time range of the collected fault data from the sample population. Data is usually collected when a product user reports a fault to the supplier. The appropriate time interval between the latest production date of the product and the start date of the sample period in the sample population varies depending on the product, geographic location, distribution process, and inventory location. For example, if a product is stored in a factory warehouse for two months and takes two months in the distribution channel, sampling can only start after four months of the latest product production date in the sample population. For products that need to pass through wholesalers, dealers, and retailers, four months are considered as a reasonable time range to consider the aforementioned variables.
Two important variables are described below: (1) there must be sufficient time intervals between the latest production date of the product in the sample population and the start date of the sample period (2) the data collection window must be large enough to ensure the reliability of the results.
If there is not enough time interval between the latest production date of the product in the sample population and the start date of the sample period, the sample may have started before the products in the sample population are fully deployed. This may cause two results. First, because the undeployed products cannot fail, there is a tendency to underestimate the failure rate. The second result is that a large number of installation failures or setup failures may be involved during the sample period. Because the failure rate of new products may be displayed as a standard "bathtub" type, a large number of installation failures may lead to an overestimated failure rate. Although we know that the two opposite effects are obvious, we cannot expect them to offset each other.
In terms of sampling time, another important issue to be considered is the window duration. How many days does it take to fully collect fault data? The sampling time window must be sufficiently wide to remove the statistical "interference" from the sample ". The duration required for obtaining a reasonable accuracy depends on the overall sample size. For example, a large number of products may take one month, while a small number of products may take several months.
Step 1: To define a fault, you must accurately define the fault and ensure the consistency of the evaluation process before you can start to count the fault.
Now it is assumed that when the "Faulty" product is returned to the factory, each technician will define a fault separately. One technician may only count products with major faults, while the other technician may count all products with major faults (including major faults. These two extreme practices make it almost impossible to accurately assess the failure rate of a specific product. Of course, they cannot accurately assess the impact of process control on the product. Therefore, before diagnosing any product, the supplier must have a clear definition of the fault. When calculating the MTBF of a specific event, the supplier may have different fault definitions. For example, suppliers will try to evaluate the MTBF that causes the failure of critical loads and the MTBF that can continue to run with less serious faults.
Step 2: there must be sufficient time intervals between the end time and AFR computing time during the receipt, diagnosis, and repair of product samples, it allows a certain amount of time to receive, diagnose, and repair reports as faulty products. The diagnosis result determines the fault type, and the repair will verify the diagnosis result. A small product is usually sent back to the supplier, which leads to a delay in receiving or a certain amount of product delivery time. When the product arrives at the supplier, it must be diagnosed and repaired, which leads to another delay called diagnosis delay. Large products are usually diagnosed and repaired at the customer's site, so there is basically no delay. In any of the above cases, you must diagnose and repair the product before calculating AFR. For a large number of products, it is likely that there will still be products to be repaired at the end of the diagnosis delay. In these cases, the probability of failure of unrepaired products is sometimes equal to that of previously repaired products. Depending on the production volume and product type of the product to be evaluated, the receipt latency and diagnostic latency can be calculated by adding a few weeks after the end time of the sample period, at which point you can calculate AFR.
Step 1: Calculate the annual failure rate. The annual failure rate is used to indicate the expected number of failures of a specific product in a calendar year.
The first step for calculating this value is to calculate the fault data by year. This value can be obtained by multiplying the number of failures in the sample period by the number of samples in each year. The second step is to determine the overall sampling failure rate. This value can be obtained by dividing the calculated number of failures per year by the number of products installed during the sampling period.
This formula has the following two assumptions: (1) 365 days a year, 24 hours a day for continuous operation (2) All products in the sample population start to run at the same time. Therefore, although this formula can be used for any product, it is more suitable for continuous operation products.
The total number of vehicles in this sample is 10,000. The fault data of the sample population should be collected within two months (during the sample period. On average, a car runs 400 hours a year. In the past two months, 10 vehicles failed.
Formula 1: failure rate is 10 faults x (52 weeks per year/8 weeks for sample period)/10,000 devices = 0.0065 or 0.65% in the sample population.
Formula 2: assuming that these products * start to run at the same time, the overall sampling duration is 10,000 x400 hours per year = 4 million hours or 4,000,000/8760 hours per year = 457.
The failure rate is 10 faults x (52 weeks per year/8 weeks for the sample period)/cumulative 457 = 0.14 or 14% * Please note that this assumption is to simplify this example. The reality is that products are sold throughout the period, so the actual running time is smaller than the above figure. The AFR value increases.
If the preceding example uses a continuous running product as an example, the two AFR values are equal. Even if you cancel the assumption that all products start to run at the same time, the AFR value is still very close. Therefore, understanding whether the product is running continuously or not is essential for correct analysis.
Step 1: Converting AFR to MTBF converting AFR to MTBF is the easiest, but most often misunderstood, in all steps. Only when the failure rate is stable can AFR be converted to MTBF.