Accident Analysis Report of Fukushima nuclear power plant in Japan
Common Problems in Software Engineering Management
Event Review:
A magnitude 9 earthquake occurred in Japan at on January 26, March 11. The epicenter was located in the Pacific Ocean east of Miyagi Prefecture, with a focal depth of 20 kilometers. the 10-Meter Wave-high tsunami triggered by the earthquake swept across the coastal area. after the earthquake, several nuclear power plants in Miyagi Prefecture and Fukushima prefecture are automatically shut down. Although nuclear fission was terminated, it would take several days for the nuclear reactors to completely shut down after cooling. The subsequent tsunami damaged the emergency power supply system of the cooling system of the Fukushima nuclear power plant, leading to the failure of the cooling system of the nuclear power plant. At around 15:36 P.M. on January 1, March 12, the Fukushima No. 1 nuclear power plant exploded and four people were injured. The fuel may melt. officials demanded emergency evacuation of residents within 10 kilometers, the range will be extended to 20 kilometers later. Unit 3 exploded at around eleven o'clock A.M., January 1, March 14. Unit 2 exploded at around 06:10, January 1, March 15. Unit 3 exploded again around on January 1, March 15. Unit 4 caught fire, causing a large amount of radiation leakage. So far, Japan has been actively handling the nuclear accident.
We can say this is a natural disaster. However, I think this is a "man-made fault". There are many design and implementation defects in the middle, which leads to the expansion of the accident. As a result, I think of some problems we often encounter when implementing a large software project. I will list them and explain them one by one.
1. Architecture Selection
The Fukushima nuclear power plant with an accident is the world's largest nuclear power plant. It is located in the Fukushima industrial zone in Japan. It consists of the Fukushima 1 and Fukushima 2 stations. There are 10 units and 6 units on the first station, the total output power is 9096 megawatt. Units 1 of the Fukushima site started construction in September 1967 and connected to the Internet in November 1970. Units 4 of the Fukushima site was put into operation in 1987.
Briefly introduce the knowledge of Nuclear Power. At present, nuclear power core nuclear reactors mainly include pressurized water reactors and boiling water reactors. The pressurized water reactor has two loops. The water in the first loop is directly heated by the nuclear fuel, and then flows into the heat exchanger. After cooling, it streams back to the reactor to cool the nuclear fuel, keep taking away the heat produced by nuclear fuel. Heat the water in the second loop is heated by the high temperature in the first loop of the heat exchanger, so that it generates steam to drive the turbine generator for power generation. There is radiation in the first loop, while there is no radiation in the second loop. The boiling water Stack has only one loop. The water heated by nuclear fuel directly produces steam and directly drives the turbine generator to generate electricity. Compared with the pressurized water stack, it is safer than the boiling water stack.
According to reports, the Fukushima nuclear power plant uses an old-fashioned boiling water stack with only one loop, and the cooling water is directly introduced into the sea. Since there is only one cooling loop, the steam produced by boiling water is used to directly drive the turbine. Once an accident causes circuit leakage, the steam carries radioactive material, and the whole turbine is directly in contact with the radiant steam, turbine structures are complex and there are many loops. It is difficult to avoid leakage, so there are many potential risks of leakage.
It can be seen that due to the early construction of the Fukushima nuclear power plant and the selection of a relatively low-security boiling water heap, it has laid a hidden danger for today's accident.
The Fukushima nuclear power plant can serve as a large portal website that provides hundreds of millions of PVS of traffic each day. To achieve uninterrupted operation within seven days and 24 hours, a well-structured and scalable solution is required, it is a safe and reliable overall architecture. Therefore, when selecting an overall model, you must choose a highly secure, reliable, and scalable framework. Wrong selection will pose potential risks for future operation.
2. Infrastructure
There are three major seismic belts in the world: the Pacific Rim seismic belt, the Eurasian seismic belt, and the hailing seismic belt. Japan is located in the Pacific Rim seismic belt, which is a region with high earthquakes. The Fukushima nuclear power plant is directly built on the coast of the Pacific Ocean, and the periphery is not blocked by continents and islands. typhoons and tsunami in the Pacific Ocean directly threaten it. Considering safety, it is not appropriate to build a nuclear power plant in Fukushima.
For the construction of a large website, the location of the data center needs to be properly considered, such as the North-South interconnectivity and slow access to international lines. At the initial stage of construction, we must fully consider the unfavorable factors, then make appropriate corresponding solutions one by one, and fully consider them during construction to avoid future crisis response.
3. Exception Handling
After the earthquake, the Fukushima nuclear power plant took the initiative to shut down the nuclear reactors and ended the chain reaction to avoid a direct nuclear reactor explosion. At the beginning of the construction of nuclear power plants, emergency shutdown measures were also added to the situation that led to abnormal nuclear reactors.
For a large website, the structure is complex, and a large number of servers are involved in the operation. hardware failure is a daily issue. There are a large number of software operations developed by different levels of personnel in the system. Writing errors and running exceptions are inevitable. If you do not consider these issues, the system will suffer from continuous maintenance.
According to the 28 theory, the code we developed to meet business functions may only account for 20% of all our code, and in order to make our business functions run normally, code that detects and handles exceptions may account for 80% of the Code. If it is compared to an iceberg on the sea, the Business Code is the iceberg that exposes the sea level, and the fault detection and processing code is the huge body below the sea level.
To reduce system faults and improve overall availability, we need to assume that the abnormal state is normal, rather than abnormal state. When writing code, developers must first assume how to handle any hardware, resources, or even any variable in case of an exception, and then the business logic for normal processing.
The NULL pointer in C/C ++ is called, which is one of the main causes of downtime. However, you can use a simple if statement, we can detect null pointers to avoid system crashes. This simple method improves the code security. The use of null objects in Java is also the main cause of code crash. Using the if statement before use is also a simple way to completely prevent such problems.
4. Earthquakes
The direct cause of the accident at the Fukushima nuclear power plant in Japan is the earthquake at the bottom of the Pacific Ocean.
For a large website, due to its large scale, it may be distributed in multiple places across the country, or even in multiple places around the world. The environment for each data center is different, even if the best consideration is given, there may also be serious accidents such as the overall fault of the data center and the break of the intercontinental communication cable. Therefore, we need to consider the countermeasures in the construction process. For example, you can use backup sites and Dynamic DNS failover. The main costs of such processing are relatively high, the processing time is relatively long, and the response is relatively slow.
5. Tsunami
After an earthquake, the tsunami caused by the earthquake directly impacted the Fukushima nuclear power plant and destroyed the emergency power generation system of the nuclear power plant. As a result, the cooling cycle system could not be provided with power, the heat produced by nuclear reactors cannot be evacuated in a timely manner, which eventually leads to a rise in the temperature of the reactors, a rise in the pressure vessel, and a reaction of zirconium water, resulting in a large amount of hydrogen, which may lead to an explosion due to leakage from the relief valve, some nuclear fuel is also melted at high temperatures.
The main cause of the accident at the Fukushima nuclear power plant is the tsunami. Although the Fukushima nuclear power plant was built with a 5-meter-tall wave-protected sea surface, the tsunami was as high as 10 meters, far higher than the design height of the sea surface. In fact, the waves in this tsunami were not the highest. The tsunami after the Meiji three-way earthquake in one hundred years ago was even higher than this one. For example, there was 38.2 meters in the yandai County, gibang is 24.4 metres away, while tianda has a 14.6-meter-high tsunami record, which is only one hundred kilometers away from Fukushima.
It can be seen that at the beginning of the construction, due to various reasons, insufficient considerations in the ability to defend against the tsunami resulted in the failure to handle the accident in a timely manner after the accident, resulting in the continuous expansion of the accident.
The biggest access pressure on our construction of large websites is our first concern. For example, sudden increases in access traffic caused by emergencies, or DDoS attacks caused by hackers may all cause massive access pressure like a tsunami, the website may be overwhelmed and paralyzed in a short time. To prevent such incidents, a simple method is to limit the Refresh Interval for each user. IP addresses with obvious exceptions are blocked on routers or firewalls. However, the most fundamental solution is to allow the system to dynamically adapt to the load size, that is, to achieve real-time scalability.
6. Nuclear Radiation
Nuclear power plants are at the core of nuclear reactors. nuclear fuel is used to generate powerful power to power turbine generators. Although nuclear fuel is powerful, once leaked, it will cause thousands of square kilometers of nuclear pollution, it causes great harm to the surrounding environment. If it cannot be processed in time, it will lead to the decommission of the direct reactors.
The core of our large website is the database that stores data. The availability of the database and the correctness of the Data logic are the biggest problems that determine the availability of the website. If the database cannot be used due to external factors or data disorder, the website loses the trust of users. Database unavailability is easy to detect and can be prevented by setting dual-machine hot standby. However, data disorder is often not easy to find. Only when the user finds out can the customer service be contacted for processing. Because operation logs may be lost, it is difficult to ensure data accuracy, this results in direct business losses.
7. Monitoring System and automatic control
For a normal Nuclear Power Plant, temperature monitoring, pressure monitoring, radiation dose detection, and other monitoring systems are deployed everywhere. This is an essential condition for the stable and safe operation of nuclear power plants. Once an exception is detected, the monitoring system can automatically respond immediately, such as shutting down nuclear reactors and automatically starting backup systems.
For a large website, there may be thousands of servers. People rely on them to check whether they are normal every day. One is heavy workload and high costs. The other is that people may be neglected, and the response is slow. Therefore, an automatic monitoring system must be provided to monitor the working conditions of all servers and give feedback to management personnel in a timely manner. More importantly, the system must be used with the automatic failover system, switch the system in case of a fault in a timely manner to ensure that the system services can run normally without being affected.
The development of the monitoring system has become the core work of large-scale systems. It is the core of stability and reliability of the website architecture. At present, apart from a few domestic first-line Internet enterprises, there are comprehensive monitoring systems, generally, small and medium-sized enterprises do not meet this requirement. The development cost of the monitoring system may even exceed the development cost of the business system.
8. Insufficient integration testing
While emergency batteries were urgently started to supply power after the backup generator set was damaged by the tsunami at the Fukushima nuclear power plant, the battery only lasted 8 hours, eight hours later, the backup generator set has not been recovered. In this situation, an effective method for cooling the core temperature of nuclear reactors is not found in time, resulting in increased core temperature and hydrogen explosion, it also leads to the melting of nuclear fuel and the leakage of nuclear radiation.
Although some emergency measures are taken into account during routine maintenance of nuclear power plants, there is no preparation for the destruction of generators by the tsunami during the earthquake.
We can see that when building a large website, we should fully consider the occurrence of the most serious events, and conduct reasonable tests during development. Before going online, real drills and tests should be performed on some events. Each part may be perfect during development. However, when the system is running, mutual cooperation may cause problems and these minor problems, it is difficult to expose a single test, but once it is run, it may directly affect the stability of your system. Therefore, it is necessary to give the entire system a reasonable online overall joint debugging time, fully discover problems, and make reasonable solutions in a timely manner. This test cannot be ignored because of the busy schedule.
9. regular drills to cope with deficiencies
From the above situation, we can also see that after the explosion of Unit 1, both units 3, 2, and 4 experienced an explosion. Why does the same problem happen when other units fail to stop it? The effects of the earthquake and tsunami make the handling difficult. On the other hand, the cause is that the accident handling personnel are not skilled and unfamiliar with the handling process.
A good company needs well-trained employees. The current situation in China is the frequent flow of IT industry staff. In order to save money, the company has recruited a large number of college students,
Many companies can feel the school atmosphere. The addition of a new employee also means that he has no experience in his work and does not know the details and procedures of his work. Although he can continue to accumulate experience over time, it takes time to complete the process, it is not a matter of first-time proficiency, but some situations are not met for a hundred years, so you may not have experience. If you really encounter it, you will only have a panic. Therefore, to cope with such a situation, you need to set up a job description for each it post, clearly describe the job functions of each position, and the handling process for the event, conducts regular assessment and handling drills for employees, strengthens work ability, and reduces the efficiency caused by unfamiliar work and unfamiliar processes. Eliminate unsuitable personnel in a timely manner and create a high-quality team.
10. System Reconstruction
Founded in the 1960s S, the Fukushima nuclear power plant has been in operation for nearly 40 years and has been designed for nearly 30 years. Due to the shortage of power supply in Japan, with the approval of the International Atomic Energy Agency, the Tokyo power company finally formulated a long-term conservative operation plan, extending the service life of the Fukushima nuclear power plant by another 20 years. A long period of service out of service is also one of the reasons why nuclear power plants fail to handle nuclear accidents in a timely manner.
For a large website, with the passage of time, the number of users continues to grow over time, and the load of the system continues to increase, which will eventually exceed the load capacity of the original design of the system, therefore, after running for a certain period of time, you need to upgrade the entire system. The most thorough method is restructuring. Generally, reconstruction needs to be considered after a system is running for five years. Refactoring is a costly task and takes a long time. A website like Taobao may take more than one year to develop, therefore, restructuring should begin one year before the system pressure reaches saturation. This is far-sighted for the company's management, and the long-term planning capability is a challenge. At the same time, although restructuring can solve some of the original problems of the system, poor restructuring may also introduce new problems that the original system does not have, sometimes leading to a complete failure of restructuring. Therefore, a good quality management system should adopt the prototype development model, Gradually refine the design, implement the system, and constantly feedback the design solution. This not only balances the advantages and disadvantages of technology, but also pursues the overall perfection. This is not a job, but an art of strategizing.