How can we avoid being a "Fireman" network engineer ?, Fireman Engineer
Author profile:
Zhang Yongfu
He yunlian Solution Architect, a network veteran who has been working on traditional networks for more than 10 years, has participated in dozens of network construction projects in various industries such as operators, finance, government affairs, and transportation. He joined Dahe yunlian company in SDN network-related work since 2016. He has successively participated in SDN product design, network architecture design, O & M automation system design, and solution design, we are committed to deploying SDN in commercial projects and promoting the development of the SDN industry together with our partners who love advanced technologies.
For network engineers, whether it is basic network O & M or business-driven operations, they will encounter various technical problems and different types of network faults in their daily work.36 network O & M planTo help network engineers reduce faults in O & M and prevent micro-dujian. The Network O & M plan can be classified into the following three categories.
Troubleshooting based on technical knowledge: Engineers learn and master the necessary skills and knowledge to improve their technical skills. They are good at learning lessons from each fault handling process and summing up their experiences, so as to continuously improve their logic thinking capabilities.
O & M automation and O & M Process System: Manual O & M to automated O & M can reduce O & M costs and maintenance complexity. At the same time, the process system can improve work efficiency and reduce communication costs.
Cross-department collaboration: The network is an intermediate link connecting various business systems. It is essential for network engineers to cooperate with upstream and downstream departments in their work. Therefore, collaborative processing can get twice the result with half the effort.
See the following two representative cases.
Watch the full version of Web O & M at the end of this Article
Case 1: exercise the "scalability" in network troubleshooting
Network engineers can master technologies by reading books, reading documents, and experimenting. troubleshooting requires not only a solid foundation of knowledge, but also a large number of current online practices, he is also good at summing up the experience in troubleshooting.
Here is a case with unclear troubleshooting ideas due to lack of basic knowledge. I hope you can review and summarize this case. We are not afraid of making mistakes, but we must sum up experience and learn lessons in time to pave the way for future work.
Lao Gao is a senior O & M engineer of a carrier's backbone network, experienced numerous ups and downs, and handled the fault decisively and calmly. In an internal O & M training class, Lao Gao shared his personal experience and stressed to the new O & M personnel that the sensitivity of fault Symptom Analysis and judgment is very important, that is, the ability to clarify the handling logic based on the fault phenomenon. If the basic knowledge is not strong, the problem may deteriorate, or even cause failure to be solved in the end.
Fault situation Reproduction
When the old master was a startup, he served as a network engineer in the O & M department of a level-2 operator and often needed night shifts. One day, at midnight, the alarm bell in the duty hall suddenly sounded, and the alarm log information was tumble on the monitoring screen.
In this scenario, Mr. Gao encountered several times during the O & M on duty, which is basically a problem that is easy to judge, such as a transient disconnection of a backbone transmission or faults of some hardware equipment; the problem of transmission interruption is generally directly transferred to the transmission department for troubleshooting. The fault of hardware equipment usually calls the manufacturer's engineers directly, and the personnel on duty cooperate with the manufacturer's engineers to collect some information.
Because of its diligence and diligence, Xiao Gao has accumulated a lot of experience in determining the cause of failure through logs.
Troubleshooting Process
According to the company's fault handling process, xiaogao first checks the monitoring alarm log information, confirms that the alarm device is a PE (Provider Edge) router device in a region, and then logs on to the device for troubleshooting, by checking the device logs, we can find that the device has a large number of BGP sessions frequently flapping:
Further, check that the physical ports of the vro are in the up status. When you view the CPU, the CPU usage of the fourth board of the vro is maintained at around 80%. The CPU usage of this level is obviously abnormal:
At this time, xiaogao was a little confused. He continued to view the analysis log information and hoped to find other information. As a result, a small amount of board error messages were included in a large number of logs:
After seeing the ipc_send_rpc_blocked field, xiaogao was very bright. He vaguely remembered that he had handled the IPC alarm fault with the manufacturer's engineers. At that time, the reason was that the card IPC processing channel was blocked by hold and the Board could not work properly, you can restart the board to restore services. After judging based on experience, xiaogao immediately restarts the board, but the fault still exists after the restart.
Fault relief
After some fault confirmation, the board was restarted, and the idea of xiaogao was completely stuck in how to solve the IPC Log Warning. At this time, it was still considered that the Board problem caused BGP flapping, therefore, the personnel on duty at the scene of the small high contact equipment use the exclusion method for board swap operations, when the field engineers switch the fourth and third slot of the router board, the fault is still in the fourth slot.
The Troubleshooting approach becomes increasingly limited. In order to restore services as soon as possible, division is adopted to switch the physical ports on the faulty board one by one and then open them. At the same time, the fault phenomenon is observed.
When the fifth port is closed, the router stops BGP flapping and the CPU returns to normal. Although the router does not know the cause of the fault, however, the port that triggers the fault is found and most services are restored.
Further troubleshooting by Mr. Gao found that this port uses VLAN access and serves as the customer's gateway to access a layer-2 network of hundreds of computers, the company requires that all ports should be accessed through layer-3 BGP access or point-to-point Static Routing. xiaogao contacted the customer who was connected to the fifth port to inquire about the situation. The customer reported that the cutover is in progress, A layer-2 loop occurs during the operation, resulting in a large number of ARP broadcast packets in the network.
After the customer's network is restored, xiaogao works with the PE router to connect the line. So far, all services have recovered. In addition, xiaogao contacted the Business Planning Department to standardize customer access methods.
Post-event reflection and summary
The next day, xiaogao sought help from other network experts and consulted the router equipment documents to learn about the specific causes of the fault and the technical troubleshooting methods for similar problems, at the same time, we have summarized experiences in the troubleshooting process as follows:
The affected router is an old device a few years ago. I am not familiar with the data packet processing process of this device. When I have a thorough understanding of the basic knowledge, you need to consult an expert engineer for support.
When handling a fault, you not only need to view the log information, but also need to confirm the device configuration information and check whether there is any nonstandard access.
When multiple fault phenomena are combined, You need to analyze the problem globally and start your thinking.
After sharing the case, Lao Gao added: "If you are always walking by the river without wet shoes, O & M drivers cannot be taken lightly ."
Case 2: Fault Handling mode before using automated O & M tools to improve work efficiency
I am currently working in an SDN software development company. At the beginning, my understanding of SDN is, you can visually complete all O & M without the need for network engineers to log on to the device and input various command lines.
However, when I entered the company and started SDN network construction and network O & M work, I found that there was a great distance from the imagination. Although all the business activation was completed through the SDN controller, however, when a fault occurs in the network, O & M engineers are required to discover and repair faults throughout the network based on their experience.
After we find some faults in our daily O & M work, we cannot determine the scope of impact of the fault as soon as possible, and whether it actually affects the customer's business. For example, when a transmission line is interrupted, O & M engineers are required to log on to the SDN controller system and network switches for troubleshooting, and determine how many services are converged, which sensitive services are affected, whether the transmission or network switch is faulty.
Manual confirmation is required for all these problems. The pressure on duty and O & M engineers can be imagined. This O & M situation is almost the same as maintaining a traditional network. The company's O & M capabilities depend entirely on the level of O & M engineers.
Develop an automated O & M platform to improve efficiency
As a new software company that embraces new technologies and SDN, in the face of various dilemmas encountered by network engineers, the company decided to develop an SDN-Based Automated O & M platform using the DevOps concept and set up a virtual working group.
The team members include first-line O & M network engineers, System Engineers, R & D engineers, and big data analysis engineers, including system planning and design, first-line requirement collection, development and design, coding, and testing, to system release, system deployment, system operation, system re-planning and design, forming a complete DevOps capability ring.
After the project is established, the lean management model of agile development and rapid iteration is adopted. It takes only two months for the Phase I automated O & M platform to start and launch the project, solved 40% of O & M engineers who need manual confirmation. Shows the architecture design of the automated O & M platform.
In the O & M platform, the most important help for O & M engineers is the monitoring and alarm module. Through association calls between systems and big data analysis, alarms are automatically merged and filtered, at the same time, different alarm channels are issued for different levels of defined alarms. For example, high-priority alarms with service impact will be directly called by O & M personnel by phone, medium-priority faults are notified through dingtalk, and low-priority faults are not notified. They are only stored in the O & M platform for online query by O & M engineers.
After the automated O & M system is launched, the on-duty personnel do not need to watch screen-based monitoring. They only need to keep their mobile phones open to learn the impact scope and severity after the fault occurs, and what resources need to be coordinated to handle the fault.
At the same time, both O & M engineers and on-duty personnel can put forward development requirements based on their own experience and problems. The R & D engineers design and code them, go to the next stage of version iterative development, testing, and release. If the requirement is verified and confirmed to meet the requirements, the demand will be closed. If the requirement is not met, the demand will be further optimized until the function meets expectations.
At the same time, the O & M Department has developed a fault handling process based on its historical experience and understanding of the existing O & M system, including faults that require manual intervention and faults that require software identification, improve the internal knowledge base system and the development iteration of the Fault Self-healing module of the automated O & M platform through each case. Shows the troubleshooting process.
Up to now, the company's automated O & M system has been developed to the third stage, helping network O & M engineers reduce the workload by 60%, and once tedious and repetitive work has been handed over to the software, engineers spend more time on technological innovation and productivity improvement. Everyone can create more value.
Full network O & M plan
Want to have close contact with many teachers who participate in the DevOps program creation?
Scan the QR code below to join the group.
Please add: gaoxiaoyunweiliuce when the group is full
Pay attention to the public number of DevOps
We will release the complete DevOps plan for a long time
If you have doubts about the content, you are welcome to point out and express your comments. Once adopted, you will become a reader of the beta version, the first batch of printing at the end of the DevOps plan will be delivered to you immediately.
Read more articles
The road to automated O & M of likes Database
How many people are crying in the O & M version of Chengdu...
In Python, his salary is twice as high as yours.
Second-level monitoring of trillions of Alibaba transactions
Salvation of it o & M-the ideal practice of sf o & M
It's so easy to learn Python well and get a high salary
Join the high-dimensional college through train to becomeCertified O & M Development Engineer
It only takes 5 days!
Within five days, we will focus on all the essentials that DevOps-oriented O & M engineers need to master.
What's more, after learning, you will also have a [O & M Development Engineer certificate]
This high-gold certificate:
Your training fee will beHalf !!
More enterprises are on the road.
You are also welcome to contact us:
Liu Lin,/Tel: 13910952502
For registration and course details, please click to read the original article link