This article describes a structured method for fault diagnosis of TCP/IP networks. This article can be used as an introduction. In the subsequent articles, we will discuss some key issues involved in this article?
So when you hear the word "TCP/IP network fault diagnosis", what do you think? Many may see a flowchart. Or you may think of several steps. Many may feel confused and unable to start.
The fault diagnosis of TCP/IP seems simple. After all, it is only a layer-4 architecture with multiple protocols on each layer. However, the simplicity of the surface does not mean that faults can be easily solved. Next let's take a look:
Traditional Fault Diagnosis Methods
A few years ago, when I first learned how to set up a TCP/IP network, I understood several simple fault diagnosis procedures. This process involves the following aspects:
Type ipconfig to check whether the IP address, subnet mask, and default gateway are correct.
1. Run ping 127.0.0.1 to check whether the network adapter is working.
2. Run ping to check whether the IP address of the local machine is correct or valid.
3. ping the IP address of any computer in the same subnet to check whether the IP address can be pinged.
4. ping the Default Gateway (that is, the interface connecting your subnet to the rest of the network on the vro) to check whether the network can be pinged.
5. Try to ping the IP address of a computer in different subnets.
6. Try to ping the IP address of a computer on the Internet.
I think this method is a little rigid, because we can follow these steps without thinking about it. In addition, there is a little inefficiency, because from the perspective of the process, it first assumes that your computer is most likely to have problems, and the problem is very close to you (your network card, the IP Address Configuration of the computer, and the local subnet), and then the remote computer problem. Before the Internet is truly developing rapidly, This method may be good. That is to say, before DNS becomes a widely used domain name resolution system, this solution may be good before firewalls and VPNs become part of the network of most enterprises.
I mean: if one of your users says, "I cannot connect to the server now ." So where is the problem? It is necessary to divide this sentence into several parts for further understanding of the problem.
Part 1-"I cannot ..." :
So, should we ask if only one user reports network problems? If there are other people, are they having similar problems? If so, the problem is very clear. You don't need to use the above rigid method to directly start Fault Diagnosis on your computer. Otherwise, the problem is very likely to occur elsewhere, which may mean that your DNS server is offline or your DNS Provider Service is faulty. Or a vro on the internal network has a problem and packet loss occurs. Maybe the server you are trying to connect to has crashed.
Maybe you should stop and think about the common problems that may exist in these faulty users. For example, are these machines on the same subnet? In this case, the default gateway configuration of the subnet may be incorrect or the router is paralyzed. Maybe a staff member cut the network cable from the working group switch connected to the subnet to the backbone switch. A malicious user may have installed a fraudulent DHCP server on that subnet. The malicious user is stealing the IP address of the machine and allocating some unroutable addresses to those computers, this forms a Denial of Service Fault.
Of course, if only one user has this problem, we need to ask such questions, such as "is the computer started? Is the network cable securely installed at the back of the computer ?"
"... Connect ..."
You can ask this user the following question: "What does the connection mean ?" This is because "connection" is a highly technical term, and many users do not really understand what they are talking about. Why? There are different types of connections, including MAC-level communication, TCP sessions, password verification, access permissions and privileges, cross-NAT connections, firewall pass-through, and application-layer sessions. As a network administrator, you need to know what the user's problem is. What are these users doing when they say they cannot "Connect" to the server? Is it a shared access to this server? Did you receive a "Access Denied" message during access? Do these users receive a logon window prompting them to enter the relevant creden? (Such as the account name and password) Does the server reject the credential? Are these users having problems finding or using shares in the Active Directory? Did they find the problem with a ing drive? Are they looking for servers by browsing their network neighbors? And so on.
Can these users fail only when they connect to a server? Or are these users failing when they connect to any network node? Here, it is important to determine the scope of the problem or fault: Is connection in one or more ways?
"... Server ..."
You have completed the user, the server, and the network. Are they still not connected? Why? Note that where is the server? Is it on the user's subnet? On an adjacent subnet? In a different department? On a different floor? In a different building? Which network connects users to a specific server? Is it wireless Ethernet? Is it a wireless lan? Is it a VPN channel on the Internet? Is it a dial-up modem connection? Is it a cable modem or a DSL modem? First, determine the connection type between the user and the server (there may be several types), and then consider where a fault may occur? It may be that CSU/DSU is faulty. Try to power it up again or contact the supplier that should monitor CSU/DSU. It may also be that someone encounters a power switch during cleaning, which causes an Ethernet switch to be offline. If you are using a manageable vswitch, you can also check the warning information of the network management software. The remote server's office may also experience a power outage. You can try to consult by phone.
Can a user be unable to connect to only one server or multiple servers? Cannot other people connect to those servers? Is there anything in common between affected servers? (The problem may be related to the user's computer, or the network architecture itself)
"... Now"
Time factors are crucial in fault diagnosis. Should I ask: is the problem just happened? When was the last successful connection to the server? How long does this phenomenon last? Is it continuous or intermittent? Intermittent network problems involve unreliable WAN connections and other problems that are difficult to solve, especially when these problems last for a short period of time or occasionally occur.
The time may also associate the problem with other situations that may affect the network. Is the problem at 10:20 A.M. today? What problems did your network have at that time? Is the WSUS server patched? Is the scheduled maintenance implemented on the domain server?
Structured Method
The author's structured Fault Diagnosis Method for TCP/IP networks consists of three key parts:
1. factors that determine the problem. That is to say, consider the following:
Client: the client with the problem
Server: the server, printer, or other network resources (such as the Internet) that the customer cannot access.
Network in the meantime: cables (if not wireless), hubs, switches, routers, firewalls, proxy servers, and other network architectures between clients and servers.
Environment: it may affect the external conditions of your network, such as fluctuations in power supply and maintenance of buildings.
Range: one or more related clients/servers.
Period: continuous, intermittent, or occasional, start time, etc.
What is the connection type? physical layer, network layer, transport layer, or application layer? Authentication or access control? And so on.
Symbolic information: error messages, logon dialog boxes, and so on the client machine.
2. When considering the above factors, determine which fault diagnosis measures should be applied, including:
Verify physical media related to client, server, and network architecture hardware. That is to say, check the cable, ensure that the network adapter is correctly installed, and further find and verify the network connection that can display the media disconnection status.
Verify the TCP/IP configuration of the client, server, and network architecture hardware. On the client, this means checking the IP address, subnet mask, default gateway, DNS settings, and so on. The Network Architecture hardware refers to the route table and Internet gateway on the router.
Verify the connectivity of the route selection between the client and the server. That is to say, ping, pathping, tracert, or other similar tools are used to verify the connectivity of end-to-end TCP/IP at the network layer. Packet sniffing is used to monitor transport layer sessions. nslookup is used, telnet and other tools are used to diagnose Application Layer problems, such as domain name resolution and identity authentication.
3. Understand, ask, and test:
What is the meaning of the solution? What is the meaning of data packages? What is the key. Successful TCP/IP fault diagnosis is based on understanding how TCP/IP works and related test tools. If you never try to understand the tracing mode of network monitor, you may encounter difficulties in diagnosing some problems.
Asking appropriate questions is also critical for successful fault diagnosis. It is the essence of the art of fault diagnosis to learn when to step by step and when to go straight to the theme with a leaping mind. This also involves making full use of your mind, that is, you must have full imagination and careful thinking.
Finally, it is critical to test and isolate the problem in a down-to-earth Manner. Therefore, you need a toolbox for fault diagnosis. And there is nothing to help you solve complex problems better than rich experience.
Summary
It may be frustrating to diagnose TCP/IP network faults, but it is also fun. In future articles, we will elaborate on fault diagnosis measures and tools to help you solve problems on the network.