Like windows systems, the linux operating system will also have many problems and faults. Many new linux users are afraid of faults and are helpless in the face of problems. What's more, they have abandoned linux. In fact, we should not be afraid of problems. learning is a process of discovering and solving problems. As long as we master the basic ideas for solving problems, all faults will be solved, of course, the premise is that we already have the ideas and solid knowledge to solve the problem.
I. How to Handle linux system faults
As a qualified linux system administrator, you must have a clear and clear set of troubleshooting ideas. When a problem occurs, you can quickly locate and solve the problem, here is a general idea for solving the problem:
Attention should be paid to the error message: every error message is displayed. Generally, this prompt basically locates the problem. Therefore, pay attention to this error message, if you turn a blind eye to the error information, the problem will never be resolved.
Refer to the log file: Sometimes the error message only shows the problem surface. To learn more about the problem, you must view the corresponding log file, the log files are classified into system log files (/var/log) and Application log Files. Combined with these two log files, the problem is usually located.
Analyze and locate the problem: this process is complicated. Based on the error information, combined with the log file, we also need to consider other relevant situations to find the cause of the problem.
Solve the problem: find the cause of the problem, and solve the problem is very simple.
From this process, we can see that the problem solving process is the process of analyzing and searching for the problem. Once the cause of the problem is determined, the fault will be solved accordingly.
Ii. Forget the linux root Password
The probability of this problem is very high. However, it is very easy to solve this problem in linux. You only need to restart the linux system and then boot to the linux single-user mode (init 1 ), because you do not need to enter the logon password in single-user mode, you can directly log on to the system and change the root password to solve the problem.
The following is a detailed solution. Here we use Redhat linux as the benchmark. The procedure is as follows:
(1) restart the system. When the linux system is started to the grub boot menu, find the current system boot option (you can press the arrow key to expand the hidden menu. A single processor has only one boot item, multi-processor has three or more boot items. The default option is the current Boot option of the system ).
(2) Place the cursor on the guiding item of the current system through the arrow key, and press the keyboard letter "e" to enter the editing status.
(3) press the up/down key, select a line with the kernel command, continue to press the keyboard letter "e", edit the line, add a space at the end of the line, and then add a single, similar to the following:
Kernel/vmlinuz-2.6.18-8.el5 ro root = LABEL =/rhgb quiet single
(4) After the modification is completed, press the Enter key to return to the previous interface.
(5) Press "B" on the keyboard to start guiding the system.
In this way, the system is started in single-user mode. The security mode in single-user root windows is similar. in single-user mode, only the most basic system is started, the network and application services are not started. After a single user mode is started, the system will automatically enter the command line status, similar to "sh-3.1 #", and then directly execute passwd and press Enter, the system will prompt you to enter the new root password twice, and the prompt that the password is successfully modified will be displayed. This completes the modification of the root password. If you need to start the system properly, you only need to enter "init 3" to enter the multi-user mode. Use the root user to log on to the system again and check whether the new password has taken effect.
3. Solution to linux system failure
Linux cannot be started due to the following causes:
Improper configuration of the guest file system, such as the/etc/inittab file and/etc/fstab file, causes a system error and cannot be started.
The kernel is shut down illegally, causing damage to the root file system, that is, damage to the linux root partition. The system cannot be started normally.
The kernel crashes and cannot be started.
If the worker System Bootstrap program encounters a problem, such as grub loss or damage, the system cannot boot.
Hardware faults, such as the motherboard, power supply, and hard disk, cause linux to fail.
From these common faults, we can see that there are two main problems that lead to system startup failure: hardware and operating system. For hardware problems, you only need to replace the hardware device to solve them, however, the problems in the operating system may vary widely. However, in most cases, you can use relatively simple and uniform methods to restore the system, next we will give some common and common solutions to the problems raised above, combined with the Redhat Linux system environment.
1. the/etc/fstab file is lost, causing the system to fail to start
The/etc/fstab file stores information about the file system in the system. If the file is correctly configured, the system reads the file when linux is started, if the configuration of this file is incorrect or is lost, the system cannot be started automatically. The specific failure occurs when the mount partition is detected:
Starting system logger
Then the system starts and stops.
To solve this problem, our first thought is to restore the information of the/etc/fstab file. Once the file is restored, the system can automatically mount each partition and start properly. Many readers may first think of switching the system to single-user mode, manually attaching partitions, and then re-establishing the/etc/fstab file based on the system information.
However, this method does not work, because the loss of the fatab file makes linux unable to mount any partition, even if linux can switch to a single user, at this time, the system is also a read-only file system and cannot write any information to the disk.
Another method is to use the linux rescue repair mode to log on to the system, obtain the partition and mount point information, and reconstruct the/etc/fstab file.
Take rhel5 as an example. First, place the first chapter of the system into the optical drive, and set BOIS to start from the optical drive, so that the system will boot from the optical drive, and then enter linux rescue after boot, as shown in 1:
Figure 1 Set linux to repair
Then the system starts to boot automatically and enters the image shown in Figure 2:
Figure 2 select a language
Here is the language used in the mode. You can set it as needed. Here we select "English", press the tab key, select "OK", and press enter to go to the next step.
The keyboard selection interface is displayed below, as shown in 3. Select the default "us" here.
Figure 3 select a keyboard type
The network configuration page is displayed, as shown in figure 4:
Figure 4 enable Network
Here we select whether to enable the network. Because the system cannot be started, we have already performed operations on the linux system. It doesn't matter whether the network is enabled or not. This option is disabled.
As shown in step 5, the repair mode automatically mounts all the system partitions to the/mnt/sysimage directory and selects "Continue ", when the repair environment enters the read-write state, you can perform Read and write operations on the partition, select "read-Only", and the repair environment enters the Read-Only mode, because we want to recreate the fstab file to the/etc directory, select "Continue" to enter the read/write mode.
Figure 5 Startup Mode of Repair Mode
The following is a friendly prompt interface, as shown in figure 6. Due to the loss of the fstab file, no Mount partitions can be found in the repair mode, the repair mode also reads the/etc/fstab file here. Press enter to go to the next step.
Figure 6 unable to mount any system partition
In the repair environment, you can perform the following operations. 7.
Figure 7 Repair Mode Command Line
The above detailed demonstration shows how to enter the linux repair mode. In fact, in many cases, when linux cannot be started, you can log on to the system to perform repair and change operations.
The detailed process of restoring the/etc/fstab file is as follows:
First, check the system partition information as follows:
Sh-3.1 # fdisk-l
Disk/dev/sda: 42.9 GB, 42949672960 bytes
255 heads, 63 sectors/track, 5221 cylinders
Units = cylinders of 16065*512 = 8225280 bytes
Device Boot Start End Blocks Id System
/Dev/sda1*1 25 200781 83 Linux
/Dev/sda2 26 1300 10241437 + 83 Linux
/Dev/sda3 1301 1682 3068415 83 Linux
/Dev/sda4 1683 5221 28427017 + 5 Extended
/Dev/sda5 1683 1873 1534176 83 Linux
/Dev/sda6 1874 2064 1534176 83 Linux
/Dev/sda7 2065 2255 1534176 83 Linux
/Dev/sda8 2256 2382 1020096 83 Linux
/Dev/sda9 2383 2484 819283 + 82 Linux swap/Solaris
/Dev/sda10 2485 5221 21984921 83 Linux
Because the partition is not damaged, you can view the complete information of the system partition through the fdisk command, but we do not know the label name information of each partition, run the e2label command to view the label name of each partition:
Sh-3.1 # e2label/dev/sda1
/Boot
Sh-3.1 # e2label/dev/sda2
/Usr
Sh-3.1 # e2label/dev/sda3
/
Sh-3.1 # e2label/dev/sda5
/Var
Sh-3.1 # e2label/dev/sda6
/Tmp
Sh-3.1 # e2label/dev/sda7
/Home
Sh-3.1 # e2label/dev/sda8
/Opt
Sh-3.1 # e2label/dev/sda10
/Webdata
In this way, the mount point information of all partitions is obtained, and a fstab file can be constructed.
TIPS: refer to the fstab file format in other systems and combine the partition and mount point information of the system to construct your own fstab file.
Because the fstab file is stored in the root directory of the system, you need to mount the root partition of the original system. The device name of the root partition is/dev/sda3, create a mount point under the temporary root partition created in repair mode, and then mount the root partition of the original system. The procedure is as follows:
Sh-3.1 # pwd
/
Sh-3.1 # mkdir temp
Sh-3.1 # mount/dev/sda3/temp
Sh-3.1 # df
Filesystem 1K-blocks Used Available Use % Mounted on
/Dev 515644 0 515644 0%/dev
/Tmp/loop0 79872 79872 0 100%/mnt/runtime
/Dev/sda3 2972268 259916 2558932 10%/temp
In this way, all the original root partition files are mounted to the/temp directory, and then we can create the fatab file we need.
Sh-3.1 # vi/temp/etc/fstab
LABEL = // ext3 defaults 1 1
LABEL =/boot ext3 defaults 1 2
LABEL =/cicro ext3 defaults 1 2
Devpts/dev/pts devpts gid = 5, mode = 620 0 0
Tmpfs/dev/shm tmpfs defaults 0 0
LABEL =/home ext3 defaults 1 2
LABEL =/opt ext3 defaults 1 2
Proc/proc defaults 0 0
Sysfs/sys sysfs defaults 0 0
LABEL =/tmp ext3 defaults 1 2
LABEL =/usr ext3 defaults 1 2
LABEL =/var ext3 defaults 1 2
LABEL = SWAP-sda9 swap defaults 0 0
After the configuration is complete, save and exit, and then restart the system.
Sh-3.1 # reboot
2. the root file system is damaged, causing the system to fail to start.
Ext3 file system is widely used in Linux, and ext3 is a log file system with logging function, which can perform simple fault tolerance and recovery, however, in a high-load read/write ext3 file system, if a sudden power loss occurs, it is likely that the internal structure of the file system is inconsistent, resulting in file system damage.
When Linux is started, it will automatically analyze and Check System partitions. If a simple error occurs in the file system, it will be automatically repaired. If the file system is seriously damaged and the system cannot complete the repair, the system automatically enters the single-user mode or displays an interactive interface, prompting the user to intervene in manual repair. The phenomenon is similar to the following:
Checking root filesystem
/Dev/sdb5 contains a file system with errors, check forced
/Dev/sdb5:
Unattached inode 68338812
/Dev/sdb5: unexpected inconsistency; RUN fsck MANUALLY
(I. e., without-a or-p options)
FAILED
/Contains a file system with errors check forced
An eror occurred during the file system check
* *** Dropping you to a shell; the system will reboot
* *** When you leave the shell
Press enter for maintenance
(Or type Control-D to continue ):
Give root password for maintenance
From this error, we can see that the system root partition file system has a problem, and the system cannot be automatically repaired at startup. Then, we enter an interactive interface, prompting you to repair the system.
This problem occurs frequently. The main cause of this problem is that the system suddenly loses power and the file system structure is inconsistent. Generally, the fsck command is used to forcibly fix the problem.
According to the error message above, when you press the "Control-D" key combination, the system will automatically restart. After you enter the root password, the system will be restored. In the repair mode, you can run the fsck command, the procedure is as follows:
[Root @ localhost/] # umount/dev/sdb5
[Root @ localhost/] # fsck. ext3-y/dev/sdb5
E2fsck 1.39 (29-May-2006)
/Contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Inode 6833812 ref count is 2, shocould be 1. Fix <y>? Yes
Unattached inode 6833812
Connect to/lost + found <y>? Yes
Inode 6833812 ref count is 2, shocould be 1. Fix <y>? Yes
Pass 5: Checking group summary information
Block bitmap differences:-(519--529)-9273
Fix <y>? Yes
...... ......
/: ***** File system was modified *****
/: 19/128520 files (15.8% non-contiguous), 46034/514048 blocks
The above is the process of repairing the damaged file system by fsck. The detailed usage of fsck is described in Chapter 4 of this book. Note that when executing fsck, you must first uninstall the partition to be repaired and then perform the repair operation. Remember!
Iii. General Solutions for other faults
If there is a problem with the linux boot program, you can also use the CD Boot Mode to enter the linux repair mode, and then modify the corresponding boot program or reinstall the boot program.
If the linux kernel crashes or is lost, you can first enter the linux rescue, load the root partition, and re-compile the kernel.
If the worst case occurs, the file system is seriously damaged, and the kernel crashes, it is easier to reinstall the system, in this case, you can first back up useful data and files on linux to other devices, and then install the entire file system completely.
It is impossible for us to provide detailed solutions for every problem that arises. The problems vary widely and each problem is handled differently, what I want to teach you in this book is that when a problem occurs in the linux system, the general idea and general strategies for solving the problem are well mastered, and these skills can be easily used to deal with any linux problem.
Iv. troubleshooting of common network faults in linux
Linux network services are very powerful. in linux, Web Server, DNS Server, Mail Server, Db server, Ftp server, and so on can be deployed. However, many network problems are also caused. According to statistics, in linux, 60% of the faults come from the network and 40% from the system itself. It can be seen that it is a great help to be proficient in linux to solve the faults.
The sequence for solving linux network problems should begin with the underlying network of the Linux operating system, and then gradually expand outward. The general process for solving network problems is as follows:
If there is a hardware Transmission Problem in the worker network, you can check whether the network cable is normal, and whether the NIC, Hub, router, and switch are normal to confirm whether the network fault is caused by hardware problems.
Check whether the network adapter works properly. You can check whether the network adapter is loaded properly, whether the IP address of the network adapter is set correctly, and whether the system route is set correctly.
The hosts command checks whether DNS is set correctly. You can check and confirm from the linux DNS Client configuration file/etc/resolv. conf and local host file/etc/hosts.
If the guest service is enabled normally, you can use telnet or netstat commands to check whether the service is enabled.
Whether the guest access permission is enabled. You can check and confirm the local iptables firewall and the Linux kernel's forced access control policy selinux.
Check whether the local area network hosts are connected normally. You can ping your own IP address, the IP address of other hosts on the local area network, and the gateway address to check whether the local area network is connected normally.
Next, we will discuss in detail the general idea of solving network problems described above.
1. Check network hardware transmission problems
To check network faults, you must first check whether there are any problems with the network hardware devices, such as whether the network cable is normal, and whether the network adapter, Hub, router, and switch are normal, these are the basic conditions for the normal operation of the network. If some devices fail, you only need to replace the hardware to solve the problem.
2. Check whether the NIC works properly
(1) check whether the NIC is properly loaded
The lsmod and ifconfig commands can be used to determine whether the NIC is loaded normally. If ifconfig displays the configuration information of the network interface (eth0, eth1, and so on), the system recognizes the NIC Driver, network device detected, Nic Loading normal.
(2) check whether the nic ip settings are correct
Next, we need to check the software settings of the NIC, such as whether the IP address is configured and correctly configured to ensure that the IP address configuration does not conflict with other computer configurations on the LAN.
(3) check whether the system route table information is correct
Finally, check whether the system route table is correctly set. If a linux system has two NICs and the IP addresses set by the two NICs are not in the same network segment, pay special attention to the system route table settings.
For example, the network interface information of the system is as follows:
[Root @ webserver ~] # Ifconfig
Eth0 Link encap: Ethernet HWaddr 00: 12: 3F: FF: 65: 24
Inet addr: 10.10.1.239 Bcast: 10.10.1.255 Mask: 255.255.255.0
Inet6 addr: fe80: 212: 3fff: feff: 6524/64 Scope: Link
Up broadcast running multicast mtu: 1500 Metric: 1
RX packets: 20632289 errors: 0 dropped: 0 overruns: 0 frame: 0
TX packets: 20223702 errors: 0 dropped: 0 overruns: 0 carrier: 0
Collisions: 0 FIG: 1000
RX bytes: 793608426 (756.8 MiB) TX bytes: 2567481473 (2.3 GiB)
Interrupt: 201
Eth1 Link encap: Ethernet HWaddr 00: 12: 3F: FF: 65: 25
Inet addr: 192.168.200.30 Bcast: 192.168.200.255 Mask: 255.255.255.0
Inet6 addr: fe80: 212: 3fff: feff: 6525/64 Scope: Link
Up broadcast running multicast mtu: 1500 Metric: 1
RX packets: 15496910 errors: 0 dropped: 0 overruns: 0 frame: 0
TX packets: 8028739 errors: 0 dropped: 0 overruns: 0 carrier: 0
Collisions: 0 FIG: 1000
RX bytes: 1048038084 (999.4 MiB) TX bytes: 3195989266 (2.9 GiB)
Interrupt: 209
Lo Link encap: Local Loopback
Inet addr: 127.0.0.1 Mask: 255.0.0.0
Inet6 addr: 1/128 Scope: Host
Up loopback running mtu: 16436 Metric: 1
RX packets: 508961 errors: 0 dropped: 0 overruns: 0 frame: 0
TX packets: 508961 errors: 0 dropped: 0 overruns: 0 carrier: 0
Collisions: 0 txqueuelen: 0
RX bytes: 574086961 (547.4 MiB) TX bytes: 574086961 (547.4 MiB)
According to the above output, the system has two NICs, which are configured with IP addresses of different network segments. Assume that eth0 provides external ssh connection services through ing, eth1 is only used for data sharing between LAN hosts.
The problem is that ssh remote login to the system is not possible, and nic loading is not a problem, and nic IP settings are OK. Next let's take a look at the route settings of the system:
[Root @ webserver ~] # Route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
10.10.1.0*255.255.255.0 U 0 0 0 eth0
192.168.200.0*255.255.255.0 U 0 0 0 eth1
Default 192.168.200.1 0.0.0.0 UG 0 0 0 eth1
At this point, the problem has been basically solved: According to the route output, the default route for linux is 192.168.200.1, And the IP address in the 192.168.200 segment is only used for data sharing between LAN hosts, it is a matter of course that you do not have the access permission to connect to the linux system.
After locating the problem, the solution is simple. Delete the default route in section 192 and add the default route in Section 10:
[Root @ webserver ~] # Route delete default
[Root @ webserver ~] # Route add default gw 10.10.1.254
In this case, you can remotely connect to the linux system through the ssh service.
3. Check whether the DNS resolution file is set correctly.
In Linux, there are two files used to specify the system where to find the relevant domain name resolution library. The files are/etc/host. conf and/etc/nsswitch. conf.
The/etc/host. conf file specifies how the system resolves the host name. Linux uses the domain name resolution library to obtain the IP address corresponding to the host name. The default/etc/host. conf content after RedHat Linux installation is as follows:
Order hosts, bind
Here, "order" specifies the host name query order. It indicates that the resolution corresponding to the/etc/hosts file is first searched. If no resolution is found, the/etc/resolve is followed. the domain name server specified by conf for resolution.
/Etc/nsswitch. the conf file is developed by SUN and is used to manage the query sequence of multiple configuration files in the system. conf provides more resource control methods, nsswich. the conf file has basically replaced hosts. conf. Although both documents exist by default in LINUX, nsswitch actually works. conf file.
The configuration of each line in the nsswitch. conf file starts with a keyword, followed by a colon, followed by a blank, followed by a list of methods.
For example:
Hosts: files dns
The system first queries the host database file. If no corresponding resolution is found, the system then goes to the DNS server specified in the DNS configuration file for resolution.
After understanding the principle and process of domain name resolution in linux, we can determine the Resolution Sequence Based on the settings of the two files to determine possible problems in domain name resolution.
4. Check whether the service is enabled normally.
When an application fails, you must check the service itself, for example, whether the service is enabled, whether the configuration is correct, and so on. There are two steps to check whether the service is enabled correctly, the first step is to check whether the service port is Enabled:
For example, we cannot use the root user ssh to log on to the linux Server 192.168.60.htm. First, check whether port 22 of the sshd service is enabled:
[Root @ localhost init. d] # telnet 192.168.60.000022
SSH-2.0-OpenSSH_4.3
This output indicates port 22 of 192.168.60.htm is open to the outside world, or the sshd service is open. If there is no output, it may be that the service is not started or the service port is blocked.
You can also run the netstat command on the server to check whether port 22 is Enabled:
[Root @ localhost xinetd. d] # netstat-ntl
Tcp 0 0 0.0.0.0: 3306 0.0.0.0: * LISTEN
Tcp 0 0: 80: * LISTEN
Tcp 0 0: 22: * LISTEN
As you can see, port 22 is opened on the server, and ports 3306 and 80 are opened on the server.
Next, check the Second Step. Since the service has been opened, it may be a problem with the sshd service configuration. Check the sshd server configuration file/etc/ssh/sshd_config and find the following line of information:
PermitRootLogin no
It can be seen that the ssh server configuration file limits the root user to be unable to log on to the system. If you need to log on to the system as root, you only need to change it to the following:
PermitRootLogin yes
By now, we have checked the port and service configuration file layers to find the root cause of the problem. It should be noted that the focus here is not on how to allow the root user to log on to the linux system, but on how to use this example to let the reader learn the ideas and methods to solve similar problems.
5. Check whether the access permission is enabled.
(1) Check the status of the system firewall iptables
When some services cannot be accessed, check whether they are blocked by the linux local firewall iptables. You can use the iptables-L command to view the iptables configuration policy, for example, we cannot access the www Service provided by a linux server. After checking, the system network and domain name resolution are normal, and the service is started properly, and then check the iptables policy configuration of the server, the information is as follows:
[Root @ localhost ~] # Iptables-L-n
Chain INPUT (policy DROP)
Target prot opt source destination
Chain FORWARD (policy ACCEPT)
Target prot opt source destination
Chain OUTPUT (policy DROP)
Target prot opt source destination
From the above OUTPUT, we can see that this linux Server only sets the preset policy, and the fatal thing is to set both the INPUT chain and OUTPUT chain to DROP, that is, all external data cannot enter the server, the server data cannot go out either. This setting is equivalent to no network.
To access the www Service provided by this server, add two policies:
[Root @ localhost ~] # Iptables-a input-I eth0-p tcp -- dport 80-j ACCEPT
[Root @ localhost ~] # Iptables-a output-p tcp -- sport 80-m state -- state ESTABLISHED-j ACCEPT
In this way, other people on the internet can access our www Service.
(2) Check whether selinux is enabled
In the previous chapter, we have already talked about the meaning and functions of selinux, which can ensure the security of Linux systems to the maximum extent. However, selinux sometimes brings some problems to the running of linux software, most of these problems are caused by poor understanding of selinux. to quickly locate the problem, the easiest way is to first disable selinux and then test whether the software runs normally. This is not a good method, however, selinux is a good security access control software. If you are not familiar with selinux access control policies, we recommend that you temporarily disable them, after having a deeper understanding of linux, enabling selinux is a wise strategy.
6. Check whether the LAN host is connected normally.
Through the above five steps, the linux system has basically ruled out problems. Next, we need to expand to the network environment outside the linux host to check whether there is a fault in the connectivity between networks, you can run the ping command to test the connectivity between LAN hosts, and then ping the gateway to check whether the communication between the host and the gateway is normal.
There is a reason for any network failure. As long as we solve the problem one by one according to the above solution process, 99% of the problems can be well solved.