Linux System Fault Analysis and troubleshooting Solution
When dealing with various faults in Linux, the symptoms of the fault are first discovered, and the cause of the fault is the key to final troubleshooting. Familiar with the log management of Linux system, understanding the analysis and solution of common faults, will help the Administrator to quickly locate the fault point. The right remedy promptly solves various system problems.
1. Log Analysis and Management
A log file is a file used to record various running messages in a Linux system, which is equivalent to a "Diary" of a Linux host ". Different log files record different types of information, such as Linux kernel messages, user logon records, and program errors. Log files are helpful for diagnosing and solving system problems, because programs running in Linux usually write system messages and error messages to corresponding log files, in this way, once a problem occurs, the system will be "well documented ". In addition, when the host is under attack, log files can also help find traces left by attackers. Next I will introduce the main logs and analysis and management methods in Linux.
1.1 Main log files include the following types:
> Kernel and System Log: This log data is centrally managed by the System Service syslog, based on the main configuration file "/etc/syslog. the setting in conf determines the location where kernel messages and various system program messages are recorded. A considerable number of programs in the system hand over their own log files to syslog management, so these programs use similar log records.
> User logs: these logs are used to record information about Linux system user logon and exit systems, including user names, logon terminals, logon times, source hosts, and processes in use.
> Program log: Some Applications choose to manage a log file independently (instead of being handed over to the syslog service for management), which is used to record various events during the program running. Because these programs are only responsible for managing their own log files, the log record formats used by different programs may be significantly different.
By default, the log files of Linux and most server programs are stored in the/var/log directory. Some programs share one log file, while some programs use a single log file, while some large server programs do not have one log file, therefore, a subdirectory is created in the "/var/log/" directory to store log files. This ensures that the structure of the log file directory is clear and the log files can be quickly located. A considerable number of log files can only be read by the root user, which ensures the security of related log information.
>>>>>>>>>: View various log files and subdirectories in the "/var/log" directory in the list.
For some common log files in Linux systems, it is necessary to be familiar with their corresponding purposes, so that you can locate the problem more quickly and solve various faults in a timely manner. For example:
>/Var/log/messages: records Linux kernel messages and common logs of various applications, including startup, IO errors, network errors, and program faults. For applications or services that do not use an independent log file, you can obtain relevant event records from the file.> /Var/log/cron: records the event messages generated by crond scheduled tasks.> /Varlog/dmesg: records various event information during Linux boot.> /Var/log/maillog: records email activity that enters or sends out the system.> /Var/log/lastlog: Last Successful Logon Events and last unsuccessful logon events.> /Var/log/rpmpkgs: records the list of rpm packages installed in the system.> /Var/log/secure: records the event information during user logon authentication.> /Var/log/wtmp: records the logon, logout, and system startup and shutdown events of each user.> /Var/log/utmp: records the details of each user currently logged on.
1.2 Log File Analysis
Familiar with the main logs in the system, we will learn about the analysis methods of log files. The purpose of Log File analysis is to view the log to find key information, debug system services, and determine the cause of the fault. This section describes the basic formats and analysis methods of three types of log files.
For most text format log formats (such as kernel and system logs and most program logs), you only need to use text processing tools such as tail, more, less, and cat to view the log Content. For some log files in binary format (for example, user logs), you need to use the corresponding query command.
1. kernel and system logs:
The kernel and System Log functions are mainly provided by the default installed syslogd-1.4.1-39.2 software package, which is installed with klogd, syslogd two programs, and through the syslog service for control, it is used to record messages of the system kernel and messages of various applications respectively. The configuration file used by the syslog service is "/etc/syslog. conf ".
In general, kernel and most system messages are recorded in the public log File "/var/log/messages", while some other program messages are recorded in different files, log messages can also be recorded in specific storage devices or sent directly to users.
>>>>> View the content in the log configuration file "/etc/syslog. conf"
From the configuration file "/etc/syslog. conf ", we can see that the log files managed by syslogd are the most important log files in Linux, they record the most basic system messages in Linux, such as kernel, user authentication, emails, and scheduled tasks. In the Linux kernel, log messages are classified into different priority levels based on their importance (the smaller the number level, the higher the priority, and the more important the message ).
> 0 EMERG (urgent): the host system may be unavailable.> 1 ALERT (warning): The problem must be resolved immediately.> 2 CRIT (severe): more serious cases.> 3ERR (error): an error occurs during running.> 4. WARNING: important events that may affect system functions must be reminded.> 5 NOTICE (Note): Events that do not affect normal functions, but need to be noted.> 6 INFO: General information.> 7 BEBUG (debugging): program or system debugging information.
Most of the log files managed by the syslog service are in the same format, the following uses the public log File "/var/log/messages" as an example to describe the basic format of kernel and system log records.
Eg: view the last two lines of records of the Public log File "/var/log/messages.
Each line in the log file represents a message. Each message consists of four fields in a fixed format.
>: Time Tag: the date and time when the message is sent.> : Host Name: name of the computer on which the message is generated.> : Subsystem name: name of the application that sent the message.> : Message: the specific content of the message.
In some cases, you can set syslog so that it can send the log information to the printer for printing while recording the log information to the file, therefore, no matter how the network intruders modify the log, the traces of intrusion cannot be cleared. Syslog Log service is a notable target that is often attacked. If it is damaged, it will make it difficult for administrators to find traces of intrusion and intrusion. Therefore, pay special attention to monitoring its daemon process and configuration files.
2. user logs,
In log files such as wtmp, utmp, and lastlog, event messages for system user logon and logout events are saved. However, these files are binary data files and cannot be viewed directly using text viewing tools such as tail and less, you need to use user query commands such as who, w, users, last, and ac to obtain log information.
We will not demonstrate it here.
3. Program logs: in Linux, a considerable number of applications do not use the syslog service to manage logs. The Program maintains the log records. For example, the httpd website service program uses two log files, access_log and error_log, which are generally stored in the "/var/log/httpd" directory to record customer access events and error events respectively, the FTP service program can record the messages related to file upload and download events in the xferlog file. Because the log record formats of different applications vary greatly, the unified format is not strictly used!
Special case: server log Distribution Management Policy:
In view of the importance of log data, targeted management policies must be adopted for various log files generated during system operation to ensure the accuracy, security, and authenticity of log data. Generally, you can consider the following aspects.
>: Log backup and archiving: log files are also important data and must be backed up and archived.
>: Extend the Log retention period. If the storage space is rich, log data should be retained for as long as possible.
>: Control Log Access Permissions: log data may contain various types of sensitive information, such as accounts and passwords. Therefore, you must strictly control the access permissions.
>: Centralized log management: Use a centralized log server to manage the log records sent by each server. The advantage is that it facilitates log collection, sorting, and analysis to prevent accidental loss, malicious tampering, or deletion.
For example, server A (IP Address: 173.17.17.3/24) is used to save log records in A centralized manner.
Save the logs generated by the crond service in client B (173.17.17.11/24) to the "/var/log/cron" file in server.
1. Set log server
In log server A, you need to edit the startup parameter configuration file "/etc/sysconfig/syslog" of syslog Log service ", change the content of the SYSLOGD_OPTIONS variable to "-r-x-m 0. The "-r" option indicates that logs sent from other hosts are allowed, and the "-x" option indicates that non-process DNS domain name resolution is enabled, "-m" indicates the time interval for logging (set to 0 to disable this function). You can obtain this information through viewing the man manual page of syslogd program.
*: Modify the "/etc/sysconfig/syslog" file of log server A, add the Centralized Management Configuration Parameter "-r", and restart the syslog service.
Vi/etc/sysconfig/syslog // modify SYSLOGD_OPTIONS row SYSLOGD_OPTIONS = "-r-x-m 0" service syslog restart
2. Set Client B
In client B, modify "/etc/syslog. conf "configuration file, set to write the log messages of cron scheduled tasks to the"/var/log/cron "file of server. When specifying the host address for log writing, use the format "@ 173.17.17.3.
*: Modify the "/etc/syslog. conf" file of client B, find the configuration line of the cron log, change the log sending location to "@ 173.17.17.3", and restart the syslog service.
Vi/etc/syslog. confcron. * @ 173.17.17.3service syslog restart
3. Verify the centralized Log Management Function
Execute the "crontab-e" command in client B, write a scheduler task information and save and exit, and then view the "/var/log/cron" log file in the local machine, no new records will be found.
2. system startup troubleshooting
During the Linux system startup process, it involves the MBR Master Boot Record, GRUB boot menu, system initialization configuration file, and partition mounting configuration file, failure in any part of the process may lead to abnormal system startup. Therefore, be sure to back up the relevant files. The following are some system startup faults:
2.1 MBR sector failure
The MBR Boot Record is located in the first sector (512 bytes) of the physical hard disk. This sector is also known as the primary sector (MBR sector). Apart from some data that contains the system boot program, it also contains the Partition Table Record of the entire hard disk. When the primary Boot Sector fails, you may not be able to enter the primary boot menu, or the system cannot be loaded because the correct partition location cannot be found, when the host is booted through the hard disk, it is likely to enter the black screen status.
The following describes how to back up, damage, and repair MBR sectors!
>: Back up MBR sector data
Because the MBR sector contains the Partition Table Record of the entire hard disk, the backup file of this sector must exist on other storage devices. Otherwise, the backup file cannot be read during restoration.
Use the dd command to back up the MBR sectors of 1st hard disks (sda) to the sdb1 partition of 2nd hard disks (mounted to the/backup directory)
Mkdir/backupmount/dev/sdb1/backupddif =/dev/sda of =/backup/sda. mbr. bak bs = 512 count = 1
>: Simulate MBR sector failure
We still use the dd command to overwrite records of the MBR sector to simulate MBR faults,
Ddif =/dev/zero of =/dev/sda bs = 512 count = 1
After the above operation is completed, the system will be restarted, and a message "Operating system not found" will appear, indicating that the possible Operating system cannot be found, so the host cannot be started.
>: Recover MBR sector data from the backup file.
After the MBR sector is damaged, the system cannot be started from the hard disk. Therefore, you need to use the operating system in other hard disks for boot, or directly use the RHEL5 system installation disc for boot. Either method is the same: obtain a Shell environment that can execute commands to restore data in the MBR sector from the backup file,
Take the RHEL5 installation CD boot as an example. When the installation wizard prompts "boot", enter "linux rescue" and press Enter, boot the Linux system on the CD in emergency mode. Next, press the Enter key to accept the default language and keyboard, and choose "No" when prompted whether to configure the NIC ', then the system will automatically view the Linux partition in the hard disk and try to mount it to the "/mnt/sysimage" Directory (select "Continue" to confirm and Continue ). Next we need a special infusion Chair: When a warning window is displayed, such:
Select "No" to avoid further damage to the hard disk data.
It is best to select "OK" to confirm and go to the Bash Shell environment with "sh-3.1 #" prompt, as long as you execute the corresponding command to mount the hard disk file (sdb1) that stores the backup file ), and recover the data to the hard disk "/dev/sda. Note that the current system environment is the Linux directory structure in the CD.
*>: Confirm the partition status of the 1st hard disks (you cannot obtain the valid partition table information and restore the MBR sector data ).
Fdisk-l/dev/sda
Mkdir/tmpdirmount/dev/sdb1/tmpdirddif =/tmpdir/sda. mbr. bak of =/dev/sda bs = 512 count = 1 // restore backup data
After the recovery is completed, Run "reboot" to restart the host (note that the RHEL5 installation disc is taken out ).
2.2 GRUB boot fault
GRUB is the default boot program used by most Linux systems. You can choose to enter different operating systems (if any) through the boot menu ). When "/boot/grub. the conf 'configuration file is lost, the key configuration is incorrect, or the boot program in the MBR record is damaged. After the Linux host is started, the "grub>" prompt may appear, further system startup cannot be completed.
If you are at this prompt, You can edit it by entering the corresponding boot command (refer to the configuration in the "/boot/grub, conf" file ), then run the "boot" command to guide the Linux system.
Eg>: manually enter the boot command in the "grub>" environment to start Linux.
Grub> root (hd0, 0) grub> kernel/vmlinux-2.6.18-8.e15 ro root =/dev/VolGroup00/LogVo100 rhgb quietgrub> inited/initrd-2.6.18-8.e15.imggrub> boot
The subsequent successful start process is exactly the same as the normal RHEL5 system startup process. After logging on to the system, you need to find the configuration file "/boot/grub. conf", fix the errors, or directly recreate the file. For details, refer to files of the same name on other normal hosts.
. >>>>>>>>>: View the main content of the grub. conf Startup Menu configuration file. Grep-v "^ #"/boot/grub. conf
Description of the main configuration items:
>: Title: The name of the operating system displayed in the Startup menu.
>: Root: specify the location of the/boot partition containing the boot file such as the kernel.
>: Kernel: Specifies the location where the kernel File is located. When the kernel is loaded, the permission is read-only "ro" and the root partition device file location is specified through "root =.
>: Initrd: Specifies the location of the temporary system image file used to start the kernel.
The commands used in the "grub>" environment are complex, and it is difficult to remember related Command Options and kernel loading parameters. Therefore, you can use the RHEL5 installation CD to enter the first aid mode. If the partition table is not damaged, the first aid mode will find the Linux Root partition on the hard disk, and mount it to the "/mnt/sysimage/" folder in the CD directory structure.
After entering the Shell environment of "sh-3.1", run the "chroot/mnt/sysimage" command to switch the directory structure to the Linux system to be repaired. Create a New grub. conf configuration file.
Eg: confirm the mounting status of the Linux system partition to be repaired and recreate the grub. conf file.
Chroot/mnt/sysimage // switch to the Linux Root environment to be repaired. Mount ..... part vi/boot/grub is omitted. conf // rebuild grub. in the conf file, exit // exit the chroot environment exit // exit the sh-3.1 environment, and the system automatically restarts.
In the preceding example, if you want to run the "chroot/mnt/sysimage" command, the re-created grub. conf configuration file should be located in "/mnt/sysimage/boot/grub. conf"
If the boot program in the MBR sector is damaged, the system may still fail to be started after the grub. conf configuration file is rebuilt. In this case, you can reinstall grub in the Shell environment of rescue mode.
Eg: Enter the Linux Root environment to be repaired, and re-install the grub boot program to the MBR sector in the first hard disk (sda.
Chroot/mnt/sysimagegrub-install/dev/sdaexitexit
The above method is also applicable to the situations where the Linux system cannot be started on a Linux host after the Windows system is not covered. For a host that uses a dual operating system, the installed Windows system will overwrite the records in the MBR sector with its own boot data. As a result, the GRUB menu is no longer displayed after the server is started, and thus cannot enter the Linux system. If the Linux system is installed later, the GRUB program will automatically identify the Windows system on the hard disk and load it to the GRUB menu configuration.
2.3. The/etc/inittab file is missing.
The "/etcinittab" file is the configuration file of the system initialization process init. If the file is deleted by mistake or the configuration is incorrect, the system may fail to be started. After the "/etc/inittab" file is lost, an error message "INIT: No inittab file found" is displayed after startup.
This type of fault can also be fixed in the first aid mode of the RHEL5 installation CD. If the file configuration is incorrect, correct it or restore it from the backup file. By default, if you do not use the chroot command to switch the environment, you need to modify the file "/mnt/sysimage/etc/inittab ".
If the inittab file is lost and no backup is available. You need to reinstall the initscript package from the RHEL5 CD directory.
For example, mount the RHEL5 disc in the "sh-3.1 #" Environment in emergency mode, reinstall the initscript package, and use the "-- replacepkgs" option of the rpm command to replace existing files.
Chroot/mnt/sysimagemount/dev/hdc/media/cdromrpm-vhi -- replacepkgs/media/cdrom/Server/initscripts-8.45.14.EL.i386.rpm
In the first-aid mode Shell environment, the cdrom connection file is usually not retained, but the CD is used directly through the device file "/dev/hdc. After installation, restart the system.
2.4. The/etc/fstab file is lost.
The "/etc/fstab" configuration file determines how each partition is loaded after the Linux system is started, such as the root partition "/" and "/boot" partitions. If these partitions cannot be mounted, the system cannot be started successfully. After the "/etc/fstab" file is lost, the following error message is displayed at startup.
Similarly, use the RHEL5 installation CD to enter the first-aid mode Shell environment. Due to the lack of fstab files, the CD system cannot find the Linux partition to be repaired, therefore, you must manually find and mount the root partition, recreate the fstab configuration file, and then restart the system.
Eg: scan the logical volume group in the Shell environment in emergency mode and activate the logical volume to locate the root partition device. Then manually mount the root partition and recreate the fstab configuration file.
Lvm vgscan // find the logical volume lvm vgchange-ay/dev/VolGroup00 // activate the found logical volume mkdir/tmpdirmount/dev/VolGroup00/LogVol00/tmpdir // mount the root partition/ tmpdir directory vi/tmpdir/etc/fstab // recreate the fstab configuration file, or copy the backup file directly.
2.5 forget the password of the root user
>: Reset the password of the root account in single-user mode (not described );
>: Reset the password of the root account in emergency mode.
If you use the RHEL5 installation CD to enter the first-aid mode Shell environment, you only need to switch to the root directory environment of the Linux system to be repaired and directly execute the "passwd root" command to reset the password of the root user; you can also modify the "/etc/shadow" file, clear the password field of the root user, restart it, enter the system normally, and then change the password.
Eg: in emergency mode, switch to the Linux root partition environment to be repaired and change the password of the root account.
Chroot/mnt/sysimagepasswd root ....