Fault analysis and troubleshooting of Linux system

Last Update:2017-05-26 Source: Internet

Author: User

Tags syslog system log

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

When dealing with various failures in the Linux system, the symptoms of the failure are the first to be discovered, and the cause of the failure is the key to eventual troubleshooting. Familiar with the Linux system log management, understand the common fault analysis and solution, will help the administrator to quickly locate the point of failure. "Remedy the situation" to solve various system problems in time.

Log analysis and Management

A log file is a file used to record various running messages in a Linux system, equivalent to a "journal" of a Linux host. Different log files record different types of information, such as: Linux kernel message, user login record, program error, etc. Log files are helpful in diagnosing and resolving problems in the system, because programs that run on Linux systems often write system messages and error messages to the appropriate log files, so that the system can be "documented" once problems occur. In addition, when the host is attacked, the log file can also help to find traces of the attacker's left. Let me introduce the main log and analysis management methods in Linux system.

1. The main log files consist of the following three types:

> Kernel and System log: This log data is managed by the system service Syslog, which determines where the kernel messages and various system program messages are recorded, based on the settings in the master configuration file "/etc/syslog.conf". There are quite a few programs in the system that send their own log files to syslog management, so the log records used by these programs also have similar formats.

> User log: This log data is used to record information about the Linux system user login and exit system, including user name, login terminal, logon time, source host, process operation in use, etc.

> Program log: Some applications will choose their own to manage a log file independently (rather than to the Syslog service management), used to record the program during the operation of various event information. Because these programs are only responsible for managing their own log files, there may be significant differences in the logging format used by different programs.

The log files of the Linux system itself and most of the server programs are placed in the directory "/var/log" by default. Some programs share a log file, some programs use a single log file, and some large server programs because the log file is not one, so in the "/var/log/" directory to establish the appropriate subdirectory to hold the log files, so that the log file directory is clearly structured, You can also quickly locate log files. There is a significant portion of log files that only the root user has permission to read, which guarantees the security of the relevant log information.

>>>>>>>>: The list looks at the various log files and subdirectories in the/var/log directory.

For some common log files in the Linux system, it is necessary to familiarize themselves with the corresponding uses, so as to find the problem faster and solve various faults in time. Such as:

>/var/log/messages: Logs Linux kernel messages and common log information for various applications, including startup, IO errors, network errors, program failures, and so on. For applications or services that do not use stand-alone log files, it is generally possible to obtain related event logging information from the file.
>/var/log/cron: Logs event messages generated by Crond scheduled tasks.
>/VARLOG/DMESG: Records the various event information of the Linux system during the boot process.
>/var/log/maillog: Records the e-mail activity that enters or issues the system.
>/var/log/lastlog: Recent successful logon events and last unsuccessful logon events.
>/var/log/rpmpkgs: Install each RPM package list information in the recording system.
>/var/log/secure: Logs event information during user logon authentication.
>/VAR/LOG/WTMP: Records each user logon, logoff, and system startup and shutdown events.
>/var/log/utmp: Record details for each user who is currently logged on

2. log file analysis

Familiar with the main log in the system, we will be on the log file analysis method to do understanding. The purpose of the analysis log file is to find the key information through the browsing log, debug the system service, determine the cause of the failure and so on. Here, the basic format of the three types of log files and analysis methods.

For most text-formatted log formats (such as kernel and system logs, most program logs), you can view the log content using text processing tools such as tail, more, less, and cat. For some binary format log files (eg: User logs), you need to use the appropriate query commands.

> Kernel and System logs:

The kernel and syslog features are mainly provided by the default installed syslogd-1.4.1-39.2 package, which installs KLOGD, syslogd two programs, and is controlled by the Syslog service, which is used to record the messages of the system kernel and the messages of various applications, respectively. The configuration file used by the Syslog service is "/etc/syslog.conf".

Typically, the kernel and most system messages are recorded in the public log file "/var/log/messages", while other program messages are recorded in different files, and log messages can be logged to a specific storage device or sent directly to the user.

>>>>> View the contents of the log profile "/etc/syslog.conf"

As you can see from the profile "/etc/syslog.conf", log files managed by the SYSLOGD service are the most important log files in the Linux system, and they record the most basic system messages in the Linux system, such as kernel, user authentication, mail, scheduled tasks, and so on. In the Linux kernel, depending on the degree of importance of the log message, it is divided into different priority levels (the smaller the number, the higher the priority, the more important the message).

>0 Emerg (Emergency): a condition that causes the host system to be unavailable.
>1 Alert (warning): A problem that must be taken immediately to resolve.
>2 crit (severe): a more serious situation.
>3 ERR (Error): An error occurred while running.
>4 WARNING (Reminder): May affect the system function, need to remind users of important events.
>5 NOTICE (Note): Events that do not affect normal functionality, but need attention.
>6 Info: General information.
>7 bebug (Debug): Program or system debugging information.

For most of the Syslog service Unified Management log files, the logging format used is basically the same, the following is a common log file "/var/log/messages" as an example to explain the basic format of kernel and system log records.

Eg: View the last two lines of the public log file "/var/log/messages".

Each row in the log file represents a message, and each message consists of a fixed format of four fields.

: Time Label: Date and time when the message was issued.
Host Name: The name of the computer that generated the message.
: Subsystem Name: The name of the application that issued the message.
Message: The exact content of the message.

In some cases, you can set up a syslog that logs information to a file while sending the log information to the printer for printing, so that no matter how the network intruder modifies the log, it cannot erase the traces of the intrusion. The Syslog Log service is a significant target that is often attacked, destroying it will make it difficult for administrators to find traces of intrusions and intrusions, so pay special attention to monitoring their daemons and configuration files.

> User Logs

In Wtmp, Utmp, Lastlog and other log files, save the system user login, exit and other related events event message. However, these files are binary data files, can not directly use the tail, less and other text viewing tool process browsing, you need to use the WHO, W, users, Last and AC and other user query commands to obtain the log information.

This is no longer a demonstration.

3. Program Log

In a Linux system, there are a significant portion of applications that do not use syslog services to manage logs. Instead, the program maintains its own log records. For example, the HTTPD Web Services program uses two log files Access_log and error_log, typically stored in the "/var/log/httpd" directory, recording customer access events, error events, and the FTP service program can be uploaded with the file, Download event-related messages are logged in the Xferlog file. Due to the different application logging format is large, and there is no strict use of uniform format, here is an unknown solution!

Exception: Server log Distribution Management policy:

In view of the importance of log data, it is necessary to use a targeted management strategy to ensure the accuracy, security and authenticity of log data for various log files generated during the system operation. In general, the following aspects can be considered.

Log backup and archive: Log files are also important data and need to be backed up and archived.

: Extended Log Retention period: Log data should be kept as long as possible in the event of a rich storage space.

: Control log access: The log data may contain various kinds of sensitive information, such as: Account number, password, etc. Therefore, it is necessary to strictly control its access rights.

Centralized management log: Use a centralized log server to manage log records sent by each server. The advantage is to facilitate the collection, collation and analysis of the log, to eliminate accidental loss, malicious tampering or deletion.

Eg: Server A (IP address is 173.17.17.3/24), which is used to centrally save the log records.

The log records generated by the Crond service in client B (IP address 173.17.17.11/24) are uniformly saved to the "/var/log/cron" file in Server A.

Set up log Server A

In log Server A, you need to edit the startup parameter profile "/etc/sysconfig/syslog" of the Syslog log service to change the contents of the syslogd_options variable to "-r-x-M 0". Where the "-r" option means that log records sent by other hosts are allowed, the "-X" option means no process DNS domain name resolution, "-M" indicates the time stamp interval for logging (set to 0 to disable the feature), which can be obtained by viewing the Man manual page of the SYSLOGD program

*: Modify the "/etc/sysconfig/syslog" file of log Server A, add the central management configuration parameter "-r" and restart the Syslog service.

Vi/etc/sysconfig/syslog//Modify Syslogd_options line syslogd_options="-r-x-M 0"
Service Syslog Restart

Set Client B

In client B, you need to modify the "/etc/syslog.conf" configuration file to write log messages for Cron scheduled tasks to the "/var/log/cron" file in Server A. Specifies the "@173.17.17.3" format to use when specifying the host address to write to the log.

*: Modify the "/etc/syslog.conf" file of client B, locate the configuration line of the cron log, change the log send location to "@173.17.17.3", and restart the Syslog service.

Vi/etc/syslog.conf
Cron.* @173.17.17.3
Service Syslog Restart

Verifying log Centralized management features

Execute the "crontab-e" command in client B, write a scheduled task information and save the exit, then view the "/var/log/cron" log file in this computer, and you will find that there are no new records.

4, System startup class troubleshooting

In the Linux system startup process, involves the MBR master boot record, the Grub boot menu, the system initialization configuration file, the partition mount configuration file and so on, any one link failure may cause the system to start the abnormality, therefore must pay attention to do the related file the backup function. The following are some of the system startup class failure scenarios:

> MBR sector failure

The MBR boot record is located on the first sector (512 bytes) of the physical hard disk, which is also known as the primary boot sector (MBR sector), and contains a partition table record of the entire hard disk in addition to some of the system bootstrapper data. When the primary boot sector fails to send, it may not be possible to enter the main boot menu, or the system cannot be loaded because the correct partition location cannot be found, and it is likely that a black screen state will be entered when booting the host through the drive.

The following will introduce the MBR sector for backup, destruction, repair process, hey!

: Backing up MBR sector data

Because the MBR sector contains partition table records for the entire hard disk, the backup files for that sector must exist in other storage devices, or the backup file will not be readable on recovery.

Use the DD command to back up the MBR sector of the 1th hard disk (SDA) to the SDB1 partition of the 2nd hard disk (mount to the/backup directory)

Mkdir/backup
Mount/dev/sdb1/backup
DD IF=/DEV/SDA of=/backup/sda.mbr.bak bs= count=1

Analog MBR sector failure

Still using the DD command, we artificially overwrite the records of the MBR sector in order to simulate the MBR failure,

DD if=/dev/zero of=/dev/sda bs= count=1

When you restart the system after doing this, a "Operating system not Found" message will appear indicating that the possible operating system could not be found and therefore the host could not be started.

> Recovering MBR sector data from backup files

Since the MBR sector has been destroyed, it is no longer possible to boot the system from the hard disk, so it is necessary to boot using the operating system on the other hard disk, or directly using the RHEL5 system's installation CD. Either way, the goal is the same: get a shell environment that can execute commands to change the data in the MBR sector from the backup file,

To boot with the RHEL5 installation disc as an example, when the Installation wizard: "Boot" prompt, enter "Linux Rescue" and return to the "First aid mode" to boot the Linux system on the CD. Then press ENTER to accept the default language, the keyboard is appropriate, prompt to configure the network card when the general selection of "No", and then the system will automatically see the hard disk on the Linux partition and try to mount it to the "/mnt/sysimage" directory (select "Continue" confirm and Continue). Special Infusion Chair is required: When the warning window for the disk is initialized, such as:

Be sure to select "No" to avoid further damage to the hard drive data.

It is best to select OK to enter the Bash shell environment with the "sh-3.1#" prompt, as long as you execute the appropriate command to mount the hard disk file (SDB1) that holds the backup file and restore the data to the hard disk "/DEV/SDA". It is important to note that the system environment currently in use is the Linux directory structure on the disc.

*>: Confirm partition of the 1th hard disk (no valid partition table information can be obtained, and restore data from MBR sector).

Fdisk-l/DEV/SDA

Mkdir/tmpdir
Mount/dev/sdb1/tmpdir
DD if=/tmpdir/sda.mbr.bak of=/dev/sda bs= count=1//restore Backup Data

After the restore operation is complete, perform a "reboot" reboot of the host (note to remove the RHEL5 installation CD).

5. Grub Boot Failure

Grub is the boot program that most Linux systems use by default, and you can choose to enter different operating systems (if any) through the boot menu. When the "/boot/grub.conf" profile is missing, or if a critical configuration error occurs, or if the boot program in the MBR record is compromised, a "grub>" prompt may appear after the Linux host is booted to complete the further system boot process.

If you are at this prompt, you can edit it by entering the corresponding boot command (you can refer to the configuration in the "/boot/grub/grub,conf" file), and then execute the "boot" command to boot the Linux system.

EG>: Start the Linux system by manually entering the boot command in the "grub>" environment.

Grub>root (hd0,0)
Grub>kernel/vmlinux-2.6.18-8.e15 ro root=/dev/volgroup00/logvo100 rhgb quiet
Grub>inited/initrd-2.6.18-8.e15.img
Grub>boot

The subsequent start-up success is identical to the normal start-up of the RHEL5 system. After logging into the system, you need to locate the configuration file "/boot/grub/grub.conf" and fix the error, or rebuild the file directly. Specific content can refer to other normal host files with the same name.

. >>>>>>>>>: View the main contents of the Grub.conf boot menu configuration file. Grep-v "^#"/boot/grub/grub.conf

Among them, the meaning of the main configuration items is explained:

>:title: Specifies the name of the operating system that is displayed in the boot menu.

>:root: Specifies the location of the/boot partition containing boot files such as the kernel.

>:kernel: Specifies the location of the kernel file, the kernel loads with read-only "Ro", and "root=" specifies the location of the root partition device file.

>:INITRD: Specifies where the temporary system image files used by the boot kernel are located.

Because the commands used in the "grub>" environment are more complex, it is generally difficult to remember the relevant command options, kernel load parameters, and so on. Therefore, users can use another method of repair, also using the RHEL5 installation CD into the first aid mode, if the partition table is not destroyed, the first aid mode will find the hard disk on the Linux root partition, and mount it to the disc directory structure in the "/mnt/sysimage/" folder.

After entering the "sh-3.1" shell environment, execute the "chroot/mnt/sysimage" command to switch the directory structure to the Linux system to be repaired. Then re-establish the new grub.conf configuration file.

Eg: confirm the mount of the Linux system partition to be repaired and rebuild the grub.conf file.

Chroot/mnt/sysimage//Switch to the Linux system root environment to be repaired.
Mount
..... Omit part of the content
vi/boot/grub/grub.conf//Rebuild grub.conf file, content will not be written
Exit//Quit CHROOT Environment
Exit//Quit the sh-3.1 environment and the system will restart automatically

In the example above, if the "chroot/mnt/sysimage" command is executed, the re-established grub.conf configuration file should be located in "/mnt/sysimage/boot/grub/grub.conf"

If the boot program in the MBR sector is corrupted, you may still not be able to start the system successfully after rebuilding the grub.conf configuration file, and you can reinstall the grub in the rescue mode shell environment

Eg: Enter the root environment of the Linux system to be repaired and re-install the Grub boot program into the MBR sector in the first hard disk (SDA).

Chroot/mnt/sysimage
Grub-install/dev/sda
Exit
Exit

The same approach applies to situations in which a Linux system fails to boot after a Windows system (not overwriting a Linux system) in a Linux host. Because I for a host that uses a dual operating system, the post-installation Windows system will overwrite the records in the MBR sector with its own bootstrap data, causing the Grub menu to no longer appear after booting to the Linux system. If the Linux system is installed later, the GRUB program will automatically identify the Windows system on the hard drive and load it into the Grub menu configuration.

>/etc/inittab file missing

The "/etcinittab" file is the system initialization process init configuration file, which may cause the system to fail to boot when the file is mistakenly deleted or there is an incorrect configuration. After you lose the "/etc/inittab" file, an error message "Init:no inittab file Found" will appear after startup.

This type of failure can also be repaired in the first aid mode of the RHEL5 installation CD. If the file is misconfigured, correct it or restore it from the backup file. By default, if you do not use the chroot command to switch the environment, you need to modify the file "/mnt/sysimage/etc/inittab".

If the Inittab file is missing and there is no backup available. You will need to reinstall the Initscript package from the RHEL5 CD-ROM directory.

Eg: Mount the RHEL5 CD-ROM device in the "sh-3.1#" environment of the first aid mode and reinstall the Initscript package, replacing the existing file with the "--replacepkgs" option of the RPM command.

Chroot/mnt/sysimage
Mount/dev/hdc/media/cdrom
Rpm-vhi--replacepkgs/media/cdrom/server/initscripts-8.45.14.el.i386.rpm

The CDROM connection file is usually not retained in the shell environment of the first aid mode, and the disc is used directly from the device file "/dev/hdc". Reboot the system after installation.

>/etc/fstab file missing

The "/etc/fstab" configuration file determines how the Linux system loads the partitions after booting, such as the root Partition "/", "/Boot" partition, and so on, if these partitions cannot be mounted, the system will not start successfully. After you lose the "/etc/fstab" file, the following error message will appear when you start up.

Similarly, using the RHEL5 installation CD into the first aid mode of the shell environment, due to the lack of fstab files, the optical disk system will not be able to find the Linux partition to be repaired, so you must manually find and mount the root partition, and then rebuild the Fstab configuration file after restarting the system.

Eg: Scan the logical volume group in the shell environment of emergency mode, activate the logical volume to locate the root partition device, then manually mount the root partition and rebuild the Fstab configuration file.

LVM Vgscan//Find logical volumes
LVM vgchange-ay/dev/volgroup00//activation of found logical volumes
Mkdir/tmpdir
Mount/dev/volgroup00/logvol00/tmpdir//Mount root partition to/tmpdir directory
Vi/tmpdir/etc/fstab//Rebuild the Fstab profile, or copy the backed up files directly

6, Forgotten root user's password

: Resets the password of the root account via single user mode (no longer stated);

Reset the root account password via the first aid mode

If you use the RHEL5 CD-ROM to enter the shell environment of the first aid mode, simply switch to the root environment of the Linux system to be repaired, execute the "passwd root" command directly to reset the root user's password, or modify the "/etc/shadow" file, The password field of the root user is emptied, restarted, and the password is changed after normal entry into the system.

Eg: in the first aid mode, switch to the Linux root partition environment to be repaired and change the password of the root account.

Chroot/mnt/sysimage
passwd Root
....

Transfer from http://os.51cto.com/art/201405/438510.htm

Fault analysis and troubleshooting of Linux system

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More