In the actual O & M environment, operating system OS maintenance is required. The application system is a whole, not only the application itself and the database server running on the application server, but also the operating system, network, storage and even hardware. Only by ensuring the overall monitoring of the application system can the most stable operation performance be achieved.
In most cases, the operating system in our environment can run continuously without causing major problems. Once the machine or server Hange is active, catastrophic results may occur. As a result, it is better to prevent slight data loss. Regular viewing of system running conditions, disk space, CPU usage, and various log information can help us solve operating system problems as early as possible.
This article introduces a simple Linux Process Bug solution.
1. Problem Introduction
A new system is accepted. Both the application server and database server are in Linux 6. The system architecture is relatively simple, and there is no serious fault after one year of operation.
[Root @ TESTDB ~] # Uname-r
2.6.32-131.0.15.el6.x86 _ 64
[Root @ TESTDB ~] # Cat/etc/RedHat-release
Red Hat Enterprise Linux Server release 6.1 (Santiago)
[Root @ TESTDB ~] # Uptime
11:28:14 up 66 days, 1 user, load average: 0.50, 0.44, 0.37-routine shutdown Maintenance
In Linux, the most common log is the/var/log directory. Checking message is our direct log check policy.
[Root @ TESTDB ~] # Tail-n 10/var/log/messages
Mar 26 08:31:42 TESTDB cachefilesd [1591]: Scan complete
Mar 26 08:32:12 TESTDB cachefilesd [1591]: Scan complete
Mar 26 08:32:42 TESTDB cachefilesd [1591]: Scan complete
Mar 26 08:33:12 TESTDB cachefilesd [1591]: Scan complete
Mar 26 08:33:42 TESTDB cachefilesd [1591]: Scan complete
Mar 26 08:34:12 TESTDB cachefilesd [1591]: Scan complete
Mar 26 08:34:42 TESTDB cachefilesd [1591]: Scan complete
Mar 26 08:35:12 TESTDB cachefilesd [1591]: Scan complete
Mar 26 08:35:42 TESTDB cachefilesd [1591]: Scan complete
Mar 26 08:36:12 TESTDB cachefilesd [1591]: Scan complete
The log volume is large. From the perspective of automatic archiving every week, the total log volume has been large for a long time.
[Root @ TESTDB ~] # Cd/var/log/
[Root @ TESTDB log] # ls-l | grep message
-Rw -------. 1 root 549637 Mar 26 08:55 messages
-Rw -------. 1 root 1193545 Mar 2 messages-20140302
-Rw -------. 1 root 1191893 Mar 9 messages-20140309
-Rw -------. 1 root 1194902 Mar 16 messages-20140316
-Rw -------. 1 root 1195079 Mar 23 messages-20140323
From the log, the service process cachefilesd automatically writes a record every 30 s. There are no other problems except too many redundant entries in logs.
The message itself is neutral, and the notification call error information. If the normal information is too frequent, it is easy to overwhelm the error content. So we hope we can solve it.
2. Fault Analysis
Fault errors are categorized. An extreme severity is an emergency. For example, the operating system goes down and hang has no response, which directly affects business operations and even data loss. The other extreme is some "minor faults" that will not cause major problems in the short term ". Serious and urgent errors test the knowledge, experience and psychological quality of O & M personnel, while the professionalism and professional quality of small faults.
I have no good idea about this issue, but I only need to seek help from the official database. In the customer subscription on the Red Hat official website, I found the article "Why server is flodded with 'cachefilesd Scan complete' messages?" The same problem is described.
The Cachefilesd process is responsible for file and directory Cache Management of network file systems. For example, a Cache object must exist in a local system for network file systems such as AFS and NFS. This problem is caused by a bug in the cachefilesd service itself, because an error log level is set internally ). Therefore, each time cachefilesd performs Scan at work, it will be written into the/var/log/messages log file.
This issue has been listed as a Bug by Red Hat, with ID 680127. Cachefilesd works as a background service of the operating system. When '/var/cache/fscache/cache' is empty, Scan Completed information is automatically written to the log.
Based on the frequency, two logs are written every minute. This is consistent with our actual system.
The version is Linux 6, and the cachefilesd package version is 0.10.1-2. View the current system version.
[Root @ TESTDB ~] # Rpm-qa | grep cachefilesd
Cachefilesd-0.10.1-2.el6.x86_64
The solution is to upgrade cachefilesd to the latest version to avoid problems.
3. Problem Solving
When the problem is located, the solution is to upgrade the cachefilesd package. Search for a dedicated rpm package to download from the official website. The directory is as follows:
Download the latest version 0.10.2.1. Use rpm for installation.
[Root @ TESTDB ~] # Cd/
[Root @ TESTDB/] # mkdir updates
[Root @ TESTDB/] # cd updates
[Root @ TESTDB updates] # ls-l
Total 36
-Rw-r --. 1 root 35332 Mar 26 cachefilesd-0.10.2-1.el6.x86_64.rpm
The parameter-Uvh determines the current version. If no corresponding program exists, install it directly. Otherwise, the system enters the upgrade mode.
[Root @ TESTDB updates] # rpm-Uvh cachefilesd-0.10.2-1.el6.x86_64.rpm
Warning: cachefilesd-0.10.2-1.el6.x86_64.rpm: Header V3 RSA/SHA256 Signature, key ID fd431d51: NOKEY
Preparing... ######################################## ### [100%]
1: cachefilesd ####################################### #### [100%]
Finally, check the effect. The log contains the process of stopping and restarting the cachefilesd service. After the restart, no new log items are generated.
Mar 26 08:55:12 TESTDB cachefilesd [1591]: Scan complete
Mar 26 08:55:21 TESTDB cachefilesd[ 1591]: Daemon Terminated
Mar 26 08:55:21 TESTDB kernel: CacheFiles: File cache on sda3 unregistering
Mar 26 08:55:21 TESTDB kernel: FS-Cache: Withdrawing cache "mycache"
Mar 26 08:55:21 TESTDB cachefilesd [10518]: About to bind cache
Mar 26 08:55:21 TESTDB cachefilesd[ 10518]: Bound cache
Mar 26 08:55:21 TESTDB kernel: FS-Cache: Cache "mycache" added (type cachefiles)
Mar 26 08:55:21 TESTDB kernel: CacheFiles: File cache on sda3 registered
Mar 26 08:55:21 TESTDB cachefilesd[ 10519]: Daemon Started
As the cachefilesd service, it also works normally.
[Root @ TESTDB ~] # Service cachefilesd status
Cachefilesd (pid 10519) is running...
[Root @ TESTDB ~] # Chkconfig -- list cachefilesd
Cachefilesd 0: off 1: off 2: on 3: on 4: on 5: on 6: off
Troubleshooting.
4. Conclusion
Various faults may occur in the actual O & M environment. In addition, it requires a lot of experience to diagnose and solve problems. Timely detection of problems and prevention of micro-du Jian are the best guarantee for the continuous and healthy operation of the system. This is why the "Fire Fighters" are not as good as the "old scalpers.