Using the tmpfs File System to quickly optimize ganglia disk I/O performance by 20151203
Recently encountered a problem,
Problem: ganglia monitors host system slow response and normal command line operations are stuck, especially when opening and editing files.
Problem Analysis: Through the ganglia monitoring host monitoring, top, iotop, vmstat and other tools for troubleshooting, it is found that there are a large number of disk write io at all times, this server also runs mongo and mysql slave databases and other applications in the production environment. The disk write io operation is mainly generated by the gmond process through iotop, A large number of write operations will be written into the rrd file after monitoring data is collected each time.
Further background analysis: We use ganglia to monitor various indicators of the production environment system and applications (about 15000 items). The monitoring frequency is 15 seconds by default, and the rrds directory size is 9.3 GB, the disk is an AID5 consisting of 3 2 tb sas hard disks.
This indicates that 9.3 GB of disk write volume (1000*620 M/15 = MB/s) is generated every 15 seconds. It is a mechanical hard disk, a RAID 5, and a mongodb backup database, for a long time, the system never went down. Thank God.
Solve the problem. After further analysis, the rrd file size will not change significantly when the monitoring metrics remain unchanged (the advantages and disadvantages of rrd ring database ), that is to say, the size of the rrds directory is kept at around GB, with a large write speed and few read operations. It seems that only ssd disks and memory can be used in this scenario, and ssd cannot be changed for the time being, use the memory (fortunately, my memory is 16 GB, which is enough currently, and I will have to add memory when I have more monitoring metrics), but the memory is easy to lose, if the data is lost when the server is restarted, A crontab is created to synchronize the 9.3GB rrds to the disk file system every 10 minutes for backup and storage, although the monitoring data is very important, it is acceptable to lose the monitoring data within 10 minutes to reduce the disk pressure and impact on other services for the system to work normally, after testing, it takes less than 1 minute to synchronize the 9.3 GB files to the disk. If it is not possible, the disk will be synchronized once every 5 minutes.
Perform the following operations:
Stop ganglia
/Etc/init. d/gmetad stop
/Etc/init. d/gmond stop
Back up the current rrds directory
Mv/var/lib/ganglia/rrds // var/lib/ganglia/rrds. mirror/
Create a tmpfs File System
Mount-t tmpfs-o size = 14G tmpfs/mnt/ramdisk/
Mount tmpfs to the rrds directory of ganglia
Mount/mnt/ramdisk/var/lib/ganglia/rrds
Change the rrds directory permission to ganglia. ganglia. ganglia requires the write permission for this directory.
Chown ganglia. ganglia/var/lib/ganglia/rrds-R
Chown ganglia. ganglia/mnt/ramdisk/-R
Start ganglia
/Etc/init. d/gmetad start
/Etc/init. d/gmond start
Go and check the ganglia monitoring charts. If they are all normal, OK.
Extension 1:
Because the memory file system is volatile, especially when the power is down, the system is restarted, The tmpfs file system is unmounted, And the tmpfs file system is remounted, the data files of the tmpfs file system will be lost, to minimize the amount of data lost in these cases (self-balancing), set crontab to synchronize the data in ramdisk to the disk file system every 10 minutes for backup. We allow 10 minutes of metric data loss, that is, the monitoring image will be interrupted for 10 minutes.
[Root @ min40] # crontab-l
# Synchronizing ganglia rrds files in memory to disk files every 10 minutes laijingli 20151202
*/10 * time rsync-av/var/lib/ganglia/rrds // var/lib/ganglia/rrds. mirror/>/tmp/rsync_ganglia_rrds_to_disk.log
9.3GB: about 15000 rrds files are synchronized for about 43 seconds.
[Root @ min40] # time rsync-av/var/lib/ganglia/rrds // var/lib/ganglia/rrds. mirror/
Real 0m42. 122 s
User 0m41. 545 s
Sys 0m22. 430 s
Extension 2:
To ensure the normal operation of the monitoring system after the next restart of the ganglia monitoring server, you need to add the above operations to the startup and automatically load the latest rrds file to ramdisk, to ensure the continuity of monitoring historical images.
[Root @ ynadmin40] # more/etc/rc. local
### Optimized ganglia rrd file write performance to solve disk I/O Problems
### Automatically mount tmpfs at startup --> mount rrds to ramdisk --> load the latest rrds data to ramdisk --> modify the rrds directory permission to ganglia
# Mount-t tmpfs-o size = 14G tmpfs/mnt/ramdisk/
# Ln-s/mnt/ramdisk // var/lib/ganglia/rrds
# Rsync-av/var/lib/ganglia/rrds. mirror // var/lib/ganglia/rrds/>/tmp/rsync_ganglia_rrds_to_ramdisk.log
# Chown ganglia. ganglia/var/lib/ganglia/rrds-R
# Chown ganglia. ganglia/mnt/ramdisk/-R
Extension 3:
Tmpfs is not suitable for all application scenarios and has the risk of data loss. However, it is suitable for the ganglia monitoring scenario and squid cache file scenario in this article. It is the best for you.
Comparison of effects before and after optimization:
After using the memory disk (tmpfs) in the rrds directory of ganglia, the optimization effect is very obvious (about 10 times better). For details, see Monitoring: