High-performance Linux cluster monitoring

Source: Internet
Author: User

Monitoring is the core task of cluster management. Monitoring data can be used for scheduling tasks, load balancing, reporting software and hardware faults to administrators, and extensively controlling system usage. The monitoring information must be obtained without affecting the cluster performance. This article will discuss how to use the/proc file system and Java to obtain monitoring data.

Java applications in Linux Clusters

Java provides many solutions for cluster management developers. Java is dynamic, flexible, and portable. These unusual features make it an ideal foundation for constructing cluster management on heterogeneous networks and platforms.

Java has a wide range of routine libraries and can easily process IP protocols, such as TCP and UDP. It can also design network programs on the multi-homed Host, it is easier to create a network connection than to use C or C ++. Java code running in Java Virtual Machine JVM through Java Local interface JNI) can operate and assemble with applications and library files written in other languages.

Java is already an optional language when constructing Cluster Monitoring and Management. However, Java is usually used only for the front-end or cluster host of the system, and the daemon process written in C language is installed on the cluster node. Although the Java programming language provides many advantages, can Java effectively replace the C-Language Daemon Running on each node for High-Performance cluster monitoring? This will be the focus of this article.

High-performance monitoring

Traditionally, the Linux cluster monitoring tool provides a limited amount of data at a measured frequency in seconds. High-Performance cluster monitoring is defined as "the ability to effectively collect data from nodes with intrasecond as the measurement frequency ". When a large cluster is involved, the inefficiency of the monitoring software becomes more serious, because the running application software must coordinate with each other or share global resources.

Interference on one node) can affect the operation of jobs on other nodes. For example, an MPI function needs to be synchronized with all participating nodes. One solution is to collect a small amount of data and transmit it at a small frequency. However, for high-performance monitoring, this solution is unacceptable because clusters with high utilization rates should be monitored frequently and continuously. The local job scheduler must be able to make quick decisions based on resource usage. Administrators often want to receive immediate notifications of emergencies and observe historical trend data. If the cluster cannot be monitored frequently and continuously, these requirements are impossible. Therefore, some measures must be taken, such as using more effective algorithms, increasing transmission concurrency, improving transmission protocol and Data Format efficiency, and reducing redundancy.

Compressing a Profiling application helps you Debug programs or optimize programs while tracking resource usage in the running state. For a given application, the use of dynamic resources such as memory, network, and CPU may change rapidly to observe how the application uses these resources, one possible method is to use high-frequency monitoring.

Even if the user is not interested in high-frequency monitoring, if the algorithm is effective, it will consume a small amount of resources regardless of the monitoring frequency. In heterogeneous clusters, this efficiency will be more important. users' jobs can be distributed to fast and slow nodes, A slow node requires all the CPUs to keep up with the fast node and synchronize it with it. The CPU time spent by a monitoring program on a slow node is the Key Path of the job.

Monitoring phase

Cluster Monitoring mainly consumes two important resources: CPU cycle and network bandwidth. However, the resource consumption problem is fundamentally different from the two resources. The CPU utilization problem is completely localized to the node and can be solved by creating an effective collection and merge algorithm. Network bandwidth is a shared resource and a scale problem. It can be solved by minimizing the amount of data transmitted over the network.

To solve these two problems, we divide Cluster Monitoring into three phases: Collection, merging, and transmission. The collection phase is responsible for loading data from the operating system, analyzing data values, and storing data. The merge phase combines data from multiple data sources to determine whether the data values change and filter them. The transmission phase is responsible for compressing and transmitting data. This article focuses on the collection phase of Linux cluster monitoring.

1. collection phase
Linux has several methods for system statistics. Each method has its own advantages and disadvantages.
◆ Use existing tools
Standard and non-standard tools can execute one or more collection, merge, and transfer phases, such as rstatd or SNMP tools. However, the information provided by standard rstat background programs is limited, slow speed and low efficiency.
◆ Kernel module
Several system monitoring projects use kernel modules to access monitoring data. In general, this is a very effective way to collect system data. However, the problem with this method is that when there are other changes in the main internal core source, the Code consistency must be maintained. A kernel module may conflict with other kernel modules you want to use. In addition, you must obtain or apply for a module before using the monitoring system.
◆/Proc Virtual File System
/Proc Virtual File System is a fast and efficient method for executing system monitoring. The main disadvantage of using/proc is that you must keep the code analysis synchronized with the/proc file format changes. It turns out that the Linux kernel changes more frequently than the/proc file format, so the/proc Virtual File System has fewer problems than the kernel module.
◆ Hybrid system
Some monitoring systems use a hybrid approach, use the kernel module to collect data, and use the/proc Virtual File System as the data interface.

2. Merge phase
The implementation of the merge phase can be on the node, the host managed by the cluster, or distributed on the two. Considering efficiency, we only use merging on nodes. The reason is that the node is the collector and provider of monitoring data. Two or more simultaneous data requests do not result in two operating system calls to collect data, but cache the data obtained from the first request and can be provided to the second request call. This method reduces the burden on the operating system and improves the responsiveness of the monitoring system. The merge phase can also be used to combine data from multiple data sources at an independent collection rate, because not all data changes at the same speed or needs to be collected at the same rate.

Another reason for merging on the node layer is to reduce the amount of information including transmission. Many/proc files contain both dynamic and static data. Delete the value that has not changed since the last transfer. The data volume sent by a node can be greatly reduced. Merging not only removes the transmission of infrequently changed dynamic values, but also solves the transmission of ever-changing static values.

3. Transmission Phase
Monitoring data is almost always organized in a hierarchical structure. The task in the transmission phase is to effectively encode hierarchical data to form a data format that can be efficiently transmitted. The file format of Java is an effective method for storing hierarchical data, and it is easy to use the provided Java APIs. S-Expressions is considered another effective method for transmitting such data.

A common question about transmission monitoring data is whether the data should be encoded in binary or text format. Binary data is easier to compress and therefore can be transmitted more effectively. However, when using the/proc file system, monitoring data is usually stored in a readable format. Before transmission, converting data to a binary format requires more processing resources and time. The collected data is retained in text format, and node resources can be used for more non-monitoring tasks.

Data in text format provides the following additional benefits:
◆ Platform independence
When monitoring heterogeneous clusters, the configuration of Data byte commands between machines is not always the same. The use of text format solves this problem in code, and the independent architecture does not affect more processing requirements.
◆ Easy-to-read format
Text data can be organized in a readable format. If needed, this feature can easily Debug programs or allow users to watch data streams.
◆ Effective compression
The text of the value data is composed of characters from the 10-byte set, rather than the 256-byte set in the binary. The relative frequency of the numbers and patterns they produce allows effective use of the dictionary and entropy average information based on the compression algorithm ).

/Proc Virtual File System

/Proc Virtual File System (procfs) is the Linux implementation of the virtual file system used by Unix operating systems, including Sun Solaris and LinuxBSD. At the beginning of/proc, it appears as a standard file system and contains files with the same name as the running process IDs. However, files in/proc do not occupy disk space and they exist in the working memory. /Proc was initially designed to facilitate process information access, but now, in Linux, it can be used by every part of the kernel to report something.

Among the hundreds of values provided by the/proc file system, we will focus on the minimum set required for cluster monitoring, including:
◆/Proc/loadavg: contains the average system load;
◆/Proc/meminfo: Contains storage management statistics;
◆/Proc/net/dev: including Nic metrics;
◆/Proc/stat: Contains kernel statistics;
◆/Proc/uptime: including the total system normal working time and idle time.

The number of values provided by each file is different. The complete list of valid values of these files is as follows.
◆/Proc/loadavg provides the following data:

Average load per second;
Average load in 5 seconds;
Average load within 15 seconds;
Total number of jobs;
The total number of Running jobs.

◆/Proc/meminfo:

Active memory;
Non-active storage;
Buffer storage;
High-speed buffer storage;
Total free memory;
Total high-level memory;
Free High-level memory;
Total low-level memory;
Free low-level memory;
Shared storage;
Swap memory;
Swap high-speed buffer memory;
Swap free memory;
Total memory.

◆/Proc/net/dev contains the following data for each NIC:

Received bytes;
Received compressed bytes;
The number of received codes;
Error codes received;
The received FIFO error code;
The received frame code;
Multicast codes received;
Total number of packages received;
Transmitted bytes;
Transmitted compressed bytes;
Total number of transmission codes;
Transmission Carrier Code;
Transmission conflict code;
Transmission missing error code;
Transmit FIFO error code;
The total number of packets transmitted.

◆/Proc/stat:

Boot time;
Number of context switches;
Total interruptions;
Total number of incoming pages;
Total number of outgoing pages;
Total process count;
Total number of exchanges;
Total number of SWAPs;
Total CPU idle time;
Total CPU nice time;
Total CPU system time;
Total CPU user time.
For each CPU:
Idle time of a single CPU;
Nice time for a single CPU;
Time of a single CPU system;
Time of a single CPU user.
And the following data for each disk drive:
Read from a single disk;
Write a single disk block;
Total I/O of a single disk;
Single Disk I/O read;
Single Disk I/O writes.

◆/Proc/uptime includes:

Total system work time;
The total idle time of the system.

It is worth noting that each time a/proc is read, a handle function is called by the kernel or a specific module to generate data. Data is generated during running. Whether it is a single character or a large block, the entire file will be rebuilt. This is critical to efficiency because any system monitor using/proc will swallow the entire file, instead of processing it.

Java provides a rich set of file I/O classes, including class-based streams, class-based Block devices, and a new I/O library provided by J2SDK 1.4. Experiments show that, in general, it is best to use the RandomAccessFile class for basic block read/write operations. For example, the block read file operation is as follows:

MFile = new RandomAccessFile ("/proc/meminfo", "r ");
// Open the file in Read mode
MFile. read (mBuffer); // read the file Block

Conclusion

This article discusses how to effectively use Java for high-performance monitoring on Linux cluster nodes. In programming, pay attention to the following aspects:
◆ Adopt the/proc file system;
◆ Read/proc files in block form, rather than line or character form;
◆ Keep the file open during file reading;
◆ Eliminate unnecessary data conversion;
◆ Merge data on nodes;
◆ Data transmission in the form of compression;
◆ Pay attention to the language or library related to performance issues.

For high-performance monitoring, kernel modules are not necessary because they provide a large degree of portability between Linux versions and categories, there are many options for the monitor implementation language. However, the performance of the/proc file system depends on the efficiency of the kernel code. Therefore, an appropriate understanding of the mechanism will have a significant impact on the performance of the monitor written in any language.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.