The way of high performance Linux cluster monitoring

Source: Internet
Author: User
Tags error code file system readable require resource

Monitoring is the core task of cluster management. Monitoring data can be used for scheduling tasks, load balancing, reporting hardware and software failures to administrators, and extensive control of system usage. Monitoring information must be obtained without impacting the performance of the cluster. This article discusses ways to obtain monitoring data using the/proc file system and Java.

The application of Java in Linux cluster

Java technology provides a number of solutions for cluster management developers. Java is dynamic, flexible and portable, and these unusual features make it an ideal base for structuring cluster management on heterogeneous networks and platforms.

Java has a wide range of example threading, it is easy to handle IP protocol, such as TCP, UDP, and can be used on the Multi-homed host network programming, it is easier to create network connections than C or C + +. With Java local interface (JNI), Java code running in the Java Virtual Machine (JVM) can interoperate and assemble with applications and library files written in other languages.

Java has long been an optional language in the construction of cluster monitoring and management. However, the Java language is typically used only in the front-end or cluster host portions of the system, and the daemon, written in C, is installed on the cluster node. Although the Java programming language offers many advantages, for high-performance cluster monitoring, can Java effectively replace the C language daemon running at each node? This will be the focus of this article.

High performance monitoring

Monitoring Linux cluster tools traditionally provide limited data in seconds for measuring frequencies. High-performance cluster monitoring is defined as "the ability to collect data efficiently from nodes using Intrasecond as measurement frequency". When large clusters are involved, the inefficiency of the monitoring software becomes more severe because the applications that are running must coordinate or share global resources.

The barrier conflict (interference) at a node can affect the operation of the job at other nodes. For example, a MPI function needs to be synchronized with all participating nodes. One solution is to collect small amounts of data and transmit it at small frequencies. However, if it is high-performance monitoring, this solution is unacceptable because clusters with heavier utilization should be monitored frequently and continuously. The local job scheduler must be able to make quick decisions based on resource usage. Administrators often want immediate notification of urgent events and want to observe historical trend data, which cannot be achieved if the cluster cannot be monitored frequently and continuously. Therefore, some measures must be taken, such as using more efficient algorithms, increasing the parallelism of transmission, increasing the efficiency of Transfer Protocol and data format, reducing redundancy and so on.

A compressed profiling application helps you debug a program or optimizer when you track resource usage in a run. For a given application, the use of dynamic resources such as memory, network, and CPU can change rapidly, and one possible way to see how applications use these resources is to use high frequency monitoring.

Even if the user is not interested in high frequency monitoring, if the algorithm is effective, regardless of the frequency of monitoring, it will consume a small amount of resources. This efficiency is more important in heterogeneous clusters, and users ' jobs can be dispersed to faster and slower nodes, and slow nodes require all CPUs to keep up with faster nodes and synchronize with them. The CPU time that a monitoring program spends on slower nodes is a critical path to the job.

Monitoring phase

Cluster monitoring mainly consumes two important resources, CPU cycle and network bandwidth. However, the problem of resource consumption is fundamentally different from those two resources. The problem of CPU utilization is completely localized for nodes, which can be solved by creating an effective collection and consolidation algorithm. Network bandwidth is a shared resource and a scale problem, which can be solved by minimizing the amount of data transmitted over the network.

In order to solve these two problems, we divide the cluster monitoring into three stages: collection, merging and transmission. The collection phase is responsible for loading data from the operating system, analyzing data values, and storing data. The merge phase is responsible for combining data from multiple data sources to determine whether the data values change and filter them. The transport phase is responsible for compressing and transferring data. This article focuses on the collection phase of Linux cluster monitoring.

1. The collection phase

Linux has several methods for system statistics, each of which has its advantages and disadvantages.

Using the existing tools

Standard and non-standard tools can perform one or more collection, merge, and transfer phases, such as RSTATD or SNMP tools, whereas standard RSTAT background programs provide limited, slow, and inefficient information.

Kernel modules

Several system monitoring projects use kernel modules to access monitoring data. In general, this is a very effective way to collect system data. The problem with this approach, however, is that code consistency must be maintained when there are other changes in the primary kernel source. A kernel module may conflict with other kernel modules that the user wants to use. In addition, the user must obtain or request a module before using the monitoring system.

/proc Virtual file system

The/proc virtual file system is a faster and more efficient way to perform system monitoring. The main disadvantage of using/proc is the need to keep the code analysis synchronized with the/proc file format change. It turns out that the Linux kernel changes more frequently than the/proc file format, so there are fewer problems with the/proc virtual file system than with kernel modules.

Hybrid Systems

Some monitoring systems adopt a hybrid approach, collect data with kernel modules, and use the/proc virtual file system as the data interface.

2. The merger phase

The implementation of the merge phase can be on a node, on a cluster-managed host, or on both. With efficiency in mind, we only use mergers at the node. The reason is that node is the collector and provider of monitoring data. Two or more simultaneous data requests do not cause two of operating system calls to collect data, but rather cache the first request and can be provided to a second request call. This method reduces the burden of the operating system and improves the responsiveness of the monitoring system. The merge phase can also be used to combine data from multiple data sources with an independent collection rate, since not all data are changed at the same rate or need to be collected at the same rate.

Another reason to use merging on the node layer is to reduce the amount of information, including transmission. Many/proc files contain both dynamic data and static data. Deletes a value that has not changed since the last transmission, and the amount of data sent by one node can be greatly reduced. Merging not only removes the transfer of dynamic values that are not constantly changing, but also solves the transmission of static values from unchanged.

3. Transmission phase

Monitoring data is almost always organized in a hierarchical structure. The task of transmission phase is to encode the hierarchical data efficiently and form a data format which can be transmitted efficiently. Java-owned file formats are an effective way to store hierarchical data and are easy to complete with the Java APIs provided. S-expressions has been considered to be another effective way to transmit this data.

The general discussion about transmission monitoring data is whether the data should be encoded in binary or text format. Binary data is easier to compress and therefore more efficient to transmit. However, when the/proc file system is adopted, the monitoring data is usually stored in a format that is readable by people. converting data into binary format will require more processing resources and time before transmission. Preserves collected data in text format, and node resources can be used for more non monitoring related work.

Data in text format will provide the following additional benefits:

Platform Independence

When monitoring heterogeneous clusters, the configuration of data byte instructions between machines is not always the same. The use of text formats solves this problem in the code, and architecture independence does not affect more processing requirements.

Easy to read format

Text data can be organized in a format that is easily readable by people. This feature makes it easy to debug programs or allow users to view data streams if needed.

Effective compression

The literal representation of numeric data is composed of characters from 10-byte sets, rather than a binary 256-byte set. The relative frequencies of the numbers and patterns they generate allow for efficient use of the dictionary and entropy (average information) based on the compression algorithm.

/proc Virtual file system

The/proc Virtual file system (also known as PROCFS) is the Linux implementation of the virtual file system used by the UNIX operating system, including Sun Solaris, Linuxbsd. At the start of/proc, it appears as a standard file system and contains files with the same name as the running process IDs. However, files in/proc do not consume disk space and they exist in the working memory (memory). /Proc's original purpose was to facilitate access to process information, but now, in Linux, it can be used by every part of the kernel to report something.

Of the hundreds of values provided by the/proc file system, we will focus on the minimum set required for cluster monitoring, which includes:

/PROC/LOADAVG: Contains the system load average;

/proc/meminfo: contains storage management statistics;

/proc/net/dev: Includes network card metric;

/proc/stat: contains kernel statistics;

/proc/uptime: Contains the total system normal working time and idle time.

The number of values provided by each file is different. The complete list of valid values for these files is listed below.

/PROC/LOADAVG provides the following data:

1 second average load;

5 second average load;

15 second average load;

Total number of operations;

The total number of jobs that are running.

The storage information provided by/proc/meminfo includes:

Active memory;

Inactive memory;

Buffer memory;

High-speed buffer memory;

The total free memory;

Total high-level memory;

Free high level memory;

Total low-level memory;

Free low memory;

Shared memory;

Exchange memory;

Exchange of high speed buffer memory;

Exchange free memory;

Total memory.

The following data for each NIC is included in the/proc/net/dev:

The bytes received;

The compressed byte received;

The number of error codes received;

The missed error code received;

The FIFO error received;

The received frame error rate;

The number of multicast errors received;

The total number of packages received;

The bytes that have been transferred;

The compressed byte that has been transmitted;

Total number of transmission errors;

Transmission carrier error;

Transmission conflict error code;

Transmission leakage error code;

Transmission FIFO error;

The total number of packets transferred.

/proc/stat offers:

Lead time;

Number of context switches;

Total interruption;

Total number of pages into the page;

Total number of pages out;

Total number of processes;

Total Exchange;

Total number of swap out;

Total CPU idle time;

Total CPU nice time;

Total CPU system time;

Total CPU User time.

Also available on each CPU:

Single CPU idle time;

Single CPU nice time;

Single CPU system time;

Single CPU User time.

As well as the following data for each disk drive:

Single disk block read;

Single disk block write;

Total number of individual disk I/O;

Single disk I/O read;

Single disk I/O write.

The/proc/uptime includes:

Total working time of the system;

Total idle time of the system.

It is important to note that each time the/proc is read, a handle function is called by the kernel or by a specific module to produce the data. Data is generated in motion, whether it is reading a character or a large block of characters, the entire file will be rebuilt. This is critical to efficiency because any System Monitor using/proc will swallow the entire file, rather than 1.1 points to process it.

Java provides a rich set of file I/O classes, including class-based streaming, class-based block devices, and new I/O libraries provided by J2SDK 1.4. Experiments show that, generally speaking, the basic block read and write file operations, with the Randomaccessfile class for I/O is the best. For example, the block read file operation is as follows:

Mfile = new Randomaccessfile ("/proc/meminfo", "R");

Open a file as read (Mbuffer); Read File Block


This article discusses how to effectively use the Java language for high-performance monitoring on Linux cluster nodes. In the program design, attention should be paid to the following:

Using the/proc file system;

Read the/proc file in block form, not in line or character form;

Keep the file open while reading the file;

Elimination of unnecessary data conversion;

Merging data on the nodes;

Transmission of data in compressed form;

Note the language or library associated with the performance issue.

The kernel module is not a requirement for high-performance monitoring, which is important because it provides a great degree of portability between Linux versions and classifications, and there are many choices in the language of the monitor implementation. However, the performance of the/proc file system relies heavily on the efficiency of kernel code, so proper understanding of the mechanism will have a significant impact on the performance of the monitors written in any language.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.