In parts 1th and 2nd of this series (a total of three articles), I introduced the techniques and patterns for monitoring Java applications, in which I focused on the JVM and application classes. In this final installment, I'll introduce tips for collecting performance and availability data from application dependencies, such as the underlying operating system, the network, or the application's back-end database. At the end of the article I'll discuss ways to manage the patterns of data collection and report and visualize data.
Spring-based collectors
In the 2nd part, I implemented a basic Spring based component model for managing monitoring services. The basic principles and benefits of the model are:
Using xml-based configurations makes it easier to manage a large number of parameter sets for configuring more complex performance data collectors.
With a separation of concerns structure, you can use simpler components that interact with each other by injecting Spring's dependencies.
Spring provides a lifecycle for simple collection beans, consisting of initialization, startup, and stop operations, and an option to expose the Java Management extensions (Java Management extension,jmx) management interface to the bean, so that you can For control, monitoring and troubleshooting.
I'll introduce more details about Spring based collectors in each section of this article below.
Monitor host and operating system
Java applications always run on the underlying hardware and the operating system that supports the JVM. One of the most critical components of a comprehensive monitoring infrastructure is the ability to collect performance, health, and availability metrics from hardware and os-, typically through OS collection. This section covers a number of techniques for obtaining such data and tracking the application performance management system (application performance MANAGEMENT,APM) through the ITracer class described in part 1th.
Typical OS performance metrics
The following summary lists the typical metrics that are related to multiple parts of the domain operating system. Although the details of data collection are very different and the interpretation of the data must be in a given OS context, these metrics are essentially equivalent on most standard hosts:
CPU Usage: Represents the CPU occupancy on a specific host. The unit is typically used as a percentage, and the CPU busy time is expressed as a percentage of a specific period of elapsed clock time at a lower level. A host can have multiple CPUs, while a CPU can contain multiple cores, but multiple cores are typically abstracted from the OS to represent a CPU. For example, a host with two dual-core CPUs would be said to have four CPUs. Metrics can typically be collected on a per-CPU basis or as a total resource utilization, which represents the overall usage of all processors. Whether to monitor each CPU separately or monitor the overall CPU is usually dependent on the nature of the software and its internal architecture. Standard multithreaded Java applications typically balance the load on all CPUs by default, so the overall monitoring is more appropriate. However, in some cases, individual OS processes are "specific" to specific CPUs, and the overall metrics may not capture the appropriate level of granularity.
CPU usage is usually split into four categories:
System: Processor Time spent executing system or OS kernel-level activity
User: Processor time spent performing user activity
I/O wait: The processor time spent waiting to complete an I/O request in idle state
Idle: Alluding to no processor activity
The other two related indicators are the running queue Length (that is, pending matters for requests waiting for CPU time) and context transformations (the conversion of the processor time allocation from one process to the instance of another process).
Memory: The simplest memory metric is the percentage of physical memory available or in use. Other things to consider are virtual memory, memory allocation rates and reallocation rates, and finer-grained metrics that indicate which areas of memory are being used.
Disk and I/O: Disk metrics a simple (but critical) report of the availability or use of disk space for each logical or physical disk device, as well as the read and write rates for these devices.
Network: Refers to the network interface data transmission rate and error incidence rate, it is usually divided into Advanced network Protocol category, such as TCP and IP.
Process and process groups: you can say that the metrics described earlier are the total activities of a specific host. They can also be divided into the same metrics, but represent the consumption or activity of individual processes or related process groups. Monitoring a process's use of resources helps explain the proportion of resources consumed by each application or service on the host. Some applications can instantiate only one process, and in other cases a service such as Apache 2 Web Server can instantiate a group of processes that represent a logical service.