Overview
Recently this time in strengthening the stability and reliability of the log system itself, a stable and reliable system can not be separated from the monitoring, we talk about the monitoring in addition to the survival of the service and the core Metrics collection and capture of these components, for which we have made the task of scheduled tasks to execute. Because of the general idea and design has been formed, so today to share the log system in the timing of the task of the selection and design.
Component run-time monitoring
From the articles I shared before, it is easy to see the selection of the various components of our log system:
- Acquisition Agent:flume-ng
- Message system: Kafka
- Live Stream Processing: Storm
- Distributed search/log storage (temporary): Elasticsearch
This is also a common choice for many Internet log solutions. However, we are finding inconsistent results when we investigate the monitoring options provided by these components themselves and the third-party monitoring tools they support:
- Flume-ng: Support for HTTP/JMX metrics, supported monitoring tools: Ganglia
- Kafka: Support for JMX metrics, supported monitoring tools: Yahoo!
- Storm: Support for JMX metrics with Storm UI
- Elasticsearch: Support for status requests in HTTP form
Judging from the above results, the ability to monitor and integrate with third-party monitoring systems is uneven. This is obviously not in line with our expectations, and we have several points of concern:
- Monitoring unification, or isomerization.
- With the stability of the system, we are able to freely configure the metrics that we think is very important and must care
- Unified visualization, we look forward to seeing the metrics we want to see at a glance on our own control platform
To summarize, as these components on the monitoring ability of the different, but there are some similarities, that is, for:
Metrics requests for both protocols, each component supports at least one of them.
In fact, monitoring the unification of this is not difficult to do, we can choose the current mainstream open source monitoring tool Zabbix (for JMX Metrics acquisition, Zabbix itself has provided the native support: Java Gateway). But for personalized monitoring, such as the extraction and presentation of specific metrics, Zabbix needs to be customized. For a variety of reasons, we are not using Zabbix-based custom solutions for the time being.
JMX Metrics Collection
Because Zabbix provides native support for JMX collection, and it itself is open source software, our JMX metrics collection is customized based on the Zabbix Java gateway.
A quick look at Zabbix Java Gateway,zabbix provides native support for JMX after 2.0. Its architecture is very simple, as shown in:
Working principle:
Zabbix server wants to know the specific JMX value on a host, it asks Zabbix Java gateway, and Zabbix Java gateway uses "JMXMANAGEMENTAPI" to query a particular application. The premise is that the application side needs -Dcom.sun.management.jmxremote
parameters to open the JMX query when it is opened.
Zabbix server has a special process used to connect Javagateway called Startjavapollers.java Gateway is a standalone running Java daemon program, Similar to a proxy to decouple Zabbix and those component that provide JMX metrics.
We used the Java gateway to get the JMX code (Jmxitemchecker.java Class), and then dumped the acquired metrics into our database for display on the console of the log system. Since we have not adopted a whole set of mechanisms, we will not talk more about irrelevant details.
HTTP Metrics Collection
The HTTP metrics is primarily used to monitor the elasticsearch (because it does not support JMX), we use the httpclient to send requests, and then we also store the acquired information in our database.
Selection--quartz of timed task frame
Quartz is an open-source, powerful, mainstream scheduled task execution framework. Let's briefly mention a few of its core concepts:
- Job: Define the specific processing logic for a task
- Jobdetail: Encapsulates the necessary information that the quartz framework needs to perform a job
- Trigger:job execution of triggers
- Jobdatamap: Encapsulates the data needed during job execution
Of course there are many other concepts in the quartz framework, but as far as this article is concerned, it is enough for us to talk about this.
The overall design of the timed task execution engine
We have discussed the open-source scheduled task Framework Quatrz, which is not enough for a single framework, and we also need to plan, classify, and manage and distribute these tasks.
Types of Scheduled Tasks
For the time being, we will divide our scheduled tasks into two categories:
- Simple off-line calculation: Offlinecalc
- Metrics Collection: Metricspoller
- Other routine maintenance tasks of the log system: such as the management of daily indexes
Here, the collection of metrics is the main requirement for us to introduce timed tasks, so we will use it as the main line to introduce our timed task execution engine.
Meta-data storage and design
Based on the concepts of the quartz described above, and the purpose of the generalization task we need to achieve, we need to think about how to make the timing task execution engine changes more automated, increasing its extensibility, which involves the metadata management required for timed task execution.
We designed a hierarchical organizational structure from top to bottom:
- Job category
- Job Type
- Job
- Job metadata
- Job Trigger
category
The job is broadly divided, such as the Offlinecalcwe mentioned above,Metricspoller and so on. In quartz, job has the concept of Grouping (group), we also use this as the basis for job grouping.
type
Defines the type of task, which belongs to category
. type
not only does the role of the organization job
, to some extent, it should deal with a job class, which is a group of collations that follow the same processing logic job
. For example, we mentioned above, for JMX and HTTP metrics poller.
job
Corresponding to the job in quartz, it is necessary to weigh its granularity. Take JMX metrics poller This type of job example, if you only need to crawl a component metrics, then the granularity of a job can be a metrics of a gain. But if you need to extract a lot of metrics from multiple component, then your job granularity can not be so fine, you may need to be responsible for a job component all metrics extraction. It depends on the amount of work you have and the number of jobs in a timed task frame that is reasonably controlled.
job metadata
job
the metadata that is required to store at run time. The above mentioned job
is an abstract execution unit of a class of identical business logic. But they are not exactly the same, and different places distinguish the metadata that is needed for their execution. job
metadata
Correspondence is a one-to-many relationship. Like the JMX metrics Poller we mentioned above, it stores a job
collection of metrics object attribute that needs to be extracted.
job trigger
Corresponds to the quartz in the Trigger
. job
trigger
The correspondence is one to the other.
All of the above data provides the ability to configure management on the control console.
Life Cycle Automation Management
To improve the scalability and self-management of the timed task execution engine. We chose Zookeeper
to store the topology and metadata information for the entire job as above.
Zookeeper is a very good meta-data management tool, but also a very mainstream distributed collaboration tool. Its event mechanism makes it possible to automate the management of the job lifecycle. We can dynamically create or delete a job by monitoring the children Znode of each znode to dynamically perceive the changes in the job and perceive the changes in the nodes.
Summarize
In this article we take the monitoring of the various component of the log system as a practical requirement, and talk about our task design in the mainstream of the scheduled task execution Framework quartz to make it more scalable, At the same time, it is combined with zookeeper to enable the management of tasks with automation capability.
Timing task execution engine for the log system