A stable and reliable system can not be separated from monitoring, we not only monitor the survival of the service, but also monitor the system's health. Health is mainly the core of these components metrics acquisition, capture, analysis and alarm.
First,
Monitoring
of Data
Log data that is monitored typically includes:
V APP, PC, Web and other systems run log: Using Flume-ng collection
V user log: Collected with Flume-ng
V Back-end server (SOA) logs: Collected with Flume-ng
V METRICS:JMX and HTTP for big data components
v MySQL and other database logs: CANAL
Different companies have different design requirements, which is not much to say.
Second, component run-time monitoring
- Acquisition Agent:flume-ng
- Message system: Kafka
- Database messaging System: MQ
- Live Stream Processing: Storm
- Distributed log storage: HBase
- Distributed Search: Elasticsearch
This is also a common choice for many Internet log solutions. However, the monitoring scenarios provided by these components themselves and the third-party monitoring tools they support are different:
- Flume-ng: Support for HTTP/JMX metrics, supported monitoring tools: Ganglia
- Kafka: Support for JMX metrics, supported monitoring tools: Yahoo!
- Storm: Support for JMX metrics with Storm UI
- Elasticsearch: Support for status requests in HTTP form
Judging from the above results, this is obviously not in line with our expectations, and we have several points of concern:
- Monitoring unification, or isomerization.
- Easy to configure, with the system stable, we can freely configure the monitoring indicators we consider very important
- Unified visualization To see the monitoring metrics we want to see at a glance on a control bench
To summarize, the components above are different in their ability to monitor, but there is something in common:
Metrics requests for both protocols, each component supports at least one of them, which is also a common choice for many Internet log solutions.
third, metadata storage and design
In order to achieve the universality and expansibility of data acquisition, the timing data acquisition task has better adaptability and automation. It is necessary to standardize the collected data and to design and manage the metadata.
We designed a hierarchical organizational structure from top to bottom:
V Meta Category
V Meta Type
V Meta Source
V Job Metadata
V Job Scheduler
All of the above data provides the ability to configure management on the control console. To improve the scalability and self-management of scheduled tasks. We chose to use zookeeper to store the topology of the task and the metadata information. Zookeeper is a very good meta-data management tool, but also a very mainstream distributed collaboration tool. Its event mechanism makes it possible to automate the management of the job lifecycle. We can dynamically create or delete a job by monitoring the children Znode of each znode to dynamically perceive the changes in the job and perceive the changes in the nodes.
Monitoring system of Big Data System (I.)