Hadoop Status Analysis System Chukwa

Source: Internet
Author: User
Apache's open-source project hadoop, as a distributed storage and computing system, has been widely used in the industry. Many large enterprises have their own hadoop-based applications and related extensions. When hadoop clusters with more than 1000 nodes become common, how can we collect and analyze the cluster's own information? Apache also raised this issue.

Apache's open-source project hadoop, as a distributed storage and computing system, has been widely used in the industry. Many large enterprises have their own hadoop-based applications and related extensions. When hadoop of over 1000 nodesClusterHow do I collect and analyze cluster information when it becomes common? To address this problem, Apache also proposed the corresponding solution, that is, chukwa.

Overview
The chukwa official website describes itself as follows: chukwa is an open-source data collection system used to monitor large-scale distributed systems. This is the h built on hadoop. DfS and map/r EdThe uce framework inherits the scalability and robustness of hadoop. Chukwa also contains a powerful and flexible tool set for displaying, monitoring, and analyzing collected data.
On some websites, chukwa is even a "full stack solution for log processing/analysis ".
Have you been so touched?
Let's take a look at what chukwa looks like:

 

Chukwa is nothing
1. chukwa is not a standalone system. deploying a chukwa system on a single node is useless. chukwa is a distributed log processing system built on hadoop. in other words, before building a chukwa environment, you need to build a hadoop environment and then build a chukwa Environment Based on hadoop. This relationship can also be viewed from the chukwaArchitectureThis is also because chukwa assumes that the data volume to be processed is T-level.
2. chukwa is not a real-time error monitoring system. in terms of solving this problem, ganglia, nagios and other systems have done a good job. These systems can be sensitive to data in seconds. chukwa analyzes the data at the minute level and considers it as a whole of a cluster.CpThe data such as u usage can be obtained after several minutes.
3. chukwa is not a closed system. although chukwa comes with many analysis items for hadoop clusters, it does not mean that it can only monitor and analyze hadoop. chukwa provides a complete set of solutions and frameworks for collecting, storing, analyzing, and displaying large amounts of data logs, chukwa provides almost perfect solutions, which can also be seen in its architecture.

What is chukwa?
As mentioned in the previous section, many chukwa are nothing. Let's take a look at the specific system of chukwa?
Specifically, chukwa is committed to the following aspects:
1. In general, chukwa can be used to monitor the overall running status of hadoop clusters on a large scale (more than 2000 nodes generate T-level data every day) and analyze their logs.
2. for cluster users: chukwa shows how long their jobs have been running, how many resources are occupied, how many resources are available, and why a job has failed, the node on which a read/write operation occurs.
3. for cluster O & M engineers: chukwa shows the hardware errors in the cluster, the cluster performance changes, and the cluster resource bottlenecks.
4. For cluster managers, chukwa displays the cluster resource consumption and overall job execution, which can be used to assist in budget and cluster resource coordination.
5. For cluster developers: chukwa shows the major performance bottlenecks and Common Errors in the cluster, so as to focus on solving important problems.

Basic Architecture
With a perceptual knowledge, let's look at its architecture. The overall structure of chukwa is as follows:
 

The main components are:
1. agents: Collects the original data and sends itColLectors
2. adaptor: interfaces and tools for directly collecting data. One agent can manage data collection of multiple adaptors.
3. collectors collects the data sent by agents and regularly writes the data to the cluster.
4. map/reDuCe jobs is started on a regular basis to classify, sort, deduplicate, and merge the data in the cluster.
5. HICC displays data

Related Designs
Adaptors and agents
On each data generation end (basically on each node in the cluster), chukwa uses an agent to collect the data it is interested in. Each type of data is implemented by an adaptor, data Type (DataType ?) Specify in the corresponding configuration. By default, chukwa provides the relevant adaptor for the following common data sources:CommandLine output, log files, and httpSender. these adaptors run regularly (such as reading the results of df once every minute) or run event-driven operations (such as generating an error log in the kernel ). if these adaptors are not enough, you can easily implement an adaptor to meet your needs.

To prevent the data collection agent from being faulty, The chukwa agent uses the so-called 'watchdog' mechanism to automatically restart the terminated data collection process to prevent the loss of raw data.
On the other hand, duplicate data will be automatically de-duplicated during chukwa's data processing process. in this way, you can deploy the same agent on multiple machines for key data to implement fault tolerance.
Collectors
The data collected by agents is stored on the hadoop cluster. the hadoop cluster is good at processing a small number of large files, and processing a large number of small files is not its strength. To address this, chukwa has designed the collector role, it is used to partially Merge data before writing data to the cluster to prevent writing a large number of small files.
On the other hand, to prevent collewa from becoming a performance bottleneck or becoming a single point of failure, chukwa allows and encourages the setting of multiple collector, and agents randomly selects a collector from the collectors list to transmit data, if a collector fails or is busy, change it to a collector. in this way, load balancing can be achieved. Practice has proved that the load of multiple collector is almost average.
Demux and archive
The data stored in the cluster is analyzed by map/reduce jobs. In the map/reduce stage, chukwa provides two built-in job types: demux and archive tasks.
Demux jobs classify, sort, and deduplicate data. In the agent section, we mention the data type (DataType ?) Concept. data written to the cluster by collector has its own type. during the execution of a demux job, the corresponding data analysis is performed through the data type and the data processing class specified in the configuration file. Generally, the unstructured data is structured, extract the data attributes from it. because demux is essentially a map/reduce job, we can develop our own demux job based on our own needs for various complex logic analysis. the demux interface provided by chukwa can be conveniently extended using java.
Archive jobs Merge data files of the same type. On the one hand, archive jobs Ensure that data of the same type is stored together for further analysis. On the other hand, archive jobs reduce the number of files and reduce the storage pressure on hadoop clusters.
Dbadmin
The data stored on the cluster can meet the long-term data storage and big data computing needs, but it is not easy to display. Therefore, chukwa has made two efforts:
1. the mdl language is used to extract data from the cluster to the mysql database. The data of the past week is completely saved. Data of more than one week is diluted by the time the data has exceeded, the longer the data is, the longer the data retention interval is. use mysql as a data source to display data.
2. Use hbase or similar technologies to directly store indexed data on clusters.
Until chukwa version 0.4.0, chukwa is the first method, but the second method is more elegant and convenient.
Hicc
Hicc is the name of the chukwa data presentation end. On the presentation end, chukwa provides some default data presentation wIdYou can use "list", "Graph", "Multi-graph", "column Chart", and "area Scheme" to present one or more types of data, allowing you to visually display data trends. Moreover, on the hicc presentation end, the round robin strategy is used for continuously generated new data and historical data to prevent the increasing pressure on servers and to "dilute" the data on the timeline ", data presentation over a long period of time
In essence, hicc is a web server implemented by jetty. It uses jsp and javascript technologies internally. various data types to be displayed and page bureaus can be implemented through simple drag-and-drop methods. More complex data presentation methods can be used to combine various required data using the SQL language. if this does not meet your needs, you can modify the jsp code.
Other data interfaces
If there are new requirements for raw data, you can also directly access the raw data on the cluster through map/reduce jobs or pig languages to generate the desired results. Chukwa also provides command line interfaces to directly access data in the cluster.
Default data support
Cpu usage, memory usage, hard disk usage, average cpu usage of the cluster, overall memory usage of the cluster, overall storage usage of the cluster, changes in the number of cluster files, and the number of jobs changes and other hadoop-related data, from collection to display, chukwa provides built-in support, which can be used only after configuration. it can be said that it is quite convenient.
It can be seen that chukwa provides comprehensive support for the entire life cycle from data generation, collection, storage, analysis to presentation.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.