Hadoop Series Six: Data Collection and Analysis System

Last Update:2014-12-22 Source: Internet

Author: User

Keywords Yes we 56 offer

Tags *.h file .gz .mall access address agents analysis apache

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Several articles in the series cover the deployment of Hadoop, distributed storage and computing systems, and Hadoop clusters, the Zookeeper cluster, and HBase distributed deployments. When the number of Hadoop clusters reaches 1000+, the cluster's own information will increase dramatically. Apache developed an open source data collection and analysis system, Chhuwa, to process Hadoop cluster data. Chukwa has several very attractive features: it has a clear architecture and easy deployment; it has a wide range of data types for collection and scalability; seamlessly integrates with Hadoop to gather and collate massive amounts of data.

1 Introduction to Chukwa

-------------------------------------------------- ------------------------------

Chukwa is described on Chukwa's website at https://chukwa.apache.org/ as: Chukwa is an open source data collection system for monitoring large distributed systems built on HDFS and Map / Reduce frameworks and inherited Hadoop excellent scalability and robustness. In data analysis, Chukwa has a flexible and powerful set of tools to monitor and analyze the results to make better use of the data collected.

In order to show Chukwa more simply and intuitively, let's look at a hypothetical scenario. Let's say we have a website that is large (and always involves a lot of Hadoop ....). The site produces a huge number of daily log files, and collecting and analyzing these log files is not an easy task. Readers may Thought, do Hadoop this kind of thing is very appropriate, many large sites are in use, then the problem comes, the data scattered in various nodes how to collect, if the data collected how to deal with duplicate data, how to integrate with Hadoop. If you write the code to complete this process, it takes a lot of energy to one, and secondly will inevitably lead to Bug. Here's where Chukwa comes into play, Chukwa is an open-source software, and many smart developers contribute their wisdom. It can help us monitor the changes of log files in real time at all nodes, and increase the contents of the file into HDFS. At the same time, we can also remove, sort, and repeat the data. At this moment, the file that Hadoop obtains from HDFS is already SequenceFile. Without any conversion process, the middle of the complicated process by Chukwa to help us complete. Is not very worry about it. Here we only give an example of the application, it can also help us monitor the data from the Socket, and even regular implementation of our designated orders to obtain output data, and so on, the specific reference can see Chukwa official documents. If these are not enough, we can define our own adapters to do more advanced functions.

2 Chukwa's architecture

-------------------------------------------------- ------------------------------

Chukwa aims to provide a flexible and powerful platform for distributed data collection and big data processing that is not only available now but also evolving with the use of newer storage technologies such as HDFS and HBase, When it becomes mature To maintain this flexibility, Chukwa is designed to collect and process hierarchical pipelines with a very specific and narrow interface between the various levels. The following diagram shows the Chukwa architecture:

The main components are:

1. Agents: responsible for collecting the most primitive data, and sent to Collectors

2. Adapters: interface and tools for direct data acquisition, an Agent can manage multiple Adaptor data acquisition

3. Collectors: responsible for collecting data collected by Agent, and regularly write to the cluster

4. Map / Reduce Jobs: Scheduled start, is responsible for the data in the cluster classification, sorting, deduplication and consolidation

HICC (Hadoop Infrastructure Care Center) is responsible for the display of data

3 the main components of the specific design

-------------------------------------------------- ------------------------------

3.1 Adapters, Agents

-------------------------------------------------- ------------------------------

At the end of each data generation (basically on every node in the cluster), Chukwa uses an Agent to collect data of interest. Each type of data is implemented via an Adapter, and the Data Model is mapped accordingly . By default, Chukwa already provides the appropriate adapters for the following common data sources: command-line output, log files, httpSender, etc. These adapters run on a regular basis (such as reading the results of df every minute) or events Driven implementation (such as the kernel hit an error log.) If these adapters are not enough, users can easily implement an adapter to meet their own needs.

To prevent the Agent at the data collection terminal from malfunctioning, Ahukwa's Agent uses a so-called 'watchdog' mechanism that will automatically restart the terminated data collection process and prevent the loss of original data.

On the other hand, for duplicate data, Chukwa's data is automatically deduplicated during data processing so that fault-tolerant features can be implemented by deploying the same Agent on multiple machines for critical data.

3.2 Collectors

-------------------------------------------------- ------------------------------

The data collected by agents is stored on the hadoop cluster. Hadoop cluster is good at dealing with a small number of large files, but for a large number of small file processing is not its strength, in view of this, chukwa designed the collector role for the data part of the merger, and then write the cluster to prevent a large number of small Document writing.

On the other hand, to prevent the collector from becoming a performance bottleneck or becoming a single point of failure, chukwa allows and encourages the provision of multiple collectors. The agents randomly select a collector from the collectors list to transfer data, and if a collector fails or becomes busy, replace A collector. So that the load can be balanced, the practice proved that the load of multiple collectors is almost average.

3.3 demux, archive

-------------------------------------------------- ------------------------------

Data placed on the cluster is analyzed by map / reduce jobs, which provide two built-in job types for demux and archive tasks.

The demux job is responsible for the sorting, sorting, and deduplication of data. In the agent section, we mentioned the notion of data type (DataType?). The data that the collector writes into the cluster has its own type. Process, through the data type and configuration file specified in the data processing class, the implementation of the corresponding data analysis, the general structure of the unstructured data, extract the data in which attributes. Since the essence of demux is a map / reduce Operation, so we can make their own demux jobs according to their needs, for a variety of complex logic analysis. Demux interface provided by chukwa can be easily extended using java language.

The archive operation is responsible for the merger of the same type of data files, on the one hand to ensure the same type of data together for further analysis, on the other hand reduce the number of files, reduce the pressure on the storage of the Hadoop cluster.

3.4 dbadmin

-------------------------------------------------- ------------------------------

Data on the cluster, although it can meet the long-term storage of data and computing needs of large amounts of data, but not easy to show. To this end, chukwa made two efforts:

1 using mdl language, the cluster data extracted to the mysql database, the data for the past week, a complete save, more than a week of data, according to the data from the current length of time for dilution, the longer the data from now, the preservation The longer the data interval. Through mysql for data sources, display data.

Use hbase or similar technology to directly index the data stored in the cluster

To chukwa 0.4.0 version, chukwa are the first method used, but the second method is more elegant and more convenient.

3.5 hicc

-------------------------------------------------- ------------------------------

hicc is the name of chukwa's data showcase. On the presentation side, chukwa provides some default data display widgets that can display one or more types of data using "List", "Curves", "Multi-Curves", "Bar Charts", "Area Charts, Intuitive data trends show.However, in the hicc display side, the robin strategy for the continuous generation of new data and historical data to prevent the continuous growth of data to increase server pressure, and the data in the timeline "diluted" can provide Long time data display

In essence, hicc is a web server implemented with jetty, internally using jsp technology and javascript technology. Various data types and page layouts that need to be displayed can be implemented by simply dragging and dragging, which is more complicated The data shows, you can use the SQL language combination of a variety of needs of the data.If this still can not meet the demand, do not be afraid, hands-on changes to its jsp code on it.

3.6 Other data interface

-------------------------------------------------- ------------------------------

With new needs for raw data, users can also directly access the raw data on the cluster through the map / reduce job or the pig language to produce the desired result. Chukwa also provides a command-line interface that provides direct access to data on the cluster.

3.7 default data support

-------------------------------------------------- ------------------------------

Cpu usage, memory usage, hard disk usage, the average cpu usage of the cluster, the overall memory usage of the cluster, the overall storage usage of the cluster, the number of cluster files, the number of jobs, etc. hadoop Related data, from collection to display a set of processes, chukwa has built-in support, only need to configure it to use. It can be said is quite convenient.

As you can see, chukwa provides full support for the entire life cycle of data generation, collection, storage and analysis. The following picture shows the complete architecture of Chukwa:

4 What is Chukwa?

-------------------------------------------------- ------------------------------

4.1 chukwa is not what

-------------------------------------------------- ------------------------------

1. chukwa is not a stand-alone system, it's basically useless to deploy a chukwa system on a single node. chukwa is a distributed log processing system built on top of Hadoop. In other words, you need to build a hadoop environment before building a chukwa environment and then build a chukwa environment based on hadoop, which can also be done from a later chukwa This is also because chukwa's assumption is that the amount of data to be processed is at T level.

2. chukwa is not a real-time error monitoring system. In solving this problem, ganglia, niosios and other systems have been doing well, these systems are sensitive to the data can be seconds. Chukwa analysis is the data is minute level It thinks it is not a problem that the data such as the overall cpu usage of the cluster is delayed by a few minutes.

Chukwa is not a closed system. Although chukwa comes with a lot of analytic items for hadoop clusters, this does not mean that it can only monitor and analyze hadoop. Chukwa provides a big data volume log data collection, storage, Analyzed and presented as a complete solution and framework, chukwa provides near-perfect solutions at all stages of the data life cycle, as can be seen from its architecture.

4.2 What is chukwa?

-------------------------------------------------- ------------------------------

Chukwa said a lot of the previous section is not what the following look chukwa specifically what is a system? Specifically, chukwa committed to the following aspects of the work:

1. Overall, chukwa can be used to monitor the overall health of the hadoop cluster and analyze its logs on a large scale (more than 2000+ nodes producing a daily amount of data at T level)

For cluster users: chukwa shows how long their jobs have been running, how much they consume, how many resources they have available, why one job failed, and at what node a read / write has a problem.

3. For cluster operation and maintenance engineers: chukwa shows the hardware errors in the cluster, the cluster performance changes, the cluster resource bottlenecks where.

4. For the managers of the cluster: chukwa shows the resource consumption of the cluster, the overall operation of the cluster implementation, can be used to assist the budget and coordination of cluster resources.

For cluster developers: chukwa shows the main performance bottlenecks in the cluster, frequently occurring errors, so that you can focus on key issues.

5 Chukwa's deployment and configuration

-------------------------------------------------- ------------------------------

5.1 Preparation

-------------------------------------------------- ------------------------------

Chukwa is deployed on top of Hadoop cluster. Therefore, it is necessary to install and deploy Hadoop cluster in advance, including SSH passwordless login and JDK installation. For more details, please refer to other blog posts in this series, "Hadoop Cluster Setup" and "Hadoop Cluster Setup" .

The Hadoop cluster architecture is as follows: 1 Master, 1 Backup (standby) and 3 Slaves (created by the virtual machine). Node IP address:

rango (Master) 192.168.56.1 namenode

vm1 (Backup) 192.168.56.101 secondarynode

vm2 (Slave1) 192.168.56.102 datanode

vm3 (Slave2) 192.168.56.103 datanode

vm4 (Slave3) 192.168.56.104 datanode

5.2 Install Chukwa

-------------------------------------------------- ------------------------------

From the official website http://www.apache.org/dyn/closer.cgi/incubator/chukwa/chukwa-0.5.0 can only be downloaded to chukwa-incubating-src-0.5.0.tar.gz, available from http: / / /people.apache.org/~eyang/chukwa-0.5.0-rc0/ Download the latest version of Chukwa version chukwa-incubating-0.5.0.tar.gz.

Unzip and rename Move to the / usr directory:

tar zxvf chukwa-incubating-0.5.0.tar.gz; mv chukwa-incubating-0.5.0 / usr / chukwa

A copy of Chukwa needs to be kept on every node that is monitored (information needs to be collected), and each node will run a collector. After the configuration is complete, use the scp command to copy all the nodes in the cluster.

5.3 Configure Chukwa

-------------------------------------------------- ------------------------------

5.3.1 Configure Environment Variables

-------------------------------------------------- ------------------------------

Edit / etc / profile, add the following statement:

# set chukwa path

export CHUKWA_HOME = / usr / chukwa

export CHUKWA_CONF_DIR = / usr / chukwa / etc / chukwa

export PATH = $ PATH: $ CHUKWA_HOME / bin: $ CHUKWA_HOME / sbin: $ CHUKWA_CONF_DIR

5.3.2 Configure Hadoop and HBase Clusters

-------------------------------------------------- ------------------------------

First copy Chukwa's file into hadoop:

mv $ HADOOP_HOME / conf / log4j.properties $ HADOOP_HOME / conf / log4j.properties.bak

mv $ HADOOP_HOME / conf / hadoop-metrics2.properties $ HADOOP_HOME / conf / hadoop-metrics2.properties.bak

cp $ CHUKWA_CONF_DIR / hadoop-log4j.properties $ HADOOP_HOME / conf / log4j.properties

cp $ CHUKWA_CONF_DIR / hadoop-metrics2.properties $ HADOOP_HOME / conf / hadoop-metrics2.properties

cp $ CHUKWA_HOME / share / chukwa / chukwa-0.5.0-client.jar $ HADOOP_HOME / lib

cp $ CHUKWA_HOME / share / chukwa / lib / json-simple-1.1.jar $ HADOOP_HOME / lib

Then start the HBase cluster for HBase settings, create the tables needed for data storage in HBase, and the schema for the tables has been built just by importing into the hbase shell:

bin / hbase shell <$ CHUKWA_CONF_DIR / hbase.schema

5.3.3 Configuring a Collector

-------------------------------------------------- ------------------------------

Set Chukwa's environment variables and edit the $ CHUKWA_CONF_DIR / chukwa-env.sh file:

export JAVA_HOME = / usr / java / jdk1.7.0_45

#export HBASE_CONF_DIR = "$ {HBASE_CONF_DIR}"

#export HADOOP_CONF_DIR = "$ {HADOOP_CONF_DIR}"

#export CHUKWA_LOG_DIR = / tmp / chukwa / log

#export CHUKWA_DATA_DIR = "$ {CHUKWA_HOME} / data"

Note: Set the first java home directory, comment out the back four. Notes point HBASE_CONF_DIR and HADOOP_CONF_DIR, because the agent is only used to collect data, so do not need HADOOP participation. Comment out CHUKWA_PID_DIR, CHUKWA_LOG_DIR, if not, then the location he specified is in the / tmp temporary directory, which causes the PID and LOG files to be deleted for no reason. Will cause abnormalities in subsequent operations. After the comment, the system will use the default path, the default will be created in the Chukwa installation directory PID and LOG files.

When more than one machine is required as the collector, you need to edit the $ CHUKWA_CONF_DIR / collectors file:

192.168.56.1

192.168.56.101

192.168.56.102

192.168.56.103

192.168.56.104

$ CHUKWA_CONF_DIR / initial_Adaptors file is mainly used to set Chukwa monitor which logs, and what way, what frequency to monitor and so on. Use the default configuration can be as follows

add sigar.SystemMetrics SystemMetrics 60 0

add SocketAdaptor HadoopMetrics 9095 0

add SocketAdaptor Hadoop 9096 0

add SocketAdaptor ChukwaMetrics 9097 0

add SocketAdaptor JobSummary 9098 0

$ CHUKWA_CONF_DIR / chukwa-collector-conf.xml maintains basic configuration information for Chukwa. We need to make the HDFS location from this file:

writer.hdfs.filesystem

hdfs: //192.168.56.1: 9000 /

HDFS to dump to

Then you can specify the address of the sink data with the following settings:

chukwaCollector.outputDir

/ chukwa / logs /

chukwa data sink directory

chukwaCollector.http.port

8080

The HTTP port number the collector will listen on

Note: / chukwa / logs / is its address in HDFS. By default, Collector listens on port 8080, but this is modifiable, and each agent sends a message to that port.

5.3.4 Configuring Agent

-------------------------------------------------- ------------------------------

Edit $ CHUKWA_CONF_DIR / agents file:

192.168.56.1

192.168.56.101

192.168.56.102

192.168.56.103

192.168.56.104

The $ CHUKWA_CONF_DIR / chukwa-agent-conf.xml file maintains the basic configuration information for the broker. The most important of these is the cluster name, which represents the node being monitored. This value is stored in each collected chunk, Used to distinguish between different clusters, such as setting the cluster name: cluster = "chukwa"

chukwaAgent.tags

cluster = "chukwa"

The cluster's name for this agent

chukwaAgent.checkpoint.dir, this directory is a regular checkpoint for Chukwa running Adapter, is not shared directory, and can only be a local directory, not the network file system directory:

chukwaAgent.checkpoint.dir

$ {CHUKWA_LOG_DIR} /

the location to put the agent's checkpoint file (s)

5.4 Pig data analysis tool installation and configuration

-------------------------------------------------- ------------------------------

Pig is a scripting language that explores large-scale data sets and is an extension of the Hadoop project to simplify Hadoop programming (simplifying beyond imagination) and providing a higher level of abstraction for data processing while enabling Keep Hadoop simple and reliable. Pig is the data flow processing language on top of HDFS and MapReduce, which transforms data flow processing into multiple map and reduce functions, providing a higher level of abstraction that frees programmers from specific programming.

Pig consists of two parts:

The language used to describe the data stream, called Pig Latin;

And the execution environment for running Pig Latin programs;

-------------------------------------------------- ------------------------------

5.4.1 Installation

-------------------------------------------------- ------------------------------

Pig has two installation modes. One is the local mode, the actual stand-alone mode, Pig can only access a local host, there is no distributed, you can even do not have to install Hadoop, all the command execution and file reading and writing are carried out locally, commonly used in job experiments. The other is MapReduce mode, which is the working mode of the actual application, it can upload files to the HDFS system, when you run the job in Pig Latin language, the job can be distributed in the Hadoop set to complete, This also reflects the idea of MapReduce so that we can connect to the Hadoop cluster through the Pig client for data management and analysis. The following is the MapReduce installation mode, which is also the mode required by this article.

-------------------------------------------------- ------------------------------

Download and unzip the stable version of Pig (for compatibility with Hadoop, etc.), rename and move to the / usr directory to complete the installation.

ps: only need to install on the master.

5.4.2 Configuration

-------------------------------------------------- ------------------------------

Edit / etc / profile:

# set pig path

export PIG_HOME = / usr / pig

export PATH = $ PATH: $ PIG_HOME / bin

export PIG_CLASSPATH = $ HADOOP_CONF_DIR: $ HBASE_CONF_DIR

5.4.3 with Hadoop, HBase binding

-------------------------------------------------- ------------------------------

Create a jar file for HBASE_CONF_DIR:

jar cf $ CHUKWA_HOME / hbase-env.jar $ HBASE_CONF_DIR

Create a recurring analysis script job: Execute every ten minutes

echo "* / 10 * * * * pig -Dpig.additional.jars = $ {HBASE_HOME} /hbase-0.96.1.1.jar: $ {HBASE_HOME} /lib/zookeeper-3.4.5.jar: $ {PIG_HOME} / pig-0.12.0.jar: $ {CHUKWA_HOME} /hbase-env.jar $ {CHUKWA_HOME} /script/pig/ClusterSummary.pig> / dev / null 2 & 1 >> >> / etc / crontab

5.5 Copy Chukwa to other cluster nodes and configure the environment variables for each node

-------------------------------------------------- ------------------------------

scp -r / usrchukwa node ip: / usr

5.6 Running Chukwa

-------------------------------------------------- ------------------------------

Restart Hadoop and HBase and then on a single node, such as the master node of a Hadoop cluster:

Start colletor: start-collectors.sh Start all nodes registered in the collectors file

Stop the collector: stop-agents.sh Stop all Collector registered in the collectors file

Start the agent: start-agents.sh Start all nodes registered in the agents file

Stop agent: stop-agents.sh Stop all Agents registered in the agents file

Start HICC: chukwa hicc

After starting the browser can be accessed: http: //: / hicc

The default port is 4080;

The default username and password are: admin

You can modify the jetty.xml under / WEB_INF / in the $ CHUKWA_HOME / webapps / hicc.war file as needed

Start-up process summary:

1) Start Hadoop and HBase

2) Start Chukwa: sbin / start-chukwa.sh

3) Start HICC: bin / chukwa hicc

5.7 problem solving

-------------------------------------------------- ------------------------------

Running chukwa collector appears:

cat / root / share / chukwa / VERSION: No such file or directory

Solution: Edit the $ CHUKWA_HOME / libexec / chukwa-config.sh file

Modify line 30 31

# the root of the Chukwa installation

export CHUKWA_HOME = `pwd -P $ {CHUKWA_LIBEXEC} / ..`

for:

# the root of the Chukwa installation

export CHUKWA_HOME = / usr / chukwa

Among them / usr / chukwa for chukwa actual installation path

5.8 Basic command introduction

-------------------------------------------------- ------------------------------

bin / chukwa agent Start local agent

bin / chukwa agent stop Close local agent

bin / chukwa collector Start local collector

bin / chukwa collector stop Turn off local collector

bin / chukwa archive Regularly run archive, the file into a Sequence File. And will remove duplicate content.

bin / chukwa archive stop stop running archive

bin / chukwa demux Starting the demux manager is equivalent to starting an M / R Job. The default is TsProcessor. We can also define our own data processing modules, as mentioned later.

bin / chukwa demux stop Stop the demux manager

bin / chukwa dp start demux post processor for timing sorting, merging files, eliminating redundant data.

bin / chukwa dp stop Stop dp operation

bin / chukwa hicc hicc Like a portal, the data is presented graphically.

slaves.sh

slaves.sh command is useful, especially when you have a lot of nodes, such as 50 nodes, want to create a directory abc under each node how to do it. If one by one to create the machine, it is too cumbersome. Fortunately, it can help us, bin / slaves.sh mkdir / home / hadoop / abc. It will help us to create the corresponding directory on each node.

start-agents.sh

This command will start all agents registered in the agents file

start-collectors.sh

This command starts all Collector registered in the collectors file

stop-agents.sh

This command will stop all agents registered in the agents file

stop-collectors.sh

This command will stop all Collector registered in the collectors file

start-data-processors.sh

This command is a combination of the following three commands:

bin / chukwa archive

bin / demux

bin / dp

He will start these three orders, do not have to start one by one.

stop-data-processors.sh

Then stop the archive / demux / dp three services

6 summary

-------------------------------------------------- ------------------------------

This article describes the Chukwa cluster setup, including installation and configuration of the data analysis tool Pig. Chukwa design concept is very simple, clear structure and easy-to-understand, but also open source products, we can build on its basis more powerful features. This is a follow-up effort.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More