Application of Chukwa in data collection and processing

Source: Internet
Author: User

Chukwa Introduction

What is Chukwa, simply said it is a data collection system that collects all kinds of data into Hadoop-ready files for Hadoop to perform various MapReduce operations. Chukwa itself also provides a number of built-in features that help us collect and collate data. Chukwa Application Scenario Introduction
For a simpler and more intuitive display of Chukwa, let's look at a hypothetical scenario first. Let's say we have a big one (it's always great to be involved in Hadoop ....) Website, the site produces a large number of log files every day, to collect, analysis of these log files is not an easy thing, the reader may think, do this kind of thing Hadoop quite suitable, a lot of large sites are in use, then the problem came, scattered in the various nodes of the data how to collect, The data collected can be integrated with Hadoop if there is duplicate data on how to deal with it. If you write your own code to complete this process, it takes a lot of effort and inevitably introduces bugs. This is the time for us Chukwa to play a role, Chukwa is an open source software, there are a lot of clever developers to contribute their wisdom. It can help us to monitor the changes of log files in real-time at each node, to incrementally write the contents of the file to HDFS, but also to remove the data from the duplication, sort, etc., when Hadoop gets the files from HDFS is already sequencefile. Without any conversion process, the complicated process was done by Chukwa. is not very worry about it. Here we just give an example of the application, it can also help us to monitor the data from the Socket, and even execute our specified command to get output data, and so on, the specific can be see Chukwa official documents. If that's not enough, we can also define our own adapters to do more advanced functionality. Later we'll see how to define your own adapter to do what you want to do. How, is not some of the heart.

Architecture Design of Chukwa

Before we briefly said some of his uses, speaking of a more general, we do not necessarily understand, it is how to help us to complete the function, we will first from his architectural design perspective. We still look at a picture first.
Figure 1: Architecture diagram

Let's look at the example of the log we just mentioned.

Here we first introduce a few words:

Agent

What is Agent,agent is the program that is responsible for collecting data on each node. The Agent is also composed of several adapter. Adapter runs within the agent process, performing the actual data collection, while the agent is responsible for adapter management.

Collector

What is Collector,collector collects data from various agents and writes this data to HDFS.

With these two key nouns in mind, perhaps readers already have a rough picture of the data flow chart in their minds. Yes, it is so simple: the data is collected by the Agent and transmitted to Collector, which is written to HDFS by Collector and then preprocessed by the Map-reduce job.

Chukwa Environment Construction and deployment

Here we introduce how to install, deploy, apply Chukwa

1. Prerequisites Linux Environment Here we use Red Hat JDK using 1.6JDK system need to support SSH other requirements

2. Download Chukwa Here is a download address for one of the mirrors wget http://mirror.bjtu.edu.cn/apache/hadoop/chukwa/chukwa-0.4.0/chukwa-0.4.0.tar.gz The link here is 0.4.0 version of Chukwa. Other versions can be downloaded from the official website. Website address:
http://incubator.apache.org/chukwa/

3. Download Hadoop hadoop download, installation is not the focus of this article, omitted here.
The current 0.20.2 version is a more stable version. Version 0.21.0 in development, due to changes in the structure of the jar package and configuration changes, incompatible with the current version of the Chukwa. So we recommend using the stable version of Hadoop.

4. After installing TAR-XZF chukwa-0.4.0.tar.gz tar-xzf hadoop-0.20.2.tar.gz Decompression, assume directory names chukwa-0.4.0 and Hadoop-0.20.2 respectively

Configuration of Hadoop

You can refer to the Hadoop official website tutorial, limited to space, which we omit here.

Configuration of the Chukwa

Here we will complete the configuration of the agent and collector in the simplest steps, so that the reader can quickly understand it.

Configure the Agent edit $CHUKWA _home/conf/chukwa-env.sh file, where you need to set Java_home comment out Hadoop_home,hadoop_conf_dir, because the agent It is only used to collect data, so there is no need for HADOOP to participate. Comment out the Chukwa_pid_dir,chukwa_log_dir, if not annotated, then he specified the location in the/tmp temporary directory, which will result in, PID and LOG files are deleted for no reason. Will cause an exception in subsequent operations. After commenting, the default path is used, and the PID and LOG files are created by default in the Chukwa installation directory. Edit $CHUKWA _home/conf/collectors file where you'll need to write collectors's address here in the format http://hostname:port/. Here you can write multiple collector, one for each row. The Agent usually randomly chooses one as collector to send the data to this collector, and if the current collector fails, it continues to select the next continuation attempt. Collector is a load balance feature that does not say that all agents write data to a collector, which results in a failure. Edit $CHUKWA _home/conf/initial_adapters file, here default with the configuration file initial_adapters.template, modify the name to Initial_adapters, inside the default with a few Examples of adapter. It's easy to understand. The name of this configuration file is the default initial adapter, and these adapter work when the agent is started.

Configure the Collector edit $CHUKWA _home/conf/chukwa-env.sh file, modify the Java_home,hadoop_home,hadoop_conf_dir, and specify the appropriate value. In the same vein, we need to comment out Chukwa_pid_dir,chukwa_log_dir.

start up Hadoop

bin/start-all.sh Start Collector

Bin/chukwa Collector Start Agent

Bin/chukwa Agent

The results you can see
Listing 1. Agent-Side log fragments

 2010-12-23 10:20:28,315 INFO Timer-1 Execadaptor-callin G exec 2010-12-23 10:20:28,377 info Timer-1 execadaptor-calling exec 2010-12-23 10:20:28,438 info Timer-1 execadapto 
                                                       r-calling exec 2010-12-23 10:20:28,451 INFO HTTP post thread Chukwahttpsender- 
                Collected chunks for post_26923 2010-12-23 10:20:28,452 INFO HTTP post thread Chukwahttpsender- >>>>>> HTTP post_26923 to Http://xi-pli:8080/length = 17788 2010-12-23 10:20:28,459 in FO HTTP POST thread Chukwahttpsender->>>>>> http Got success back from Http://xi-pli:8080/chu Kwa Response length 924 2010-12-23 10:20:28,459 INFO HTTP post thread chukwahttpsender-post_26923 sent 0 chunks, Got back ACKs 2010-12-23 10:20:28,500 INFO Timer-1 execadaptor-calling exec 

From here we can see that Timer-1 execadaptor-calling exec has been executed on a timed basis, and we can see that the Agent has sent the information to our designated Collector:
Listing 2. Collector-side logs

				
 2010-12-23 10:30:22,207 INFO Timer-4 seqfilewriter-
     Rotating sink File/chukwa/logs/201023102522181_xipli_ 15db999712d1106ead87ffe.chukwa 
 2010-12-23 10:30:22,784 INFO Timer-1 root- 
         stats:servletcollector, numberhttpconnection:15,numberchunks:1110 
 2010-12-23 10:30:23,220 INFO Timer-3 seqfilewriter- 
         stat: Datacollection.writer.hdfs datasize=797670 datarate=26587 

From the log we can see that Collector has written the collected data to the file through writer/chukwa/logs/201023102522181_xipli_15db999712d1106ead87ffe.chukwa

So how do you know that the data has been written to HDFS? We can view HDFS by executing Hadoop commands
Listing 3. viewing logs

				
 Bin/hadoop Fs-ls/chukwa/logs 

If nothing happens, we can already see that the data has been written to HDFS.
Listing 4. checking Files

				
 Found 205 Items 
 -rw-r--r--   3 hadoop supergroup     
 676395 2010-12-22 17:12/chukwa/logs/201022171225851_xipli _18f1ec1212d0d4c13067ffa.done 
 -rw-r--r--   3 hadoop supergroup 6046366 2010-12-22 17:17 
 /chukwa/logs/ 201022171725877_xipli_18f1ec1212d0d4c13067ff8.chukwa 
 -rw-r--r--   3 hadoop supergroup    
8352420 2010-12-22 17:32/chukwa/logs/201022173249756_xipli_1f33c45712d0d6c7cdd8000.done 

If you want to deploy agents to multiple nodes, you need to configure them on the other nodes as well. But as the number of nodes increases, we find it more difficult to manage agents. At this point we can use a single node to manage all Agents in batches.

The first file we need to edit is the agents file in the Conf directory, which by default is a template file named Agents.template, and we need to rename it to agents. Add the agent's HOSTNAME/IP to this file individually, one node per line. By calling bin/start-agents.sh and bin/stop-agents.sh, you can manage agents startup and shutdown in batches. If you encounter a situation where the agent cannot be shut down gracefully, an available temporary workaround is to modify the KILL-1 in the Chukwa script for each node to kill-9. Collector is also a similar control method.

Basic command Introduction

There are a lot of executable files in the bin directory. Here are just a few of the documents we focus on:

Chukwa

The Help information for the Chukwa command is as follows

Listing 5. Basic Commands

 Usage:chukwa [--config confdir] command where command is one of:agent run a Chukwa Agent archive run the archive Manager collector run a Chukwa Coll Ector Demux Run the Demux Manager DP run the Post DEMUX data processors HICC run a HI 
  CC Webserver Droll run a daily rolling job (deprecated) hroll run a hourly rolling job (deprecated)  Version print the version Utilities:backfill run a back Fill Data Loader utility dumparchive view An archive file Dumprecord view a record file tail start tailing a file more command print help when 
Invoked w/o parameters. 

The functions of each parameter are as follows:

Bin/chukwa Agent launches local agent

Bin/chukwa Agent Stop close local agent

Bin/chukwa Collector Start Local collector

Bin/chukwa collector Stop Close local collector

Bin/chukwa Archive timed Run archive, the files are organized into Sequence file. And it removes duplicate content.

Bin/chukwa Archive Stop stops running archive
Bin/chukwa Demux starts the Demux manager, which is equivalent to starting a m/r Job. The default is Tsprocessor. We can also define our own data processing module, mentioned later.

Bin/chukwa Demux Stop Stop Demux Manager

The Bin/chukwa DP Boot Demux post processor is used for timing sequencing, merging files, and eliminating redundant data.

Bin/chukwa DP Stop DP Run

Bin/chukwa HICC It's an interesting thing, it's like a portal that shows the data graphically. However, in the current 0.4.0 version There are many questions, if the reader is interested, you can try the development of the 0.5.0 version.

The following command is relatively simple, according to the prompt can also be run correctly, do not do a detailed description.

slaves.sh

The slaves.sh command is useful, especially when you have a lot of nodes, such as 50 nodes, and want to create a directory ABC under each node. If one goes to a machine to create it, it's too cumbersome. Fortunately, it can help us, bin/slaves.sh mkdir/home/hadoop/abc. It will help us create the corresponding directory on each node.

start-agents.sh

This command launches all agents registered in the agents file.

start-collectors.sh

This command launches all Collector registered in the collectors file.

stop-agents.sh

This command stops all agents registered in the agents file

stop-collectors.sh

This command will stop all Collector registered in the collectors file.

start-data-processors.sh

This command is a combination of the following three commands:

Bin/chukwa Archive

Bin/demux

Bin/dp

He will start the three commands in turn, without having to start them on their own.

stop-data-processors.sh

Stop ARCHIVE/DEMUX/DP three services in turn

Agent-side command:

When the agent is started, we can also dynamically control the adapter in the agent. When the agent is started, each agent initiates a Telnet service, which is used to control the agent separately. The default port is 9093.

When the command telnet localhost 9093 is run, it enters the Telnet console, which is displayed as follows:
Listing 6. Agent Console Information

				
 Trying:: 1 
 ... Connected to localhost. 
 Escape character is ' ^] '. 

Enter a carriage return and enter the Help command, which displays detailed command assistance information
Listing 7. Agent Console Information

				
 You ' re talking to the Chukwa agent.  Commands available: 
 add [Adaptorname] [args] [offset]--Start an adaptor 
 shutdown [Adaptornumber]  -- Graceful Stop 
 stop [adaptornumber]  --abrupt stop 
 List--list running adaptors 
 Close--close this Connec tion 
 stopagent--Stop the whole agent process 
 StopAll--Stop all adaptors 
 reloadcollectors--Reload the List of collectors help- 
 print this message 
         Command names is case-blind. 

When you enter list, you can view all the adapter that are currently running. Shown below:
Listing 8. Agent Console Information

				
 ADAPTOR_8567B8B00A5DC746CCD5B8D179873DB1)  
     Org.apache.hadoop.chukwa.datacollection.adaptor.ExecAdaptor 
      Top 60/usr/bin/top-b-N 1-c 505728 
 adaptor_e69999b07d7023e6ba08060c85bd9ad7)  
           Org.apache.hadoop.chukwa.datacollection.adaptor.ExecAdaptor 
           Df 60/bin/df-l 353792 

Other commands are easy to understand, so let's just look at the add command. Look at the following example:
Listing 9. Agent Add Adapter

				
 Add Filetailer. Lwftadaptor Datatype/foo/bar 0 

This command will lwftadaptor this adaptor into the Agent and run, the data type is DataType type, this type will work with the subsequent Demux service,/foo/barbar is the parameter of adaptor, the last number 0 is the table The offset of the displayed data. Currently chukwa0.4.0 supported adaptor broadly divided into the following types: Detection of File/directory changes, monitoring of UDP data, execution of specific shell scripts. Refer to the Chukwa official documentation for more detailed descriptions. It should be noted that in version 0.5, adaptor is more hardened and normalized. Interested readers can take a look.

If you want to exit from the Telnet console, you can exit by entering close.

Introduction to internal data processing timing
Figure 2: Internal data processing Timing

Collector writes the data sent by the Agent to the Logs/*.chukwa file until the file size reaches 64M or reaches a certain interval, Collector renames the *.chukwa file to a *.done file to indicate the end of the file write. Demuxmanager checks the *.done file every 20 seconds, and if the file exists, move the file to the Demuxprocessing/mrinput folder. The Demux MapReduce job will use this folder as input for map/reduce operations. If this succeeds (and can be retried 3 times), the output from the Map/reduce is archived from the Demuxprocessing/mroutput folder and moved to Datasinkarchives/[yyyymmdd]/*/*.done. The file is also output to the PostProcess directory. Otherwise, if the operation fails, the output will be moved to Datasinkarchives/inerror/[yyyymmdd]/*/*.done. Postprocessmanager is executed every few minutes and is responsible for merging, de-ordering, and sorting the files in the PostProcess directory. After the run is complete, the data is written to the repos directory. Directories will be stored in the cluster Name,data type sub-category. In the above operation, Demux will be our attention to the content, a lot of data processing will be carried out here. We can also define our own Demux processor of the data type.

How to support new data types

In the Chukwa built-in adaptor, data from files and sockets can be collected. We then define a new adaptor ourselves to collect data from JMS. Typically in custom adaptor you need to inherit org.apache.hadoop.chukwa.datacollection.adaptor.AbstractAdaptor. Let's look at the following code:
List. adapter Example

				
Import Org.apache.hadoop.chukwa.datacollection.adaptor.AbstractAdaptor; 
 Import org.apache.hadoop.chukwa.datacollection.adaptor.AdaptorException; 
 Import Org.apache.hadoop.chukwa.datacollection.adaptor.AdaptorShutdownPolicy; 

 public class Jmsadaptor  extends abstractadaptor{ 

	 @Override public 
	 String getcurrentstatus () { 
		 //  TODO auto-generated Method stub 
		 return
				null; 
	 } 

	 @Override public 
	 long shutdown (adaptorshutdownpolicy shutdownpolicy) 
			  throws adaptorexception { 
		 //  TODO auto-generated Method stub 
		  return 0; 
	 } 

	 @Override public 
	 void start (long offset) throws Adaptorexception { 
		 //  TODO auto-generated method stub< c24/>} 

	 @Override public 
	  String Parseargs (string s) { 
		 //  TODO auto-generated method Stub 
		  return null; 
	 } 

 } 

There are 4 methods to implement: Parseargs
Parses the user's input into the program's entry start
Start adaptor shutdown
Close Adaptor Getcurrentstatus
Gets the current state of the adaptor, which is called by Adaptormanager and is used to periodically report the status of adaptor

How to get the data depends entirely on the user's own implementation. For example, we use JMS as an example, we can implement JMS Consumer, and then start in the Start method. Consumer the required parameters are obtained by Parseargs. When the user stops the adaptor in the agent's Telnet console, the shutdown method is called to enable the program to close.

When a adaptor is implemented, we need to register the adaptor with the agent. The registration method is described in the basic commands in the previous chapters.


How to customize the data processing module

In the previous section, we defined our own adaptor to collect data for custom types. Next we will describe how Collector collects and processes data. Assume that the collected data type is foo type. Then we only need to add the following fragment to the configuration file Chukwa-demux-conf.xml in the Collector side:
Listing 11: Customizing the Data Processing module Add Method

				
 <property> 
  <name>foo</name> 
  <value> Org.apache.hadoop.chukwa.extraction.demux.processor.mapper.fooprocessor</value> 
  <description> Parser class for foo</description> 
 </property> 


The definition of fooprocessor is our new data processing module, which is the map part of Map/reduce. This class can be placed in any directory. After registering the map, we also need to consider reduce,reduce does not need to be configured in the configuration file. Look at the following code:
listing 11. Custom Data Processing module code examples

				
Import Org.apache.hadoop.chukwa.extraction.demux.processor.mapper.AbstractProcessor; 

 Import Org.apache.hadoop.chukwa.extraction.engine.ChukwaRecord; 
  Import Org.apache.hadoop.chukwa.extraction.engine.ChukwaRecordKey; 
  Import Org.apache.hadoop.mapred.OutputCollector; 
  Import Org.apache.hadoop.mapred.Reporter; 

  Public 
				 class Fooprocessor  extends abstractprocessor{ 

	 @Override 
	  protected 
				 void Parse (String Recordentry, 
			 outputcollector<chukwarecordkey, chukwarecord> output, 
			 Reporter Reporter)  Throws Throwable { 
		 Chukwarecordkey key1 =  new Chukwarecordkey (); 
		 String Reducetype = "Foo"; 
		 Key1.setreducetype (reducetype); 
		 Chukwarecord record =  new Chukwarecord (); 
		 .....//map work 
		 Output.collect (Key1, record);} 

 } 

We can see that this is a very common map/reduce class, where we can do what we want to do, the only thing that is special is that reduce needs to be set by key, Key1.setreducetype ("Reducetype"). And here is a bit of a limitation, our own definition of the reduce unit must be under the Org.apache.hadoop.chukwa.extraction.demux.processor package, Chukwa will be in accordance with Reducetype Find the corresponding reduce under this package and create a reduce instance by Class.forName. The reduce class needs to implement the Org.apache.hadoop.chukwa.extraction.demux.processor.reducer.ReduceProcessor interface. The interface is simple, there are two methods that need to be implemented, one is the Getdatatype method that returns the type of data processed by the reduce, and the return value is a string. Another method is a very common reduce function, omitted here. Having done this, our entire data flow is complete, including data collection, processing, and custom module definition, registration.


Conclusion

Through the above introduction, presumably everyone on the basic work principle of chukwa, how to customize, deployment have a certain understanding. Because of its simple design concept, clear structure, and open-source products, we can build our own more powerful functions on the basis of it. The emphasis here is that Chukwa is an open-source project in the Apache incubator, which is in the process of rapid evolution, the difference between the current versions is still relatively obvious, and because of the fast evolution, the document is relatively lag, so in use may encounter some strange problems, These issues can all be sent to the Chukwa mailing group for discussion. Finally, I hope this article will bring you help in the work.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

Tags Index: