Ganglia Hadoop-related monitoring configuration and metrics

Source: Internet
Author: User

About ganglia configuration in Hadoop2.0.0-cdh4.3.0:

Modify configuration file: $ HADOOP_HOME/etc/hadoop/
Add the following content:
*. Sink. ganglia. class = org. apache. hadoop. metrics2.sink. ganglia. GangliaSink31
*. Sink. ganglia. period = 10
# Default for supportsparse is false
*. Sink. ganglia. supportsparse = true
*. Sink. ganglia. slope = jvm. metrics. gcCount = zero, jvm. metrics. memHeapUsedM = both
*. Sink. ganglia. dmax = jvm. metrics. threadsBlocked = 70, jvm. metrics. memHeapUsedM = 40
# multicast address 8801 is the port for receiving and sending data
Namenode. sink. ganglia. servers = 8801
Datanode. sink. ganglia. servers = 8801
Jobtracker. sink. ganglia. servers = 8801
Tasktracker. sink. ganglia. servers = 8801
Maptask. sink. ganglia. servers = 8801
Reducetask. sink. ganglia. servers = 8801

Various control indicators:
Dfs. datanode. blockChecksumOp_avg_time average time of block Verification
Dfs. datanode. blockChecksumOp_num_ops block check count
Dfs. datanode. blockReports_avg_time average time of the block report
Dfs. datanode. blockReports_num_ops block report count
Dfs. datanode. block_verification_failures block verification failure count
Dfs. datanode. blocks_read total number of times a block is read from a hard disk
Dfs. datanode. blocks_removed: Number of deleted Blocks
Dfs. datanode. blocks_replicated total number of block copies
Dfs. datanode. blocks_verified total number of block verification times
Total number of times dfs. datanode. blocks_written writes to Hard Disk
Dfs. datanode. bytes_read read Total Bytes include crc verification file bytes
Dfs. datanode. bytes_written total number of bytes written (the number of bytes written to each packet)
Dfs. datanode. copyBlockOp_avg_time average time of the copy block (unit: ms)
Dfs. datanode. copyBlockOp_num_ops
Dfs. datanode. heartBeats_avg_time report average time to namenode
Dfs. datanode. heartBeats_num_ops reports the total number of times to namenode.
Dfs. datanode. readBlockOp_avg_time average block read time (unit: ms)
Dfs. datanode. readBlockOp_num_ops: the total number of read blocks is generally the same as that of dfs. datanode. blocks_read. First, read the input stream from the hard disk, increase the dfs. datanode. blocks_read count, and then increase the count.
Dfs. datanode. reads_from_local_client
Dfs. datanode. reads_from_remote_client
Dfs. datanode. replaceBlockOp_avg_time replaces the average block time (SLB Policy)
Dfs. datanode. replaceBlockOp_num_ops (SLB Policy)
Dfs. datanode. volumeFailures notfound is related to the volume failure of the block.
Dfs. datanode. writeBlockOp_avg_time average block write time
Dfs. datanode. writeBlockOp_num_ops: the total number of write blocks is generally the same as that of dfs. datanode. blocks_written. Increase the dfs. datanode. blocks_read count from the hard disk and then increase the count.
Dfs. datanode. writes_from_local_client write times
Dfs. datanode. writes_from_remote_client Remote Write count
Jvm. metrics. gcCount gc total times
Jvm. metrics. gcTimeMillis gc total time consumption (MS)
Jvm. metrics. logError jvm error count
Jvm. metrics. logFatal the number of times that jvm appears fatal
Jvm. metrics. logInfo jvm info times
Jvm. metrics. logWarn jvm warn occurrences
Jvm. metrics. maxMemoryM jvm tries to use the maximum memory (M). If there is no limit, Long. MAX_VALUE is returned.
Jvm. metrics. memHeapCommittedM jvm submit heap memory size
Jvm. metrics. memHeapUsedM jvm uses heap memory size
Jvm. metrics. memNonHeapCommittedM jvm non-heap memory submitted size
Jvm. metrics. memNonHeapUsedM jvm non-heap memory used size
Jvm. metrics. threadsBlocked is blocking the number of threads waiting for the monitor lock
Jvm. metrics. threadsNew Number of threads not started
Jvm. metrics. threadsRunnable: Number of threads in the execution status
Jvm. metrics. threadsTerminated Number of exited threads
Jvm. metrics. threadsTimedWaiting waiting for another thread to execute depends on the number of threads for the specified wait time operation
Jvm. metrics. threadsWaiting the number of threads waiting for another thread to execute a specific operation indefinitely
Rpc. metrics. NumOpenConnections number of open connections rpc connections
Rpc. metrics. ReceivedBytes number of bytes received by rpc
Rpc. metrics. RpcProcessingTime_avg_time Average time for RPC Operations in last interval rpc Average operation time in recent interactions
Rpc. metrics. RpcProcessingTime_num_ops rpc connection count in the most recent interaction
Rpc. metrics. RpcQueueTime_avg_time average rpc wait time during interaction
Number of rpc Operations completed in rpc. metrics. RpcQueueTime_num_ops rpc queue
Rpc. metrics. SentBytes number of bytes sent data bytes sent by rpc
Rpc. metrics. callQueueLen length of the rpc queue length
Rpc. metrics. rpcAuthenticationFailures number of failures in rpc Authentication
Rpc. metrics. rpcAuthenticationSuccesses number of successful authentications verified successfully
Rpc. metrics. rpcAuthorizationFailures number of failures in authorization
Rpc. metrics. rpcAuthorizationSuccesses number of successful authorizations
Mapred. shuffleInput. shuffle_failed_fetches
Mapred. shuffleInput. shuffle_fetchers_busy_percent: Percentage of threads busy in obtaining map output in parallel
Mapred. shuffleInput. shuffle_input_bytes read data bytes during shuffle
Mapred. shuffleInput. shuffle_success_fetches
Mapred. shuffleOutput. shuffle_failed_outputs
Mapred. shuffleOutput. shuffle_handler_busy_percent: Percentage of busy server threads in map output sent to reduce (configured in tasktracker. http. threads.
Mapred. shuffleOutput. shuffle_output_bytes output data bytes during shuffle
Mapred. shuffleOutput. shuffle_success_outputs is successfully directed to reduce.
Mapred. tasktracker. mapTaskSlots: set the number of map Slots
Mapred. tasktracker. maps_running Number of running maps
Mapred. tasktracker. cetcetaskslots: set the number of reduce Slots
Mapred. tasktracker. reduces_running: Number of running reduce tasks
Mapred. tasktracker. tasks_completed
Mapred. tasktracker. tasks_failed_ping the number of failed tasks caused by tasktracker interaction with tasks
Mapred. tasktracker. tasks_failed_timeout Number of kill tasks that time out because tasks are not configured in mapred. task. timeout (10 minutes by default ).
Rpc. detailed-metrics.canCommit_avg_time rpc ask whether to submit the task Average Time
Rpc. detailed-metrics.canCommit_num_ops rpc ask whether to submit the task count
Rpc. detailed-metrics.commitPending_avg_time rpc Report Task commit completed, but the average time that the submission is still in pending state
Rpc. detailed-metrics.commitPending_num_ops rpc reports the number of times that the job is submitted completed, but it is still in pending state
Rpc. detailed-metrics.done_avg_time rpc reports the average time of successful completion of the task
Rpc. detailed-metrics.done_num_ops rpc reports the number of successful tasks
Rpc. detailed-metrics.fatalError_avg_time rpc reports the average time for the task to experience fatalerror
Rpc. detailed-metrics.fatalError_num_ops rpc reports the number of times the task experienced a fatalerror
Rpc. Average time the detailed-metrics.getBlockInfo_avg_time gets the block from the specified datanode
Rpc. Number of times the detailed-metrics.getBlockInfo_num_ops gets a block from a specified datanode
Rpc. detailed-metrics.getMapCompletionEvents_avg_time reduce gets the average time of completed map output address events
Rpc. Number of times the detailed-metrics.getMapCompletionEvents_num_ops reduce gets completed map output address events
Average time for rpc. detailed-metrics.getProtocolVersion_avg_time to get rpc Protocol Version Information
Rpc. Number of times the detailed-metrics.getProtocolVersion_num_ops obtains rpc Protocol Version Information
Rpc. detailed-metrics.getTask_avg_time get the average time of jvmtask after the child process starts
Rpc. detailed-metrics.getTask_num_ops the number of jvmtask requests when the child process is started
Rpc. The average time for the detailed-metrics.ping_avg_time sub-process to periodically check whether the parent process is still alive
Rpc. Number of times the detailed-metrics.ping_num_ops sub-process periodically checks whether the parent process is still alive
Rpc. The average time that the detailed-metrics.recoverBlock_avg_time generates for the specified block start recovery tag
Rpc. The number of times the detailed-metrics.recoverBlock_num_ops starts to restore tag generation for the specified block
Rpc. The average time for the detailed-metrics.reportDiagnosticInfo_avg_time to report task error messages to the parent process, which is saved in jobtracker as little as possible
Rpc. Number of times the detailed-metrics.reportDiagnosticInfo_num_ops reports task error messages to the parent process
Rpc. Average time for the detailed-metrics.startBlockRecovery_avg_time to start block recovery
Rpc. Number of times the detailed-metrics.startBlockRecovery_num_ops starts to recover the block
Rpc. The average time that the detailed-metrics.statusUpdate_avg_time reports the progress of the child process to the parent process
Rpc. The number of times the detailed-metrics.statusUpdate_num_ops reports the progress of the child process to the parent process
Rpc. detailed-metrics.updateBlock_avg_time update block to new records and average operation time of Length
Rpc. Number of times the detailed-metrics.updateBlock_num_ops updates a block to a new tag and length

Use Ganglia to monitor Hadoop Clusters

Install and configure Hadoop and Ganglia in Ubuntu of VMware Workstation

Create a Grid

Ganglia installation tutorial yum

Ganglia Quick Start Guide (translated from the official wiki)

Install Ganglia-3.6.0 monitoring Hadoop-2.2.0 and HBase-0.96.0 on CentOS Cluster

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.