Supported cluster metrics

AutoActions by default collect metrics from the YARN Resource Manager and MapReduce Application Master for all apps running (or submitted) on the target cluster. MapReduce Application Master (AM) also maintains various counters. Users can use these metrics and counters when defining an AutoActions rule. Additionally, there are Hive/Workflow and Spark metrics that can used to define rules.

Note

Monitoring is performed on most live running apps allowing users to take proactive actions when violations are detected. See here for limitations on AutoActions.

Monitoring is only performed on MapReduce AM metrics when the user defines an AutoAction rule that requires the polling/aggregation of a MapReduce AM metric.

Hive/Workflow metrics

Metric	Definition
duration	Total time taken by the app.
totalDfsBytesRead	Total HDFS bytes read.
totalDfsBytesWritten	Total HDFS bytes written.

MapReduce application master and MapReduce metrics

Type	Metric	Definition
	elapsedAppTime	Time since the app was started.
Map
	mapsCompleted	Number of completed maps.
	mapsPending	Number of maps still to be run.
	mapsRunning	Number of running maps.
	mapsTotal	Total number of maps.
Map Attempts
	failedMapAttempts	Number of failed map attempts.
	killedMapAttempts	Number of killed map attempts.
	newMapAttempts	Number of new map attempts.
	runningMapAttempts	Number of running map attempts.
Reduce
	reducesCompleted	Number of completed reduces.
	reducesPending	Number of reduces still to be run.
	reducesRunning	Number of running reduces.
	reducesTotal	Total number of reduces.
Reduce Attempts
	failedReduceAttempts	Number of failed reduce attempts.
	killedReduceAttempts	Number of killed reduce attempts.
	newReduceAttempts	Number of new reduce attempts.
	runningReduceAttempts	Number of running reduce attempts.
	successfulReduceAttempts	Number of successful reduce attempts.

For more details see: Apache MapReduce REST Jobs API

MapReduce and file system counters

Metric	Definitions
fileBytesRead	Amount of data read from local file system.
fileBytesWritten	Amount of data written to local file system.
fileReadOps	Number of read operations from local file system.
fileLargeReadOps	Number of read operations of large files from local file system.
fileWriteOps	Number of write operations from local file system.
hdfsBytesRead	Amount of data read from HDFS.
hdfsBytesWritten	Amount of data written to HDFS.
hdfsReadOps	Number of read operations from HDFS.
hdfsLargeReadOps	Number of read operations of large files from HDFS.
hdfsWriteOps	Number of write operations to HDFS.

Job counters

Type	Metric	Definition
Map
	dataLocalMaps	Number of map tasks which were launched on the nodes containing required data.
	mbMillisMaps	Total megabyte-seconds taken by all map tasks.
	millisMaps	Total time spent by all map tasks.
	slotsMillisMaps	Total time spent by all executing maps in occupied slots.
	vcoresMillisMaps	Total vCore-seconds taken by all map tasks.
Reduce
	mbMillisReduces	Total megabyte-seconds taken by all reduce tasks.
	millisReduces	Total time spent by all reduce tasks.
	slotsMillisReduces	Total time spent by all executing reduces in occupied slots.
	totalLaunchedReduces	Total number of launched reduce tasks.
	vcoresMillisReduces	Total vCore-seconds taken by all reduce tasks.

File input/output format counters

Metric	Definition
bytesRead	Amount of data read by every task for every file system.
bytesWritten	Amount of data written by every task for every file system.

For more details see: Apache MapReduce REST Jobs Counters API

MapReduce framework counters

Type	Metric	Definition
Map
	failedShuffle	Total number of mappers which failed to undergo through shuffle phase.
	mapInputRecords	Total number of records processed by all the mappers.
	mapOutputBytes	Total amount of (uncompressed) data produced by mappers.
	mapOutputMaterializedBytes	Amount of (compressed) data which was actually written to disk.
	mapOutputRecords	Total number of records produced by all the mappers.
	mergedMapOutputs	Total number of mapper output files undergone through shuffle phase.
	shuffledMaps	Total number of mappers which undergone shuffle phase.
Reduce
	reduceInputGroups	Total number of unique keys.
	reduceInputRecords	Total number of records processed by all reducers.
	reduceOutputRecords	Total number of records produced by all reducers.
	reduceShuffleBytes	Amount of data processed in shuffle and reduce phase.
Records
	combineInputRecords	Total number of records processed by combiners.
	combineOutputRecords	Total number of records produced by combiners.
	spilledRecords	Total number of map and reduce records that were spilled to disk.
Time
	gcTimeMillis	Wall clock time spent in Java Garbage Collection.
	cpuMilliseconds	Cumulative CPU time for all tasks.
Memory
	committedHeapBytes	Total amount of memory available for JVM.
	physicalMemoryBytes	Total physical memory used by all tasks including spilled data.
	splitRawBytes	Amount of data consumed for meta-data representation during splits.
	virtualMemoryBytes	Total virtual memory used by all tasks.

Shuffle errors

Metric	Definition
badId	Total number of errors related with the interpretations of IDs from shuffle headers.
connection	Total number of established network connections.
ioError	Total number of errors related with reading and writing intermediate data.
wrongLength	Total number of errors related to compression and decompression of intermediate data.
wrongMap	Total number of errors related to duplication of the mapper output data.
wrongReduce	Total number of errors related to the attempts of shuffling data for wrong reducer.

Spark metrics

In addition to the metric set supported by MapReduce apps, Spark apps can be polled on:

Type	Metric	Definition
Join
	joinInputRowCount	The total input rows of the first join of the SQL query, aggregated for all the queries that are part of the app.
	totalJoinInputRowCount	Total number of input rows count for all join operators of all SQL queries that are part of the app.
	totalJoinOutputRowCount	Total number of output rows count for all join operators of all SQL queries that are part of the app.
	joinOutputRowCount	The total output rows of the first join of the SQL query, aggregated for all the queries that are part of the app.
Partitions
	inputPartitions	Total number of input partitions for all SQL queries that are part of the app.
	outputPartitions	Total number of output partitions for all SQL queries that are part of the app.
Records
	inputRecords	Cumulative number of input records for all SQL queries that are part of the app (collected at stage level).
	outputRecords	Cumulative number of output records for all SQL queries that are part of the app (collected at stage level).
	outputToInputRecordsRatio	OutputRecords / inputRecords if inputRecords > 0, else 0.

YARN resource manager metrics

Metric	Definition
allocatedMB	The sum of memory in MB allocated to the app’s running containers.
allocatedVCores	The sum of virtual cores allocated to the app’s running containers.
appCount	Total number of apps.
elapsedTime	The elapsed time since the app started (in ms).
runningContainers	The number of containers currently running for the app.
memorySeconds	The amount of memory the app has allocated (megabyte-seconds).
vcoreSeconds	The amount of CPU resources the app has allocated (virtual core-seconds).

For more details see:Apache MapReduce REST Cluster Applications API

In this section:

Home