Skip to main content

Home

Supported cluster metrics

AutoActions, by default, collects metrics from the YARN Resource Manager and MapReduce Application Master for all apps running (or submitted) on the target cluster. MapReduce Application Master (AM) also maintains various counters. Users can use these metrics and counters when defining an AutoActions rule. Additionally, there are Hive/Workflow and Spark metrics that can be used to define rules.

Note

Monitoring is performed on most live running apps allowing users to take proactive actions when violations are detected. See here for limitations on AutoActions.

Monitoring is only performed on MapReduce AM metrics when the user defines an AutoAction rule requiring polling/aggregation of a MapReduce AM metric.

Databricks metrics

Metric

Definition

totalDuration

Databricks job duration.

cost

Databricks job cost.

Hive/Workflow metrics

Metric

Definition

duration

Total time taken by the app.

totalDfsBytesRead

Total HDFS bytes read.

totalDfsBytesWritten

Total HDFS bytes written.

Impala apps metrics

Metric

Definition

ImpalaDuration

Total time taken by the Impala app.

ImpalaHdfsBytesRead

Total HDFS bytes read.

ImpalaHdfsBytesWritten

Total HDFS bytes written.

MapReduce application master and MapReduce metrics

Type

Metric

Definition

elapsedAppTime

Time since the app was started.

Map

mapsCompleted

Number of completed maps.

mapsPending

Number of maps still to be run.

mapsRunning

Number of running maps.

mapsTotal

Total number of maps.

Map Attempts

failedMapAttempts

Number of failed map attempts.

killedMapAttempts

Number of killed map attempts.

newMapAttempts

Number of new map attempts.

runningMapAttempts

Number of running map attempts.

Reduce

reducesCompleted

Number of completed reduces.

reducesPending

Number of reduces still to be run.

reducesRunning

Number of running reduces.

reducesTotal

Total number of reduces.

Reduce Attempts

failedReduceAttempts

Number of failed reduce attempts.

killedReduceAttempts

Number of killed reduce attempts.

newReduceAttempts

Number of new reduce attempts.

runningReduceAttempts

Number of running reduce attempts.

successfulReduceAttempts

Number of successful reduce attempts.

For more details see: Apache MapReduce REST Jobs API

MapReduce and file system counters

Metric

Definitions

fileBytesRead

Amount of data read from local file system.

fileBytesWritten

Amount of data written to local file system.

fileReadOps

Number of read operations from local file system.

fileLargeReadOps

Number of read operations of large files from local file system.

fileWriteOps

Number of write operations from local file system.

hdfsBytesRead

Amount of data read from HDFS.

hdfsBytesWritten

Amount of data written to HDFS.

hdfsReadOps

Number of read operations from HDFS.

hdfsLargeReadOps

Number of read operations of large files from HDFS.

hdfsWriteOps

Number of write operations to HDFS.

Job counters

Type

Metric

Definition

Map

dataLocalMaps

Number of map tasks which were launched on the nodes containing required data.

mbMillisMaps

Total megabyte-seconds taken by all map tasks.

millisMaps

Total time spent by all map tasks.

slotsMillisMaps

Total time spent by all executing maps in occupied slots.

vcoresMillisMaps

Total vCore-seconds taken by all map tasks.

Reduce

mbMillisReduces

Total megabyte-seconds taken by all reduce tasks.

millisReduces

Total time spent by all reduce tasks.

slotsMillisReduces

Total time spent by all executing reduces in occupied slots.

totalLaunchedReduces

Total number of launched reduce tasks.

vcoresMillisReduces

Total vCore-seconds taken by all reduce tasks.

File input/output format counters

Metric

Definition

bytesRead

Amount of data read by every task for every file system.

bytesWritten

Amount of data written by every task for every file system.

For more details see: Apache MapReduce REST Jobs Counters API

MapReduce framework counters

Type

Metric

Definition

Map

failedShuffle

Total number of mappers which failed to undergo through shuffle phase.

mapInputRecords

Total number of records processed by all the mappers.

mapOutputBytes

Total amount of (uncompressed) data produced by mappers.

mapOutputMaterializedBytes

Amount of (compressed) data that was written to disk.

mapOutputRecords

Total number of records produced by all the mappers.

mergedMapOutputs

Total number of mapper output files undergone through shuffle phase.

shuffledMaps

Total number of mappers that undergone the shuffle phase.

Reduce

reduceInputGroups

Total number of unique keys.

reduceInputRecords

Total number of records processed by all reducers.

reduceOutputRecords

Total number of records produced by all reducers.

reduceShuffleBytes

Amount of data processed in the shuffle and reduce phase.

Records

combineInputRecords

Total number of records processed by combiners.

combineOutputRecords

Total number of records produced by combiners.

spilledRecords

Total number of map and reduce records that were spilled to disk.

Time

gcTimeMillis

Wall clock time spent in Java Garbage Collection.

cpuMilliseconds

Cumulative CPU time for all tasks.

Memory

committedHeapBytes

Total amount of memory available for JVM.

physicalMemoryBytes

Total physical memory used by all tasks, including spilled data.

splitRawBytes

Amount of data consumed for meta-data representation during splits.

virtualMemoryBytes

Total virtual memory used by all tasks.

Shuffle errors

Metric

Definition

badId

Total number of errors related to the interpretations of IDs from shuffle headers.

connection

Total number of established network connections.

ioError

Total number of errors related to reading and writing intermediate data.

wrongLength

Total number of errors related to compression and decompression of intermediate data.

wrongMap

Total number of errors related to duplication of the mapper output data.

wrongReduce

Total number of errors related to the attempts of shuffling data for the wrong reducer.

Spark metrics

In addition to the metric set supported by MapReduce apps, Spark apps can be polled on:

Type

Metric

Definition

Join

joinInputRowCount

The total input rows of the first join of the SQL query, aggregated for all the queries that are part of the app.

totalJoinInputRowCount

Total number of input rows count for all join operators of all SQL queries that are part of the app.

totalJoinOutputRowCount

Total number of output rows count for all join operators of all SQL queries that are part of the app.

joinOutputRowCount

The total output rows of the first join of the SQL query, aggregated for all the queries that are part of the app.

Partitions

inputPartitions

Total number of input partitions for all SQL queries that are part of the app.

outputPartitions

Total number of output partitions for all SQL queries that are part of the app.

Records

inputRecords

Cumulative number of input records for all SQL queries part of the app (collected at stage level).

outputRecords

Cumulative number of output records for all SQL queries part of the app (collected at stage level).

outputToInputRecordsRatio

OutputRecords / inputRecords if inputRecords > 0, else 0.

YARN resource manager metrics

Metric

Definition

allocatedMB

The sum of memory in MB allocated to the app’s running containers.

allocatedVCores

The sum of virtual cores allocated to the app’s running containers.

appCount

Total number of apps.

elapsedTime

The elapsed time since the app started (in ms).

runningContainers

The number of containers currently running for the app.

memorySeconds

The amount of memory the app has allocated (megabyte-seconds).

vcoreSeconds

The amount of CPU resources the app has allocated (virtual core-seconds).

For more details, see:Apache MapReduce REST Cluster Applications API