Supported cluster metrics
AutoActions, by default, collects metrics from the YARN Resource Manager and MapReduce Application Master for all apps running (or submitted) on the target cluster. MapReduce Application Master (AM) also maintains various counters. Users can use these metrics and counters when defining an AutoActions rule. Additionally, there are Hive/Workflow and Spark metrics that can be used to define rules.
Note
Monitoring is performed on most live running apps allowing users to take proactive actions when violations are detected. See here for limitations on AutoActions.
Monitoring is only performed on MapReduce AM metrics when the user defines an AutoAction rule requiring polling/aggregation of a MapReduce AM metric.
Databricks metrics
Metric | Definition |
---|---|
totalDuration | Databricks job duration. |
cost | Databricks job cost. |
Hive/Workflow metrics
Metric | Definition |
---|---|
duration | Total time taken by the app. |
totalDfsBytesRead | Total HDFS bytes read. |
totalDfsBytesWritten | Total HDFS bytes written. |
Impala apps metrics
Metric | Definition |
---|---|
ImpalaDuration | Total time taken by the Impala app. |
ImpalaHdfsBytesRead | Total HDFS bytes read. |
ImpalaHdfsBytesWritten | Total HDFS bytes written. |
MapReduce application master and MapReduce metrics
Type | Metric | Definition |
---|---|---|
elapsedAppTime | Time since the app was started. | |
Map | ||
mapsCompleted | Number of completed maps. | |
mapsPending | Number of maps still to be run. | |
mapsRunning | Number of running maps. | |
mapsTotal | Total number of maps. | |
Map Attempts | ||
failedMapAttempts | Number of failed map attempts. | |
killedMapAttempts | Number of killed map attempts. | |
newMapAttempts | Number of new map attempts. | |
runningMapAttempts | Number of running map attempts. | |
Reduce | ||
reducesCompleted | Number of completed reduces. | |
reducesPending | Number of reduces still to be run. | |
reducesRunning | Number of running reduces. | |
reducesTotal | Total number of reduces. | |
Reduce Attempts | ||
failedReduceAttempts | Number of failed reduce attempts. | |
killedReduceAttempts | Number of killed reduce attempts. | |
newReduceAttempts | Number of new reduce attempts. | |
runningReduceAttempts | Number of running reduce attempts. | |
successfulReduceAttempts | Number of successful reduce attempts. |
For more details see: Apache MapReduce REST Jobs API
MapReduce and file system counters
Metric | Definitions |
---|---|
fileBytesRead | Amount of data read from local file system. |
fileBytesWritten | Amount of data written to local file system. |
fileReadOps | Number of read operations from local file system. |
fileLargeReadOps | Number of read operations of large files from local file system. |
fileWriteOps | Number of write operations from local file system. |
hdfsBytesRead | Amount of data read from HDFS. |
hdfsBytesWritten | Amount of data written to HDFS. |
hdfsReadOps | Number of read operations from HDFS. |
hdfsLargeReadOps | Number of read operations of large files from HDFS. |
hdfsWriteOps | Number of write operations to HDFS. |
Job counters
Type | Metric | Definition |
---|---|---|
Map | ||
dataLocalMaps | Number of map tasks which were launched on the nodes containing required data. | |
mbMillisMaps | Total megabyte-seconds taken by all map tasks. | |
millisMaps | Total time spent by all map tasks. | |
slotsMillisMaps | Total time spent by all executing maps in occupied slots. | |
vcoresMillisMaps | Total vCore-seconds taken by all map tasks. | |
Reduce | ||
mbMillisReduces | Total megabyte-seconds taken by all reduce tasks. | |
millisReduces | Total time spent by all reduce tasks. | |
slotsMillisReduces | Total time spent by all executing reduces in occupied slots. | |
totalLaunchedReduces | Total number of launched reduce tasks. | |
vcoresMillisReduces | Total vCore-seconds taken by all reduce tasks. |
File input/output format counters
Metric | Definition |
---|---|
bytesRead | Amount of data read by every task for every file system. |
bytesWritten | Amount of data written by every task for every file system. |
For more details see: Apache MapReduce REST Jobs Counters API
MapReduce framework counters
Type | Metric | Definition |
---|---|---|
Map | ||
failedShuffle | Total number of mappers which failed to undergo through shuffle phase. | |
mapInputRecords | Total number of records processed by all the mappers. | |
mapOutputBytes | Total amount of (uncompressed) data produced by mappers. | |
mapOutputMaterializedBytes | Amount of (compressed) data that was written to disk. | |
mapOutputRecords | Total number of records produced by all the mappers. | |
mergedMapOutputs | Total number of mapper output files undergone through shuffle phase. | |
shuffledMaps | Total number of mappers that undergone the shuffle phase. | |
Reduce | ||
reduceInputGroups | Total number of unique keys. | |
reduceInputRecords | Total number of records processed by all reducers. | |
reduceOutputRecords | Total number of records produced by all reducers. | |
reduceShuffleBytes | Amount of data processed in the shuffle and reduce phase. | |
Records | ||
combineInputRecords | Total number of records processed by combiners. | |
combineOutputRecords | Total number of records produced by combiners. | |
spilledRecords | Total number of map and reduce records that were spilled to disk. | |
Time | ||
gcTimeMillis | Wall clock time spent in Java Garbage Collection. | |
cpuMilliseconds | Cumulative CPU time for all tasks. | |
Memory | ||
committedHeapBytes | Total amount of memory available for JVM. | |
physicalMemoryBytes | Total physical memory used by all tasks, including spilled data. | |
splitRawBytes | Amount of data consumed for meta-data representation during splits. | |
virtualMemoryBytes | Total virtual memory used by all tasks. |
Shuffle errors
Metric | Definition |
---|---|
badId | Total number of errors related to the interpretations of IDs from shuffle headers. |
connection | Total number of established network connections. |
ioError | Total number of errors related to reading and writing intermediate data. |
wrongLength | Total number of errors related to compression and decompression of intermediate data. |
wrongMap | Total number of errors related to duplication of the mapper output data. |
wrongReduce | Total number of errors related to the attempts of shuffling data for the wrong reducer. |
Spark metrics
In addition to the metric set supported by MapReduce apps, Spark apps can be polled on:
Type | Metric | Definition |
---|---|---|
Join | ||
joinInputRowCount | The total input rows of the first join of the SQL query, aggregated for all the queries that are part of the app. | |
totalJoinInputRowCount | Total number of input rows count for all join operators of all SQL queries that are part of the app. | |
totalJoinOutputRowCount | Total number of output rows count for all join operators of all SQL queries that are part of the app. | |
joinOutputRowCount | The total output rows of the first join of the SQL query, aggregated for all the queries that are part of the app. | |
Partitions | ||
inputPartitions | Total number of input partitions for all SQL queries that are part of the app. | |
outputPartitions | Total number of output partitions for all SQL queries that are part of the app. | |
Records | ||
inputRecords | Cumulative number of input records for all SQL queries part of the app (collected at stage level). | |
outputRecords | Cumulative number of output records for all SQL queries part of the app (collected at stage level). | |
outputToInputRecordsRatio | OutputRecords / inputRecords if inputRecords > 0, else 0. |
YARN resource manager metrics
Metric | Definition |
---|---|
allocatedMB | The sum of memory in MB allocated to the app’s running containers. |
allocatedVCores | The sum of virtual cores allocated to the app’s running containers. |
appCount | Total number of apps. |
elapsedTime | The elapsed time since the app started (in ms). |
runningContainers | The number of containers currently running for the app. |
memorySeconds | The amount of memory the app has allocated (megabyte-seconds). |
vcoreSeconds | The amount of CPU resources the app has allocated (virtual core-seconds). |
For more details, see:Apache MapReduce REST Cluster Applications API