HBase alerts and metrics
Alerts
Alerts generated and stored along with metrics. Unravel UI plots this information as appropriate.
Category | Alert | Suggested Action |
---|---|---|
Data availability | Table offline | Run |
Region offline | Run | |
Region in transition beyond threshold period. | If a region server is dead, this is common. If not run | |
Server availability | Dead region servers | Check region server logs for more information. |
Performance | Region servers with reads > 20% of average | Region server hotspotting. Split regions or randomize the keys. |
Region servers with writes > 20% of average | Region server hotspotting. Split regions or randomize the keys. | |
Regions within a table with reads > 20% of average for that table | Table hotspotting - Split regions or randomize the keys. | |
Regions within a table with writes > 20% of average for that table | Table hotspotting - Split regions or randomize the keys. | |
Regions within a regionserver with reads > 20% of average for that table | Region server hotspotting - Split regions or randomize the keys. | |
Regions within a regionserver with writes > 20% of average for that table | Region server hotspotting - Split regions or randomize the keys. | |
Load, osload > 20% of average | Check for compactions, regions in transition and server logs. | |
Balancer not running | Enable Balancer. | |
Number of compactions and length of compaction | Disable periodic automatic major compactions by setting - hbase.hregion.majorcompaction to 0 | |
Storage | Regionservers with storage (storefilesie sum) > 20% of average | Split or randomize the keys. |
Regions within a table with storage (storefilesie sum) > 20% of average for that table | Split or randomize the keys. | |
Temporal | e.g. requests > 20% higher for the last 1 hour as compared to the prior 3 hours (just an example) | Check master and region server alerts or environment issues which could be slowing down the read/write. |
Metrics
Master/Cluster & JMX metrics
Metric | Description | Unit |
---|---|---|
averageLoad | Average number of Regions per Region Server. | percentage |
clusterRequests | Number of read and write requests across Cluster. | count |
masterActiveTime | Master Active Time | epoch in milliseconds |
masterStartTime | Master Start Time | epoch in milliseconds |
numDeadRegionServers | Number of dead Region Servers. | count |
numRegionServers | Number of live Region Servers. | count |
ritCount | The number of regions in transition. | count |
ritCountOverThreshold | The number of regions that have been in transition longer than a threshold time. | seconds |
ritOldestAge | The age of the longest region in transition, in milliseconds. | millliseconds |
OS Metrics (Ambari Only)
OS Metrics | Description | Unit |
---|---|---|
jvm_* | jvm metrics | number |
rpc_* | rpc metrics | number |
Region server metrics
JMX metrics
JMX Metrics | Description | Unit |
---|---|---|
compactionQueueLength | Current depth of the compaction request queue. If increasing, we are falling behind with storefile compaction. | count |
hlogFileSize | Size of all WAL Files. | bytes |
percentFilesLocal | Percent of store file data that can be read from the local DataNode, 0-100. | percentage |
readRequestCount | The number of read requests received. | count |
regionCount | The number of regions hosted by the regionserver. | count |
slowOPCount | The number of operations we thought were slow. OP: delete, get, put, increment, append. | count |
storeFileSize | Aggregate size of the store files on disk. | bytes |
writeRequestCount | The number of write requests received. | count |
OS Metrics (Ambari Only)
OS Metrics | Description | Unit |
---|---|---|
cpu_user | cpu | percentage |
disk.disk_free | Amount of free disk space. | bytes |
disk.write_bps | Number of bytes written per second to disk. | bytes per second |
disk.read_bps | Number of bytes read per second to disk. | bytes per second |
load.load_one | load | number |
memory.mem_free | Percentage of free memory. | percentage |
network.bytes_in | Total number incoming bytes to network. | bytes |
network.bytes_out | Total number outgoing bytes to network. | bytes |
Table/Region Metrics
Table and Region Metrics | Description | Unit |
---|---|---|
tableSize | Total table size in the region server. | bytes |
regionCount | Number of regions. | count |
averageRegionSize (Table only) | Average region size over the region server including memstore and storefile sizes. | bytes |
storeFileSize | Size of storefiles being served. | bytes |
readRequestCount | Number of read requests this region server has answered. | count |
writeRequestCount | Number of mutation requests this region server has answered. | count |