Monitoring helps to track potential issues in your cluster before they cause a critical error.
External monitoring collects data from the ClickHouse cluster and uses it for analysis and review. Recommended external monitoring systems include:
- Prometheus: Use embedded exporter or clickhouse-exporter
- Graphite: Use the embedded exporter. See config.xml.
- InfluxDB: Use the embedded exporter, plus Telegraf. For more information, see Graphite protocol support in InfluxDB.
ClickHouse can collect the recording of metrics internally by enabling
For dashboard system:
- Grafana is recommended for graphs, reports, alerts, dashboard, etc.
- Other options are Nagios or Zabbix.
The following metrics should be collected:
- For Host Machine:
- Network (bytes/packets)
- Storage (iops)
- Disk Space (free / used)
- For ClickHouse:
- Connections (count)
- Read / Write / Return (bytes)
- Read / Write / Return (rows)
- Zookeeper operations (count)
- Absolute delay
- Query duration (optional)
- Replication parts and queue (count)
- For Zookeeper:
The following queries are recommended to be included in monitoring:
- SELECT * FROM system.replicas
- For more information, see the ClickHouse guide on System Tables
- SELECT * FROM system.merges
- Checks on the speed and progress of currently executed merges.
- SELECT * FROM system.mutations
- This is the source of information on the speed and progress of currently executed merges.
Monitor and Alerts
Configure the notifications for events and thresholds based on the following table:
The following health checks should be monitored:
|Check Name||Shell or SQL command||Severity|
|ClickHouse status||$ curl 'http://localhost:8123/'Ok.||Critical|
|Too many simultaneous queries. Maximum: 100||select value from system.metrics where metric='Query'||Critical|
|Replication status||$ curl 'http://localhost:8123/replicas_status'Ok.||High|
|Read only replicas (reflected by replicas_status as well)||select value from system.metrics where metric='ReadonlyReplica’||High|
|ReplicaPartialShutdown (not reflected by replicas_status, but seems to correlate with ZooKeeperHardwareExceptions)||select value from system.events where event='ReplicaPartialShutdown'||HighI turned this one off. It almost always correlates with ZooKeeperHardwareExceptions, and when it’s not, then there is nothing bad happening…|
|Some replication tasks are stuck||select count()from system.replication_queuewhere num_tries > 100||High|
|ZooKeeper is available||select count() from system.zookeeper where path='/'||Critical for writes|
|ZooKeeper exceptions||select value from system.events where event='ZooKeeperHardwareExceptions'||Medium|
|Other CH nodes are available||$ for node in `echo "select distinct host_address from system.clusters where host_name !='localhost'"||curl 'http://localhost:8123/' –silent –data-binary @-`; do curl "http://$node:8123/" –silent ; done|
|All CH clusters are available (i.e. every configured cluster has enough replicas to serve queries)||for cluster in `echo "select distinct cluster from system.clusters where host_name !='localhost'"||curl 'http://localhost:8123/' –silent –data-binary @-` ; do clickhouse-client –query="select '$cluster', 'OK' from cluster('$cluster', system, one)" ; done|
|There are files in 'detached' folders||$ find /var/lib/clickhouse/data///detached/* -type d||
19.8+select count() from system.detached_parts
Too many parts:
Number of parts is growing;
Inserts are being delayed;
Inserts are being rejected
select value from system.asynchronous_metrics where metric='MaxPartCountForPartition';select value from system.events/system.metrics where event/metric='DelayedInserts';
select value from system.events where event='RejectedInserts'
|Dictionaries: exception||select concat(name,': ',last_exception) from system.dictionarieswhere last_exception != ''||Medium|
|ClickHouse has been restarted||select uptime();select value from system.asynchronous_metrics where metric='Uptime'|
|DistributedFilesToInsert should not be always increasing||select value from system.metrics where metric='DistributedFilesToInsert'||Medium|
|A data part was lost||select value from system.events where event='ReplicatedDataLoss'||High|
|Data parts are not the same on different replicas||
select value from system.events where event='DataAfterMergeDiffersFromReplica';
select value from system.events where event='DataAfterMutationDiffersFromReplica'
- Key Metrics for Monitoring ClickHouse
- ClickHouse Monitoring Key Metrics to Monitor
- ClickHouse Monitoring Tools: Five Tools to Consider
- Monitoring ClickHouse
- Monitor ClickHouse with Datadog