ClickHouse Monitoring

ClickHouse Monitoring

Monitoring helps to track potential issues in your cluster before they cause a critical error.

What to read / watch on subject:

What should be monitored

The following metrics should be collected / monitored

  • For Host Machine:

    • CPU
    • Memory
    • Network (bytes/packets)
    • Storage (iops)
    • Disk Space (free / used)
  • For ClickHouse:

    • Connections (count)
    • RWLocks
    • Read / Write / Return (bytes)
    • Read / Write / Return (rows)
    • Zookeeper operations (count)
    • Absolute delay
    • Query duration (optional)
    • Replication parts and queue (count)
  • For Zookeeper:

Monitoring tools

Prometheus (embedded exporter) + Grafana

Prometheus (embedded http handler with clickhouse-operator style metrics) + Grafana

  • Enable http handler
  • Useful, if you want to use clickhouse-operator dashboard, but do not run ClickHouse in k8s.

Prometheus (clickhouse-operator embedded exporter) + Grafana

Prometheus (clickhouse external exporter) + Grafana


Dashboards quering clickhouse directly via vertamedia / Altinity plugin

Dashboard quering clickhouse directly via Grafana plugin



  • Use the embedded exporter. See docs and config.xml



Commercial solution

“Build your own” monitoring

ClickHouse allow to access lot of internals using system tables. The main tables to access monitoring data are:

  • system.metrics
  • system.asynchronous_metrics

Minimum neccessary set of checks

Check NameShell or SQL commandSeverity
ClickHouse status$ curl 'http://localhost:8123/'


Too many simultaneous queries. Maximum: 100 (by default)select value from system.metrics

where metric='Query'

Replication status$ curl 'http://localhost:8123/replicas_status'


Read only replicas (reflected by replicas_status as well)select value from system.metrics

where metric='ReadonlyReplica'

Some replication tasks are stuckselect count()

from system.replication_queue

where num_tries > 100 or num_postponed > 1000

ZooKeeper is availableselect count() from system.zookeeper

where path='/'

Critical for writes
ZooKeeper exceptionsselect value from

where event='ZooKeeperHardwareExceptions'

Other CH nodes are available$ for node in `echo "select distinct host_address from system.clusters where host_name !='localhost'" | curl 'http://localhost:8123/' --silent --data-binary @-`; do curl "http://$node:8123/" --silent ; done | sort -u


All CH clusters are available (i.e. every configured cluster has enough replicas to serve queries)for cluster in `echo "select distinct cluster from system.clusters where host_name !='localhost'" | curl 'http://localhost:8123/' --silent --data-binary @-` ; do clickhouse-client --query="select '$cluster', 'OK' from cluster('$cluster', system, one)" ; doneCritical
There are files in 'detached' folders$ find /var/lib/clickhouse/data/*/*/detached/* -type d | wc -l; \ 19.8+

select count() from system.detached_parts

Too many parts: \ Number of parts is growing; \ Inserts are being delayed; \ Inserts are being rejectedselect value from system.asynchronous_metrics

where metric='MaxPartCountForPartition';

select value from

where event/metric='DelayedInserts'; \ select value from

where event='RejectedInserts'

Dictionaries: exceptionselect concat(name,': ',last_exception)

from system.dictionaries

where last_exception != ''

ClickHouse has been restartedselect uptime();

select value from system.asynchronous_metrics

where metric='Uptime'

DistributedFilesToInsert should not be always increasingselect value from system.metrics

where metric='DistributedFilesToInsert'

A data part was lostselect value from

where event='ReplicatedDataLoss'

Data parts are not the same on different replicasselect value from where event='DataAfterMergeDiffersFromReplica'; \ select value from where event='DataAfterMutationDiffersFromReplica'Medium

The following queries are recommended to be included in monitoring:

  • SELECT * FROM system.replicas
  • SELECT * FROM system.merges
    • Checks on the speed and progress of currently executed merges.
  • SELECT * FROM system.mutations
    • This is the source of information on the speed and progress of currently executed merges.

Logs monitoring

ClickHouse logs can be another important source of information. There are 2 logs enabled by default

  • /var/log/clickhouse-server/clickhouse-server.err.log (error & warning, you may want to keep an eye on that or send it to some monitoring system)
  • /var/log/clickhouse-server/clickhouse-server.log (trace logs, very detailed, useful for debugging, usually too verbose to monitor).

You can additionally enable system.text_log table to have an access to the logs from clickhouse sql queries (ensure that you will not expose some information to the users which should not see it).

$ cat /etc/clickhouse-server/config.d/text_log.xml

OpenTelemetry support


Other sources