ClickHouse Monitoring

ClickHouse Monitoring

Monitoring helps to track potential issues in your cluster before they cause a critical error.

What to read / watch on subject:

What should be monitored

The following metrics should be collected / monitored

  • For Host Machine:

    • CPU
    • Memory
    • Network (bytes/packets)
    • Storage (iops)
    • Disk Space (free / used)
  • For ClickHouse:

    • Connections (count)
    • RWLocks
    • Read / Write / Return (bytes)
    • Read / Write / Return (rows)
    • Zookeeper operations (count)
    • Absolute delay
    • Query duration (optional)
    • Replication parts and queue (count)
  • For Zookeeper:

Monitoring tools

Prometheus (embedded exporter) + Grafana

clickhouse-operator embedded exporter

Prometheus exporter (external) + Grafana


Dashboards quering clickhouse directly via vertamedia / Altinity plugin

Dashboard quering clickhouse directly via Grafana plugin



  • Use the embedded exporter. See docs and config.xml



Commercial solution

“Build your own” monitoring

ClickHouse allow to access lot of internals using system tables. The main tables to access monitoring data are:

  • system.metrics
  • system.asynchronous_metrics

Minimum neccessary set of checks

Check Name Shell or SQL command Severity
ClickHouse status $ curl 'http://localhost:8123/'


Too many simultaneous queries. Maximum: 100 (by default) select value from system.metrics

where metric='Query'

Replication status $ curl 'http://localhost:8123/replicas_status'


Read only replicas (reflected by replicas_status as well) select value from system.metrics

where metric='ReadonlyReplica'

Some replication tasks are stuck select count()

from system.replication_queue

where num_tries > 100 or num_postponed > 1000

ZooKeeper is available select count() from system.zookeeper

where path='/'

Critical for writes
ZooKeeper exceptions select value from

where event='ZooKeeperHardwareExceptions'

Other CH nodes are available $ for node in `echo "select distinct host_address from system.clusters where host_name !='localhost'" | curl 'http://localhost:8123/' --silent --data-binary @-`; do curl "http://$node:8123/" --silent ; done | sort -u


All CH clusters are available (i.e. every configured cluster has enough replicas to serve queries) for cluster in `echo "select distinct cluster from system.clusters where host_name !='localhost'" | curl 'http://localhost:8123/' --silent --data-binary @-` ; do clickhouse-client --query="select '$cluster', 'OK' from cluster('$cluster', system, one)" ; done Critical
There are files in 'detached' folders $ find /var/lib/clickhouse/data/*/*/detached/* -type d | wc -l; \ 19.8+

select count() from system.detached_parts

Too many parts: \ Number of parts is growing; \ Inserts are being delayed; \ Inserts are being rejected select value from system.asynchronous_metrics

where metric='MaxPartCountForPartition';

select value from

where event/metric='DelayedInserts'; \ select value from

where event='RejectedInserts'

Dictionaries: exception select concat(name,': ',last_exception)

from system.dictionaries

where last_exception != ''

ClickHouse has been restarted select uptime();

select value from system.asynchronous_metrics

where metric='Uptime'

DistributedFilesToInsert should not be always increasing select value from system.metrics

where metric='DistributedFilesToInsert'

A data part was lost select value from

where event='ReplicatedDataLoss'

Data parts are not the same on different replicas select value from where event='DataAfterMergeDiffersFromReplica'; \ select value from where event='DataAfterMutationDiffersFromReplica' Medium

The following queries are recommended to be included in monitoring:

  • SELECT * FROM system.replicas
    • For more information, see the ClickHouse guide on System Tables
  • SELECT * FROM system.merges
    • Checks on the speed and progress of currently executed merges.
  • SELECT * FROM system.mutations
    • This is the source of information on the speed and progress of currently executed merges.

Logs monitoring

ClickHouse logs can be another important source of information. There are 2 logs enabled by default

  • /var/log/clickhouse-server/clickhouse-server.err.log (error & warning, you may want to keep an eye on that or send it to some monitoring system)
  • /var/log/clickhouse-server/clickhouse-server.log (trace logs, very detailed, useful for debugging, usually too verbose to monitor).

You can additionally enable system.text_log table to have an access to the logs from clickhouse sql queries (ensure that you will not expose some information to the users which should not see it).

$ cat /etc/clickhouse-server/config.d/text_log.xml

OpenTelemetry support


Other sources