ClickHouse® Monitoring

Tracking potential issues in your cluster before they cause a critical error

What to read / watch on the subject:

Altinity webinar “ClickHouse Monitoring 101: What to monitor and how”. Watch the video or download the slides .
The ClickHouse docs

What should be monitored

The following metrics should be collected / monitored

For Host Machine:
- CPU
- Memory
- Network (bytes/packets)
- Storage (iops)
- Disk Space (free / used)
For ClickHouse:
- Connections (count)
- RWLocks
- Read / Write / Return (bytes)
- Read / Write / Return (rows)
- Zookeeper operations (count)
- Absolute delay
- Query duration (optional)
- Replication parts and queue (count)
For Zookeeper:
- See separate article

ClickHouse monitoring tools

Prometheus (embedded exporter) + Grafana

Enable embedded exporter
Grafana dashboards https://grafana.com/grafana/dashboards/14192 or https://grafana.com/grafana/dashboards/13500

Prometheus (embedded http handler with Altinity Kubernetes Operator for ClickHouse style metrics) + Grafana

Enable http handler
Useful, if you want to use the dashboard from the Altinity Kubernetes Operator for ClickHouse, but do not run ClickHouse in k8s.

Prometheus (embedded exporter in the Altinity Kubernetes Operator for ClickHouse) + Grafana

exporter is included in the Altinity Kubernetes Operator for ClickHouse, and enabled automatically
see instructions of Prometheus and Grafana installation (if you don’t have one)
Grafana dashboard https://github.com/Altinity/clickhouse-operator/tree/master/grafana-dashboard
Prometheus alerts https://github.com/Altinity/clickhouse-operator/blob/master/deploy/prometheus/prometheus-alert-rules-clickhouse.yaml

Prometheus (ClickHouse external exporter) + Grafana

clickhouse-exporter
Dashboard: https://grafana.com/grafana/dashboards/882

(unmaintained)

Dashboards querying ClickHouse directly via vertamedia / Altinity plugin

Overview: https://grafana.com/grafana/dashboards/13606
Queries dashboard (analyzing system.query_log) https://grafana.com/grafana/dashboards/2515

Dashboard querying ClickHouse directly via Grafana plugin

https://grafana.com/blog/2022/05/05/introducing-the-official-clickhouse-plugin-for-grafana/

Zabbix

Graphite

Use the embedded exporter. See docs and config.xml

InfluxDB

You can use embedded exporter, plus Telegraf. For more information, see Graphite protocol support in InfluxDB .

Nagios/Icinga

https://github.com/exogroup/check_clickhouse/

Commercial solution

“Build your own” ClickHouse monitoring

ClickHouse allows to access lots of internals using system tables. The main tables to access monitoring data are:

system.metrics
system.asynchronous_metrics
system.events

Minimum necessary set of checks

Check Name	`Shell or SQL command`	`Severity`
ClickHouse status	`$ curl 'http://localhost:8123/'` `Ok.`	`Critical`
Too many simultaneous queries. Maximum: 100 (by default)	`select value from system.metrics` `where metric='Query'`	`Critical`
Replication status	`$ curl 'http://localhost:8123/replicas_status'` `Ok.`	`High`
Read only replicas (reflected by `replicas_status` as well)	`select value from system.metrics` `where metric='ReadonlyReplica'`	`High`
Some replication tasks are stuck	`select count()` `from system.replication_queue` `where num_tries > 100 or num_postponed > 1000`	`High`
ZooKeeper is available	`select count() from system.zookeeper` `where path='/'`	`Critical for writes`
ZooKeeper exceptions	`select value from system.events` `where event='ZooKeeperHardwareExceptions'`	`Medium`
Other CH nodes are available	$ for node in `echo "select distinct host_address from system.clusters where host_name !='localhost'" \| curl 'http://localhost:8123/' --silent --data-binary @-`; do curl "http://$node:8123/" --silent ; done \| sort -u `Ok.`	`High`
All CH clusters are available (i.e. every configured cluster has enough replicas to serve queries)	for cluster in `echo "select distinct cluster from system.clusters where host_name !='localhost'" \| curl 'http://localhost:8123/' --silent --data-binary @-` ; do clickhouse-client --query="select '$cluster', 'OK' from cluster('$cluster', system, one)" ; done	`Critical`
There are files in 'detached' folders	`$ find /var/lib/clickhouse/data///detached/* -type d \| wc -l; \ 19.8+` `select count() from system.detached_parts`	`Medium`
Too many parts: \ Number of parts is growing; \ Inserts are being delayed; \ Inserts are being rejected	`select value from system.asynchronous_metrics` `where metric='MaxPartCountForPartition';` `select value from system.events/system.metrics` `where event/metric='DelayedInserts'; \ select value from system.events` `where event='RejectedInserts'`	`Critical`
Dictionaries: exception	`select concat(name,': ',last_exception)` `from system.dictionaries` `where last_exception != ''`	`Medium`
ClickHouse has been restarted	`select uptime();` `select value from system.asynchronous_metrics` `where metric='Uptime'`
DistributedFilesToInsert should not be always increasing	`select value from system.metrics` `where metric='DistributedFilesToInsert'`	`Medium`
A data part was lost	`select value from system.events` `where event='ReplicatedDataLoss'`	`High`
Data parts are not the same on different replicas	`select value from system.events where event='DataAfterMergeDiffersFromReplica'; \ select value from system.events where event='DataAfterMutationDiffersFromReplica'`	`Medium`

The following queries are recommended to be included in monitoring:

SELECT * FROM system.replicas
- For more information, see the ClickHouse guide on System Tables
SELECT * FROM system.merges
- Checks on the speed and progress of currently executed merges.
SELECT * FROM system.mutations
- This is the source of information on the speed and progress of currently executed merges.

Monitoring ClickHouse logs

ClickHouse logs can be another important source of information. There are 2 logs enabled by default

/var/log/clickhouse-server/clickhouse-server.err.log (error & warning, you may want to keep an eye on that or send it to some monitoring system)
/var/log/clickhouse-server/clickhouse-server.log (trace logs, very detailed, useful for debugging, usually too verbose to monitor).

You can additionally enable system.text_log table to have an access to the logs from clickhouse sql queries (ensure that you will not expose some information to the users who should not see it).

$ cat /etc/clickhouse-server/config.d/text_log.xml
<yandex>
    <text_log>
        <database>system</database>
        <table>text_log</table>
        <flush_interval_milliseconds>7500</flush_interval_milliseconds>
        <level>warning</level>
    </text_log>
</yandex>

OpenTelemetry support

See https://clickhouse.com/docs/en/operations/opentelemetry/

Other sources

Last modified 2025.03.05: SEO updates, cross-links (886fe71)