ClickHouse® Monitoring

Tracking potential issues in your cluster before they cause a critical error

What to read / watch on the subject:

Altinity webinar “ClickHouse® Monitoring 101: What to monitor and how”. Watch the video or download the slides .
The ClickHouse® docs

What should be monitored

The following metrics should be collected / monitored

For Host Machine:
- CPU: saturation, load average, and iowait
- Memory: pressure and available memory
- Network: throughput, packets, errors, and drops
- Storage: latency, throughput, IOPS, and queue depth
- Disk Space: free / used
For ClickHouse:
- Query workload:
  - Connections and number of queries running
  - Query rate, query duration, and long-running queries
  - Read / Write / Return (bytes/rows)
  - Query read amplification: selected rows / bytes / marks / ranges / parts
- Memory / cache / contention:
  - Cache hit rates: mark cache, query cache, and page / filesystem cache if used
- Parts / background work:
  - Merges (queue length, memory used)
  - Mutations
  - Part growth, max parts per partition, and detached parts
- Replication / distributed execution:
  - Replication queue length, lag, and failed fetch / check events
  - Read-only replicas
  - Keeper / ZooKeeper wait time on the ClickHouse® side
  - Keeper / ZooKeeper client metrics on the ClickHouse® side: in-flight requests, sessions / watches, operation rates by type, init / close churn, and exceptions
  - DDL queue length and Distributed tables backlog
- Optional integrations:
  - S3 errors and remote-disk latency (if used)
  - Kafka consumer health (if used)
For ClickHouse® Keeper (if used):
- Quorum / leader election stability, leader churn, and quorum uptime
- Follower / observer sync, proposal size, and proposal / ack / commit / propagation latency
- Outstanding requests and backlog in prep / sync / commit / final processing queues
- Sessions, connection rejects / drops, and watch growth if your workload uses watches heavily
- Fsync time / rate, snapshot time, open file descriptors, and other disk-pressure signals
- TLS handshake or ensemble-auth failures if enabled
- See also clickhouse-keeper
For ZooKeeper (if used):
- Session health, outstanding requests, connection churn, and watch counts
- Znode count / growth and approximate data size
- Packets sent / received, leader election, quorum uptime, follower sync time, and request latency
- Snapshot / fsync pressure, unrecoverable errors, and digest mismatches
- JVM heap / GC / pause and thread health
- See separate article

ClickHouse® monitoring tools

ClickHouse® internal dashboards

Built-in ClickHouse® web dashboards are useful for local troubleshooting and ad hoc checks. Do not treat them as a replacement for production monitoring, alerting, retention, or access-control design. Do not expose these endpoints publicly.

Advanced dashboard: http://localhost:8123/dashboard. Current ClickHouse® docs describe it as a built-in dashboard for query rate, CPU, merges, reads, memory, inserts, and part counts. It is backed by rows from <code>system.dashboards</code> and mostly charts history from system.metric_log and system.asynchronous_metric_log; if those logs are disabled or empty, many graphs will be empty. See the upstream monitoring docs and advanced dashboard example .
Custom dashboard definitions can be served by the same /dashboard page from any table with the same schema as system.dashboards. This is useful for local one-off panels, but keep long-term dashboards in your normal observability system.
ClickStack UI: starting with ClickHouse® 26.2, ClickStack / HyperDX is embedded in the ClickHouse® binary at http://localhost:8123/clickstack. Use it to explore local logs, traces, metrics, or ClickHouse® system tables. The embedded version is intended for local development and learning, not production deployments; it does not provide persistent state storage, alerting, or saved dashboard/query persistence. See Introducing ClickStack embedded in ClickHouse .
Keeper dashboard: http://localhost:9182/dashboard, only when keeper_server.http_control.port is enabled. The same HTTP control interface exposes commands and storage APIs, so restrict it with network controls. See Keeper HTTP API and Dashboard .
jemalloc UI: starting with ClickHouse® 26.2, http://localhost:8123/jemalloc shows allocator statistics and can fetch heap profiles. Use it for allocation and memory debugging, not steady-state monitoring; jemalloc profiling can add overhead. See allocation profiling .

Prometheus + Grafana

Use Prometheus for production monitoring and alerting. Scrape ClickHouse® Server and ClickHouse® Keeper as separate targets when Keeper is used.

ClickHouse® Server: enable the built-in Prometheus endpoint in clickhouse-server config. It can expose metrics from system.metrics, system.asynchronous_metrics, system.events, and system.errors; newer versions can also expose histograms and dimensional metrics through the Prometheus protocol handler . Common dashboards: 14192 and 13500 .
ClickHouse® Keeper: starting with ClickHouse® 22.12, Keeper has its own Prometheus endpoint. Configure prometheus.port and prometheus.endpoint in the Keeper config and scrape every Keeper node; the release example uses port 9369 and /metrics. These are Keeper server metrics, not the same thing as ClickHouse® Server metrics about ZooKeeper / Keeper client activity. See the 22.12 release note .
Altinity Kubernetes Operator: if ClickHouse® is deployed by the operator, use the operator-managed metrics exporter, dashboards, and alerts. See the operator Prometheus setup , Grafana setup , dashboard , and alert rules .
Operator-compatible metrics without the operator: if you do not run ClickHouse® in Kubernetes but want to reuse the operator Grafana dashboard, expose a FORMAT Prometheus query through an HTTP handler. See Compatibility layer for the Altinity Kubernetes Operator for ClickHouse .
Legacy external exporter: clickhouse_exporter with dashboard 882 exists, but is unmaintained. Prefer the built-in exporter or the operator exporter for new deployments.

Grafana dashboards querying ClickHouse® directly

Grafana can query ClickHouse® directly through a ClickHouse® datasource. This is useful for system.query_log analysis and ad hoc operational dashboards, but it is different from Prometheus monitoring: every refresh runs SQL on ClickHouse. Use a restricted read-only user, keep panels time-bounded, and avoid expensive high-cardinality queries on production clusters.

Altinity / Vertamedia datasource: prefer Altinity plugin for ClickHouse for new direct-query dashboards. It was initially developed by Vertamedia and has been maintained by Altinity since 2020. For modern Grafana use current 3.x versions; old pre-3.x versions were Angular-based. You can use the operator queries dashboard as a starting point.
Official Grafana ClickHouse® datasource: Grafana ClickHouse® datasource is an alternative when your Grafana stack standardizes on Grafana-maintained datasource plugins or needs its logs, traces, alerting, and OpenTelemetry-oriented workflows. Current plugin docs also list built-in dashboards for query, data, cluster, and OpenTelemetry analysis: ClickHouse® datasource docs .
Older direct-query dashboards: ClickHouse® Performance Monitor 13606 and ClickHouse® Queries 2515 query ClickHouse® directly. Treat them as import examples to review and adapt, not drop-in production defaults. Dashboard 13606 states it was built for ClickHouse® 20.8.7; dashboard 2515 depends on system.query_log.

Other monitoring integrations

These are secondary paths. Prefer Prometheus/Grafana for production monitoring and the Altinity Grafana datasource plugin for new direct-query dashboards unless your environment already standardizes on one of these tools.

Commercial monitoring platforms

These commercial platforms have ClickHouse® monitoring integrations or documented ClickHouse® monitoring workflows. Validate exact metric coverage, ClickHouse® version support, and ClickHouse® Keeper coverage before relying on a vendor dashboard as the only monitoring source.

Datadog : provides a ClickHouse® integration for collecting service checks and metrics into Datadog.
Sematext : provides a ClickHouse® integration for metrics, dashboards, and alerts in Sematext Cloud or Enterprise.
IBM Instana : documents ClickHouse® monitoring in Instana Observability.
Site24x7 : provides a ClickHouse® plugin-based monitoring workflow.
Acceldata Pulse : documents ClickHouse® monitoring workflows in Acceldata Pulse.
Grafana Cloud : provides a ClickHouse® integration with prebuilt dashboards and alerts.
ManageEngine Applications Manager : provides ClickHouse® monitoring through a Prometheus-based integration.
MetricFire : documents ClickHouse® monitoring with MetricFire-managed metrics.

Other integrations

ClickStack / HyperDX: ClickStack is a ClickHouse®-powered observability stack for OpenTelemetry logs, metrics, traces, session replay, dashboards, and alerts. It can use ClickHouse® as the observability backend, but still monitor the underlying ClickHouse® storage, ingestion, replication, and Keeper health separately.
Zabbix: use the official Zabbix ClickHouse® by HTTP template for current Zabbix deployments.
Graphite-compatible pipelines: ClickHouse® can push system.metrics, system.events, and system.asynchronous_metrics to Graphite with <graphite> in config.xml; multiple <graphite> sections are supported for different intervals. See the ClickHouse® Graphite configuration . Do not confuse this monitoring exporter with the GraphiteMergeTree table engine, which stores Graphite time-series data in ClickHouse®.
InfluxDB / Telegraf: for InfluxDB stacks, prefer the Telegraf ClickHouse® input plugin or scrape the ClickHouse® Prometheus endpoint through Telegraf. The old InfluxDB v1 Graphite protocol path is mainly for legacy Graphite-compatible pipelines.
Nagios / Icinga: keep these checks coarse: /ping, /replicas_status, host checks, and a small number of thresholded SQL checks. If you write custom plugins, follow the standard Monitoring Plugins guidelines for return codes, thresholds, timeouts, and one-line output. Do not rely on unmaintained ClickHouse®-specific plugins without reviewing them first.

“Build your own” ClickHouse® monitoring

Use custom checks for smoke tests, Nagios / Icinga-style checks, or legacy monitoring systems. They are not a replacement for Prometheus / Grafana metric retention, dashboards, and alerting.

The HTTP examples assume the default HTTP interface on port 8123; adjust the scheme, host, and port for HTTPS, load balancers, or non-default ports.

Enable rows for optional engines or Keeper-backed features only where those features are configured.

Check name	Shell or SQL command	Severity
ClickHouse® status	`$ curl 'http://localhost:8123/'` `Ok.`	Critical
Too many simultaneous queries. Maximum: 100 by default	`SELECT value FROM system.metrics WHERE metric = 'Query'`	Critical
Replication status	`$ curl 'http://localhost:8123/replicas_status'` `Ok.`	High
Read-only replicas, reflected by `replicas_status` as well	`SELECT value FROM system.metrics WHERE metric = 'ReadonlyReplica'`	High
Some replication tasks are stuck	`SELECT count() FROM system.replication_queue WHERE num_tries > 100 OR num_postponed > 1000`	High
ZooKeeper is available	`SELECT count() FROM system.zookeeper WHERE path = '/'`	Critical for writes
ZooKeeper exceptions	`SELECT value FROM system.events WHERE event = 'ZooKeeperHardwareExceptions'`	Medium
Other ClickHouse® nodes are available	$ for node in `echo "SELECT DISTINCT host_address FROM system.clusters WHERE host_name != 'localhost'" \| curl 'http://localhost:8123/' --silent --data-binary @-`; do curl "http://$node:8123/" --silent; done \| sort -u `Ok.`	High
All ClickHouse® clusters are available, meaning every configured cluster has enough replicas to serve queries	$ for cluster in `echo "SELECT DISTINCT cluster FROM system.clusters WHERE host_name != 'localhost'" \| curl 'http://localhost:8123/' --silent --data-binary @-`; do clickhouse-client --query="SELECT '$cluster', 'OK' FROM cluster('$cluster', system, one)"; done	Critical
There are files in `detached` folders	`$ find /var/lib/clickhouse/data///detached/* -type d \| wc -l` ClickHouse® 19.8+: `SELECT count() FROM system.detached_parts`	Medium
Too many parts: number of parts is growing, inserts are being delayed, or inserts are being rejected	`SELECT value FROM system.asynchronous_metrics WHERE metric = 'MaxPartCountForPartition'` `SELECT value FROM system.metrics WHERE metric = 'DelayedInserts'` `SELECT value FROM system.events WHERE event = 'DelayedInserts'` `SELECT value FROM system.events WHERE event = 'RejectedInserts'`	Critical
Dictionaries: exception	`SELECT concat(name, ': ', last_exception) FROM system.dictionaries WHERE last_exception != ''`	Medium
ClickHouse® has been restarted	`SELECT uptime()` `SELECT value FROM system.asynchronous_metrics WHERE metric = 'Uptime'`
`DistributedFilesToInsert` should not be always increasing	`SELECT value FROM system.metrics WHERE metric = 'DistributedFilesToInsert'`	Medium
A data part was lost	`SELECT value FROM system.events WHERE event = 'ReplicatedDataLoss'`	High
Data parts are not the same on different replicas	`SELECT value FROM system.events WHERE event = 'DataAfterMergeDiffersFromReplica'` `SELECT value FROM system.events WHERE event = 'DataAfterMutationDiffersFromReplica'`	Medium

For deeper dashboards or incident drill-downs, include these system tables as inspection sources:

<code>system.metrics</code> : current counters for active server state, queues, background work, and integration-specific gauges.
<code>system.asynchronous_metrics</code> : periodically refreshed metrics such as uptime, part counts, and disk usage.
<code>system.events</code> : cumulative event counters, including insert rejections, replication data-loss events, and ZooKeeper / Keeper client exceptions.
<code>system.replicas</code> : replicated table state, queue size, delay, and session status.
<code>system.merges</code> : currently running merges and progress.
<code>system.mutations</code> : pending and running mutations.
<code>system.detached_parts</code> : detached parts for MergeTree tables, including reason and disk path when available.
<code>system.asynchronous_inserts</code> : pending async inserts in the server memory queue.
<code>system.kafka_consumers</code> : Kafka consumer assignments, offsets, recent exceptions, and dependencies.

Warning

Scraped metrics are not a complete history. Short-lived states between scrapes can be missed.

Interpret these tables by signal type:

system.metrics is a point-in-time view of current values. For example, Query is the number of queries running when the table is read.
system.asynchronous_metrics is also a snapshot, but values are calculated periodically in the background.
system.events contains cumulative counters since server start. Alert on deltas or rates between scrapes, not on the raw value alone, except for rare counters where any increase is meaningful.

If you need a full picture of query volume, latency, errors, or short-lived query spikes, use <code>system.query_log</code> in addition to scraped metrics.

Monitoring ClickHouse® logs

ClickHouse® logs can be another important source of information. There are 2 logs enabled by default

/var/log/clickhouse-server/clickhouse-server.err.log (error & warning, you may want to keep an eye on that or send it to some monitoring system)
/var/log/clickhouse-server/clickhouse-server.log (trace logs, very detailed, useful for debugging, usually too verbose to monitor).

The server log level is controlled by logger.level and optional per-output / per-logger overrides. In the upstream default config logger.level is trace, which is very verbose. system.text_log has its own <text_log><level> filter, but it only receives messages that already passed the server logger level. Setting <text_log><level>trace</level> will not recover trace / debug messages if the server logger is configured at information, warning, or another less verbose level. Valid levels include fatal, critical, error, warning, notice, information, debug, and trace. Allowing trace or debug in both places is useful for troubleshooting, but it can make system.text_log grow quickly.

Since ClickHouse® 24.8, the upstream default config enables system.text_log with <level>trace</level>. In older versions, or in custom packages/configs, you may still need to enable it manually. Ensure that you will not expose sensitive log messages to users who should not see them.

Warning

With the default trace level, system.text_log can grow quickly. If you keep it enabled in production, set an appropriate level, partition_by, order_by, and ttl. Without a TTL, system log table growth is not bounded by retention.

Check the current volume before using system.text_log for monitoring:

SELECT
    level,
    count(),
    min(event_time),
    max(event_time)
FROM system.text_log
GROUP BY level
ORDER BY level;

Example configuration with fewer rows:

$ cat /etc/clickhouse-server/config.d/text_log.xml
<clickhouse>
    <text_log>
        <database>system</database>
        <table>text_log</table>
        <flush_interval_milliseconds>7500</flush_interval_milliseconds>
        <level>warning</level>
        <partition_by>toYYYYMM(event_date)</partition_by>
        <order_by>event_date, event_time, level, logger_name</order_by>
        <ttl>event_date + INTERVAL 30 DAY DELETE</ttl>
    </text_log>
</clickhouse>

Other sources

Last modified 2026.06.04: Changes after review (2e9e674)

Get in touch with ClickHouse experts.

ClickHouse® Monitoring

What should be monitored

ClickHouse® monitoring tools

ClickHouse® internal dashboards

Prometheus + Grafana

Grafana dashboards querying ClickHouse® directly

Other monitoring integrations

Commercial monitoring platforms

Other integrations

“Build your own” ClickHouse® monitoring

Warning

Monitoring ClickHouse® logs

Warning

Other sources