ClickHouse Monitoring

ClickHouse Monitoring

Monitoring helps to track potential issues in your cluster before they cause a critical error.

What to read / watch on subject:

What should be monitored

The following metrics should be collected / monitored

  • For Host Machine:

    • CPU
    • Memory
    • Network (bytes/packets)
    • Storage (iops)
    • Disk Space (free / used)
  • For ClickHouse:

    • Connections (count)
    • RWLocks
    • Read / Write / Return (bytes)
    • Read / Write / Return (rows)
    • Zookeeper operations (count)
    • Absolute delay
    • Query duration (optional)
    • Replication parts and queue (count)
  • For Zookeeper:

Monitoring tools

Prometheus (embedded exporter) + Grafana

Prometheus (embedded http handler with clickhouse-operator style metrics) + Grafana

  • Enable http handler
  • Useful, if you want to use clickhouse-operator dashboard, but do not run ClickHouse in k8s.

Prometheus (clickhouse-operator embedded exporter) + Grafana

Prometheus (clickhouse external exporter) + Grafana

(unmaintained)

Dashboards quering clickhouse directly via vertamedia / Altinity plugin

Dashboard quering clickhouse directly via Grafana plugin

Zabbix

Graphite

  • Use the embedded exporter. See docs and config.xml

InfluxDB

Nagios/Icinga

Commercial solution

“Build your own” monitoring

ClickHouse allow to access lot of internals using system tables. The main tables to access monitoring data are:

  • system.metrics
  • system.asynchronous_metrics
  • system.events

Minimum neccessary set of checks

Check NameShell or SQL commandSeverity
ClickHouse status$ curl 'http://localhost:8123/'

Ok.

Critical
Too many simultaneous queries. Maximum: 100 (by default)select value from system.metrics

where metric='Query'

Critical
Replication status$ curl 'http://localhost:8123/replicas_status'

Ok.

High
Read only replicas (reflected by replicas_status as well)select value from system.metrics

where metric='ReadonlyReplica'

High
Some replication tasks are stuckselect count()

from system.replication_queue

where num_tries > 100 or num_postponed > 1000

High
ZooKeeper is availableselect count() from system.zookeeper

where path='/'

Critical for writes
ZooKeeper exceptionsselect value from system.events

where event='ZooKeeperHardwareExceptions'

Medium
Other CH nodes are available$ for node in `echo "select distinct host_address from system.clusters where host_name !='localhost'" | curl 'http://localhost:8123/' --silent --data-binary @-`; do curl "http://$node:8123/" --silent ; done | sort -u

Ok.

High
All CH clusters are available (i.e. every configured cluster has enough replicas to serve queries)for cluster in `echo "select distinct cluster from system.clusters where host_name !='localhost'" | curl 'http://localhost:8123/' --silent --data-binary @-` ; do clickhouse-client --query="select '$cluster', 'OK' from cluster('$cluster', system, one)" ; doneCritical
There are files in 'detached' folders$ find /var/lib/clickhouse/data/*/*/detached/* -type d | wc -l; \ 19.8+

select count() from system.detached_parts

Medium
Too many parts: \ Number of parts is growing; \ Inserts are being delayed; \ Inserts are being rejectedselect value from system.asynchronous_metrics

where metric='MaxPartCountForPartition';

select value from system.events/system.metrics

where event/metric='DelayedInserts'; \ select value from system.events

where event='RejectedInserts'

Critical
Dictionaries: exceptionselect concat(name,': ',last_exception)

from system.dictionaries

where last_exception != ''

Medium
ClickHouse has been restartedselect uptime();

select value from system.asynchronous_metrics

where metric='Uptime'

DistributedFilesToInsert should not be always increasingselect value from system.metrics

where metric='DistributedFilesToInsert'

Medium
A data part was lostselect value from system.events

where event='ReplicatedDataLoss'

High
Data parts are not the same on different replicasselect value from system.events where event='DataAfterMergeDiffersFromReplica'; \ select value from system.events where event='DataAfterMutationDiffersFromReplica'Medium

The following queries are recommended to be included in monitoring:

  • SELECT * FROM system.replicas
  • SELECT * FROM system.merges
    • Checks on the speed and progress of currently executed merges.
  • SELECT * FROM system.mutations
    • This is the source of information on the speed and progress of currently executed merges.

Logs monitoring

ClickHouse logs can be another important source of information. There are 2 logs enabled by default

  • /var/log/clickhouse-server/clickhouse-server.err.log (error & warning, you may want to keep an eye on that or send it to some monitoring system)
  • /var/log/clickhouse-server/clickhouse-server.log (trace logs, very detailed, useful for debugging, usually too verbose to monitor).

You can additionally enable system.text_log table to have an access to the logs from clickhouse sql queries (ensure that you will not expose some information to the users which should not see it).

$ cat /etc/clickhouse-server/config.d/text_log.xml
<yandex>
    <text_log>
        <database>system</database>
        <table>text_log</table>
        <flush_interval_milliseconds>7500</flush_interval_milliseconds>
        <level>warning</level>
    </text_log>
</yandex>

OpenTelemetry support

See https://clickhouse.com/docs/en/operations/opentelemetry/

Other sources