ClickHouse® Monitoring
ClickHouse® Monitoring
Monitoring helps to track potential issues in your cluster before they cause a critical error.
What to read / watch on subject:
- Altinity webinar “ClickHouse Monitoring 101: What to monitor and how”. recording, slides
- docs https://clickhouse.com/docs/en/operations/monitoring/
What should be monitored
The following metrics should be collected / monitored
For Host Machine:
- CPU
- Memory
- Network (bytes/packets)
- Storage (iops)
- Disk Space (free / used)
For ClickHouse:
- Connections (count)
- RWLocks
- Read / Write / Return (bytes)
- Read / Write / Return (rows)
- Zookeeper operations (count)
- Absolute delay
- Query duration (optional)
- Replication parts and queue (count)
For Zookeeper:
Monitoring tools
Prometheus (embedded exporter) + Grafana
- Enable embedded exporter
- Grafana dashboards https://grafana.com/grafana/dashboards/14192 or https://grafana.com/grafana/dashboards/13500
Prometheus (embedded http handler with Altinity Kubernetes Operator for ClickHouse style metrics) + Grafana
- Enable http handler
- Useful, if you want to use the dashboard from the Altinity Kubernetes Operator for ClickHouse, but do not run ClickHouse in k8s.
Prometheus (embedded exporter in the Altinity Kubernetes Operator for ClickHouse) + Grafana
- exporter is included in the Altinity Kubernetes Operator for ClickHouse, and enabled automatically
- see instructions of Prometheus and Grafana installation (if you don’t have one)
- Grafana dashboard https://github.com/Altinity/clickhouse-operator/tree/master/grafana-dashboard
- Prometheus alerts https://github.com/Altinity/clickhouse-operator/blob/master/deploy/prometheus/prometheus-alert-rules-clickhouse.yaml
Prometheus (ClickHouse external exporter) + Grafana
(unmaintained)
Dashboards querying ClickHouse directly via vertamedia / Altinity plugin
- Overview: https://grafana.com/grafana/dashboards/13606
- Queries dashboard (analyzing system.query_log) https://grafana.com/grafana/dashboards/2515
Dashboard querying ClickHouse directly via Grafana plugin
Zabbix
- https://www.zabbix.com/integrations/clickhouse
- https://github.com/Altinity/clickhouse-zabbix-template
Graphite
- Use the embedded exporter. See docs and config.xml
InfluxDB
- You can use embedded exporter, plus Telegraf. For more information, see Graphite protocol support in InfluxDB.
Nagios/Icinga
Commercial solution
- Datadog https://docs.datadoghq.com/integrations/clickhouse/?tab=host
- Sematext https://sematext.com/docs/integration/clickhouse/
- Instana https://www.instana.com/supported-technologies/clickhouse-monitoring/
- site24x7 https://www.site24x7.com/plugins/clickhouse-monitoring.html
- Acceldata Pulse https://www.acceldata.io/blog/acceldata-pulse-for-clickhouse-monitoring
“Build your own” monitoring
ClickHouse allow to access lot of internals using system tables. The main tables to access monitoring data are:
- system.metrics
- system.asynchronous_metrics
- system.events
Minimum necessary set of checks
Check Name | Shell or SQL command | Severity |
ClickHouse status | $ curl 'http://localhost:8123/'
| Critical |
Too many simultaneous queries. Maximum: 100 (by default) | select value from system.metrics
| Critical |
Replication status | $ curl 'http://localhost:8123/replicas_status'
| High |
Read only replicas (reflected by replicas_status as well) | select value from system.metrics
| High |
Some replication tasks are stuck | select count()
| High |
ZooKeeper is available | select count() from system.zookeeper
| Critical for writes |
ZooKeeper exceptions | select value from system.events
| Medium |
Other CH nodes are available | $ for node in `echo "select distinct host_address from system.clusters where host_name !='localhost'" | curl 'http://localhost:8123/' --silent --data-binary @-`; do curl "http://$node:8123/" --silent ; done | sort -u
| High |
All CH clusters are available (i.e. every configured cluster has enough replicas to serve queries) | for cluster in `echo "select distinct cluster from system.clusters where host_name !='localhost'" | curl 'http://localhost:8123/' --silent --data-binary @-` ; do clickhouse-client --query="select '$cluster', 'OK' from cluster('$cluster', system, one)" ; done | Critical |
There are files in 'detached' folders | $ find /var/lib/clickhouse/data/*/*/detached/* -type d | wc -l; \
19.8+
| Medium |
Too many parts: \ Number of parts is growing; \ Inserts are being delayed; \ Inserts are being rejected | select value from system.asynchronous_metrics
| Critical |
Dictionaries: exception | select concat(name,': ',last_exception)
| Medium |
ClickHouse has been restarted | select uptime();
| |
DistributedFilesToInsert should not be always increasing | select value from system.metrics
| Medium |
A data part was lost | select value from system.events
| High |
Data parts are not the same on different replicas | select value from system.events where event='DataAfterMergeDiffersFromReplica'; \
select value from system.events where event='DataAfterMutationDiffersFromReplica' | Medium |
The following queries are recommended to be included in monitoring:
SELECT * FROM system.replicas
- For more information, see the ClickHouse guide on System Tables
SELECT * FROM system.merges
- Checks on the speed and progress of currently executed merges.
SELECT * FROM system.mutations
- This is the source of information on the speed and progress of currently executed merges.
Logs monitoring
ClickHouse logs can be another important source of information. There are 2 logs enabled by default
- /var/log/clickhouse-server/clickhouse-server.err.log (error & warning, you may want to keep an eye on that or send it to some monitoring system)
- /var/log/clickhouse-server/clickhouse-server.log (trace logs, very detailed, useful for debugging, usually too verbose to monitor).
You can additionally enable system.text_log table to have an access to the logs from clickhouse sql queries (ensure that you will not expose some information to the users which should not see it).
$ cat /etc/clickhouse-server/config.d/text_log.xml
<yandex>
<text_log>
<database>system</database>
<table>text_log</table>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
<level>warning</level>
</text_log>
</yandex>
OpenTelemetry support
See https://clickhouse.com/docs/en/operations/opentelemetry/