Monitoring Considerations

Monitoring helps to track potential issues in your cluster before they cause a critical error.

External Monitoring

External monitoring collects data from the ClickHouse cluster and uses it for analysis and review. Recommended external monitoring systems include:

ClickHouse can collect the recording of metrics internally by enabling system.metric_log in config.xml.

For dashboard system:

  • Grafana is recommended for graphs, reports, alerts, dashboard, etc.
  • Other options are Nagios or Zabbix.

The following metrics should be collected:

  • For Host Machine:
    • CPU
    • Memory
    • Network (bytes/packets)
    • Storage (iops)
    • Disk Space (free / used)
  • For ClickHouse:
    • Connections (count)
    • RWLocks
    • Read / Write / Return (bytes)
    • Read / Write / Return (rows)
    • Zookeeper operations (count)
    • Absolute delay
    • Query duration (optional)
    • Replication parts and queue (count)
  • For Zookeeper:

The following queries are recommended to be included in monitoring:

  • SELECT * FROM system.replicas
    • For more information, see the ClickHouse guide on System Tables
  • SELECT * FROM system.merges
    • Checks on the speed and progress of currently executed merges.
  • SELECT * FROM system.mutations
    • This is the source of information on the speed and progress of currently executed merges.

Monitor and Alerts

Configure the notifications for events and thresholds based on the following table:

Health Checks

The following health checks should be monitored:

Check Name Shell or SQL command Severity
ClickHouse status $ curl 'http://localhost:8123/'

Ok.

Critical
Too many simultaneous queries. Maximum: 100 (by default) select value from system.metrics

where metric='Query'

Critical
Replication status $ curl 'http://localhost:8123/replicas_status'

Ok.

High
Read only replicas (reflected by replicas_status as well) select value from system.metrics

where metric='ReadonlyReplica'

High
Some replication tasks are stuck select count()

from system.replication_queue

where num_tries > 100 or num_postponed > 1000

High
ZooKeeper is available select count() from system.zookeeper

where path='/'

Critical for writes
ZooKeeper exceptions select value from system.events

where event='ZooKeeperHardwareExceptions'

Medium
Other CH nodes are available $ for node in `echo "select distinct host_address from system.clusters where host_name !='localhost'" | curl 'http://localhost:8123/' --silent --data-binary @-`; do curl "http://$node:8123/" --silent ; done | sort -u

Ok.

High
All CH clusters are available (i.e. every configured cluster has enough replicas to serve queries) for cluster in `echo "select distinct cluster from system.clusters where host_name !='localhost'" | curl 'http://localhost:8123/' --silent --data-binary @-` ; do clickhouse-client --query="select '$cluster', 'OK' from cluster('$cluster', system, one)" ; done Critical
There are files in 'detached' folders $ find /var/lib/clickhouse/data/*/*/detached/* -type d | wc -l; \ 19.8+

select count() from system.detached_parts

Medium
Too many parts: \ Number of parts is growing; \ Inserts are being delayed; \ Inserts are being rejected select value from system.asynchronous_metrics

where metric='MaxPartCountForPartition';

select value from system.events/system.metrics

where event/metric='DelayedInserts'; \ select value from system.events

where event='RejectedInserts'

Critical
Dictionaries: exception select concat(name,': ',last_exception)

from system.dictionaries

where last_exception != ''

Medium
ClickHouse has been restarted select uptime();

select value from system.asynchronous_metrics

where metric='Uptime'

DistributedFilesToInsert should not be always increasing select value from system.metrics

where metric='DistributedFilesToInsert'

Medium
A data part was lost select value from system.events

where event='ReplicatedDataLoss'

High
Data parts are not the same on different replicas select value from system.events where event='DataAfterMergeDiffersFromReplica'; \ select value from system.events where event='DataAfterMutationDiffersFromReplica' Medium

Monitoring References