This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

ZooKeeper

1: clickhouse-keeper-initd
2: clickhouse-keeper-service
3: Install standalone Zookeeper for ClickHouse on Ubuntu / Debian
4: clickhouse-keeper
5: How to check the list of watches
6: JVM sizes and garbage collector settings
7: Proper setup
8: Recovering from complete metadata loss in ZooKeeper
9: ZooKeeper backup
10: ZooKeeper cluster migration
11: ZooKeeper Monitoring
12: ZooKeeper schema

Requirements

TLDR version:

USE DEDICATED FAST DISKS for the transaction log! (crucial for performance due to write-ahead-log, NVMe is preferred for heavy load setup).
use 3 nodes (more nodes = slower quorum, less = no HA).
low network latency between zookeeper nodes is very important (latency, not bandwidth).
have at least 4Gb of RAM, disable swap, tune JVM sizes, and garbage collector settings
ensure that zookeeper will not be CPU-starved by some other processes
monitor zookeeper.

Side note: in many cases, the slowness of the zookeeper is actually a symptom of some issue with clickhouse schema/usage pattern (the most typical issues: an enormous number of partitions/tables/databases with real-time inserts, tiny & frequent inserts).

How to install

Random links on best practices

Cite from https://zookeeper.apache.org/doc/r3.5.7/zookeeperAdmin.html#sc_commonProblems :

Things to Avoid
Here are some common problems you can avoid by configuring ZooKeeper correctly:
inconsistent lists of servers : The list of ZooKeeper servers used by the clients must match the list of ZooKeeper servers that each ZooKeeper server has. Things work okay if the client list is a subset of the real list, but things will really act strange if clients have a list of ZooKeeper servers that are in different ZooKeeper clusters. Also, the server lists in each Zookeeper server configuration file should be consistent with one another.
incorrect placement of transaction log : The most performance critical part of ZooKeeper is the transaction log. ZooKeeper syncs transactions to media before it returns a response. A dedicated transaction log device is key to consistent good performance. Putting the log on a busy device will adversely affect performance. If you only have one storage device, increase the snapCount so that snapshot files are generated less often; it does not eliminate the problem, but it makes more resources available for the transaction log.
incorrect Java heap size : You should take special care to set your Java max heap size correctly. In particular, you should not create a situation in which ZooKeeper swaps to disk. The disk is death to ZooKeeper. Everything is ordered, so if processing one request swaps the disk, all other queued requests will probably do the same. the disk. DON’T SWAP. Be conservative in your estimates: if you have 4G of RAM, do not set the Java max heap size to 6G or even 4G. For example, it is more likely you would use a 3G heap for a 4G machine, as the operating system and the cache also need memory. The best and only recommend practice for estimating the heap size your system needs is to run load tests, and then make sure you are well below the usage limit that would cause the system to swap.
Publicly accessible deployment : A ZooKeeper ensemble is expected to operate in a trusted computing environment. It is thus recommended to deploy ZooKeeper behind a firewall.

How to check number of followers:

echo mntr | nc zookeeper 2187 | grep foll
zk_synced_followers    2
zk_synced_non_voting_followers    0
zk_avg_follower_sync_time    0.0
zk_min_follower_sync_time    0
zk_max_follower_sync_time    0
zk_cnt_follower_sync_time    0
zk_sum_follower_sync_time    0

Tools

https://github.com/apache/zookeeper/blob/master/zookeeper-docs/src/main/resources/markdown/zookeeperTools.md

Alternative for zkCli

https://github.com/go-zkcli/zkcli

Web UI

1 - clickhouse-keeper-initd

clickhouse-keeper-initd

An init.d script for clickhouse-keeper. This example is based on zkServer.sh

#!/bin/bash
### BEGIN INIT INFO
# Provides:          clickhouse-keeper
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# Required-Start:
# Required-Stop:
# Short-Description: Start keeper daemon
# Description: Start keeper daemon
### END INIT INFO

NAME=clickhouse-keeper
ZOOCFGDIR=/etc/$NAME
ZOOCFG="$ZOOCFGDIR/keeper.xml"
ZOO_LOG_DIR=/var/log/$NAME
USER=clickhouse
GROUP=clickhouse
ZOOPIDDIR=/var/run/$NAME
ZOOPIDFILE=$ZOOPIDDIR/$NAME.pid
SCRIPTNAME=/etc/init.d/$NAME

#echo "Using config: $ZOOCFG" >&2
ZOOCMD="clickhouse-keeper -C ${ZOOCFG} start --daemon"

# ensure PIDDIR exists, otw stop will fail
mkdir -p "$(dirname "$ZOOPIDFILE")"

if [ ! -w "$ZOO_LOG_DIR" ] ; then
mkdir -p "$ZOO_LOG_DIR"
fi

case $1 in
start)
    echo -n "Starting keeper ... "
    if [ -f "$ZOOPIDFILE" ]; then
      if kill -0 `cat "$ZOOPIDFILE"` > /dev/null 2>&1; then
         echo already running as process `cat "$ZOOPIDFILE"`.
         exit 0
      fi
    fi
    sudo -u clickhouse `echo "$ZOOCMD"`
    if [ $? -eq 0 ]
    then
      pgrep -f "$ZOOCMD" > "$ZOOPIDFILE"
      echo "PID:" `cat $ZOOPIDFILE`
      if [ $? -eq 0 ];
      then
        sleep 1
        echo STARTED
      else
        echo FAILED TO WRITE PID
        exit 1
      fi
    else
      echo SERVER DID NOT START
      exit 1
    fi
    ;;
start-foreground)
    sudo -u clickhouse clickhouse-keeper -C "$ZOOCFG" start
    ;;
print-cmd)
    echo "sudo -u clickhouse ${ZOOCMD}"
    ;;
stop)
    echo -n "Stopping keeper ... "
    if [ ! -f "$ZOOPIDFILE" ]
    then
      echo "no keeper to stop (could not find file $ZOOPIDFILE)"
    else
      ZOOPID=$(cat "$ZOOPIDFILE")
      echo $ZOOPID
      kill $ZOOPID
      while true; do
         sleep 3
         if kill -0 $ZOOPID > /dev/null 2>&1; then
            echo $ZOOPID is still running
         else
            break
         fi
      done
      rm "$ZOOPIDFILE"
      echo STOPPED
    fi
    exit 0
    ;;
restart)
    shift
    "$0" stop ${@}
    sleep 3
    "$0" start ${@}
    ;;
status)
    clientPortAddress="localhost"
    clientPort=2181
    STAT=`echo srvr | nc $clientPortAddress $clientPort 2> /dev/null | grep Mode`
    if [ "x$STAT" = "x" ]
    then
        echo "Error contacting service. It is probably not running."
        exit 1
    else
        echo $STAT
        exit 0
    fi
    ;;
*)
    echo "Usage: $0 {start|start-foreground|stop|restart|status|print-cmd}" >&2

esac

2 - clickhouse-keeper-service

clickhouse-keeper-service

installation

Need to install clickhouse-common-static + clickhouse-keeper OR clickhouse-common-static + clickhouse-server. Both OK, use the first if you don’t need clickhouse server locally.

dpkg -i clickhouse-common-static_{%version}.deb clickhouse-keeper_{%version}.deb

dpkg -i clickhouse-common-static_{%version}.deb clickhouse-server_{%version}.deb clickhouse-client_{%version}.deb

Create directories

mkdir -p /etc/clickhouse-keeper/config.d
mkdir -p /var/log/clickhouse-keeper
mkdir -p /var/lib/clickhouse-keeper/coordination/log
mkdir -p /var/lib/clickhouse-keeper/coordination/snapshots
mkdir -p /var/lib/clickhouse-keeper/cores

chown -R clickhouse.clickhouse /etc/clickhouse-keeper /var/log/clickhouse-keeper /var/lib/clickhouse-keeper

config

cat /etc/clickhouse-keeper/config.xml

<?xml version="1.0"?>
<clickhouse>
    <logger>
        <!-- Possible levels [1]:

          - none (turns off logging)
          - fatal
          - critical
          - error
          - warning
          - notice
          - information
          - debug
          - trace
          - test (not for production usage)

            [1]: https://github.com/pocoproject/poco/blob/poco-1.9.4-release/Foundation/include/Poco/Logger.h#L105-L114
        -->
        <level>trace</level>
        <log>/var/log/clickhouse-keeper/clickhouse-keeper.log</log>
        <errorlog>/var/log/clickhouse-keeper/clickhouse-keeper.err.log</errorlog>
        <!-- Rotation policy
             See https://github.com/pocoproject/poco/blob/poco-1.9.4-release/Foundation/include/Poco/FileChannel.h#L54-L85
          -->
        <size>1000M</size>
        <count>10</count>
        <!-- <console>1</console> --> <!-- Default behavior is autodetection (log to console if not daemon mode and is tty) -->

        <!-- Per level overrides (legacy):

        For example to suppress logging of the ConfigReloader you can use:
        NOTE: levels.logger is reserved, see below.
        -->
        <!--
        <levels>
          <ConfigReloader>none</ConfigReloader>
        </levels>
        -->

        <!-- Per level overrides:

        For example to suppress logging of the RBAC for default user you can use:
        (But please note that the logger name maybe changed from version to version, even after minor upgrade)
        -->
        <!--
        <levels>
          <logger>
            <name>ContextAccess (default)</name>
            <level>none</level>
          </logger>
          <logger>
            <name>DatabaseOrdinary (test)</name>
            <level>none</level>
          </logger>
        </levels>
        -->
        <!-- Structured log formatting:
        You can specify log format(for now, JSON only). In that case, the console log will be printed
        in specified format like JSON.
        For example, as below:
        {"date_time":"1650918987.180175","thread_name":"#1","thread_id":"254545","level":"Trace","query_id":"","logger_name":"BaseDaemon","message":"Received signal 2","source_file":"../base/daemon/BaseDaemon.cpp; virtual void SignalListener::run()","source_line":"192"}
        To enable JSON logging support, just uncomment <formatting> tag below.
        -->
        <!-- <formatting>json</formatting> -->
    </logger>

    <!-- Listen specified address.
     Use :: (wildcard IPv6 address), if you want to accept connections both with IPv4 and IPv6 from everywhere.
     Notes:
     If you open connections from wildcard address, make sure that at least one of the following measures applied:
     - server is protected by firewall and not accessible from untrusted networks;
     - all users are restricted to subset of network addresses (see users.xml);
     - all users have strong passwords, only secure (TLS) interfaces are accessible, or connections are only made via TLS interfaces.
     - users without password have readonly access.
     See also: https://www.shodan.io/search?query=clickhouse
    -->
    <!-- <listen_host>::</listen_host> -->


    <!-- Same for hosts without support for IPv6: -->
    <!-- <listen_host>0.0.0.0</listen_host> -->

    <!-- Default values - try listen localhost on IPv4 and IPv6. -->
    <!--
    <listen_host>::1</listen_host>
    <listen_host>127.0.0.1</listen_host>
    -->

    <!-- <interserver_listen_host>::</interserver_listen_host> -->
    <!-- Listen host for communication between replicas. Used for data exchange -->
    <!-- Default values - equal to listen_host -->

    <!-- Don't exit if IPv6 or IPv4 networks are unavailable while trying to listen. -->
    <!-- <listen_try>0</listen_try> -->

    <!-- Allow multiple servers to listen on the same address:port. This is not recommended.
    -->
    <!-- <listen_reuse_port>0</listen_reuse_port> -->
    <!-- <listen_backlog>4096</listen_backlog> -->

    <path>/var/lib/clickhouse-keeper/</path>
    <core_path>/var/lib/clickhouse-keeper/cores</core_path>

    <keeper_server>
	    <tcp_port>2181</tcp_port>
	    <server_id>1</server_id>
	    <log_storage_path>/var/lib/clickhouse-keeper/coordination/log</log_storage_path>
	    <snapshot_storage_path>/var/lib/clickhouse-keeper/coordination/snapshots</snapshot_storage_path>

	    <coordination_settings>
        	<operation_timeout_ms>10000</operation_timeout_ms>
	        <session_timeout_ms>30000</session_timeout_ms>
	        <raft_logs_level>trace</raft_logs_level>
	        <rotate_log_storage_interval>10000</rotate_log_storage_interval>
	    </coordination_settings>

            <raft_configuration>
	              <server>
                   <id>1</id>
                   <hostname>localhost</hostname>
                   <port>9444</port>
                </server>
           </raft_configuration>
    </keeper_server>
</clickhouse>

cat /etc/clickhouse-keeper/config.d/keeper.xml
<?xml version="1.0"?>
<clickhouse>
    <listen_host>::</listen_host>
    <keeper_server>
            <tcp_port>2181</tcp_port>
            <server_id>1</server_id>
            <raft_configuration>
                <server>
                   <id>1</id>
       	           <hostname>keeper-host-1</hostname>
                   <port>9444</port>
                </server>
                <server>
                   <id>2</id>
                   <hostname>keeper-host-2</hostname>
                   <port>9444</port>
                </server>
                <server>
                   <id>3</id>
                   <hostname>keeper-host-3</hostname>
                   <port>9444</port>
                </server>                
           </raft_configuration>
    </keeper_server>
</clickhouse>

systemd service

cat /lib/systemd/system/clickhouse-keeper.service
[Unit]
Description=ClickHouse Keeper (analytic DBMS for big data)
Requires=network-online.target
# NOTE: that After/Wants=time-sync.target is not enough, you need to ensure
# that the time was adjusted already, if you use systemd-timesyncd you are
# safe, but if you use ntp or some other daemon, you should configure it
# additionaly.
After=time-sync.target network-online.target
Wants=time-sync.target

[Service]
Type=simple
User=clickhouse
Group=clickhouse
Restart=always
RestartSec=30
RuntimeDirectory=clickhouse-keeper
ExecStart=/usr/bin/clickhouse-keeper --config=/etc/clickhouse-keeper/config.xml --pid-file=/run/clickhouse-keeper/clickhouse-keeper.pid
# Minus means that this file is optional.
EnvironmentFile=-/etc/default/clickhouse
LimitCORE=infinity
LimitNOFILE=500000
CapabilityBoundingSet=CAP_NET_ADMIN CAP_IPC_LOCK CAP_SYS_NICE CAP_NET_BIND_SERVICE

[Install]
# ClickHouse should not start from the rescue shell (rescue.target).
WantedBy=multi-user.target

systemctl daemon-reload

systemctl status clickhouse-keeper

systemctl start clickhouse-keeper

debug start without service (as foreground application)

sudo -u clickhouse /usr/bin/clickhouse-keeper --config=/etc/clickhouse-keeper/config.xml

3 - Install standalone Zookeeper for ClickHouse on Ubuntu / Debian

Install standalone Zookeeper for ClickHouse on Ubuntu / Debian.

Reference script to install standalone Zookeeper for Ubuntu / Debian

Tested on Ubuntu 20.

# install java runtime environment
sudo apt-get update
sudo apt install default-jre

# prepare folders, logs folder should be on the low-latency disk.
sudo mkdir -p /var/lib/zookeeper/data /var/lib/zookeeper/logs /etc/zookeeper /var/log/zookeeper /opt 

# download and install files 
export ZOOKEEPER_VERSION=3.6.3
wget https://dlcdn.apache.org/zookeeper/zookeeper-${ZOOKEEPER_VERSION}/apache-zookeeper-${ZOOKEEPER_VERSION}-bin.tar.gz -O /tmp/apache-zookeeper-${ZOOKEEPER_VERSION}-bin.tar.gz
sudo tar -xvf /tmp/apache-zookeeper-${ZOOKEEPER_VERSION}-bin.tar.gz -C /opt
rm -rf /tmp/apache-zookeeper-${ZOOKEEPER_VERSION}-bin.tar.gz

# create the user 
sudo groupadd -r zookeeper
sudo useradd -r -g zookeeper --home-dir=/var/lib/zookeeper --shell=/bin/false zookeeper

# symlink pointing to the used version of zookeeper distibution
sudo ln -s /opt/apache-zookeeper-${ZOOKEEPER_VERSION}-bin /opt/zookeeper 
sudo chown -R zookeeper:zookeeper /var/lib/zookeeper /var/log/zookeeper /etc/zookeeper /opt/apache-zookeeper-${ZOOKEEPER_VERSION}-bin
sudo chown -h zookeeper:zookeeper /opt/zookeeper

# shortcuts in /usr/local/bin/
echo -e '#!/usr/bin/env bash\n/opt/zookeeper/bin/zkCli.sh "$@"'             | sudo tee /usr/local/bin/zkCli
echo -e '#!/usr/bin/env bash\n/opt/zookeeper/bin/zkServer.sh "$@"'          | sudo tee /usr/local/bin/zkServer
echo -e '#!/usr/bin/env bash\n/opt/zookeeper/bin/zkCleanup.sh "$@"'         | sudo tee /usr/local/bin/zkCleanup
echo -e '#!/usr/bin/env bash\n/opt/zookeeper/bin/zkSnapShotToolkit.sh "$@"' | sudo tee /usr/local/bin/zkSnapShotToolkit
echo -e '#!/usr/bin/env bash\n/opt/zookeeper/bin/zkTxnLogToolkit.sh "$@"'   | sudo tee /usr/local/bin/zkTxnLogToolkit
sudo chmod +x /usr/local/bin/zkCli /usr/local/bin/zkServer /usr/local/bin/zkCleanup /usr/local/bin/zkSnapShotToolkit /usr/local/bin/zkTxnLogToolkit

# put in the config
sudo cp opt/zookeeper/conf/* /etc/zookeeper
cat <<EOF | sudo tee /etc/zookeeper/zoo.cfg
initLimit=20
syncLimit=10
maxSessionTimeout=60000000
maxClientCnxns=2000
preAllocSize=131072
snapCount=3000000
dataDir=/var/lib/zookeeper/data
dataLogDir=/var/lib/zookeeper/logs # use low-latency disk!
clientPort=2181
#clientPortAddress=nthk-zoo1.localdomain
autopurge.snapRetainCount=10
autopurge.purgeInterval=1
4lw.commands.whitelist=*
EOF
sudo chown -R zookeeper:zookeeper /etc/zookeeper

# create systemd service file
cat <<EOF | sudo tee /etc/systemd/system/zookeeper.service
[Unit]
Description=Zookeeper Daemon
Documentation=http://zookeeper.apache.org
Requires=network.target
After=network.target

[Service]
Type=forking
WorkingDirectory=/var/lib/zookeeper
User=zookeeper
Group=zookeeper
Environment=ZK_SERVER_HEAP=1536 # in megabytes, adjust to ~ 80-90% of avaliable RAM (more than 8Gb is rather overkill)
Environment=SERVER_JVMFLAGS="-Xms256m -XX:+AlwaysPreTouch -Djute.maxbuffer=8388608 -XX:MaxGCPauseMillis=50"
Environment=ZOO_LOG_DIR=/var/log/zookeeper
ExecStart=/opt/zookeeper/bin/zkServer.sh start /etc/zookeeper/zoo.cfg
ExecStop=/opt/zookeeper/bin/zkServer.sh stop /etc/zookeeper/zoo.cfg
ExecReload=/opt/zookeeper/bin/zkServer.sh restart /etc/zookeeper/zoo.cfg
TimeoutSec=30
Restart=on-failure

[Install]
WantedBy=default.target
EOF

# start zookeeper
sudo systemctl daemon-reload
sudo systemctl start zookeeper.service 

# check status etc.
echo stat | nc localhost 2181
echo ruok | nc localhost 2181
echo mntr | nc localhost 2181

4 - clickhouse-keeper

clickhouse-keeper

Since 2021 the developement of built-in alternative for Zookeeper is happening, which goal is to address several design pitfalls, and get rid of extra dependency.

See slides: https://presentations.clickhouse.com/meetup54/keeper.pdf and video https://youtu.be/IfgtdU1Mrm0?t=2682

Current status (last updated: July 2023)

Since version 23.3 we recommend using clickhouse-keeper for new installations.

Even better if you will use the latest version of clickhouse-keeper (currently it’s 23.7), and it’s not necessary to use the same version of clickhouse-keeper as clickhouse itself.

For existing systems that currently use Apache Zookeeper, you can consider upgrading to clickhouse-keeper especially if you will upgrade clickhouse also.

But please remember that on very loaded systems the change can give no performance benefits or can sometimes lead to a worse perfomance.

The development pace of keeper code is still high so every new version should bring improvements / cover the issues, and stability/maturity grows from version to version, so if you want to play with clickhouse-keeper in some environment - please use the most recent ClickHouse releases! And of course: share your feedback :)

How does it work

Official docs: https://clickhouse.com/docs/en/guides/sre/keeper/clickhouse-keeper/

Clickhouse-keeper still need to be started additionally on few nodes (similar to ’normal’ zookeeper) and speaks normal zookeeper protocol - needed to simplify A/B tests with real zookeeper.

To test that you need to run 3 instances of clickhouse-server (which will mimic zookeeper) with an extra config like that:

https://github.com/ClickHouse/ClickHouse/blob/master/tests/integration/test_keeper_multinode_simple/configs/enable_keeper1.xml

https://github.com/ClickHouse/ClickHouse/blob/master/tests/integration/test_keeper_snapshots/configs/enable_keeper.xml

or event single instance with config like that: https://github.com/ClickHouse/ClickHouse/blob/master/tests/config/config.d/keeper_port.xml https://github.com/ClickHouse/ClickHouse/blob/master/tests/config/config.d/zookeeper.xml

And point all the clickhouses (zookeeper config secton) to those nodes / ports.

Latests version is recommended (even testing / master builds). We will be thankful for any feedback.

systemd service file

See https://kb.altinity.com/altinity-kb-setup-and-maintenance/altinity-kb-zookeeper/clickhouse-keeper-service/

init.d script

See https://kb.altinity.com/altinity-kb-setup-and-maintenance/altinity-kb-zookeeper/clickhouse-keeper-initd/

Example of a simple cluster with 2 nodes of Clickhouse using built-in keeper

For example you can start two Clikhouse nodes (hostname1, hostname2)

hostname1

$ cat /etc/clickhouse-server/config.d/keeper.xml

<?xml version="1.0" ?>
<yandex>
    <keeper_server>
        <tcp_port>2181</tcp_port>
        <server_id>1</server_id>
        <log_storage_path>/var/lib/clickhouse/coordination/log</log_storage_path>
        <snapshot_storage_path>/var/lib/clickhouse/coordination/snapshots</snapshot_storage_path>

        <coordination_settings>
            <operation_timeout_ms>10000</operation_timeout_ms>
            <session_timeout_ms>30000</session_timeout_ms>
            <raft_logs_level>trace</raft_logs_level>
              <rotate_log_storage_interval>10000</rotate_log_storage_interval>
        </coordination_settings>

      <raft_configuration>
            <server>
               <id>1</id>
                 <hostname>hostname1</hostname>
               <port>9444</port>
          </server>
          <server>
               <id>2</id>
                 <hostname>hostname2</hostname>
               <port>9444</port>
          </server>
      </raft_configuration>

    </keeper_server>

    <zookeeper>
        <node>
            <host>localhost</host>
            <port>2181</port>
        </node>
    </zookeeper>

    <distributed_ddl>
        <path>/clickhouse/testcluster/task_queue/ddl</path>
    </distributed_ddl>
</yandex>

$ cat /etc/clickhouse-server/config.d/macros.xml

<?xml version="1.0" ?>
<yandex>
    <macros>
        <cluster>testcluster</cluster>
        <replica>replica1</replica>
        <shard>1</shard>
    </macros>
</yandex>

hostname2

$ cat /etc/clickhouse-server/config.d/keeper.xml

<?xml version="1.0" ?>
<yandex>
    <keeper_server>
        <tcp_port>2181</tcp_port>
        <server_id>2</server_id>
        <log_storage_path>/var/lib/clickhouse/coordination/log</log_storage_path>
        <snapshot_storage_path>/var/lib/clickhouse/coordination/snapshots</snapshot_storage_path>

        <coordination_settings>
            <operation_timeout_ms>10000</operation_timeout_ms>
            <session_timeout_ms>30000</session_timeout_ms>
            <raft_logs_level>trace</raft_logs_level>
              <rotate_log_storage_interval>10000</rotate_log_storage_interval>
        </coordination_settings>

      <raft_configuration>
            <server>
               <id>1</id>
                 <hostname>hostname1</hostname>
               <port>9444</port>
          </server>
          <server>
               <id>2</id>
                 <hostname>hostname2</hostname>
               <port>9444</port>
          </server>
      </raft_configuration>

    </keeper_server>

    <zookeeper>
        <node>
            <host>localhost</host>
            <port>2181</port>
        </node>
    </zookeeper>

    <distributed_ddl>
        <path>/clickhouse/testcluster/task_queue/ddl</path>
    </distributed_ddl>
</yandex>

$ cat /etc/clickhouse-server/config.d/macros.xml

<?xml version="1.0" ?>
<yandex>
    <macros>
        <cluster>testcluster</cluster>
        <replica>replica2</replica>
        <shard>1</shard>
    </macros>
</yandex>

on both

$ cat /etc/clickhouse-server/config.d/clusters.xml

<?xml version="1.0" ?>
<yandex>
    <remote_servers>
        <testcluster>
            <shard>
                <replica>
                    <host>hostname1</host>
                    <port>9000</port>
                </replica>
                <replica>
                    <host>hostname2</host>
                    <port>9000</port>
                </replica>
            </shard>
        </testcluster>
    </remote_servers>
</yandex>

Then create a table

create table test on cluster '{cluster}'   ( A Int64, S String)
Engine = ReplicatedMergeTree('/clickhouse/{cluster}/tables/{database}/{table}','{replica}')
Order by A;

insert into test select number, '' from numbers(100000000);

-- on both nodes:
select count() from test;

5 - How to check the list of watches

How to check the list of watches

Zookeeper use watches to notify a client on znode changes. This article explains how to check watches set by ZooKeeper servers and how it is used.

Solution:

Zookeeper uses the 'wchc' command to list all watches set on the Zookeeper server.

# echo wchc | nc zookeeper 2181

Reference

https://zookeeper.apache.org/doc/r3.4.12/zookeeperAdmin.html

The wchp and wchc commands are not enabled by default because of their known DOS vulnerability. For more information, see ZOOKEEPER-2693and Zookeeper 3.5.2 - Denial of Service.

By default those commands are disabled, they can be enabled via Java system property:

-Dzookeeper.4lw.commands.whitelist=*

on in zookeeper config: 4lw.commands.whitelist=*\

6 - JVM sizes and garbage collector settings

JVM sizes and garbage collector settings

TLDR version

use fresh Java version (11 or newer), disable swap and set up (for 4 Gb node):

JAVA_OPTS="-Xms512m -Xmx3G -XX:+AlwaysPreTouch -Djute.maxbuffer=8388608 -XX:MaxGCPauseMillis=50"

If you have a node with more RAM - change it accordingly, for example for 8Gb node:

JAVA_OPTS="-Xms512m -Xmx7G -XX:+AlwaysPreTouch -Djute.maxbuffer=8388608 -XX:MaxGCPauseMillis=50"

Details

ZooKeeper runs as in JVM. Depending on version different garbage collectors are available.
Recent JVM versions (starting from 10) use G1 garbage collector by default (should work fine). On JVM 13-14 using ZGC or Shenandoah garbage collector may reduce pauses. On older JVM version (before 10) you may want to make some tuning to decrease pauses, ParNew + CMS garbage collectors (like in Yandex config) is one of the best options.
One of the most important setting for JVM application is heap size. A heap size of >1 GB is recommended for most use cases and monitoring heap usage to ensure no delays are caused by garbage collection. We recommend to use at least 4Gb of RAM for zookeeper nodes (8Gb is better, that will make difference only when zookeeper is heavily loaded).

Set the Java heap size smaller than available RAM size on the node. This is very important to avoid swapping, which will seriously degrade ZooKeeper performance. Be conservative - use a maximum heap size of 3GB for a 4GB machine.

Add XX:+AlwaysPreTouch flag as well to load the memory pages into memory at the start of the zookeeper.
Set min (Xms) heap size to the values like 512Mb, or even to the same value as max (Xmx) to avoid resizing and returning the RAM to OS. Add XX:+AlwaysPreTouch flag as well to load the memory pages into memory at the start of the zookeeper.
MaxGCPauseMillis=50 (by default 200) - the ’target’ acceptable pause for garbage collection (milliseconds)
jute.maxbuffer limits the maximum size of znode content. By default it’s 1Mb. In some usecases (lot of partitions in table) ClickHouse may need to create bigger znodes.
(optional) enable GC logs: -Xloggc:/path_to/gc.log

Zookeeper configurarion used by Yandex Metrika (from 2017)

The configuration used by Yandex ( https://clickhouse.tech/docs/en/operations/tips/#zookeeper ) - they use older JVM version (with UseParNewGC garbage collector), and tune GC logs heavily:

JAVA_OPTS="-Xms{{ cluster.get('xms','128M') }} \
    -Xmx{{ cluster.get('xmx','1G') }} \
    -Xloggc:/var/log/$NAME/zookeeper-gc.log \
    -XX:+UseGCLogFileRotation \
    -XX:NumberOfGCLogFiles=16 \
    -XX:GCLogFileSize=16M \
    -verbose:gc \
    -XX:+PrintGCTimeStamps \
    -XX:+PrintGCDateStamps \
    -XX:+PrintGCDetails
    -XX:+PrintTenuringDistribution \
    -XX:+PrintGCApplicationStoppedTime \
    -XX:+PrintGCApplicationConcurrentTime \
    -XX:+PrintSafepointStatistics \
    -XX:+UseParNewGC \
    -XX:+UseConcMarkSweepGC \
    -XX:+CMSParallelRemarkEnabled"

7 - Proper setup

Proper setup

Main docs article

https://docs.altinity.com/operationsguide/clickhouse-zookeeper/zookeeper-installation/

Hardware requirements

TLDR version:

USE DEDICATED FAST DISKS for the transaction log! (crucial for performance due to write-ahead-log, NVMe is preferred for heavy load setup).
use 3 nodes (more nodes = slower quorum, less = no HA).
low network latency between zookeeper nodes is very important (latency, not bandwidth).
have at least 4Gb of RAM, disable swap, tune JVM sizes, and garbage collector settings.
ensure that zookeeper will not be CPU-starved by some other processes
monitor zookeeper.

Some doc about that subject:

Cite from https://zookeeper.apache.org/doc/r3.5.7/zookeeperAdmin.html#sc_commonProblems :

Things to Avoid
Here are some common problems you can avoid by configuring ZooKeeper correctly:
inconsistent lists of servers : The list of ZooKeeper servers used by the clients must match the list of ZooKeeper servers that each ZooKeeper server has. Things work okay if the client list is a subset of the real list, but things will really act strange if clients have a list of ZooKeeper servers that are in different ZooKeeper clusters. Also, the server lists in each Zookeeper server configuration file should be consistent with one another.
incorrect placement of transaction log : The most performance critical part of ZooKeeper is the transaction log. ZooKeeper syncs transactions to media before it returns a response. A dedicated transaction log device is key to consistent good performance. Putting the log on a busy device will adversely affect performance. If you only have one storage device, increase the snapCount so that snapshot files are generated less often; it does not eliminate the problem, but it makes more resources available for the transaction log.
incorrect Java heap size : You should take special care to set your Java max heap size correctly. In particular, you should not create a situation in which ZooKeeper swaps to disk. The disk is death to ZooKeeper. Everything is ordered, so if processing one request swaps the disk, all other queued requests will probably do the same. the disk. DON’T SWAP. Be conservative in your estimates: if you have 4G of RAM, do not set the Java max heap size to 6G or even 4G. For example, it is more likely you would use a 3G heap for a 4G machine, as the operating system and the cache also need memory. The best and only recommend practice for estimating the heap size your system needs is to run load tests, and then make sure you are well below the usage limit that would cause the system to swap.
Publicly accessible deployment : A ZooKeeper ensemble is expected to operate in a trusted computing environment. It is thus recommended to deploy ZooKeeper behind a firewall.

8 - Recovering from complete metadata loss in ZooKeeper

Recovering from complete metadata loss in ZooKeeper

Problem

Every ClickHouse user experienced a loss of ZooKeeper one day. While the data is available and replicas respond to queries, inserts are no longer possible. ClickHouse uses ZooKeeper in order to store the reference version of the table structure and part of data, and when it is not available can not guarantee data consistency anymore. Replicated tables turn to the read-only mode. In this article we describe step-by-step instructions of how to restore ZooKeeper metadata and bring ClickHouse cluster back to normal operation.

In order to restore ZooKeeper we have to solve two tasks. First, we need to restore table metadata in ZooKeeper. Currently, the only way to do it is to recreate the table with the CREATE TABLE DDL statement.

CREATE TABLE table_name ... ENGINE=ReplicatedMergeTree('zookeeper_path','replica_name');

The second and more difficult task is to populate zookeeper with information of clickhouse data parts. As mentioned above, ClickHouse stores the reference data about all parts of replicated tables in ZooKeeper, so we have to traverse all partitions and re-attach them to the recovered replicated table in order to fix that.

Info

Starting from ClickHouse version 21.7 there is SYSTEM RESTORE REPLICA command

https://altinity.com/blog/a-new-way-to-restore-clickhouse-after-zookeeper-metadata-is-lost

Test case

Let’s say we have replicated table table_repl.

CREATE TABLE table_repl 
(
   `number` UInt32
)
ENGINE = ReplicatedMergeTree('/clickhouse/{cluster}/tables/{shard}/table_repl','{replica}')
PARTITION BY intDiv(number, 1000)
ORDER BY number;

And populate it with some data

SELECT * FROM system.zookeeper WHERE path='/clickhouse/cluster_1/tables/01/';

INSERT INTO table_repl SELECT * FROM numbers(1000,2000);

SELECT partition, sum(rows) AS rows, count() FROM system.parts WHERE table='table_repl' AND active GROUP BY partition;

Now let’s remove metadata in zookeeper using ZkCli.sh at ZooKeeper host:

deleteall  /clickhouse/cluster_1/tables/01/table_repl

And try to resync clickhouse replica state with zookeeper:

SYSTEM RESTART REPLICA table_repl;

If we try to insert some data in the table, error happens:

INSERT INTO table_repl SELECT number AS number FROM numbers(1000,2000) WHERE number % 2 = 0;

And now we have an exception that we lost all metadata in zookeeper. It is time to recover!

Current Solution

Detach replicated table.
```
DETACH TABLE table_repl;
```

Save the table’s attach script and change engine of replicated table to non-replicated *mergetree analogue. Table definition is located in the ‘metadata’ folder, ‘/var/lib/clickhouse/metadata/default/table_repl.sql’ in our example. Please make a backup copy and modify the file as follows:

ATTACH TABLE table_repl
(
   `number` UInt32
)
ENGINE = ReplicatedMergeTree('/clickhouse/{cluster}/tables/{shard}/table_repl', '{replica}')
PARTITION BY intDiv(number, 1000)
ORDER BY number
SETTINGS index_granularity = 8192

Needs to be replaced with this:

ATTACH TABLE table_repl
(
   `number` UInt32
)
ENGINE = MergeTree()
PARTITION BY intDiv(number, 1000)
ORDER BY number
SETTINGS index_granularity = 8192

Attach non-replicated table.
```
ATTACH TABLE table_repl;
```

Rename non-replicated table.

RENAME TABLE table_repl TO table_repl_old;

Create a new replicated table. Take the saved attach script and replace ATTACH with CREATE, and run it.

CREATE TABLE table_repl
(
   `number` UInt32
)
ENGINE = ReplicatedMergeTree('/clickhouse/{cluster}/tables/{shard}/table_repl', '{replica}')
PARTITION BY intDiv(number, 1000)
ORDER BY number
SETTINGS index_granularity = 8192

Attach parts from old table to new.

ALTER TABLE table_repl ATTACH PARTITION 1 FROM table_repl_old;

ALTER TABLE table_repl ATTACH PARTITION 2 FROM table_repl_old;

If the table has many partitions, it may require some shell script to make it easier.

Automated approach

For a large number of tables, you can use script https://github.com/Altinity/clickhouse-zookeeper-recovery which partially automates the above approach.

9 - ZooKeeper backup

ZooKeeper backup

Question: Do I need to backup Zookeeper Database, because it’s pretty important for ClickHouse?

TLDR answer: NO, just backup ClickHouse data itself, and do SYSTEM RESTORE REPLICA during recovery to recreate zookeeper data

Details:

Zookeeper does not store any data, it stores the STATE of the distributed system (“that replica have those parts”, “still need 2 merges to do”, “alter is being applied” etc). That state always changes, and you can not capture / backup / and recover that state in a safe manner. So even backup from few seconds ago is represending some ‘old state from the past’ which is INCONSISTENT with actual state of the data.

In other words - if clickhouse is working - then the state of distributed system always changes, and it’s almost impossible to collect the current state of zookeeper (while you collecting it it will change many times). The only exception is ‘stop-the-world’ scenario - i.e. shutdown all clickhouse nodes, with all other zookeeper clients, then shutdown all the zookeeper, and only then take the backups, in that scenario and backups of zookeeper & clickhouse will be consistent. In that case restoring the backup is as simple (and is equal to) as starting all the nodes which was stopped before. But usually that scenario is very non-practical because it requires huge downtime.

So what to do instead? It’s enought if you will backup clickhouse data itself, and to recover the state of zookeeper you can just run the command SYSTEM RESTORE REPLICA command AFTER restoring the clickhouse data itself. That will recreate the state of the replica in the zookeeper as it exists on the filesystem after backup recovery.

Normally Zookeeper ensemble consists of 3 nodes, which is enough to survive hardware failures.

On older verion (which don’t have SYSTEM RESTORE REPLICA command - it can be done manually, using instruction https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/replication/#converting-from-mergetree-to-replicatedmergetree), on scale you can try https://github.com/Altinity/clickhouse-zookeeper-recovery

10 - ZooKeeper cluster migration

ZooKeeper cluster migration

Here is a plan for ZK 3.4.9 (no dynamic reconfiguration):

Add the 3 new ZK nodes to the old cluster. No changes needed for the 3 old ZK nodes at this time.
1. Configure one of the new ZK nodes as a cluster of 4 nodes (3 old + 1 new), start it.
2. Configure the other two new ZK nodes as a cluster of 6 nodes (3 old + 3 new), start them.
Make sure the 3 new ZK nodes connected to the old ZK cluster as followers (run echo stat | nc localhost 2181 on the 3 new ZK nodes)
Confirm that the leader has 5 synced followers (run echo mntr | nc localhost 2181 on the leader, look for zk_synced_followers)
Stop data ingestion in CH (this is to minimize errors when CH loses ZK).
Change the zookeeper section in the configs on the CH nodes (remove the 3 old ZK servers, add the 3 new ZK servers)
Make sure that there are no connections from CH to the 3 old ZK nodes (run echo stat | nc localhost 2181 on the 3 old nodes, check their Clients section). Restart all CH nodes if necessary (In some cases CH can reconnect to different ZK servers without a restart).
Remove the 3 old ZK nodes from zoo.cfg on the 3 new ZK nodes.
Restart the 3 new ZK nodes. They should form a cluster of 3 nodes.
When CH reconnects to ZK, start data loading.
Turn off the 3 old ZK nodes.

This plan works, but it is not the only way to do this, it can be changed if needed.

11 - ZooKeeper Monitoring

ZooKeeper Monitoring

ZooKeeper

scrape metrics

embedded exporter since version 3.6.0
- https://zookeeper.apache.org/doc/r3.6.2/zookeeperMonitor.html
standalone exporter
- https://github.com/dabealu/zookeeper-exporter

Install dashboards

embedded exporter https://grafana.com/grafana/dashboards/10465
dabealu exporter https://grafana.com/grafana/dashboards/11442

setup alert rules

embedded exporter link

12 - ZooKeeper schema

ZooKeeper schema

/metadata

Table schema.

date column -> legacy MergeTree partition expression.
sampling expression -> SAMPLE BY
index granularity -> index_granularity
mode -> type of MergeTree table
sign column -> sign - CollapsingMergeTree / VersionedCollapsingMergeTree
primary key -> ORDER BY key if PRIMARY KEY not defined.
sorting key -> ORDER BY key if PRIMARY KEY defined.
data format version -> 1
partition key -> PARTITION BY
granularity bytes -> index_granularity_bytes

types of MergeTree tables:
Ordinary            = 0
Collapsing          = 1
Summing             = 2
Aggregating         = 3
Replacing           = 5
Graphite            = 6
VersionedCollapsing = 7

/mutations

Log of latest mutations

/columns

List of columns for latest (reference) table version. Replicas would try to reach this state.

/log

Log of latest actions with table.

Related settings:

┌─name────────────────────────┬─value─┬─changed─┬─description────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─type───┐
│ max_replicated_logs_to_keep │ 1000  │       0 │ How many records may be in log, if there is inactive replica. Inactive replica becomes lost when when this number exceed.                                                  │ UInt64 │
│ min_replicated_logs_to_keep │ 10    │       0 │ Keep about this number of last records in ZooKeeper log, even if they are obsolete. It doesn't affect work of tables: used only to diagnose ZooKeeper log before cleaning. │ UInt64 │
└─────────────────────────────┴───────┴─────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────┘

/replicas

List of table replicas.

/replicas/replica_name/

/replicas/replica_name/mutation_pointer

Pointer to the latest mutation executed by replica

/replicas/replica_name/log_pointer

Pointer to the latest task from replication_queue executed by replica

/replicas/replica_name/max_processed_insert_time

/replica/replica_name/metadata

Table schema of specific replica

/replica/replica_name/columns

Columns list of specific replica.

/quorum

Used for quorum inserts.