JVM sizes and garbage collector settings

JVM sizes and garbage collector settings

TLDR version

use fresh Java version (11 or newer), disable swap and set up (for 4 Gb node):

JAVA_OPTS="-Xms3G -Xmx3G -XX:+AlwaysPreTouch -Djute.maxbuffer=8388608 -XX:MaxGCPauseMillis=50"

If you have a node with more RAM - change it accordingly, for example for 8Gb node:

JAVA_OPTS="-Xms7G -Xmx7G -XX:+AlwaysPreTouch -Djute.maxbuffer=8388608 -XX:MaxGCPauseMillis=50"

Details

  1. ZooKeeper runs as in JVM. Depending on version different garbage collectors are available.

  2. Recent JVM versions (starting from 10) use G1 garbage collector by default (should work fine). On JVM 13-14 using ZGC or Shenandoah garbage collector may reduce pauses. On older JVM version (before 10) you may want to make some tuning to decrease pauses, ParNew + CMS garbage collectors (like in Yandex config) is one of the best options.

  3. One of the most important setting for JVM application is heap size. A heap size of >1 GB is recommended for most use cases and monitoring heap usage to ensure no delays are caused by garbage collection. We recommend to use at least 4Gb of RAM for zookeeper nodes (8Gb is better, that will make difference only when zookeeper is heavily loaded).

Set the Java heap size smaller than available RAM size on the node. This is very important to avoid swapping, which will seriously degrade ZooKeeper performance. Be conservative - use a maximum heap size of 3GB for a 4GB machine.

  1. Set min (Xms) and max (Xmx) heap size to same value to avoid resizing. Add XX:+AlwaysPreTouch flag as well to load the memory pages into memory at the start of the zookeeper.

  2. MaxGCPauseMillis=50 (by default 200) - the ‘target’ acceptable pause for garbage collection (milliseconds)

  3. jute.maxbuffer limits the maximum size of znode content. By default it’s 1Mb. In some usecases (lot of partitions in table) ClickHouse may need to create bigger znodes.

  4. (optional) enable GC logs: -Xloggc:/path_to/gc.log

Zookeeper configurarion used by Yandex Metrika (from 2017)

The configuration used by Yandex ( https://clickhouse.tech/docs/en/operations/tips/#zookeeper ) - they use older JVM version (with UseParNewGC garbage collector), and tune GC logs heavily:

JAVA_OPTS="-Xms{{ cluster.get('xms','128M') }} \
    -Xmx{{ cluster.get('xmx','1G') }} \
    -Xloggc:/var/log/$NAME/zookeeper-gc.log \
    -XX:+UseGCLogFileRotation \
    -XX:NumberOfGCLogFiles=16 \
    -XX:GCLogFileSize=16M \
    -verbose:gc \
    -XX:+PrintGCTimeStamps \
    -XX:+PrintGCDateStamps \
    -XX:+PrintGCDetails
    -XX:+PrintTenuringDistribution \
    -XX:+PrintGCApplicationStoppedTime \
    -XX:+PrintGCApplicationConcurrentTime \
    -XX:+PrintSafepointStatistics \
    -XX:+UseParNewGC \
    -XX:+UseConcMarkSweepGC \
    -XX:+CMSParallelRemarkEnabled"

See also

Last modified 2021.09.07 : Replaced HTML code artifacts. (465040d6)