This the multi-page printable view of this section. Click here to print.
Setup & maintenance
- 1: S3 & object storage
- 1.1: AWS S3 Recipes
- 1.2: S3Disk
- 2: AggregateFunction(uniq, UUID) doubled after ClickHouse upgrade
- 3: Can not connect to my ClickHouse server
- 4: cgroups and kubernetes cloud providers
- 5: Transformation Clickhouse logs to ndjson using Vector.dev
- 6: ClickHouse operator
- 7: Custom Settings
- 8: Description of asynchronous_metrics
- 9: http handler example
- 10: Logging
- 11: Precreate parts using clickhouse-local
- 12: Access Control and Account Management example (RBAC)
- 13: recovery-after-complete-data-loss
- 14: source parts size is greater than the current maximum
- 15: Successful ClickHouse deployment plan
- 16: Timeouts during OPTIMIZE FINAL
- 17: Useful settings to turn on/Defaults that should be reconsidered
- 18: ZooKeeper session has expired
- 19: Altinity packaging compatibility >21.x and earlier
- 20: AWS EBS
- 21: ClickHouse in Docker
- 22: ClickHouse Monitoring
- 23: ClickHouse versions
- 24: clickhouse-backup
- 25: Converting MergeTree to Replicated
- 26: Data Migration
- 26.1: clickhouse-copier
- 26.1.1: clickhouse-copier 20.3 and earlier
- 26.1.2: clickhouse-copier 20.4 - 21.6
- 26.2: Fetch Alter Table
- 26.3: Remote table function
- 26.4: rsync
- 27: DDLWorker
- 28: differential backups using clickhouse-backup
- 29: High CPU usage
- 30: Load balancers
- 31: memory configuration settings
- 32: Moving table to another device.
- 33: Object consistency in a cluster
- 34: Production Cluster Configuration Guide
- 34.1: Backups
- 34.2: Cluster Configuration FAQ
- 34.3: Cluster Configuration Process
- 34.4: Hardware Requirements
- 34.5: Network Configuration
- 34.6: Version Upgrades
- 35: Replication queue
- 36: Schema migration tools for ClickHouse
- 36.1: golang-migrate
- 37: Server config files
- 38: Settings to adjust
- 39: Shutting down a node
- 40: SSL connection unexpectedly closed
- 41: Suspiciously many broken parts
- 42: System tables eat my disk
- 43: Threads
- 44: Who ate my memory
- 45: X rows of Y total rows in filesystem are suspicious
- 46: ZooKeeper
- 46.1: Install standalone Zookeeper for ClickHouse on Ubuntu / Debian
- 46.2: clickhouse-keeper
- 46.3: How to check the list of watches
- 46.4: JVM sizes and garbage collector settings
- 46.5: Proper setup
- 46.6: Recovering from complete metadata loss in ZooKeeper
- 46.7: ZooKeeper backup
- 46.8: ZooKeeper cluster migration
- 46.9: ZooKeeper Monitoring
- 46.10: ZooKeeper schema
1 - S3 & object storage
1.1 - AWS S3 Recipes
Using AWS IAM — Identity and Access Management roles
For EC2 instance, there is an option to configure an IAM role:
Role shall contain a policy with permissions like:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "allow-put-and-get",
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject"
],
"Resource": "arn:aws:s3:::BUCKET_NAME/test_s3_disk/*"
}
]
}
Corresponding configuration of ClickHouse:
<clickhouse>
<storage_configuration>
<disks>
<disk_s3>
<type>s3</type>
<endpoint>http://s3.us-east-1.amazonaws.com/BUCKET_NAME/test_s3_disk/</endpoint>
<use_environment_credentials>true</use_environment_credentials>
</disk_s3>
</disks>
<policies>
<policy_s3_only>
<volumes>
<volume_s3>
<disk>disk_s3</disk>
</volume_s3>
</volumes>
</policy_s3_only>
</policies>
</storage_configuration>
</clickhouse>
Small check:
CREATE TABLE table_s3 (number Int64) ENGINE=MergeTree() ORDER BY tuple() PARTITION BY tuple() SETTINGS storage_policy='policy_s3_only';
INSERT INTO table_s3 SELECT * FROM system.numbers LIMIT 100000000;
SELECT * FROM table_s3;
DROP TABLE table_s3;
1.2 - S3Disk
Settings
<clickhouse>
<storage_configuration>
<disks>
<s3>
<type>s3</type>
<endpoint>http://s3.us-east-1.amazonaws.com/BUCKET_NAME/test_s3_disk/</endpoint>
<access_key_id>ACCESS_KEY_ID</access_key_id>
<secret_access_key>SECRET_ACCESS_KEY</secret_access_key>
<skip_access_check>true</skip_access_check>
<send_metadata>true</send_metadata>
</s3>
</clickhouse>
-
skip_access_check — if true, it’s possible to use read only credentials with regular MergeTree table. But you would need to disable merges (
prefer_not_to_merge
setting) on s3 volume as well. -
send_metadata — if true, ClickHouse will populate s3 object with initial part & file path, which allow you to recover metadata from s3 and make debug easier.
Restore metadata from S3
Default
Limitations:
- ClickHouse need RW access to this bucket
In order to restore metadata, you would need to create restore file in metadata_path/_s3_disk_name_
directory:
touch /var/lib/clickhouse/disks/_s3_disk_name_/restore
In that case ClickHouse would restore to the same bucket and path and update only metadata files in s3 bucket.
Custom
Limitations:
- ClickHouse needs RO access to the old bucket and RW to the new.
- ClickHouse will copy objects in case of restoring to a different bucket or path.
If you would like to change bucket or path, you need to populate restore file with settings in key=value format:
cat /var/lib/clickhouse/disks/_s3_disk_name_/restore
source_bucket=s3disk
source_path=vol1/
Links
https://altinity.com/blog/integrating-clickhouse-with-minio https://altinity.com/blog/clickhouse-object-storage-performance-minio-vs-aws-s3 https://altinity.com/blog/tips-for-high-performance-clickhouse-clusters-with-s3-object-storage
2 - AggregateFunction(uniq, UUID) doubled after ClickHouse upgrade
What happened
After ClickHouse upgrade from version pre 21.6 to version after 21.6, count of unique UUID in AggregatingMergeTree tables nearly doubled in case of merging of data which was generated in different ClickHouse versions.
Why happened
In pull request which changed the internal representation of big integers data types (and UUID). SipHash64 hash-function used for uniq aggregation function for UUID data type was replaced with intHash64, which leads to different result for the same UUID value across different ClickHouse versions. Therefore, it results in doubling of counts, when uniqState created by different ClickHouse versions being merged together.
Related issue.
Solution
You need to replace any occurrence of uniqState(uuid)
in MATERIALIZED VIEWs with uniqState(sipHash64(uuid))
and change data type for already saved data from AggregateFunction(uniq, UUID)
to AggregateFunction(uniq, UInt64)
, because result data type of sipHash64 is UInt64.
-- On ClickHouse version 21.3
CREATE TABLE uniq_state
(
`key` UInt32,
`value` AggregateFunction(uniq, UUID)
)
ENGINE = MergeTree
ORDER BY key
INSERT INTO uniq_state SELECT
number % 10000 AS key,
uniqState(reinterpretAsUUID(number))
FROM numbers(1000000)
GROUP BY key
Ok.
0 rows in set. Elapsed: 0.404 sec. Processed 1.05 million rows, 8.38 MB (2.59 million rows/s., 20.74 MB/s.)
SELECT
key % 20,
uniqMerge(value)
FROM uniq_state
GROUP BY key % 20
┌─modulo(key, 20)─┬─uniqMerge(value)─┐
│ 0 │ 50000 │
│ 1 │ 50000 │
│ 2 │ 50000 │
│ 3 │ 50000 │
│ 4 │ 50000 │
│ 5 │ 50000 │
│ 6 │ 49999 │
│ 7 │ 50000 │
│ 8 │ 49999 │
│ 9 │ 50000 │
│ 10 │ 50000 │
│ 11 │ 50000 │
│ 12 │ 50000 │
│ 13 │ 50000 │
│ 14 │ 50000 │
│ 15 │ 50000 │
│ 16 │ 50000 │
│ 17 │ 50000 │
│ 18 │ 50000 │
│ 19 │ 50000 │
└─────────────────┴──────────────────┘
-- After upgrade of ClickHouse to 21.8
SELECT
key % 20,
uniqMerge(value)
FROM uniq_state
GROUP BY key % 20
┌─modulo(key, 20)─┬─uniqMerge(value)─┐
│ 0 │ 50000 │
│ 1 │ 50000 │
│ 2 │ 50000 │
│ 3 │ 50000 │
│ 4 │ 50000 │
│ 5 │ 50000 │
│ 6 │ 49999 │
│ 7 │ 50000 │
│ 8 │ 49999 │
│ 9 │ 50000 │
│ 10 │ 50000 │
│ 11 │ 50000 │
│ 12 │ 50000 │
│ 13 │ 50000 │
│ 14 │ 50000 │
│ 15 │ 50000 │
│ 16 │ 50000 │
│ 17 │ 50000 │
│ 18 │ 50000 │
│ 19 │ 50000 │
└─────────────────┴──────────────────┘
20 rows in set. Elapsed: 0.240 sec. Processed 10.00 thousand rows, 1.16 MB (41.72 thousand rows/s., 4.86 MB/s.)
CREATE TABLE uniq_state_2
ENGINE = MergeTree
ORDER BY key AS
SELECT *
FROM uniq_state
Ok.
0 rows in set. Elapsed: 0.128 sec. Processed 10.00 thousand rows, 1.16 MB (78.30 thousand rows/s., 9.12 MB/s.)
INSERT INTO uniq_state_2 SELECT
number % 10000 AS key,
uniqState(reinterpretAsUUID(number))
FROM numbers(1000000)
GROUP BY key
Ok.
0 rows in set. Elapsed: 0.266 sec. Processed 1.05 million rows, 8.38 MB (3.93 million rows/s., 31.48 MB/s.)
SELECT
key % 20,
uniqMerge(value)
FROM uniq_state_2
GROUP BY key % 20
┌─modulo(key, 20)─┬─uniqMerge(value)─┐
│ 0 │ 99834 │ <- Count of unique values nearly doubled.
│ 1 │ 100219 │
│ 2 │ 100128 │
│ 3 │ 100457 │
│ 4 │ 100272 │
│ 5 │ 100279 │
│ 6 │ 99372 │
│ 7 │ 99450 │
│ 8 │ 99974 │
│ 9 │ 99632 │
│ 10 │ 99562 │
│ 11 │ 100660 │
│ 12 │ 100439 │
│ 13 │ 100252 │
│ 14 │ 100650 │
│ 15 │ 99320 │
│ 16 │ 100095 │
│ 17 │ 99632 │
│ 18 │ 99540 │
│ 19 │ 100098 │
└─────────────────┴──────────────────┘
20 rows in set. Elapsed: 0.356 sec. Processed 20.00 thousand rows, 2.33 MB (56.18 thousand rows/s., 6.54 MB/s.)
CREATE TABLE uniq_state_3
ENGINE = MergeTree
ORDER BY key AS
SELECT *
FROM uniq_state
0 rows in set. Elapsed: 0.126 sec. Processed 10.00 thousand rows, 1.16 MB (79.33 thousand rows/s., 9.24 MB/s.)
-- Option 1, create separate column
ALTER TABLE uniq_state_3
ADD COLUMN `value_2` AggregateFunction(uniq, UInt64) DEFAULT unhex(hex(value));
ALTER TABLE uniq_state_3
UPDATE value_2 = value_2 WHERE 1;
SELECT *
FROM system.mutations
WHERE is_done = 0;
Ok.
0 rows in set. Elapsed: 0.008 sec.
INSERT INTO uniq_state_3 (key, value_2) SELECT
number % 10000 AS key,
uniqState(sipHash64(reinterpretAsUUID(number)))
FROM numbers(1000000)
GROUP BY key
Ok.
0 rows in set. Elapsed: 0.337 sec. Processed 1.05 million rows, 8.38 MB (3.11 million rows/s., 24.89 MB/s.)
SELECT
key % 20,
uniqMerge(value),
uniqMerge(value_2)
FROM uniq_state_3
GROUP BY key % 20
┌─modulo(key, 20)─┬─uniqMerge(value)─┬─uniqMerge(value_2)─┐
│ 0 │ 50000 │ 50000 │
│ 1 │ 50000 │ 50000 │
│ 2 │ 50000 │ 50000 │
│ 3 │ 50000 │ 50000 │
│ 4 │ 50000 │ 50000 │
│ 5 │ 50000 │ 50000 │
│ 6 │ 49999 │ 49999 │
│ 7 │ 50000 │ 50000 │
│ 8 │ 49999 │ 49999 │
│ 9 │ 50000 │ 50000 │
│ 10 │ 50000 │ 50000 │
│ 11 │ 50000 │ 50000 │
│ 12 │ 50000 │ 50000 │
│ 13 │ 50000 │ 50000 │
│ 14 │ 50000 │ 50000 │
│ 15 │ 50000 │ 50000 │
│ 16 │ 50000 │ 50000 │
│ 17 │ 50000 │ 50000 │
│ 18 │ 50000 │ 50000 │
│ 19 │ 50000 │ 50000 │
└─────────────────┴──────────────────┴────────────────────┘
20 rows in set. Elapsed: 0.768 sec. Processed 20.00 thousand rows, 4.58 MB (26.03 thousand rows/s., 5.96 MB/s.)
-- Option 2, modify column in-place with String as intermediate data type.
ALTER TABLE uniq_state_3
MODIFY COLUMN `value` String
Ok.
0 rows in set. Elapsed: 0.280 sec.
ALTER TABLE uniq_state_3
MODIFY COLUMN `value` AggregateFunction(uniq, UInt64)
Ok.
0 rows in set. Elapsed: 0.254 sec.
INSERT INTO uniq_state_3 (key, value) SELECT
number % 10000 AS key,
uniqState(sipHash64(reinterpretAsUUID(number)))
FROM numbers(1000000)
GROUP BY key
Ok.
0 rows in set. Elapsed: 0.554 sec. Processed 1.05 million rows, 8.38 MB (1.89 million rows/s., 15.15 MB/s.)
SELECT
key % 20,
uniqMerge(value),
uniqMerge(value_2)
FROM uniq_state_3
GROUP BY key % 20
┌─modulo(key, 20)─┬─uniqMerge(value)─┬─uniqMerge(value_2)─┐
│ 0 │ 50000 │ 50000 │
│ 1 │ 50000 │ 50000 │
│ 2 │ 50000 │ 50000 │
│ 3 │ 50000 │ 50000 │
│ 4 │ 50000 │ 50000 │
│ 5 │ 50000 │ 50000 │
│ 6 │ 49999 │ 49999 │
│ 7 │ 50000 │ 50000 │
│ 8 │ 49999 │ 49999 │
│ 9 │ 50000 │ 50000 │
│ 10 │ 50000 │ 50000 │
│ 11 │ 50000 │ 50000 │
│ 12 │ 50000 │ 50000 │
│ 13 │ 50000 │ 50000 │
│ 14 │ 50000 │ 50000 │
│ 15 │ 50000 │ 50000 │
│ 16 │ 50000 │ 50000 │
│ 17 │ 50000 │ 50000 │
│ 18 │ 50000 │ 50000 │
│ 19 │ 50000 │ 50000 │
└─────────────────┴──────────────────┴────────────────────┘
20 rows in set. Elapsed: 0.589 sec. Processed 30.00 thousand rows, 6.87 MB (50.93 thousand rows/s., 11.66 MB/s.)
SHOW CREATE TABLE uniq_state_3;
CREATE TABLE default.uniq_state_3
(
`key` UInt32,
`value` AggregateFunction(uniq, UInt64),
`value_2` AggregateFunction(uniq, UInt64) DEFAULT unhex(hex(value))
)
ENGINE = MergeTree
ORDER BY key
SETTINGS index_granularity = 8192
-- Option 3, CAST uniqState(UInt64) to String.
CREATE TABLE uniq_state_4
ENGINE = MergeTree
ORDER BY key AS
SELECT *
FROM uniq_state
Ok.
0 rows in set. Elapsed: 0.146 sec. Processed 10.00 thousand rows, 1.16 MB (68.50 thousand rows/s., 7.98 MB/s.)
INSERT INTO uniq_state_4 (key, value) SELECT
number % 10000 AS key,
CAST(uniqState(sipHash64(reinterpretAsUUID(number))), 'String')
FROM numbers(1000000)
GROUP BY key
Ok.
0 rows in set. Elapsed: 0.476 sec. Processed 1.05 million rows, 8.38 MB (2.20 million rows/s., 17.63 MB/s.)
SELECT
key % 20,
uniqMerge(value)
FROM uniq_state_4
GROUP BY key % 20
┌─modulo(key, 20)─┬─uniqMerge(value)─┐
│ 0 │ 50000 │
│ 1 │ 50000 │
│ 2 │ 50000 │
│ 3 │ 50000 │
│ 4 │ 50000 │
│ 5 │ 50000 │
│ 6 │ 49999 │
│ 7 │ 50000 │
│ 8 │ 49999 │
│ 9 │ 50000 │
│ 10 │ 50000 │
│ 11 │ 50000 │
│ 12 │ 50000 │
│ 13 │ 50000 │
│ 14 │ 50000 │
│ 15 │ 50000 │
│ 16 │ 50000 │
│ 17 │ 50000 │
│ 18 │ 50000 │
│ 19 │ 50000 │
└─────────────────┴──────────────────┘
20 rows in set. Elapsed: 0.281 sec. Processed 20.00 thousand rows, 2.33 MB (71.04 thousand rows/s., 8.27 MB/s.)
SHOW CREATE TABLE uniq_state_4;
CREATE TABLE default.uniq_state_4
(
`key` UInt32,
`value` AggregateFunction(uniq, UUID)
)
ENGINE = MergeTree
ORDER BY key
SETTINGS index_granularity = 8192
3 - Can not connect to my ClickHouse server
Can not connect to my ClickHouse server
Errors like “Connection reset by peer, while reading from socket”
-
Ensure that the clickhouse-server is running
systemctl status clickhouse-server
If server was restarted recently and don’t accept the connections after the restart - most probably it still just starting. During the startup sequence it need to iterate over all data folders in /var/lib/clickhouse-server In case if you have a very high number of folders there (usually caused by a wrong partitioning, or a very high number of tables / databases) that startup time can take a lot of time (same can happen if disk is very slow, for example NFS).
You can check that by looking for ‘Ready for connections’ line in
/var/log/clickhouse-server/clickhouse-server.log
(Information
log level neede) -
Ensure you use the proper port ip / interface?
Ensure you’re not trying to connect to secure port without tls / https or vice versa.
For clickhouse-client - pay attention on host / port / secure flags.
Ensure the interface you’re connecting to is the one which clickhouse listens (by default clickhouse listens only localhost).
Note: If you uncomment line
<listen_host>0.0.0.0</listen_host>
only - clickhouse will listen only ipv4 interfaces, while the localhost (used by clickhouse-client) may be resolved to ipv6 address. And clickhouse-client may be failing to connect.How to check which interfaces / ports do clickhouse listen?
sudo lsof -i -P -n | grep LISTEN echo listen_host sudo clickhouse-extract-from-config --config=/etc/clickhouse-server/config.xml --key=listen_host echo tcp_port sudo clickhouse-extract-from-config --config=/etc/clickhouse-server/config.xml --key=tcp_port echo tcp_port_secure sudo clickhouse-extract-from-config --config=/etc/clickhouse-server/config.xml --key=tcp_port_secure echo http_port sudo clickhouse-extract-from-config --config=/etc/clickhouse-server/config.xml --key=http_port echo https_port sudo clickhouse-extract-from-config --config=/etc/clickhouse-server/config.xml --key=https_port
-
For secure connection:
- ensure that server uses some certificate which can be validated by the client
- OR disable certificate checks on the client (UNSECURE)
-
Check for errors in /var/log/clickhouse-server/clickhouse-server.err.log ?
-
Is clickhouse able to serve some trivial tcp / http requests from localhost?
curl 127.0.0.1:9200 curl 127.0.0.1:8123
-
Check number of sockets opened by clickhouse
sudo lsof -i -a -p $(pidof clickhouse-server) # or (adjust 9000 / 8123 ports if needed) netstat -tn 2>/dev/null | tail -n +3 | awk '{ printf("%s\t%s\t%s\t%s\t%s\t%s\n", $1, $2, $3, $4, $5, $6) }' | clickhouse-local -S "Proto String, RecvQ Int64, SendQ Int64, LocalAddress String, ForeignAddress String, State LowCardinality(String)" --query="SELECT * FROM table WHERE LocalAddress like '%:9000' FORMAT PrettyCompact" netstat -tn 2>/dev/null | tail -n +3 | awk '{ printf("%s\t%s\t%s\t%s\t%s\t%s\n", $1, $2, $3, $4, $5, $6) }' | clickhouse-local -S "Proto String, RecvQ Int64, SendQ Int64, LocalAddress String, ForeignAddress String, State LowCardinality(String)" --query="SELECT * FROM table WHERE LocalAddress like '%:8123' FORMAT PrettyCompact"
ClickHouse has a limit of number of open connections (4000 by default).
-
Check also:
# system overall support limited number of connections it can handle netstat # you can also be reaching of of the process ulimits (Max open files) cat /proc/$(pidof -s clickhouse-server)/limits
-
Check firewall / selinux rules (if used)
4 - cgroups and kubernetes cloud providers
cgroups and kubernetes cloud providers
Why my ClickHouse is slow after upgrade to version 22.2 and higher?
The probable reason is that ClickHouse 22.2 started to respect cgroups (Respect cgroups limits in max_threads autodetection. #33342 (JaySon).
You can observe that max_threads = 1
SELECT
name,
value
FROM system.settings
WHERE name = 'max_threads'
┌─name────────┬─value─────┐
│ max_threads │ 'auto(1)' │
└─────────────┴───────────┘
This makes ClickHouse to execute all queries with a single thread (normal behavior is half of available CPU cores, cores = 64, then ‘auto(32)’).
We observe this cgroups behavior with AWS EKS (Kubernetes) environment and Altinity ClickHouse Operator in case if requests.cpu and limits.cpu are not set for a resource.
Workaround
We suggest to set requests.cpu = half of available CPU cores
, and limits.cpu = CPU cores
.
For example in case of 16 CPU cores:
resources:
requests:
memory: ...
cpu: 8
limits:
memory: ....
cpu: 16
Then you should get a new result:
SELECT
name,
value
FROM system.settings
WHERE name = 'max_threads'
┌─name────────┬─value─────┐
│ max_threads │ 'auto(8)' │
└─────────────┴───────────┘
in depth
For some reason AWS EKS sets cgroup kernel parameters in case of empty requests.cpu & limits.cpu into these:
# cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us
-1
# cat /sys/fs/cgroup/cpu/cpu.cfs_period_us
100000
# cat /sys/fs/cgroup/cpu/cpu.shares
2
This makes ClickHouse to set max_threads = 1
because of
cgroup_share = /sys/fs/cgroup/cpu/cpu.shares (2)
PER_CPU_SHARES = 1024
share_count = ceil( cgroup_share / PER_CPU_SHARES ) ---> ceil(2 / 1024) ---> 1
Fix
Incorrect calculation was fixed in https://github.com/ClickHouse/ClickHouse/pull/35815 and will work correctly on newer releases.
5 - Transformation Clickhouse logs to ndjson using Vector.dev
Transformation Clickhouse logs to ndjson using Vector.dev"
Installation of vector.dev
# arm64
wget https://packages.timber.io/vector/0.15.2/vector_0.15.2-1_arm64.deb
# amd64
wget https://packages.timber.io/vector/0.15.2/vector_0.15.2-1_amd64.deb
dpkg -i vector_0.15.2-1_*.deb
systemctl stop vector
mkdir /var/log/clickhouse-server-json
chown vector.vector /var/log/clickhouse-server-json
usermod -a -G clickhouse vector
vector config
# cat /etc/vector/vector.toml
data_dir = "/var/lib/vector"
[sources.clickhouse-log]
type = "file"
include = [ "/var/log/clickhouse-server/clickhouse-server.log" ]
fingerprinting.strategy = "device_and_inode"
message_start_indicator = '^\d+\.\d+\.\d+ \d+:\d+:\d+'
multi_line_timeout = 1000
[transforms.clickhouse-log-text]
inputs = [ "clickhouse-log" ]
type = "remap"
source = '''
. |= parse_regex!(.message, r'^(?P<timestamp>\d+\.\d+\.\d+ \d+:\d+:\d+\.\d+) \[\s?(?P<thread_id>\d+)\s?\] \{(?P<query_id>.*)\} <(?P<severity>\w+)> (?s)(?P<message>.*$)')
'''
[sinks.emit-clickhouse-log-json]
type = "file"
inputs = [ "clickhouse-log-text" ]
compression = "none"
path = "/var/log/clickhouse-server-json/clickhouse-server.%Y-%m-%d.ndjson"
encoding.only_fields = ["timestamp", "thread_id", "query_id", "severity", "message" ]
encoding.codec = "ndjson"
start
systemctl start vector
tail /var/log/clickhouse-server-json/clickhouse-server.2022-04-21.ndjson
{"message":"DiskLocal: Reserving 1.00 MiB on disk `default`, having unreserved 166.80 GiB.","query_id":"","severity":"Debug","thread_id":"283239","timestamp":"2022.04.21 13:43:21.164660"}
{"message":"MergedBlockOutputStream: filled checksums 202204_67118_67118_0 (state Temporary)","query_id":"","severity":"Trace","thread_id":"283239","timestamp":"2022.04.21 13:43:21.166810"}
{"message":"system.metric_log (e3365172-4c9b-441b-b803-756ae030e741): Renaming temporary part tmp_insert_202204_67118_67118_0 to 202204_171703_171703_0.","query_id":"","severity":"Trace","thread_id":"283239","timestamp":"2022.04.21 13:43:21.167226"}
....
sink logs into ClickHouse table
Be carefull with logging ClickHouse messages into the same ClickHouse instance, it will cause endless recursive self-logging.
create table default.clickhouse_logs(
timestamp DateTime64(3),
host LowCardinality(String),
thread_id LowCardinality(String),
severity LowCardinality(String),
query_id String,
message String)
Engine = MergeTree
Partition by toYYYYMM(timestamp)
Order by (toStartOfHour(timestamp), host, severity, query_id);
create user vector identified by 'vector1234';
grant insert on default.clickhouse_logs to vector;
create settings profile or replace profile_vector settings log_queries=0 readonly TO vector;
[sinks.clickhouse-output-clickhouse]
inputs = ["clickhouse-log-text"]
type = "clickhouse"
host = "http://localhost:8123"
database = "default"
auth.strategy = "basic"
auth.user = "vector"
auth.password = "vector1234"
healthcheck = true
table = "clickhouse_logs"
encoding.timestamp_format = "unix"
buffer.type = "disk"
buffer.max_size = 104900000
buffer.when_full = "block"
request.in_flight_limit = 20
encoding.only_fields = ["host", "timestamp", "thread_id", "query_id", "severity", "message"]
select * from default.clickhouse_logs limit 10;
┌───────────────timestamp─┬─host───────┬─thread_id─┬─severity─┬─query_id─┬─message─────────────────────────────────────────────────────
│ 2022-04-21 19:08:13.443 │ clickhouse │ 283155 │ Debug │ │ HTTP-Session: 13e87050-7824-46b0-9bd5-29469a1b102f Authentic
│ 2022-04-21 19:08:13.443 │ clickhouse │ 283155 │ Debug │ │ HTTP-Session: 13e87050-7824-46b0-9bd5-29469a1b102f Authentic
│ 2022-04-21 19:08:13.443 │ clickhouse │ 283155 │ Debug │ │ HTTP-Session: 13e87050-7824-46b0-9bd5-29469a1b102f Creating
│ 2022-04-21 19:08:13.447 │ clickhouse │ 283155 │ Debug │ │ MemoryTracker: Peak memory usage (for query): 4.00 MiB.
│ 2022-04-21 19:08:13.447 │ clickhouse │ 283155 │ Debug │ │ HTTP-Session: 13e87050-7824-46b0-9bd5-29469a1b102f Destroyin
│ 2022-04-21 19:08:13.495 │ clickhouse │ 283155 │ Debug │ │ HTTP-Session: f7eb829f-7b3a-4c43-8a41-a2e6676177fb Authentic
│ 2022-04-21 19:08:13.495 │ clickhouse │ 283155 │ Debug │ │ HTTP-Session: f7eb829f-7b3a-4c43-8a41-a2e6676177fb Authentic
│ 2022-04-21 19:08:13.495 │ clickhouse │ 283155 │ Debug │ │ HTTP-Session: f7eb829f-7b3a-4c43-8a41-a2e6676177fb Creating
│ 2022-04-21 19:08:13.496 │ clickhouse │ 283155 │ Debug │ │ MemoryTracker: Peak memory usage (for query): 4.00 MiB.
│ 2022-04-21 19:08:13.496 │ clickhouse │ 283155 │ Debug │ │ HTTP-Session: f7eb829f-7b3a-4c43-8a41-a2e6676177fb Destroyin
└─────────────────────────┴────────────┴───────────┴──────────┴──────────┴─────────────────────────────────────────────────────────────
6 - ClickHouse operator
ClickHouse operator
https://github.com/Altinity/clickhouse-operator/blob/master/docs/README.md
7 - Custom Settings
Using custom settings in config
You can not use the custom settings in config file ‘as is’, because clickhouse don’t know which datatype should be used to parse it.
cat /etc/clickhouse-server/users.d/default_profile.xml
<?xml version="1.0"?>
<yandex>
<profiles>
<default>
<custom_data_version>1</custom_data_version> <!-- will not work! see below -->
</default>
</profiles>
</yandex>
That will end up with the following error:
2021.09.24 12:50:37.369259 [ 264905 ] {} <Error> ConfigReloader: Error updating configuration from '/etc/clickhouse-server/users.xml' config.: Code: 536. DB::Exception: Couldn't restore Field from dump: 1: while parsing value '1' for setting 'custom_data_version'. (CANNOT_RESTORE_FROM_FIELD_DUMP), Stack trace (when copying this message, always include the lines below):
0. DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0x9440eba in /usr/lib/debug/.build-id/ba/25f6646c3be7aa95f452ec85461e96178aa365.debug
1. DB::Field::restoreFromDump(std::__1::basic_string_view<char, std::__1::char_traits<char> > const&)::$_4::operator()() const @ 0x10449da0 in /usr/lib/debug/.build-id/ba/25f6646c3be7aa95f452ec85461e96178aa365.debug
2. DB::Field::restoreFromDump(std::__1::basic_string_view<char, std::__1::char_traits<char> > const&) @ 0x10449bf1 in /usr/lib/debug/.build-id/ba/25f6646c3be7aa95f452ec85461e96178aa365.debug
3. DB::BaseSettings<DB::SettingsTraits>::stringToValueUtil(std::__1::basic_string_view<char, std::__1::char_traits<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) @ 0x1042e2bf in /usr/lib/debug/.build-id/ba/25f6646c3be7aa95f452ec85461e96178aa365.debug
4. DB::UsersConfigAccessStorage::parseFromConfig(Poco::Util::AbstractConfiguration const&) @ 0x1041a097 in /usr/lib/debug/.build-id/ba/25f6646c3be7aa95f452ec85461e96178aa365.debug
5. void std::__1::__function::__policy_invoker<void (Poco::AutoPtr<Poco::Util::AbstractConfiguration>, bool)>::__call_impl<std::__1::__function::__default_alloc_func<DB::UsersConfigAccessStorage::load(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::function<std::__1::shared_ptr<zkutil::ZooKeeper> ()> const&)::$_0, void (Poco::AutoPtr<Poco::Util::AbstractConfiguration>, bool)> >(std::__1::__function::__policy_storage const*, Poco::AutoPtr<Poco::Util::AbstractConfiguration>&&, bool) @ 0x1042e7ff in /usr/lib/debug/.build-id/ba/25f6646c3be7aa95f452ec85461e96178aa365.debug
6. DB::ConfigReloader::reloadIfNewer(bool, bool, bool, bool) @ 0x11caf54e in /usr/lib/debug/.build-id/ba/25f6646c3be7aa95f452ec85461e96178aa365.debug
7. DB::ConfigReloader::run() @ 0x11cb0f8f in /usr/lib/debug/.build-id/ba/25f6646c3be7aa95f452ec85461e96178aa365.debug
8. ThreadFromGlobalPool::ThreadFromGlobalPool<void (DB::ConfigReloader::*)(), DB::ConfigReloader*>(void (DB::ConfigReloader::*&&)(), DB::ConfigReloader*&&)::'lambda'()::operator()() @ 0x11cb19f1 in /usr/lib/debug/.build-id/ba/25f6646c3be7aa95f452ec85461e96178aa365.debug
9. ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>) @ 0x9481f5f in /usr/lib/debug/.build-id/ba/25f6646c3be7aa95f452ec85461e96178aa365.debug
10. void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void ThreadPoolImpl<std::__1::thread>::scheduleImpl<void>(std::__1::function<void ()>, int, std::__1::optional<unsigned long>)::'lambda0'()> >(void*) @ 0x9485843 in /usr/lib/debug/.build-id/ba/25f6646c3be7aa95f452ec85461e96178aa365.debug
11. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
12. __clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so
(version 21.10.1.8002 (official build))
2021.09.29 11:36:07.722213 [ 2090 ] {} <Error> Application: DB::Exception: Couldn't restore Field from dump: 1: while parsing value '1' for setting 'custom_data_version'
To make it work you need to change it an the following way:
cat /etc/clickhouse-server/users.d/default_profile.xml
<?xml version="1.0"?>
<yandex>
<profiles>
<default>
<custom_data_version>UInt64_1</custom_data_version>
</default>
</profiles>
</yandex>
or
cat /etc/clickhouse-server/users.d/default_profile.xml
<?xml version="1.0"?>
<yandex>
<profiles>
<default>
<custom_data_version>'1'</custom_data_version>
</default>
</profiles>
</yandex>
The list of recognized prefixes is in the sources: https://github.com/ClickHouse/ClickHouse/blob/ea13a8b562edbc422c07b5b4ecce353f79b6cb63/src/Core/Field.cpp#L253-L270
8 - Description of asynchronous_metrics
CompiledExpressionCacheCount -- number or compiled cached expression (if CompiledExpressionCache is enabled)
jemalloc -- parameters of jemalloc allocator, they are not very useful, and not interesting
MarkCacheBytes / MarkCacheFiles -- there are cache for .mrk files (default size is 5GB), you can see is it use all 5GB or not
MemoryCode -- how much memory allocated for ClickHouse executable
MemoryDataAndStack -- virtual memory allocated for data and stack
MemoryResident -- real memory used by ClickHouse ( the same as top RES/RSS)
MemoryShared -- shared memory used by ClickHouse
MemoryVirtual -- virtual memory used by ClickHouse ( the same as top VIRT)
NumberOfDatabases
NumberOfTables
ReplicasMaxAbsoluteDelay -- important parameter - replica max absolute delay in seconds
ReplicasMaxRelativeDelay -- replica max relative delay (from other replicas) in seconds
ReplicasMaxInsertsInQueue -- max number of parts to fetch for a single Replicated table
ReplicasSumInsertsInQueue -- sum of parts to fetch for all Replicated tables
ReplicasMaxMergesInQueue -- max number of merges in queue for a single Replicated table
ReplicasSumMergesInQueue -- total number of merges in queue for all Replicated tables
ReplicasMaxQueueSize -- max number of tasks for a single Replicated table
ReplicasSumQueueSize -- total number of tasks in replication queue
UncompressedCacheBytes/UncompressedCacheCells -- allocated memory for uncompressed cache (disabled by default)
Uptime -- uptime seconds
9 - http handler example
http handler example (how to disable /play)
# cat /etc/clickhouse-server/config.d/play_disable.xml
<?xml version="1.0" ?>
<yandex>
<http_handlers>
<rule>
<url>/play</url>
<methods>GET</methods>
<handler>
<type>static</type>
<status>403</status>
<content_type>text/plain; charset=UTF-8</content_type>
<response_content></response_content>
</handler>
</rule>
<defaults/> <!-- handler to save default handlers ?query / ping -->
</http_handlers>
</yandex>
10 - Logging
Q. I get errors:
File not found: /var/log/clickhouse-server/clickhouse-server.log.0.
File not found: /var/log/clickhouse-server/clickhouse-server.log.8.gz.
...
File not found: /var/log/clickhouse-server/clickhouse-server.err.log.0, Stack trace (when copying this message, always include the lines below):
0. Poco::FileImpl::handleLastErrorImpl(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) @ 0x11c2b345 in /usr/bin/clickhouse
1. Poco::PurgeOneFileStrategy::purge(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) @ 0x11c84618 in /usr/bin/clickhouse
2. Poco::FileChannel::log(Poco::Message const&) @ 0x11c314cc in /usr/bin/clickhouse
3. DB::OwnFormattingChannel::logExtended(DB::ExtendedLogMessage const&) @ 0x8681402 in /usr/bin/clickhouse
4. DB::OwnSplitChannel::logSplit(Poco::Message const&) @ 0x8682fa8 in /usr/bin/clickhouse
5. DB::OwnSplitChannel::log(Poco::Message const&) @ 0x8682e41 in /usr/bin/clickhouse
A. Check if you have proper permission to a log files folder, and enough disk space (& inode numbers) on the block device used for logging.
ls -la /var/log/clickhouse-server/
df -Th
df -Thi
Q. How to configure logging in clickhouse?
11 - Precreate parts using clickhouse-local
Precreate parts using clickhouse-local
rm -rf /tmp/precreate_parts
mkdir -p /tmp/precreate_parts/metadata/local/
cd /tmp/precreate_parts
## 1. Imagine we want to process this file:
cat <<EOF > /tmp/data.csv
1,2020-01-01,"String"
2,2020-02-02,"Another string"
3,2020-03-03,"One more string"
4,2020-01-02,"String for first partition"
EOF
## 2. that is the metadata for the table we want to fill
## schema should match the schema of the table from server
## (the easiest way is just to copy it from the server)
## I've added sleepEachRow(0.5) here just to mimic slow insert
cat <<EOF > metadata/local/test.sql
ATTACH TABLE local.test (id UInt64, d Date, s String, x MATERIALIZED sleepEachRow(0.5)) Engine=MergeTree ORDER BY id PARTITION BY toYYYYMM(d);
EOF
## 3a. that is the metadata for the input file we want to read
## it should match the structure of source file
## use stdin to read from pipe
cat <<EOF > metadata/local/stdin.sql
ATTACH TABLE local.stdin (id UInt64, d Date, s String) Engine=File(CSV, stdin);
EOF
## 3b. Instead of stdin you can use file path
cat <<EOF > metadata/local/data_csv.sql
ATTACH TABLE local.data_csv (id UInt64, d Date, s String) Engine=File(CSV, '/tmp/data.csv');
EOF
## All preparations done,
## the rest is simple:
# option a (if 3a used) with pipe / reading stdin
cat /tmp/data.csv | clickhouse-local --query "INSERT INTO local.test SELECT * FROM local.stdin" -- --path=.
# option b (if 3b used) 0 with filepath
cd /tmp/precreate_parts
clickhouse-local --query "INSERT INTO local.test SELECT * FROM local.data_csv" -- --path=.
# now you can check what was inserted (i did both options so i have doubled data)
clickhouse-local --query "SELECT _part,* FROM local.test ORDER BY id" -- --path=.
202001_4_4_0 1 2020-01-01 String
202001_1_1_0 1 2020-01-01 String
202002_5_5_0 2 2020-02-02 Another string
202002_2_2_0 2 2020-02-02 Another string
202003_6_6_0 3 2020-03-03 One more string
202003_3_3_0 3 2020-03-03 One more string
202001_4_4_0 4 2020-01-02 String for first partition
202001_1_1_0 4 2020-01-02 String for first partition
# But you can't do OPTIMIZE (local will die with coredump) :) That would be too good
# clickhouse-local --query "OPTIMIZE TABLE local.test FINAL" -- --path=.
## now you can upload those parts to a server (in detached subfolder) and attach them.
mfilimonov@laptop5591:/tmp/precreate_parts$ ls -la data/local/test/
total 40
drwxrwxr-x 9 mfilimonov mfilimonov 4096 paź 15 11:15 .
drwxrwxr-x 3 mfilimonov mfilimonov 4096 paź 15 11:15 ..
drwxrwxr-x 2 mfilimonov mfilimonov 4096 paź 15 11:15 202001_1_1_0
drwxrwxr-x 2 mfilimonov mfilimonov 4096 paź 15 11:15 202001_4_4_0
drwxrwxr-x 2 mfilimonov mfilimonov 4096 paź 15 11:15 202002_2_2_0
drwxrwxr-x 2 mfilimonov mfilimonov 4096 paź 15 11:15 202002_5_5_0
drwxrwxr-x 2 mfilimonov mfilimonov 4096 paź 15 11:15 202003_3_3_0
drwxrwxr-x 2 mfilimonov mfilimonov 4096 paź 15 11:15 202003_6_6_0
drwxrwxr-x 2 mfilimonov mfilimonov 4096 paź 15 11:15 detached
-rw-rw-r-- 1 mfilimonov mfilimonov 1 paź 15 11:15 format_version.txt
mfilimonov@laptop5591:/tmp/precreate_parts$ ls -la data/local/test/202001_1_1_0/
total 44
drwxrwxr-x 2 mfilimonov mfilimonov 4096 paź 15 11:15 .
drwxrwxr-x 9 mfilimonov mfilimonov 4096 paź 15 11:15 ..
-rw-rw-r-- 1 mfilimonov mfilimonov 250 paź 15 11:15 checksums.txt
-rw-rw-r-- 1 mfilimonov mfilimonov 79 paź 15 11:15 columns.txt
-rw-rw-r-- 1 mfilimonov mfilimonov 1 paź 15 11:15 count.txt
-rw-rw-r-- 1 mfilimonov mfilimonov 155 paź 15 11:15 data.bin
-rw-rw-r-- 1 mfilimonov mfilimonov 144 paź 15 11:15 data.mrk3
-rw-rw-r-- 1 mfilimonov mfilimonov 10 paź 15 11:15 default_compression_codec.txt
-rw-rw-r-- 1 mfilimonov mfilimonov 4 paź 15 11:15 minmax_d.idx
-rw-rw-r-- 1 mfilimonov mfilimonov 4 paź 15 11:15 partition.dat
-rw-rw-r-- 1 mfilimonov mfilimonov 16 paź 15 11:15 primary.idx
12 - Access Control and Account Management example (RBAC)
Documentation https://clickhouse.com/docs/en/operations/access-rights/
Example: 3 roles (dba, dashboard_ro, ingester_rw)
You need to create an .xml file at each node to allow user default
to manage access using SQL.
cat /etc/clickhouse-server/users.d/access_management.xml
<?xml version="1.0"?>
<yandex>
<users>
<default>
<access_management>1</access_management>
</default>
</users>
</yandex>
create role dba on cluster '{cluster}';
grant all on *.* to dba on cluster '{cluster}';
create user `user1` identified by 'pass1234' on cluster '{cluster}';
grant dba to user1 on cluster '{cluster}';
create role dashboard_ro on cluster '{cluster}';
grant select on default.* to dashboard_ro on cluster '{cluster}';
grant dictGet on *.* to dashboard_ro on cluster '{cluster}';
create settings profile or replace profile_dashboard_ro on cluster '{cluster}'
settings max_concurrent_queries_for_user = 10 READONLY,
max_threads = 16 READONLY,
max_memory_usage_for_user = '30G' READONLY,
max_memory_usage = '30G' READONLY,
max_execution_time = 60 READONLY,
max_rows_to_read = 1000000000 READONLY,
max_bytes_to_read = '5000G' READONLY
TO dashboard_ro;
create user `dash1` identified by 'pass1234' on cluster '{cluster}';
grant dashboard_ro to dash1 on cluster '{cluster}';
create role ingester_rw on cluster '{cluster}';
grant select,insert on default.* to ingester_rw on cluster '{cluster}';
create settings profile or replace profile_ingester_rw on cluster '{cluster}'
settings max_concurrent_queries_for_user = 40 READONLY, -- user can run 40 queries (select, insert ...) simultaneously
max_threads = 10 READONLY, -- each query can use up to 10 cpu (READONLY means user cannot override a value)
max_memory_usage_for_user = '30G' READONLY, -- all queries of the user can use up to 30G RAM
max_memory_usage = '25G' READONLY, -- each query can use up to 25G RAM
max_execution_time = 200 READONLY, -- each query can executes no longer 200 seconds
max_rows_to_read = 1000000000 READONLY, -- each query can read up to 1 billion rows
max_bytes_to_read = '5000G' READONLY -- each query can read up to 5 TB from a MergeTree
TO ingester_rw;
create user `ingester_app1` identified by 'pass1234' on cluster '{cluster}';
grant ingester_rw to ingester_app1 on cluster '{cluster}';
check
$ clickhouse-client -u dash1 --password pass1234
create table test ( A Int64) Engine=Log;
DB::Exception: dash1: Not enough privileges
$ clickhouse-client -u user1 --password pass1234
create table test ( A Int64) Engine=Log;
Ok.
drop table test;
Ok.
$ clickhouse-client -u ingester_app1 --password pass1234
select count() from system.numbers limit 1000000000000;
DB::Exception: Received from localhost:9000. DB::Exception: Limit for rows or bytes to read exceeded, max rows: 1.00 billion
clean up
show profiles;
┌─name─────────────────┐
│ default │
│ profile_dashboard_ro │
│ profile_ingester_rw │
│ readonly │
└──────────────────────┘
drop profile if exists readonly on cluster '{cluster}';
drop profile if exists profile_dashboard_ro on cluster '{cluster}';
drop profile if exists profile_ingester_rw on cluster '{cluster}';
show roles;
┌─name─────────┐
│ dashboard_ro │
│ dba │
│ ingester_rw │
└──────────────┘
drop role if exists dba on cluster '{cluster}';
drop role if exists dashboard_ro on cluster '{cluster}';
drop role if exists ingester_rw on cluster '{cluster}';
show users;
┌─name──────────┐
│ dash1 │
│ default │
│ ingester_app1 │
│ user1 │
└───────────────┘
drop user if exists ingester_app1 on cluster '{cluster}';
drop user if exists user1 on cluster '{cluster}';
drop user if exists dash1 on cluster '{cluster}';
13 - recovery-after-complete-data-loss
Atomic & Ordinary databases.
srv1 – good replica
srv2 – lost replica / we will restore it from srv1
test data (3 tables (atomic & ordinary databases))
srv1
create database testatomic on cluster '{cluster}' engine=Atomic;
create table testatomic.test on cluster '{cluster}' (A Int64, D Date, s String)
Engine = ReplicatedMergeTree('/clickhouse/{cluster}/tables/{database}/{table}','{replica}')
partition by toYYYYMM(D)
order by A;
insert into testatomic.test select number, today(), '' from numbers(1000000);
create database testordinary on cluster '{cluster}' engine=Ordinary;
create table testordinary.test on cluster '{cluster}' (A Int64, D Date, s String)
Engine = ReplicatedMergeTree('/clickhouse/{cluster}/tables/{database}/{table}','{replica}')
partition by toYYYYMM(D)
order by A;
insert into testordinary.test select number, today(), '' from numbers(1000000);
create table default.test on cluster '{cluster}' (A Int64, D Date, s String)
Engine = ReplicatedMergeTree('/clickhouse/{cluster}/tables/{database}/{table}','{replica}')
partition by toYYYYMM(D)
order by A;
insert into default.test select number, today(), '' from numbers(1000000);
destroy srv2
srv2
/etc/init.d/clickhouse-server stop
rm -rf /var/lib/clickhouse/*
generate script to re-create databases (create_database.sql).
srv1
$ cat /home/ubuntu/generate_schema.sql
SELECT concat('CREATE DATABASE "', name, '" ENGINE = ', engine, ' COMMENT \'', comment, '\';')
FROM system.databases
WHERE name NOT IN ('INFORMATION_SCHEMA', 'information_schema', 'system', 'default');
clickhouse-client < /home/denis.zhuravlev/generate_schema.sql > create_database.sql
check the result
$ cat create_database.sql
CREATE DATABASE "testatomic" ENGINE = Atomic COMMENT '';
CREATE DATABASE "testordinary" ENGINE = Ordinary COMMENT '';
transfer this create_database.sql to srv2 (scp / rsync)
make a copy of schema sql files (metadata_schema.tar)
srv1
cd /var/lib/clickhouse/
tar -cvhf /home/ubuntu/metadata_schema.tar metadata
-h
- is important! (-h, –dereference Follow symlinks; archive and dump the files they point to.)
transfer this metadata_schema.tar to srv2 (scp / rsync)
create databases at srv2
srv2
/etc/init.d/clickhouse-server start
clickhouse-client < create_database.sql
/etc/init.d/clickhouse-server stop
create tables at srv2
srv2
cd /var/lib/clickhouse/
tar xkfv /home/ubuntu/metadata_schema.tar
sudo -u clickhouse touch /var/lib/clickhouse/flags/force_restore_data
/etc/init.d/clickhouse-server start
tar xkfv
-k
is important! To save folders/symlinks created with create database ( -k, –keep-old-files Don’t replace existing files when extracting )
check a recovery
srv2
SELECT count() FROM testatomic.test;
┌─count()─┐
│ 1000000 │
└─────────┘
SELECT count() FROM testordinary.test;
┌─count()─┐
│ 1000000 │
└─────────┘
SELECT count() FROM default.test;
┌─count()─┐
│ 1000000 │
└─────────┘
14 - source parts size is greater than the current maximum
Symptom
I see messages like: source parts size (...) is greater than the current maximum (...)
in the logs and/or inside system.replication_queue
Cause
Usually that means that there are already few big merges running. You can see the running merges using the query:
SELECT * FROM system.merges
That logic is needed to prevent picking a log of huge merges simultaneously (otherwise they will take all available slots and clickhouse will not be able to do smaller merges, which usally are important for keeping the number of parts stable).
Action
It is normal to see those messages on some stale replicas. And it should be resolved automatically after some time. So just wait & monitor system.merges & system.replication_queue tables, it should be resolved by it’s own.
If it happens often or don’t resolves by it’s own during some longer period of time, it could be caused by:
- increased insert pressure
- disk issues / high load (it works slow, not enought space etc.)
- high CPU load (not enough CPU power to catch up with merges)
- issue with table schemas leading to high merges pressure (high / increased number of tables / partitions / etc.)
Start from checking dmesg / system journals / clickhouse monitoring to find the anomalies.
15 - Successful ClickHouse deployment plan
Successful ClickHouse deployment plan
Stage 0. Build POC
- Install single node clickhouse
- Start with creating a single table (the biggest one), use MergeTree engine. Create ‘some’ schema (most probably it will be far from optimal). Prefer denormalized approach for all immutable dimensions, for mutable dimensions - consider dictionaries.
- Load some amount of data (at least 5 Gb, and 10 mln rows) - preferable the real one, or as close to real as possible. Usully the simplest options are either through CSV / TSV files (or
insert into clickhouse_table select * FROM mysql(...) where ...
) - Create several representative queries.
- Check the columns cardinality, and appropriate types, use minimal needed type
- Review the partition by and order by. https://kb.altinity.com/engines/mergetree-table-engine-family/pick-keys/
- Create the schema(s) with better/promising order by / partitioning, load data in. Pick the best schema.
- consider different improvements of particular columns (codecs / better data types etc.) https://kb.altinity.com/altinity-kb-schema-design/codecs/altinity-kb-how-to-test-different-compression-codecs/
- If the performance of certain queries is not enough - consider using PREWHERE / skipping indexes
- Repeat 2-9 for next big table(s). Avoid scenarios when you need to join big tables.
- Pick the clients library for you programming language (the most mature are python / golang / java / c++), build some pipeline - for inserts (low QPS, lot of rows in singe insert, check acknowledgements & retry the same block on failures), ETLs if needed, some reporting layer (https://kb.altinity.com/altinity-kb-integrations/bi-tools/)
Stage 1. Planning the production setup
- Collect more data / estimate insert speed, estimate the column sizes per day / month.
- Measure the speed of queries
- Consider improvement using materialized views / projections / dictionaries.
- Collect requirements (ha / number of simultaneous queries / insert pressure / ’exactly once’ etc)
- Do a cluster sizing estimation, plan the hardware
- plan the network, if needed - consider using LoadBalancers etc.
- If you need sharding - consider different sharding approaches.
Stage 2. Preprod setup & developement
- Install clickhouse in cluster - several nodes / VMs + zookeeper
- https://kb.altinity.com/altinity-kb-setup-and-maintenance/cluster-production-configuration-guide/cluster-configuration-process/
- https://kb.altinity.com/altinity-kb-setup-and-maintenance/altinity-kb-zookeeper/altinity-kb-proper-setup/
- https://kb.altinity.com/altinity-kb-setup-and-maintenance/altinity-kb-zookeeper/install_ubuntu/
- Create good config & automate config / os / restarts (ansible / puppet etc)
- https://kb.altinity.com/altinity-kb-setup-and-maintenance/altinity-kb-settings-to-adjust/
- for docker: https://kb.altinity.com/altinity-kb-setup-and-maintenance/altinity-kb-clickhouse-in-docker/
- for k8, use clickhouse-operator OR https://kb.altinity.com/altinity-kb-kubernetes/altinity-kb-possible-issues-with-running-clickhouse-in-k8s/
- Set up monitoring / log processing / alerts etc.
- Set up users.
- Think of schema management. Deploy the schema.
- Design backup / failover strategies:
- Develop pipelines / queries, create test suite, CI/CD
- Do benchmark / stress tests
- Test configuration changes / server restarts / failovers / version upgrades
- Review the security topics (tls, limits / restrictions, network, passwords)
- Document the solution for operations
Stage 3. Production setup
- Deploy the production setup (consider also canary / blue-greed deployments etc)
- Schedule ClickHouse upgrades every 6 to 12 months (if possible)
16 - Timeouts during OPTIMIZE FINAL
Timeout exceeded ...
or executing longer than distributed_ddl_task_timeout
during OPTIMIZE FINAL
.Timeout exceeded ...
or executing longer than distributed_ddl_task_timeout
during OPTIMIZE FINAL
Timeout may occur
-
due to the fact that the client reach timeout interval.
- in case of TCP / native clients - you can change send_timeout / recieve_timeout + tcp_keep_alive_timeout + driver timeout settings
- in case of HTTP clients - you can change http_send_timeout / http_receive_timeout + tcp_keep_alive_timeout + driver timeout settings
-
(in the case of ON CLUSTER queries) due to the fact that the timeout for query execution by shards ends
- see setting
distributed_ddl_task_timeout
- see setting
In the first case you additionally may get the misleading messages: Cancelling query. ... Query was cancelled.
In both cases, this does NOT stop the execution of the OPTIMIZE command. It continues to work even after
the client is disconnected. You can see the progress of that in system.processes
/ show processlist
/ system.merges
/ system.query_log
.
The same applies to queries like:
INSERT ... SELECT
CREATE TABLE ... AS SELECT
CREATE MATERIALIZED VIEW ... POPULATE ...
It is possible to run a query with some special query_id
and then poll the status from the processlist (in the case of a cluster, it can be a bit more complicated).
See also
17 - Useful settings to turn on/Defaults that should be reconsidered
Useful settings to turn on/Defaults that should be reconsidered
Some setting that are not enabled by default.
Enables or disables complete dropping of data parts where all rows are expired in MergeTree tables.
When ttl_only_drop_parts is disabled (by default), the ClickHouse server only deletes expired rows according to their TTL.
When ttl_only_drop_parts is enabled, the ClickHouse server drops a whole part when all rows in it are expired.
Dropping whole parts instead of partial cleaning TTL-d rows allows having shorter merge_with_ttl_timeout times and lower impact on system performance.
Might be you not expect that join will be filled with default values for missing columns (instead of classic NULLs) during JOIN.
Sets the type of JOIN behaviour. When merging tables, empty cells may appear. ClickHouse fills them differently based on this setting.
Possible values:
0 — The empty cells are filled with the default value of the corresponding field type. 1 — JOIN behaves the same way as in standard SQL. The type of the corresponding field is converted to Nullable, and empty cells are filled with NULL.
Default behaviour is not compatible with ANSI SQL (ClickHouse avoids Nullable types by perfomance reasons)
select sum(x), avg(x) from (select 1 x where 0);
┌─sum(x)─┬─avg(x)─┐
│ 0 │ nan │
└────────┴────────┘
set aggregate_functions_null_for_empty=1;
select sum(x), avg(x) from (select 1 x where 0);
┌─sumOrNull(x)─┬─avgOrNull(x)─┐
│ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │
└──────────────┴──────────────┘
18 - ZooKeeper session has expired
Q. I get “ZooKeeper session has expired” once. What should i do? Should I worry?
Getting exceptions or lack of acknowledgement in distributed system from time to time is a normal situation. Your client should do the retry. If that happened once and your client do retries correctly - nothing to worry about.
It it happens often, or with every retry - it may be a sign of some misconfiguration / issue in cluster (see below).
Q. we see a lot of these: ZooKeeper session has expired. Switching to a new session
A. There is a single zookeeper session per server. But there are many threads that can use zookeeper simultaneously. So the same event (we lose the single zookeeper session we had), will be reported by all the threads/queries which were using that zookeeper session.
Usually after loosing the zookeeper session that exception is printed by all the thread which watch zookeeper replication queues, and all the threads which had some in-flight zookeeper operations (for example inserts, ON CLUSTER
commands etc).
If you see a lot of those simultaneously - that just means you have a lot of threads talking to zookeeper simultaneously (or may be you have many replicated tables?).
BTW: every Replicated table comes with its own cost, so you can’t scale the number of replicated tables indefinitely.
Typically after several hundreds (sometimes thousands) of replicated tables, the clickhouse server becomes unusable: it can’t do any other work, but only keeping replication housekeeping tasks. ‘ClickHouse-way’ is to have a few (maybe dozens) of very huge tables instead of having thousands of tiny tables. (Side note: the number of not-replicated tables can be scaled much better).
So again if during short period of time you see lot of those exceptions and that don’t happen anymore for a while - nothing to worry about. Just ensure your client is doing retries properly.
Q. We are wondering what is causing that session to “timeout” as the default looks like 30 seconds, and there’s certainly stuff happening much more frequently than every 30 seconds.
Typically that has nothing with an expiration/timeout - even if you do nothing there are heartbeat events in the zookeeper protocol.
So internally inside clickhouse:
- we have a ‘zookeeper client’ which in practice is a single zookeeper connection (TCP socket), with 2 threads - one serving reads, the seconds serving writes, and some API around.
- while everything is ok zookeeper client keeps a single logical ‘zookeeper session’ (also by sending heartbeats etc).
- we may have hundreds of ‘users’ of that zookeeper client - those are threads that do some housekeeping, serve queries etc.
- zookeeper client normally have dozen ‘in-flight’ requests (asked by different threads). And if something bad happens with that (disconnect, some issue with zookeeper server, some other failure), zookeeper client needs to re-establish the connection and switch to the new session so all those ‘in-flight’ requests will be terminated with a ‘session expired’ exception.
Q. That problem happens very often (all the time, every X minutes / hours / days).
Sometimes the real issue can be visible somewhere close to the first ‘session expired’ exception in the log. (i.e. zookeeper client thread can know & print to logs the real reason, while all ‘user’ threads just get ‘session expired’).
Also zookeeper logs may ofter have a clue to that was the real problem.
Known issues which can lead to session termination by zookeeper:
- connectivity / network issues.
jute.maxbuffer
overrun. If you need to pass too much data in a single zookeeper transaction. (often happens if you need to do ALTER table UPDATE or other mutation on the table with big number of parts). The fix is adjusting JVM setting: -Djute.maxbuffer=8388608. See https://kb.altinity.com/altinity-kb-setup-and-maintenance/altinity-kb-zookeeper/jvm-sizes-and-garbage-collector-settings/- XID overflow. XID is a transaction counter in zookeeper, if you do too many transactions the counter reaches maxint32, and to restart the counter zookeeper closes all the connections. Usually, that happens rarely, and is not avoidable in zookeeper (well in clickhouse-keeper that problem solved). There are some corner cases / some schemas which may end up with that XID overflow happening quite often. (a worst case we saw was once per 3 weeks).
Q. “ZooKeeper session has expired” happens every time I try to start the mutation / do other ALTER on Replicated table.
During ALTERing replicated table ClickHouse need to create a record in zookeeper listing all the parts which should be mutated (that usually means = list names of all parts of the table). If the size of list of parts exceeds maximum buffer size - zookeeper drops the connection.
Parts name length can be different for different tables. In average with default jute.maxbuffer
(1Mb) mutations start to fail for tables which have more than 5000 parts.
Solutions:
- rethink partitioning, high number of parts in table is usually not recommended
- increase
jute.maxbuffer
on zookeeper side to values about 8M - use IN PARITION clause for mutations (where applicable) - since 20.12
- switch to clickhouse-keeper
Related issues:
19 - Altinity packaging compatibility >21.x and earlier
Working with Altinity & Yandex packaging together
Since version 21.1 Altinity switches to the same packaging as used by Yandex. That is needed for syncing things and introduces several improvements (like adding systemd service file).
Unfortunately, that change leads to compatibility issues - automatic dependencies resolution gets confused by the conflicting package names: both when you update ClickHouse to the new version (the one which uses older packaging) and when you want to install older altinity packages (20.8 and older).
Installing old clickhouse version (with old packaging schema)
When you try to install versions 20.8 or older from Altinity repo -
version=20.8.12.2-1.el7
yum install clickhouse-client-${version} clickhouse-server-${version}
yum outputs smth like
yum install clickhouse-client-${version} clickhouse-server-${version}
Loaded plugins: fastestmirror, ovl
Loading mirror speeds from cached hostfile
* base: centos.hitme.net.pl
* extras: centos1.hti.pl
* updates: centos1.hti.pl
Altinity_clickhouse-altinity-stable/x86_64/signature | 833 B 00:00:00
Altinity_clickhouse-altinity-stable/x86_64/signature | 1.0 kB 00:00:01 !!!
Altinity_clickhouse-altinity-stable-source/signature | 833 B 00:00:00
Altinity_clickhouse-altinity-stable-source/signature | 951 B 00:00:00 !!!
Resolving Dependencies
--> Running transaction check
---> Package clickhouse-client.x86_64 0:20.8.12.2-1.el7 will be installed
---> Package clickhouse-server.x86_64 0:20.8.12.2-1.el7 will be installed
--> Processing Dependency: clickhouse-server-common = 20.8.12.2-1.el7 for package: clickhouse-server-20.8.12.2-1.el7.x86_64
Package clickhouse-server-common is obsoleted by clickhouse-server, but obsoleting package does not provide for requirements
--> Processing Dependency: clickhouse-common-static = 20.8.12.2-1.el7 for package: clickhouse-server-20.8.12.2-1.el7.x86_64
--> Running transaction check
---> Package clickhouse-common-static.x86_64 0:20.8.12.2-1.el7 will be installed
---> Package clickhouse-server.x86_64 0:20.8.12.2-1.el7 will be installed
--> Processing Dependency: clickhouse-server-common = 20.8.12.2-1.el7 for package: clickhouse-server-20.8.12.2-1.el7.x86_64
Package clickhouse-server-common is obsoleted by clickhouse-server, but obsoleting package does not provide for requirements
--> Finished Dependency Resolution
Error: Package: clickhouse-server-20.8.12.2-1.el7.x86_64 (Altinity_clickhouse-altinity-stable)
Requires: clickhouse-server-common = 20.8.12.2-1.el7
Available: clickhouse-server-common-1.1.54370-2.x86_64 (clickhouse-stable)
clickhouse-server-common = 1.1.54370-2
Available: clickhouse-server-common-1.1.54378-2.x86_64 (clickhouse-stable)
clickhouse-server-common = 1.1.54378-2
...
Available: clickhouse-server-common-20.8.11.17-1.el7.x86_64 (Altinity_clickhouse-altinity-stable)
clickhouse-server-common = 20.8.11.17-1.el7
Available: clickhouse-server-common-20.8.12.2-1.el7.x86_64 (Altinity_clickhouse-altinity-stable)
clickhouse-server-common = 20.8.12.2-1.el7
You could try using --skip-broken to work around the problem
You could try running: rpm -Va --nofiles --nodigest
As you can see yum has an issue with resolving clickhouse-server-common
dependency, which marked as obsoleted by newer packages.
Solution with Old Packaging Scheme
add --setopt=obsoletes=0
flag to the yum call.
version=20.8.12.2-1.el7
yum install --setopt=obsoletes=0 clickhouse-client-${version} clickhouse-server-${version}
---
title: "installation succeeded"
linkTitle: "installation succeeded"
description: >
installation succeeded
---
Alternatively, you can add obsoletes=0
into /etc/yum.conf
.
To update to new ClickHouse version (from old packaging schema to new packaging schema)
version=21.1.7.1-2
yum install clickhouse-client-${version} clickhouse-server-${version}
Loaded plugins: fastestmirror, ovl
Loading mirror speeds from cached hostfile
* base: centos.hitme.net.pl
* extras: centos1.hti.pl
* updates: centos1.hti.pl
Altinity_clickhouse-altinity-stable/x86_64/signature | 833 B 00:00:00
Altinity_clickhouse-altinity-stable/x86_64/signature | 1.0 kB 00:00:01 !!!
Altinity_clickhouse-altinity-stable-source/signature | 833 B 00:00:00
Altinity_clickhouse-altinity-stable-source/signature | 951 B 00:00:00 !!!
Nothing to do
It is caused by wrong dependencies resolution.
Solution with New Package Scheme
To update to the latest available version - just add clickhouse-server-common
:
yum install clickhouse-client clickhouse-server clickhouse-server-common
This way the latest available version will be installed (even if you will request some other version explicitly).
To install some specific version remove old packages first, then install new ones.
yum erase clickhouse-client clickhouse-server clickhouse-server-common clickhouse-common-static
version=21.1.7.1
yum install clickhouse-client-${version} clickhouse-server-${version}
Downgrade from new version to old one
version=20.8.12.2-1.el7
yum downgrade clickhouse-client-${version} clickhouse-server-${version}
will not work:
Loaded plugins: fastestmirror, ovl
Loading mirror speeds from cached hostfile
* base: ftp.agh.edu.pl
* extras: ftp.agh.edu.pl
* updates: centos.wielun.net
Resolving Dependencies
--> Running transaction check
---> Package clickhouse-client.x86_64 0:20.8.12.2-1.el7 will be a downgrade
---> Package clickhouse-client.noarch 0:21.1.7.1-2 will be erased
---> Package clickhouse-server.x86_64 0:20.8.12.2-1.el7 will be a downgrade
--> Processing Dependency: clickhouse-server-common = 20.8.12.2-1.el7 for package: clickhouse-server-20.8.12.2-1.el7.x86_64
Package clickhouse-server-common-20.8.12.2-1.el7.x86_64 is obsoleted by clickhouse-server-21.1.7.1-2.noarch which is already installed
--> Processing Dependency: clickhouse-common-static = 20.8.12.2-1.el7 for package: clickhouse-server-20.8.12.2-1.el7.x86_64
---> Package clickhouse-server.noarch 0:21.1.7.1-2 will be erased
--> Finished Dependency Resolution
Error: Package: clickhouse-server-20.8.12.2-1.el7.x86_64 (Altinity_clickhouse-altinity-stable)
Requires: clickhouse-common-static = 20.8.12.2-1.el7
Installed: clickhouse-common-static-21.1.7.1-2.x86_64 (@clickhouse-stable)
clickhouse-common-static = 21.1.7.1-2
Available: clickhouse-common-static-1.1.54378-2.x86_64 (clickhouse-stable)
clickhouse-common-static = 1.1.54378-2
Error: Package: clickhouse-server-20.8.12.2-1.el7.x86_64 (Altinity_clickhouse-altinity-stable)
...
Available: clickhouse-server-common-20.8.12.2-1.el7.x86_64 (Altinity_clickhouse-altinity-stable)
clickhouse-server-common = 20.8.12.2-1.el7
You could try using --skip-broken to work around the problem
You could try running: rpm -Va --nofiles --nodigest
Solution With Downgrading
Remove packages first, then install older versions:
yum erase clickhouse-client clickhouse-server clickhouse-server-common clickhouse-common-static
version=20.8.12.2-1.el7
yum install --setopt=obsoletes=0 clickhouse-client-${version} clickhouse-server-${version}
20 - AWS EBS
Volume type | gp3 | gp2 |
---|---|---|
Max throughput per volume | 1000 MiB/s | 250 MiB/s |
Price |
$0.08/GB-month 3,000 IOPS free and $0.005/provisioned IOPS-month over 3,000; 125 MB/s free and $0.04/provisioned MB/s-month over 125 |
$0.10/GB-month |
GP2
In usual conditions ClickHouse being limited by throughput of volumes and amount of provided IOPS doesn’t make any big difference for performance starting from a certain number. So the most native choice for clickhouse is gp2 and gp3 volumes.
Because gp2 volumes have a hard limit of 250 MiB/s per volume (for volumes bigger than 334 GB), it usually makes sense to split one big volume in multiple smaller volumes larger than 334GB in order to have maximum possible throughput.
EC2 instances also have an EBS throughput limit, it depends on the size of the EC2 instance. That means if you would attach multiple volumes which would have high potential throughput, you would be limited by your EC2 instance, so usually there is no reason to have more than 4-5 volumes per node.
It’s pretty straightforward to set up a ClickHouse for using multiple EBS volumes with storage_policies.
GP3
It’s a new type of volume, which is 20% cheaper than gp2 per GB-month and has lower free throughput: only 125 MB/s vs 250 MB/s. But you can buy additional throughput for volume and gp3 pricing became comparable with multiple gp2 volumes starting from 1000-1500GB size. It also works better if most of your queries read only one or several parts, because in that case you are not being limited by performance of a single ebs disk, as parts can be located only on one disk at once.
For best performance, it’s suggested to buy:
- 7000 IOPS
- Throughput up to the limit of your EC2 instance
https://altinity.com/blog/2019/11/27/amplifying-clickhouse-capacity-with-multi-volume-storage-part-1
https://altinity.com/blog/2019/11/29/amplifying-clickhouse-capacity-with-multi-volume-storage-part-2
https://calculator.aws/#/createCalculator/EBS?nc2=h_ql_pr_calc
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-optimized.html
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volume-types.html
21 - ClickHouse in Docker
Do you have documentation on Docker deployments?
Check
- https://hub.docker.com/r/yandex/clickhouse-server/
- https://docs.altinity.com/clickhouseonkubernetes/
- sources of entry point - https://github.com/ClickHouse/ClickHouse/blob/master/docker/server/entrypoint.sh
Important things:
- use concrete version tag (avoid using latest)
- if possible use
--network=host
(due to performance reasons) - you need to mount the folder
/var/lib/clickhouse
to have persistency. - you MAY also mount the folder
/var/log/clickhouse-server
to have logs accessible outside of the container. - Also, you may mount in some files or folders in the configuration folder:
/etc/clickhouse-server/config.d/listen_ports.xml
--ulimit nofile=262144:262144
- You can also set on some linux capabilities to enable some of extra features of ClickHouse (not obligatory):
SYS_PTRACE NET_ADMIN IPC_LOCK SYS_NICE
- you may also mount in the folder
/docker-entrypoint-initdb.d/
- all SQL or bash scripts there will be executed during container startup. - if you use cgroup limits - it may misbehave https://github.com/ClickHouse/ClickHouse/issues/2261 (set up
<max_server_memory_usage>
manually) - there are several ENV switches, see: https://github.com/ClickHouse/ClickHouse/blob/master/docker/server/entrypoint.sh
TLDR version: use it as a starting point:
docker run -d \
--name some-clickhouse-server \
--ulimit nofile=262144:262144 \
--volume=$(pwd)/data:/var/lib/clickhouse \
--volume=$(pwd)/logs:/var/log/clickhouse-server \
--volume=$(pwd)/configs/memory_adjustment.xml:/etc/clickhouse-server/config.d/memory_adjustment.xml \
--cap-add=SYS_NICE \
--cap-add=NET_ADMIN \
--cap-add=IPC_LOCK \
--cap-add=SYS_PTRACE \
--network=host \
yandex/clickhouse-server:21.1.7
docker exec -it some-clickhouse-server clickhouse-client
docker exec -it some-clickhouse-server bash
22 - ClickHouse Monitoring
ClickHouse Monitoring
Monitoring helps to track potential issues in your cluster before they cause a critical error.
What to read / watch on subject:
- Altinity webinar “ClickHouse Monitoring 101: What to monitor and how”. recording, slides
- docs https://clickhouse.com/docs/en/operations/monitoring/
What should be monitored
The following metrics should be collected / monitored
-
For Host Machine:
- CPU
- Memory
- Network (bytes/packets)
- Storage (iops)
- Disk Space (free / used)
-
For ClickHouse:
- Connections (count)
- RWLocks
- Read / Write / Return (bytes)
- Read / Write / Return (rows)
- Zookeeper operations (count)
- Absolute delay
- Query duration (optional)
- Replication parts and queue (count)
-
For Zookeeper:
Monitoring tools
Prometheus (embedded exporter) + Grafana
- Enable embedded exporter
- Grafana dashboards https://grafana.com/grafana/dashboards/14192 or https://grafana.com/grafana/dashboards/13500
clickhouse-operator embedded exporter
- exporter is included in clickhouse-operator, and enabled automatically
- see instructions of Prometheus and Grafana installation (if you don’t have one)
- Grafana dashboard https://github.com/Altinity/clickhouse-operator/tree/master/grafana-dashboard
- Prometheus alerts https://github.com/Altinity/clickhouse-operator/blob/master/deploy/prometheus/prometheus-alert-rules-clickhouse.yaml
Prometheus exporter (external) + Grafana
(unmaintained)
Dashboards quering clickhouse directly via vertamedia / Altinity plugin
- Overview: https://grafana.com/grafana/dashboards/13606
- Queries dashboard (analyzing system.query_log) https://grafana.com/grafana/dashboards/2515
Dashboard quering clickhouse directly via Grafana plugin
Zabbix
- https://www.zabbix.com/integrations/clickhouse
- https://github.com/Altinity/clickhouse-zabbix-template
Graphite
- Use the embedded exporter. See docs and config.xml
InfluxDB
- You can use embedded exporter, plus Telegraf. For more information, see Graphite protocol support in InfluxDB.
Nagios/Icinga
Commercial solution
- Datadog https://docs.datadoghq.com/integrations/clickhouse/?tab=host
- Sematext https://sematext.com/docs/integration/clickhouse/
- Instana https://www.instana.com/supported-technologies/clickhouse-monitoring/
- site24x7 https://www.site24x7.com/plugins/clickhouse-monitoring.html
- Acceldata Pulse https://www.acceldata.io/blog/acceldata-pulse-for-clickhouse-monitoring
“Build your own” monitoring
ClickHouse allow to access lot of internals using system tables. The main tables to access monitoring data are:
- system.metrics
- system.asynchronous_metrics
- system.events
Minimum neccessary set of checks
Check Name | Shell or SQL command
|
Severity
|
ClickHouse status | $ curl 'http://localhost:8123/'
|
Critical
|
Too many simultaneous queries. Maximum: 100 (by default) | select value from system.metrics
|
Critical
|
Replication status | $ curl 'http://localhost:8123/replicas_status'
|
High
|
Read only replicas (reflected by replicas_status as well)
|
select value from system.metrics
|
High
|
Some replication tasks are stuck | select count()
|
High
|
ZooKeeper is available | select count() from system.zookeeper
|
Critical for writes
|
ZooKeeper exceptions | select value from system.events
|
Medium
|
Other CH nodes are available | $ for node in `echo "select distinct host_address from system.clusters where host_name !='localhost'" | curl 'http://localhost:8123/' --silent --data-binary @-`; do curl "http://$node:8123/" --silent ; done | sort -u
|
High
|
All CH clusters are available (i.e. every configured cluster has enough replicas to serve queries) | for cluster in `echo "select distinct cluster from system.clusters where host_name !='localhost'" | curl 'http://localhost:8123/' --silent --data-binary @-` ; do clickhouse-client --query="select '$cluster', 'OK' from cluster('$cluster', system, one)" ; done
|
Critical
|
There are files in 'detached' folders | $ find /var/lib/clickhouse/data/*/*/detached/* -type d | wc -l; \
19.8+
|
Medium
|
Too many parts: \ Number of parts is growing; \ Inserts are being delayed; \ Inserts are being rejected | select value from system.asynchronous_metrics
|
Critical
|
Dictionaries: exception | select concat(name,': ',last_exception)
|
Medium
|
ClickHouse has been restarted | select uptime();
|
|
DistributedFilesToInsert should not be always increasing | select value from system.metrics
|
Medium
|
A data part was lost | select value from system.events
|
High
|
Data parts are not the same on different replicas | select value from system.events where event='DataAfterMergeDiffersFromReplica'; \
select value from system.events where event='DataAfterMutationDiffersFromReplica'
|
Medium
|
The following queries are recommended to be included in monitoring:
SELECT * FROM system.replicas
- For more information, see the ClickHouse guide on System Tables
SELECT * FROM system.merges
- Checks on the speed and progress of currently executed merges.
SELECT * FROM system.mutations
- This is the source of information on the speed and progress of currently executed merges.
Logs monitoring
ClickHouse logs can be another important source of information. There are 2 logs enabled by default
- /var/log/clickhouse-server/clickhouse-server.err.log (error & warning, you may want to keep an eye on that or send it to some monitoring system)
- /var/log/clickhouse-server/clickhouse-server.log (trace logs, very detailed, useful for debugging, usually too verbose to monitor).
You can additionally enable system.text_log table to have an access to the logs from clickhouse sql queries (ensure that you will not expose some information to the users which should not see it).
$ cat /etc/clickhouse-server/config.d/text_log.xml
<yandex>
<text_log>
<database>system</database>
<table>text_log</table>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
<level>warning</level>
</text_log>
</yandex>
OpenTelemetry support
See https://clickhouse.com/docs/en/operations/opentelemetry/
Other sources
- https://tech.marksblogg.com/clickhouse-prometheus-grafana.html
- Key Metrics for Monitoring ClickHouse
- ClickHouse Monitoring Key Metrics to Monitor
- ClickHouse Monitoring Tools: Five Tools to Consider
- Monitoring ClickHouse
- Monitor ClickHouse with Datadog
- Unsorted notes on monitor and Alerts
- https://intl.cloud.tencent.com/document/product/1026/36887
- https://chowdera.com/2021/03/20210301161806704Y.html
- https://chowdera.com/2021/03/20210301160252465m.html#
23 - ClickHouse versions
ClickHouse versioning schema
Example:
21.3.10.1-lts
- 21 is the year of release.
- 3 indicates a Feature Release. This is an increment where features are delivered.
- 10 is the bugfix / maintenance version. When that version is incremented it means that some bugs was fixed comparing to 21.3.9.
- 1 - build number, means nothing for end users.
- lts - type of release. (long time support).
What is Altinity Stable version?
It is one of general / public version of ClickHouse which has passed some extra testings, the upgrade path and changelog was analyzed, known issues are documented, and at least few big companies use it on production. All those things take some time, so usually that means that Altinity Stable is always a ‘behind’ the main releases.
Altinity version - is an option for conservative users, who prefer bit older but better known things.
Usually there is no reason to use version older than Altinity Stable. If you see that new Altinity Version arrived and you still use some older version - you should for sure consider an upgrade.
Additionally for Altinity client we provide extra support for those version for a longer time (and we also support newer versions).
Which version should I use?
We recommend the following approach:
- When you start using ClickHouse and before you go on production - pick the latest stable version.
- If you already have ClickHouse running on production:
- Check all the new queries / schemas on the staging first, especially if some new ClickHouse features are used.
- Do minor (bugfix) upgrades regularly: monitor new maintenance releases of the feature release you use.
- When considering upgrade - check Altinity Stable release docs, if you want to use newer release - analyze changelog and known issues.
- Check latest stable or test versions of ClickHouse on your staging environment regularly and pass the feedback to us or on the official ClickHouse github.
- Consider blue/green or canary upgrades.
See also: https://clickhouse.tech/docs/en/faq/operations/production/
How do I upgrade?
Warning
Check upgrade / downgrade scenario on staging first.- check if you need to adjust some settings / to opt-out some new features you don’t need (maybe needed to to make the downgrade path possible, or to make it possible for 2 versions to work together).
- upgrade packages on odd replicas
- (if needed / depends on use case) stop ingestion into odd replicas / remove them for load-balancer etc.
- restart clickhouse-server service on odd replicas.
- once odd replicas will go back online - repeat the same procedure on the even replicas.
In some upgrade scenarios (depending from which version to which you do upgrate) when differerent replicas use different clickhouse versions you may see following issues:
- the replication don’t work at all and delays grow.
- errors about ‘checksum mismatch’ and traffic between replicase increase (they need to resync merge results).
Both problems will go away once all replicas will be upgraded.
Bugs?
ClickHouse development process goes in a very high pace and has already thousands of features. CI system doing tens of thousands of tests (including tests with different sanitizers) against every commit.
All core features are well-tested, and very stable, and code is high-quality. But as with any other software bad things may happen. Usually the most of bugs happens in the new, freshly added functionality, and in some complex combination of several features (of course all possible combinations of features just physically can’t be tested). Usually new features are adopted by the community and stabilize quickly.
What should I do if I found a bug in clickhouse?
- First of all: try to upgrade to the latest bugfix release Example: if you use v21.3.5.42-lts but you know that v21.3.10.1-lts already exists - start with upgrade to that. Upgrades to latest maintenance releases are smooth and safe.
- Look for similar issues in github. Maybe the fix is on the way.
- If you can reproduce the bug: try to isolate it - remove some pieces of query one-by-one / simplify the scenario until the issue still reproduces. This way you can figure out which part is responsible for that bug, and you can try to create minimal reproducible example
- Once you have minimal reproducible example:
- report it to github (or to Altinity Support)
- check if it reproduces on newer clickhouse versions
24 - clickhouse-backup
Installation and configuration
Download the latest clickhouse-backup.tar.gz
from assets from https://github.com/AlexAkulov/clickhouse-backup/releases
This tar.gz contains a single binary of clickhouse-backup
and an example of config file.
Backblaze has s3 compatible API but requires empty acl parameter acl: ""
.
https://www.backblaze.com/ has 15 days and free 10Gb S3 trial.
$ mkdir clickhouse-backup
$ cd clickhouse-backup
$ wget https://github.com/AlexAkulov/clickhouse-backup/releases/download/1.0.0-beta2/clickhouse-backup.tar.gz
$ tar zxf clickhouse-backup.tar.gz
$ rm clickhouse-backup.tar.gz
$ cat config.yml
general:
remote_storage: s3
max_file_size: 1099511627776
disable_progress_bar: false
backups_to_keep_local: 0
backups_to_keep_remote: 0
log_level: info
allow_empty_backups: false
clickhouse:
username: default
password: ""
host: localhost
port: 9000
disk_mapping: {}
skip_tables:
- system.*
timeout: 5m
freeze_by_part: false
secure: false
skip_verify: false
sync_replicated_tables: true
log_sql_queries: false
s3:
access_key: 0****1
secret_key: K****1
bucket: "mybucket"
endpoint: s3.us-west-000.backblazeb2.com
region: us-west-000
acl: ""
force_path_style: false
path: clickhouse-backup
disable_ssl: false
part_size: 536870912
compression_level: 1
compression_format: tar
sse: ""
disable_cert_verification: false
storage_class: STANDARD
I have a database test
with table test
select count() from test.test;
┌─count()─┐
│ 400000 │
└─────────┘
clickhouse-backup list should work without errors (it scans local and remote (s3) folders):
$ sudo ./clickhouse-backup list -c config.yml
$
Backup
- create a local backup of database test
- upload this backup to remote
- remove the local backup
- drop the source database
$ sudo ./clickhouse-backup create --tables='test.*' bkp01 -c config.yml
2021/05/31 23:11:13 info done backup=bkp01 operation=create table=test.test
2021/05/31 23:11:13 info done backup=bkp01 operation=create
$ sudo ./clickhouse-backup upload bkp01 -c config.yml
1.44 MiB / 1.44 MiB [=====================] 100.00% 2s
2021/05/31 23:12:13 info done backup=bkp01 operation=upload table=test.test
2021/05/31 23:12:17 info done backup=bkp01 operation=upload
$ sudo ./clickhouse-backup list -c config.yml
bkp01 1.44MiB 31/05/2021 23:11:13 local
bkp01 1.44MiB 31/05/2021 23:11:13 remote tar
$ sudo ./clickhouse-backup delete local bkp01 -c config.yml
2021/05/31 23:13:29 info delete 'bkp01'
DROP DATABASE test;
Restore
- download the remote backup
- restore database
$ sudo ./clickhouse-backup list -c config.yml
bkp01 1.44MiB 31/05/2021 23:11:13 remote tar
$ sudo ./clickhouse-backup download bkp01 -c config.yml
2021/05/31 23:14:41 info done backup=bkp01 operation=download table=test.test
1.47 MiB / 1.47 MiB [=====================] 100.00% 0s
2021/05/31 23:14:43 info done backup=bkp01 operation=download table=test.test
2021/05/31 23:14:43 info done backup=bkp01 operation=download
$ sudo ./clickhouse-backup restore bkp01 -c config.yml
2021/05/31 23:16:04 info done backup=bkp01 operation=restore table=test.test
2021/05/31 23:16:04 info done backup=bkp01 operation=restore
SELECT count() FROM test.test;
┌─count()─┐
│ 400000 │
└─────────┘
Delete backups
$ sudo ./clickhouse-backup delete local bkp01 -c config.yml
2021/05/31 23:17:05 info delete 'bkp01'
$ sudo ./clickhouse-backup delete remote bkp01 -c config.yml
25 - Converting MergeTree to Replicated
Options here are:
- Use
INSERT INTO foo_replicated SELECT * FROM foo
. - Create table aside and attach all partition from the existing table then drop original table (uses hard links don’t require extra disk space).
ALTER TABLE foo_replicated ATTACH PARTITION ID 'bar' FROM 'foo'
You can easily auto generate those commands using a query like:SELECT DISTINCT 'ALTER TABLE foo_replicated ATTACH PARTITION ID \'' || partition_id || '\' FROM foo;' from system.parts WHERE table = 'foo';
- Do it ‘in place’ using some file manipulation. see the procedure described here: https://clickhouse.tech/docs/en/engines/table-engines/mergetree-family/replication/#converting-from-mergetree-to-replicatedmergetree
- Do a backup of MergeTree and recover as ReplicatedMergeTree. https://github.com/AlexAkulov/clickhouse-backup/blob/master/Examples.md#how-to-convert-mergetree-to-replicatedmegretree
- Embedded command for that should be added in future.
example for option 2
Note: ATTACH PARTITION ID ‘bar’ FROM ‘foo’` is practically free from compute and disk space perspective. This feature utilizes filesystem hard-links and the fact that files are immutable in Clickhouse ( it’s the core of the Clickhouse design, filesystem hard-links and such file manipulations are widely used ).
create table foo( A Int64, D Date, S String )
Engine MergeTree
partition by toYYYYMM(D) order by A;
insert into foo select number, today(), '' from numbers(1e8);
insert into foo select number, today()-60, '' from numbers(1e8);
select count() from foo;
┌───count()─┐
│ 200000000 │
└───────────┘
create table foo_replicated as foo
Engine ReplicatedMergeTree('/clickhouse/{cluster}/tables/{database}/{table}/{shard}','{replica}')
partition by toYYYYMM(D) order by A;
SYSTEM STOP MERGES;
SELECT DISTINCT 'ALTER TABLE foo_replicated ATTACH PARTITION ID \'' || partition_id || '\' FROM foo;' from system.parts WHERE table = 'foo' AND active;
┌─concat('ALTER TABLE foo_replicated ATTACH PARTITION ID \'', partition_id, '\' FROM foo;')─┐
│ ALTER TABLE foo_replicated ATTACH PARTITION ID '202111' FROM foo; │
│ ALTER TABLE foo_replicated ATTACH PARTITION ID '202201' FROM foo; │
└───────────────────────────────────────────────────────────────────────────────────────────┘
clickhouse-client -q "SELECT DISTINCT 'ALTER TABLE foo_replicated ATTACH PARTITION ID \'' || partition_id || '\' FROM foo;' from system.parts WHERE table = 'foo' format TabSeparatedRaw" |clickhouse-client -mn
SYSTEM START MERGES;
SELECT count() FROM foo_replicated;
┌───count()─┐
│ 200000000 │
└───────────┘
rename table foo to foo_old, foo_replicated to foo;
-- you can drop foo_old any time later, it's kinda a cheap backup,
-- it cost nothing until you insert a lot of additional data into foo_replicated
26 - Data Migration
Export & Import into common data formats
Pros:
- Data can be inserted into any DBMS.
Cons:
- Decoding & encoding of common data formats may be slower / require more CPU
- The data size is usually bigger than ClickHouse formats.
- Some of the common data formats have limitations.
Info
The best approach to do that is using clickhouse-client, in that case, encoding/decoding of format happens client-side, while client and server speak clickhouse Native format (columnar & compressed).
In contrast: when you use HTTP protocol, the server do encoding/decoding and more data is passed between client and server.
remote/remoteSecure or cluster/Distributed table
Pros:
- Simple to run.
- It’s possible to change the schema and distribution of data between shards.
- It’s possible to copy only some subset of data.
- Needs only access to ClickHouse TCP port.
Cons:
- Uses CPU / RAM (mostly on the receiver side)
See details in:
clickhouse-copier
Pros:
- Possible to do some changes in schema.
- Needs only access to ClickHouse TCP port.
- It’s possible to change the distribution of data between shards.
- Suitable for large clusters: many clickhouse-copier can execute the same task together.
Cons:
- May create an inconsistent result if source cluster data is changing during the process.
- Hard to setup.
- Requires zookeeper.
- Uses CPU / RAM (mostly on the clickhouse-copier and receiver side)
Info
Internally it works like smartINSERT INTO cluster(…) SELECT * FROM ...
with some consistency checks.
Info
Run clickhouse copier on the same nodes as receiver clickhouse, to avoid doubling the network load.See details in:
Manual parts moving: freeze / rsync / attach
Pros:
- Low CPU / RAM usage.
Cons:
- Table schema should be the same.
- A lot of manual operations/scripting.
Info
With some additional care and scripting, it’s possible to do cheap re-sharding on parts level.See details in:
clickhouse-backup
Pros:
- Low CPU / RAM usage.
- Suitable to recover both schema & data for all tables at once.
Cons:
- Table schema should be the same.
Just create the backup on server 1, upload it to server 2, and restore the backup.
See https://github.com/AlexAkulov/clickhouse-backup
https://altinity.com/blog/introduction-to-clickhouse-backups-and-clickhouse-backup
Fetch from zookeeper path
Pros:
- Low CPU / RAM usage.
Cons:
- Table schema should be the same.
- Works only when the source and the destination clickhouse servers share the same zookeeper (without chroot)
- Needs to access zookeeper and ClickHouse replication ports: (
interserver_http_port
orinterserver_https_port
)
ALTER TABLE table_name FETCH PARTITION partition_expr FROM 'path-in-zookeeper'
Replication protocol
Just make one more replica in another place.
Pros:
- Simple to setup
- Data is consistent all the time automatically.
- Low CPU and network usage.
Cons:
- Needs to reach both zookeeper client (2181) and ClickHouse replication ports: (
interserver_http_port
orinterserver_https_port
) - In case of cluster migration, zookeeper need’s to be migrated too.
- Replication works both ways.
../altinity-kb-zookeeper/altinity-kb-zookeeper-cluster-migration.md
See also
Github issues
https://github.com/ClickHouse/ClickHouse/issues/10943 https://github.com/ClickHouse/ClickHouse/issues/20219 https://github.com/ClickHouse/ClickHouse/pull/17871
Other links
26.1 - clickhouse-copier
The description of the utility and its parameters, as well as examples of the config files that you need to create for the copier are in the doc https://clickhouse.tech/docs/en/operations/utilities/clickhouse-copier/
The steps to run a task:
-
Create a config file for clickhouse-copier (zookeeper.xml)
https://clickhouse.tech/docs/en/operations/utilities/clickhouse-copier/#format-of-zookeeper-xml
-
Create a config file for the task (task1.xml)
-
Create the task in ZooKeeper and start an instance of clickhouse-copier
clickhouse-copier --daemon --base-dir=/opt/clickhouse-copier --config /opt/clickhouse-copier/zookeeper.xml --task-path /clickhouse/copier/task1 --task-file /opt/clickhouse-copier/task1.xml
If the node in ZooKeeper already exists and you want to change it, you need to add the task-upload-force
parameter:
clickhouse-copier --daemon --base-dir=/opt/clickhouse-copier --config /opt/clickhouse-copier/zookeeper.xml --task-path /clickhouse/copier/task1 --task-file /opt/clickhouse-copier/task1.xml --task-upload-force 1
If you want to run another instance of clickhouse-copier for the same task, you need to copy the config file (zookeeper.xml) to another server, and run this command:
clickhouse-copier --daemon --base-dir=/opt/clickhouse-copier --config /opt/clickhouse-copier/zookeeper.xml --task-path /clickhouse/copier/task1
The number of simultaneously running instances is controlled be the max_workers
parameter in your task configuration file. If you run more workers superfluous workers will sleep and log messages like this:
<Debug> ClusterCopier: Too many workers (1, maximum 1). Postpone processing
See also
- https://clickhouse.tech/docs/en/operations/utilities/clickhouse-copier/
- Никита Михайлов. Кластер ClickHouse ctrl-с ctrl-v. HighLoad++ Весна 2021 slides
- 21.7 have a huge bulk of fixes / improvements. https://github.com/ClickHouse/ClickHouse/pull/23518
- https://altinity.com/blog/2018/8/22/clickhouse-copier-in-practice
- http://www.clickhouse.com.cn/topic/601fb322b06e5e0f21ba79e1
- https://github.com/getsentry/snuba/blob/master/docs/clickhouse-copier.md
- https://hughsite.com/post/clickhouse-copier-usage.html
- https://www.jianshu.com/p/c058edd664a6
26.1.1 - clickhouse-copier 20.3 and earlier
Clickhouse-copier was created to move data between clusters. It runs simple INSERT…SELECT queries and can copy data between tables with different engine parameters and between clusters with different number of shards. In the task configuration file you need to describe the layout of the source and the target cluster, and list the tables that you need to copy. You can copy whole tables or specific partitions. Clickhouse-copier uses temporary distributed tables to select from the source cluster and insert into the target cluster.
The process is as follows
- Process the configuration files.
- Discover the list of partitions if not provided in the config.
- Copy partitions one by one.
- Drop the partition from the target table if it’s not empty
- Copy data from source shards one by one.
- Check if there is data for the partition on a source shard.
- Check the status of the task in ZooKeeper.
- Create target tables on all shards of the target cluster.
- Insert the partition of data into the target table.
- Mark the partition as completed in ZooKeeper.
If there are several workers running simultaneously, they will assign themselves to different source shards. If a worker was interrupted, another worker can be started to continue the task. The next worker will drop incomplete partitions and resume the copying.
Configuring the engine of the target table
Clickhouse-copier uses the engine from the task configuration file for these purposes:
- to create target tables if they don’t exist.
- PARTITION BY: to SELECT a partition of data from the source table, to DROP existing partitions from target tables.
Clickhouse-copier does not support the old MergeTree format. However, you can create the target tables manually and specify the engine in the task configuration file in the new format so that clickhouse-copier can parse it for its SELECT queries.
How to monitor the status of running tasks
Clickhouse-copier uses ZooKeeper to keep track of the progress and to communicate between workers. Here is a list of queries that you can use to see what’s happening.
--task-path /clickhouse/copier/task1
-- The task config
select * from system.zookeeper
where path='<task-path>'
name | ctime | mtime
----------------------------+---------------------+--------------------
description | 2019-10-18 15:40:00 | 2020-09-11 16:01:14
task_active_workers_version | 2019-10-18 16:00:09 | 2020-09-11 16:07:08
tables | 2019-10-18 16:00:25 | 2019-10-18 16:00:25
task_active_workers | 2019-10-18 16:00:09 | 2019-10-18 16:00:09
-- Running workers
select * from system.zookeeper
where path='<task-path>/task_active_workers'
-- The list of processed tables
select * from system.zookeeper
where path='<task-path>/tables'
-- The list of processed partitions
select * from system.zookeeper
where path='<task-path>/tables/<table>'
name | ctime
-------+--------------------
201909 | 2019-10-18 18:24:18
-- The status of a partition
select * from system.zookeeper
where path='<task-path>/tables/<table>/<partition>'
name | ctime
-------------------------+--------------------
shards | 2019-10-18 18:24:18
partition_active_workers | 2019-10-18 18:24:18
-- The status of source shards
select * from system.zookeeper
where path='<task-path>/tables/<table>/<partition>/shards'
name | ctime | mtime
-----+---------------------+--------------------
1 | 2019-10-18 22:37:48 | 2019-10-18 22:49:29
26.1.2 - clickhouse-copier 20.4 - 21.6
Clickhouse-copier was created to move data between clusters. It runs simple INSERT…SELECT queries and can copy data between tables with different engine parameters and between clusters with different number of shards. In the task configuration file you need to describe the layout of the source and the target cluster, and list the tables that you need to copy. You can copy whole tables or specific partitions. Clickhouse-copier uses temporary distributed tables to select from the source cluster and insert into the target cluster.
The behavior of clickhouse-copier was changed in 20.4:
- Now clickhouse-copier inserts data into intermediate tables, and after the insert finishes successfully clickhouse-copier attaches the completed partition into the target table. This allows for incremental data copying, because the data in the target table is intact during the process. Important note: ATTACH PARTITION respects the
max_partition_size_to_drop
limit. Make sure themax_partition_size_to_drop
limit is big enough (or set to zero) in the destination cluster. If clickhouse-copier is unable to attach a partition because of the limit, it will proceed to the next partition, and it will drop the intermediate table when the task is finished (if the intermediate table is less than themax_table_size_to_drop
limit). Another important note: ATTACH PARTITION is replicated. The attached partition will need to be downloaded by the other replicas. This can create significant network traffic between ClickHouse nodes. If an attach takes a long time, clickhouse-copier will log a timeout and will proceed to the next step. - Now clickhouse-copier splits the source data into chunks and copies them one by one. This is useful for big source tables, when inserting one partition of data can take hours. If there is an error during the insert clickhouse-copier has to drop the whole partition and start again. The
number_of_splits
parameter lets you split your data into chunks so that in case of an exception clickhouse-copier has to re-insert only one chunk of the data. - Now clickhouse-copier runs
OPTIMIZE target_table PARTITION ... DEDUPLICATE
for non-Replicated MergeTree tables. Important note: This is a very strange feature that can do more harm than good. We recommend to disable it by configuring the engine of the target table as Replicated in the task configuration file, and create the target tables manually if they are not supposed to be replicated. Intermediate tables are always created as plain MergeTree.
The process is as follows
- Process the configuration files.
- Discover the list of partitions if not provided in the config.
- Copy partitions one by one I’m not sure of the order since I was copying from 1 shard to 4 shards. The metadata in ZooKeeper suggests the order described here.
- Copy chunks of data one by one.
- Copy data from source shards one by one.
- Create intermediate tables on all shards of the target cluster.
- Check the status of the chunk in ZooKeeper.
- Drop the partition from the intermediate table if the previous attempt was interrupted.
- Insert the chunk of data into the intermediate tables.
- Mark the shard as completed in ZooKeeper
- Copy data from source shards one by one.
- Attach the chunks of the completed partition into the target table one by one
- Attach a chunk into the target table.
- non-Replicated: Run OPTIMIZE target_table DEDUPLICATE for the partition on the target table.
- Copy chunks of data one by one.
- Drop intermediate tables (may not succeed if the tables are bigger than
max_table_size_to_drop
).
If there are several workers running simultaneously, they will assign themselves to different source shards. If a worker was interrupted, another worker can be started to continue the task. The next worker will drop incomplete partitions and resume the copying.
Configuring the engine of the target table
Clickhouse-copier uses the engine from the task configuration file for these purposes:
- to create target and intermediate tables if they don’t exist.
- PARTITION BY: to SELECT a partition of data from the source table, to ATTACH partitions into target tables, to DROP incomplete partitions from intermediate tables, to OPTIMIZE partitions after they are attached to the target.
- ORDER BY: to SELECT a chunk of data from the source table.
Here is an example of SELECT that clickhouse-copier runs to get the sixth of ten chunks of data:
WHERE (<the PARTITION BY clause> = (<a value of the PARTITION BY expression> AS partition_key))
AND (cityHash64(<the ORDER BY clause>) % 10 = 6 )
Clickhouse-copier does not support the old MergeTree format. However, you can create the intermediate tables manually with the same engine as the target tables (otherwise ATTACH will not work), and specify the engine in the task configuration file in the new format so that clickhouse-copier can parse it for SELECT, ATTACH PARTITION and DROP PARTITION queries.
Important note: always configure engine as Replicated to disable OPTIMIZE … DEDUPLICATE (unless you know why you need clickhouse-copier to run OPTIMIZE … DEDUPLICATE).
How to configure the number of chunks
The default value for number_of_splits
is 10.
You can change this parameter in the table
section of the task configuration file. We recommend setting it to 1 for smaller tables.
<cluster_push>target_cluster</cluster_push>
<database_push>target_database</database_push>
<table_push>target_table</table_push>
<number_of_splits>1</number_of_splits>
<engine>Engine=Replicated...<engine>
How to monitor the status of running tasks
Clickhouse-copier uses ZooKeeper to keep track of the progress and to communicate between workers. Here is a list of queries that you can use to see what’s happening.
--task-path /clickhouse/copier/task1
-- The task config
select * from system.zookeeper
where path='<task-path>'
name | ctime | mtime
----------------------------+---------------------+--------------------
description | 2021-03-22 13:15:48 | 2021-03-22 13:25:28
task_active_workers_version | 2021-03-22 13:15:48 | 2021-03-22 20:32:09
tables | 2021-03-22 13:16:47 | 2021-03-22 13:16:47
task_active_workers | 2021-03-22 13:15:48 | 2021-03-22 13:15:48
-- Running workers
select * from system.zookeeper
where path='<task-path>/task_active_workers'
-- The list of processed tables
select * from system.zookeeper
where path='<task-path>/tables'
-- The list of processed partitions
select * from system.zookeeper
where path='<task-path>/tables/<table>'
name | ctime
-------+--------------------
202103 | 2021-03-22 13:16:47
202102 | 2021-03-22 13:18:31
202101 | 2021-03-22 13:27:36
202012 | 2021-03-22 14:05:08
-- The status of a partition
select * from system.zookeeper
where path='<task-path>/tables/<table>/<partition>'
name | ctime
---------------+--------------------
piece_0 | 2021-03-22 13:18:31
attach_is_done | 2021-03-22 14:05:05
-- The status of a piece
select * from system.zookeeper
where path='<task-path>/tables/<table>/<partition>/piece_N'
name | ctime
-------------------------------+--------------------
shards | 2021-03-22 13:18:31
is_dirty | 2021-03-22 13:26:51
partition_piece_active_workers | 2021-03-22 13:26:54
clean_start | 2021-03-22 13:26:54
-- The status of source shards
select * from system.zookeeper
where path='<task-path>/tables/<table>/<partition>/piece_N/shards'
name | ctime | mtime
-----+---------------------+--------------------
1 | 2021-03-22 13:26:54 | 2021-03-22 14:05:05
26.2 - Fetch Alter Table
FETCH Parts from Zookeeper
This is a detailed explanation on how to move data by fetching partitions or parts between replicas
Get partitions by database and table:
SELECT
hostName() AS host,
database,
table
partition_id,
name as part_id
FROM cluster('{cluster}', system.parts)
WHERE database IN ('db1','db2' ... 'dbn') AND active
This query will return all the partitions and parts stored in this node for the ORION
and ORIONREPLICA
databases and their tables.
Fetch the partitions:
Prior starting with the fetching process it is recommended to check the system.detached_parts
table of the destination node. There is a chance that detached folders already contain some old parts, and you will have to remove them all before starting moving data. Otherwise you will attach those old parts together with the fetched parts. Also you could run into issues if there are detached folders with the same names as the ones you are fetching (not very probable, put possible). Simply delete the detached parts and continue with the process.
To fetch a partition:
ALTER TABLE <tablename> FETCH PARTITION <partition_id> FROM '/clickhouse/{cluster}/tables/{shard}/{table}'
The FROM
path is from the zookeeper node and you have to specify the shard from you’re fetching the partition. Next executing the DDL query:
ALTER TABLE <tablename> ATTACH PARTITION <partition_id>
will attach the partitions to a table. Again and because the process is manual, it is recommended to check that the fetched partitions are attached correctly and that there are no detached parts left. Check both system.parts
and system.detached_parts
tables.
Detach tables and delete replicas:
If needed, after moving the data and checking that everything is sound, you can detach the tables and delete the replicas.
-- Required for DROP REPLICA
DETACH TABLE <table_name>;
-- It will remove everything from /table_path_in_z
-- but not the data. You could reattach the table again and
-- restore the replica if needed
SYSTEM DROP REPLICA 'replica_name' FROM ZKPATH '/table_path_in_zk/';
Query to generate all the DDL:
With this query you can generate the DDL script that will do the fetch and attach operations for each table and partition.
SELECT
DISTINCT
'alter table '||database||'.'||table||' FETCH PARTITION '''||partition_id||''' FROM '''||zookeeper_path||'''; '
||'alter table '||database||'.'||table||' ATTACH PARTITION '''||partition_id||''';'
FROM system.parts INNER JOIN system.replicas USING (database, table)
WHERE database IN ('db1','db2' ... 'dbn') AND active
You could add an ORDER BY to manually make the list in the order you need, or use ORDER BY rand() to randomize it. You will then need to split the commands between the shards.
26.3 - Remote table function
remote(…) table function
Suitable for moving up to hundreds of gigabytes of data.
With bigger tables recommended approach is to slice the original data by some WHERE
condition, ideally - apply the condition on partitioning key, to avoid writing data to many partitions at once.
INSERT INTO staging_table SELECT * FROM remote(...) WHERE date='2021-04-13';
INSERT INTO staging_table SELECT * FROM remote(...) WHERE date='2021-04-12';
INSERT INTO staging_table SELECT * FROM remote(...) WHERE date='2021-04-11';
....
Q. Can it create a bigger load on the source system?
Yes, it may use disk read & network write bandwidth. But typically write speed is worse than the read speed, so most probably the receiver side will be a bottleneck, and the sender side will not be overloaded.
While of course it should be checked, every case is different.
Q. Can I tune INSERT speed to make it faster?
Yes, by the cost of extra memory usage (on the receiver side).
Clickhouse tries to form blocks of data in memory and while one of limit: min_insert_block_size_rows
or min_insert_block_size_bytes
being hit, clickhouse dump this block on disk. If clickhouse tries to execute insert in parallel (max_insert_threads > 1
), it would form multiple blocks at one time.
So maximum memory usage can be calculated like this: max_insert_threads * first(min_insert_block_size_rows OR min_insert_block_size_bytes)
Default values:
┌─name────────────────────────┬─value─────┐
│ min_insert_block_size_rows │ 1048545 │
│ min_insert_block_size_bytes │ 268427520 │
│ max_insert_threads │ 0 │ <- Values 0 or 1 means that INSERT SELECT is not run in parallel.
└─────────────────────────────┴───────────┘
Tune those settings depending on your table average row size and amount of memory which are safe to occupy by INSERT SELECT
query.
Q. I’ve got the error “All connection tries failed”
SELECT count()
FROM remote('server.from.remote.dc:9440', 'default.table', 'admin', 'password')
Received exception from server (version 20.8.11):
Code: 519. DB::Exception: Received from localhost:9000. DB::Exception: All attempts to get table structure failed. Log:
Code: 279, e.displayText() = DB::NetException: All connection tries failed. Log:
Code: 209, e.displayText() = DB::NetException: Timeout: connect timed out: 192.0.2.1:9440 (server.from.remote.dc:9440) (version 20.8.11.17 (official build))
Code: 209, e.displayText() = DB::NetException: Timeout: connect timed out: 192.0.2.1:9440 (server.from.remote.dc:9440) (version 20.8.11.17 (official build))
Code: 209, e.displayText() = DB::NetException: Timeout: connect timed out: 192.0.2.1:9440 (server.from.remote.dc:9440) (version 20.8.11.17 (official build))
- Using remote(…) table function with secure TCP port (default values is 9440). There is remoteSecure() function for that.
- High (>50ms) ping between servers, values for
connect_timeout_with_failover_ms,
connect_timeout_with_failover_secure_ms
need’s to be adjusted accordingly.
Default values:
┌─name────────────────────────────────────┬─value─┐
│ connect_timeout_with_failover_ms │ 50 │
│ connect_timeout_with_failover_secure_ms │ 100 │
└─────────────────────────────────────────┴───────┘
26.4 - rsync
Short Instruction
-
Do FREEZE TABLE on needed table, partition. It would produce consistent snapshot of table data.
-
Run rsync command.
rsync -ravlW --bwlimit=100000 /var/lib/clickhouse/data/shadow/N/database/table root@remote_host:/var/lib/clickhouse/data/database/table/detached
--bwlimit
is transfer limit in KBytes per second. -
Run ATTACH PARTITION for each partition from
./detached
directory.
27 - DDLWorker
DDLWorker is a subprocess (thread) of clickhouse-server that executes ON CLUSTER
tasks at the node.
When you execute a DDL query with ON CLUSTER mycluster
section the query executor at the current node reads the cluster mycluster
definition (remote_servers / system.clusters) and places tasks into Zookeeper znode task_queue/ddl/...
for members of the cluster mycluster
.
DDLWorker at all ClickHouse nodes constantly check this task_queue
for their tasks and executes them locally and reports about a result back into task_queue
.
The common issue is the different hostnames/IPAddresses in the cluster definition and locally.
So a node initiator puts tasks for a host named Host1. But the Host1 thinks about own name as localhost or xdgt634678d (internal docker hostname) and never sees tasks for the Host1 because is looking tasks for xdgt634678d. The same with internal VS external IP addresses.
Another issue that sometimes DDLWorker thread can crash then ClickHouse node stops to execute ON CLUSTER
tasks.
Check that DDLWorker is alive:
ps -eL|grep DDL
18829 18876 ? 00:00:00 DDLWorkerClnr
18829 18879 ? 00:00:00 DDLWorker
ps -ef|grep 18829|grep -v grep
clickho+ 18829 18828 1 Feb09 ? 00:55:00 /usr/bin/clickhouse-server --con...
As you can see there are two threads: DDLWorker
and DDLWorkerClnr
.
The second thread – DDLWorkerCleaner
cleans old tasks from task_queue
. You can configure how many recent tasks to store:
config.xml
<yandex>
<distributed_ddl>
<path>/clickhouse/task_queue/ddl</path>
<max_tasks_in_queue>1000</max_tasks_in_queue>
<task_max_lifetime>604800</task_max_lifetime>
<cleanup_delay_period>60</cleanup_delay_period>
</distributed_ddl>
</yandex>
Default values:
cleanup_delay_period = 60 seconds – Sets how often to start cleanup to remove outdated data.
task_max_lifetime = 7 * 24 * 60 * 60 (in seconds = week) – Delete task if its age is greater than that.
max_tasks_in_queue = 1000 – How many tasks could be in the queue.
27.1 - There are N unfinished hosts (0 of them are currently active).
Sometimes your Distributed DDL queries are being stuck, and not executing on all or subset of nodes, there are a lot of possible reasons for that kind of behavior, so it would take some time and effort to investigate.
Possible reasons
Clickhouse node can’t recognize itself
SELECT * FROM system.clusters; -- check is_local column, it should have 1 for itself
getent hosts clickhouse.local.net # or other name which should be local
hostname --fqdn
cat /etc/hosts
cat /etc/hostname
Debian / Ubuntu
There is an issue in Debian based images, when hostname being mapped to 127.0.1.1 address which doesn’t literally match network interface and clickhouse fails to detect this address as local.
https://github.com/ClickHouse/ClickHouse/issues/23504
Previous task is being executed and taking some time
It’s usually some heavy operations like merges, mutations, alter columns, so it make sense to check those tables:
SHOW PROCESSLIST;
SELECT * FROM system.merges;
SELECT * FROM system.mutations;
In that case, you can just wait completion of previous task.
Previous task is stuck because of some error
In that case, the first step is to understand which exact task is stuck and why. There are some queries which can help with that.
-- list of all distributed ddl queries, path can be different in your installation
SELECT * FROM system.zookeeper WHERE path = '/clickhouse/task_queue/ddl/';
-- information about specific task.
SELECT * FROM system.zookeeper WHERE path = '/clickhouse/task_queue/ddl/query-0000001000/';
SELECT * FROM system.zookeeper WHERE path = '/clickhouse/task_queue/ddl/' AND name = 'query-0000001000';
-- How many nodes executed this task
SELECT name, numChildren as finished_nodes FROM system.zookeeper
WHERE path = '/clickhouse/task_queue/ddl/query-0000001000/' AND name = 'finished';
┌─name─────┬─finished_nodes─┐
│ finished │ 0 │
└──────────┴────────────────┘
-- The nodes that are running the task
SELECT name, value, ctime, mtime FROM system.zookeeper
WHERE path = '/clickhouse/task_queue/ddl/query-0000001000/active/';
-- What was the result for the finished nodes
SELECT name, value, ctime, mtime FROM system.zookeeper
WHERE path = '/clickhouse/task_queue/ddl/query-0000001000/finished/';
-- Latest successfull executed tasks from query_log.
SELECT query FROM system.query_log WHERE query LIKE '%ddl_entry%' AND type = 2 ORDER BY event_time DESC LIMIT 5;
SELECT
FQDN(),
*
FROM clusterAllReplicas('cluster', system.metrics)
WHERE metric LIKE '%MaxDDLEntryID%'
┌─FQDN()───────────────────┬─metric────────┬─value─┬─description───────────────────────────┐
│ chi-ab.svc.cluster.local │ MaxDDLEntryID │ 1468 │ Max processed DDL entry of DDLWorker. │
└──────────────────────────┴───────────────┴───────┴───────────────────────────────────────┘
┌─FQDN()───────────────────┬─metric────────┬─value─┬─description───────────────────────────┐
│ chi-ab.svc.cluster.local │ MaxDDLEntryID │ 1468 │ Max processed DDL entry of DDLWorker. │
└──────────────────────────┴───────────────┴───────┴───────────────────────────────────────┘
┌─FQDN()───────────────────┬─metric────────┬─value─┬─description───────────────────────────┐
│ chi-ab.svc.cluster.local │ MaxDDLEntryID │ 1468 │ Max processed DDL entry of DDLWorker. │
└──────────────────────────┴───────────────┴───────┴───────────────────────────────────────┘
-- Information about task execution from logs.
grep -C 40 "ddl_entry" /var/log/clickhouse-server/clickhouse-server*.log
Issues that can prevent the task execution
Obsolete replicas left in zookeeper.
SELECT database, table, zookeeper_path, replica_path zookeeper FROM system.replicas WHERE total_replicas != active_replicas;
SELECT * FROM system.zookeeper WHERE path = '/clickhouse/cluster/tables/01/database/table/replicas';
SYSTEM DROP REPLICA 'replica_name';
SYSTEM STOP REPLICATION QUEUES;
SYSTEM START REPLICATION QUEUES;
https://clickhouse.tech/docs/en/sql-reference/statements/system/#query_language-system-drop-replica
Task were removed from DDL queue, but left in Replicated*MergeTree table queue.
grep -C 40 "ddl_entry" /var/log/clickhouse-server/clickhouse-server*.log
/var/log/clickhouse-server/clickhouse-server.log:2021.05.04 12:41:28.956888 [ 599 ] {} <Debug> DDLWorker: Processing task query-0000211211 (ALTER TABLE db.table_local ON CLUSTER `all-replicated` DELETE WHERE id = 1)
/var/log/clickhouse-server/clickhouse-server.log:2021.05.04 12:41:29.053555 [ 599 ] {} <Error> DDLWorker: ZooKeeper error: Code: 999, e.displayText() = Coordination::Exception: No node, Stack trace (when copying this message, always include the lines below):
/var/log/clickhouse-server/clickhouse-server.log-
/var/log/clickhouse-server/clickhouse-server.log-0. Coordination::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, Coordination::Error, int) @ 0xfb2f6b3 in /usr/bin/clickhouse
/var/log/clickhouse-server/clickhouse-server.log-1. Coordination::Exception::Exception(Coordination::Error) @ 0xfb2fb56 in /usr/bin/clickhouse
/var/log/clickhouse-server/clickhouse-server.log:2. DB::DDLWorker::createStatusDirs(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::shared_ptr<zkutil::ZooKeeper> const&) @ 0xeb3127a in /usr/bin/clickhouse
/var/log/clickhouse-server/clickhouse-server.log:3. DB::DDLWorker::processTask(DB::DDLTask&) @ 0xeb36c96 in /usr/bin/clickhouse
/var/log/clickhouse-server/clickhouse-server.log:4. DB::DDLWorker::enqueueTask(std::__1::unique_ptr<DB::DDLTask, std::__1::default_delete<DB::DDLTask> >) @ 0xeb35f22 in /usr/bin/clickhouse
/var/log/clickhouse-server/clickhouse-server.log-5. ? @ 0xeb47aed in /usr/bin/clickhouse
/var/log/clickhouse-server/clickhouse-server.log-6. ThreadPoolImpl<ThreadFromGlobalPool>::worker(std::__1::__list_iterator<ThreadFromGlobalPool, void*>) @ 0x8633bcd in /usr/bin/clickhouse
/var/log/clickhouse-server/clickhouse-server.log-7. ThreadFromGlobalPool::ThreadFromGlobalPool<void ThreadPoolImpl<ThreadFromGlobalPool>::scheduleImpl<void>(std::__1::function<void ()>, int, std::__1::optional<unsigned long>)::'lambda1'()>(void&&, void ThreadPoolImpl<ThreadFromGlobalPool>::scheduleImpl<void>(std::__1::function<void ()>, int, std::__1::optional<unsigned long>)::'lambda1'()&&...)::'lambda'()::operator()() @ 0x863612f in /usr/bin/clickhouse
/var/log/clickhouse-server/clickhouse-server.log-8. ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>) @ 0x8630ffd in /usr/bin/clickhouse
/var/log/clickhouse-server/clickhouse-server.log-9. ? @ 0x8634bb3 in /usr/bin/clickhouse
/var/log/clickhouse-server/clickhouse-server.log-10. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
/var/log/clickhouse-server/clickhouse-server.log-11. __clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so
/var/log/clickhouse-server/clickhouse-server.log- (version 21.1.8.30 (official build))
/var/log/clickhouse-server/clickhouse-server.log:2021.05.04 12:41:29.053951 [ 599 ] {} <Debug> DDLWorker: Processing task query-0000211211 (ALTER TABLE db.table_local ON CLUSTER `all-replicated` DELETE WHERE id = 1)
Context of this problem is:
- Constant pressure of cheap ON CLUSTER DELETE queries.
- One replica was down for a long amount of time (multiple days).
- Because of pressure on the DDL queue, it purged old records due to the
task_max_lifetime
setting. - When a lagging replica comes up, it’s fail’s execute old queries from DDL queue, because at this point they were purged from it.
Solution:
- Reload/Restore this replica from scratch.
28 - differential backups using clickhouse-backup
differential backups using clickhouse-backup
- Download the latest clickhouse-backup for your platform https://github.com/AlexAkulov/clickhouse-backup/releases
# ubuntu / debian
wget https://github.com/AlexAkulov/clickhouse-backup/releases/download/v1.0.0/clickhouse-backup_1.0.0_amd64.deb
sudo dpkg -i clickhouse-backup_1.0.0_amd64.deb
# centos / redhat / fedora
sudo yum install https://github.com/AlexAkulov/clickhouse-backup/releases/download/v1.0.0/clickhouse-backup-1.0.0-1.x86_64.rpm
# other platforms
wget https://github.com/AlexAkulov/clickhouse-backup/releases/download/v1.0.0/clickhouse-backup.tar.gz
sudo mkdir /etc/clickhouse-backup/
sudo mv clickhouse-backup/config.yml /etc/clickhouse-backup/config.yml.example
sudo mv clickhouse-backup/clickhouse-backup /usr/bin/
rm -rf clickhouse-backup clickhouse-backup.tar.gz
- Create a runner script for the crontab
mkdir /opt/clickhouse-backup-diff/
cat << 'END' > /opt/clickhouse-backup-diff/clickhouse-backup-cron.sh
#!/bin/bash
set +x
command_line_argument=$1
backup_name=$(date +%Y-%M-%d-%H-%M-%S)
echo "Creating local backup '${backup_name}' (full, using hardlinks)..."
clickhouse-backup create "${backup_name}"
if [[ "run_diff" == "${command_line_argument}" && "2" -le "$(clickhouse-backup list local | wc -l)" ]]; then
prev_backup_name="$(clickhouse-backup list local | tail -n 2 | head -n 1 | cut -d " " -f 1)"
echo "Uploading the backup '${backup_name}' as diff from the previous backup ('${prev_backup_name}')"
clickhouse-backup upload --diff-from "${prev_backup_name}" "${backup_name}"
elif [[ "" == "${command_line_argument}" ]]; then
echo "Uploading the backup '${backup_name}, and removing old unneeded backups"
KEEP_BACKUPS_LOCAL=1 KEEP_BACKUPS_REMOTE=1 clickhouse-backup upload "${backup_name}"
fi
END
chmod +x /opt/clickhouse-backup-diff/clickhouse-backup-cron.sh
- Create confuguration for clickhouse-backup
# Check the example: /etc/clickhouse-backup/config.yml.example
vim /etc/clickhouse-backup/config.yml
- Edit the crontab
crontab -e
# full backup at 0:00 Monday
0 0 * * 1 clickhouse /opt/clickhouse-backup-diff/clickhouse-backup-cron.sh
# differential backup every hour (except of 00:00) Monday
0 1-23 * * 1 clickhouse /opt/clickhouse-backup-diff/clickhouse-backup-cron.sh run_diff
# differential backup every hour Sunday, Tuesday-Saturday
0 */1 * * 0,2-6 clickhouse /opt/clickhouse-backup-diff/clickhouse-backup-cron.sh run_diff
- Recover the last backup:
last_remote_backup="$(clickhouse-backup list remote | tail -n 1 | cut -d " " -f 1)"
clickhouse-backup download "${last_remote_backup}"
clickhouse-backup restore --rm "${last_remote_backup}"
29 - High CPU usage
In general, it is a NORMAL situation for clickhouse that while processing a huge dataset it can use a lot of (or all of) the server resources. It is ‘by design’ - just to make the answers faster.
The main directions to reduce the CPU usage is to review the schema / queries to limit the amount of the data which need to be processed, and to plan the resources in a way when single running query will not impact the others.
Any attempts to reduce the CPU usage will end up with slower queries!
How to slow down queries to reduce the CPU usage
If it is acceptable for you - please check the following options for limiting the CPU usage:
-
setting
max_threads
: reducing the number of threads that are allowed to use one request. Fewer threads = more free cores for other requests. By default, it’s allowed to take half of the available CPU cores, adjust only when needed. So if if you have 10 cores thenmax_threads = 10
will work about twice faster thanmax_threads=5
, but will take 100% or CPU. (max_threads=5 will use half of CPUs so 50%). -
setting
os_thread_priority
: increasing niceness for selected requests. In this case, the operating system, when choosing which of the running processes to allocate processor time, will prefer processes with lower niceness. 0 is the default niceness. The higher the niceness, the lower the priority of the process. The maximum niceness value is 19.
These are custom settings that can be tweaked in several ways:
-
by specifying them when connecting a client, for example
clickhouse-client --os_thread_priority=19 -q 'SELECT max (number) from numbers (100000000)' echo 'SELECT max(number) from numbers(100000000)' | curl 'http://localhost:8123/?os_thread_priority=19' --data-binary @-
-
via dedicated API / connection parameters in client libraries
-
using the SQL command SET (works only within the session)
SET os_thread_priority = 19; SELECT max(number) from numbers(100000000)
-
using different profiles of settings for different users. Something like
<?xml version="1.0"?> <yandex> <profiles> <default> ... </default> <lowcpu> <os_thread_priority>19</os_thread_priority> <max_threads>4</max_threads> </lowcpu> </profiles> <!-- Users and ACL. --> <users> <!-- If user name was not specified, 'default' user is used. --> <limited_user> <password>123</password> <networks> <ip>::/0</ip> </networks> <profile>lowcpu</profile> <!-- Quota for user. --> <quota>default</quota> </limited_user> </users> </yandex>
There are also plans to introduce a system of more flexible control over the assignment of resources to different requests.
Also, if these are manually created queries, then you can try to discipline users by adding quotas to them (they can be formulated as “you can read no more than 100GB of data per hour” or “no more than 10 queries”, etc.)
If these are automatically generated queries, it may make sense to check if there is no way to write them in a more efficient way.
30 - Load balancers
In general - one of the simplest option to do load balancing is to implement it on the client side.
I.e. list several endpoints for clickhouse connections and add some logic to pick one of the nodes.
Many client libraries support that.
ClickHouse native protocol (port 9000)
Currently there are no protocol-aware proxies for clickhouse protocol, so the proxy / load balancer can work only on TCP level.
One of the best option for TCP load balancer is haproxy, also nginx can work in that mode.
Haproxy will pick one upstream when connection is established, and after that it will keep it connected to the same server until the client or server will disconnect (or some timeout will happen).
It can’t send different queries coming via a single connection to different servers, as he knows nothing about clickhouse protocol and doesn’t know when one query ends and another start, it just sees the binary stream.
So for native protocol, there are only 3 possibilities:
- close connection after each query client-side
- close connection after each query server-side (currently there is only one setting for that - idle_connection_timeout=0, which is not exact what you need, but similar).
- use a clickhouse server with Distributed table as a proxy.
HTTP protocol (port 8123)
There are many more options and you can use haproxy / nginx / chproxy, etc. chproxy give some extra clickhouse-specific features, you can find a list of them at https://github.com/Vertamedia/chproxy
31 - memory configuration settings
max_memory_usage. Single query memory usage
max_memory_usage - the maximum amount of memory allowed for a single query to take. By default, it’s 10Gb. The default value is good, don’t adjust it in advance.
There are scenarios when you need to relax the limit for particular queries (if you hit ‘Memory limit (for query) exceeded’), or use a lower limit if you need to discipline the users or increase the number of simultaneous queries.
Server memory usage
Server memory usage = constant memory footprint (used by different caches, dictionaries, etc) + sum of memory temporary used by running queries (a theoretical limit is a number of simultaneous queries multiplied by max_memory_usage).
Since 20.4 you can set up a global limit using the max_server_memory_usage
setting. If something will hit that limit you will see ‘Memory limit (total) exceeded’ in random places.
By default it 90% of the physical RAM of the server. https://clickhouse.tech/docs/en/operations/server-configuration-parameters/settings/#max_server_memory_usage https://github.com/ClickHouse/ClickHouse/blob/e5b96bd93b53d2c1130a249769be1049141ef386/programs/server/config.xml#L239-L250
You can decrease that in some scenarios (like you need to leave more free RAM for page cache or to some other software).
How to check what is using my RAM?
altinity-kb-who-ate-my-memory.md"
Mark cache
https://github.com/ClickHouse/clickhouse-presentations/blob/master/meetup39/mark-cache.pdf
32 - Moving table to another device.
Suppose we mount a new device at path /mnt/disk_1
and want to move table_4
to it.
- Create directory on new device for ClickHouse data. /in shell
mkdir /mnt/disk_1/clickhouse
- Change ownership of created directory to ClickHouse user. /in shell
chown -R clickhouse:clickhouse /mnt/disk_1/clickhouse
- Create a special storage policy which should include both disks: old and new. /in shell
nano /etc/clickhouse-server/config.d/storage.xml
###################/etc/clickhouse-server/config.d/storage.xml###########################
<yandex>
<storage_configuration>
<disks>
<!--
default disk is special, it always
exists even if not explicitly
configured here, but you can't change
it's path here (you should use <path>
on top level config instead)
-->
<default>
<!--
You can reserve some amount of free space
on any disk (including default) by adding
keep_free_space_bytes tag
-->
</default>
<disk_1> <!-- disk name -->
<path>/mnt/disk_1/clickhouse/</path>
</disk_1>
</disks>
<policies>
<move_from_default_to_disk_1> <!-- name for new storage policy -->
<volumes>
<default>
<disk>default</disk>
<max_data_part_size_bytes>10000000</max_data_part_size_bytes>
</default>
<disk_1_vol> <!-- name of volume -->
<!--
we have only one disk in that volume
and we reference here the name of disk
as configured above in <disks> section
-->
<disk>disk_1</disk>
</disk_1_vol>
</volumes>
<move_factor>0.99</move_factor>
</move_from_default_to_disk_1>
</policies>
</storage_configuration>
</yandex>
#########################################################################################
- Update storage_policy setting of tables to new policy.
ALTER TABLE table_4 MODIFY SETTING storage_policy='move_from_default_to_disk_1';
- Wait till all parts of tables change their disk_name to new disk.
SELECT name,disk_name, path from system.parts WHERE table='table_4' and active;
SELECT disk_name, path, sum(rows), sum(bytes_on_disk), uniq(partition), count() FROM system.parts WHERE table='table_4' and active GROUP BY disk_name, path ORDER BY disk_name, path;
- Remove ‘default’ disk from new storage policy. In server shell:
nano /etc/clickhouse-server/config.d/storage.xml
###################/etc/clickhouse-server/config.d/storage.xml###########################
<yandex>
<storage_configuration>
<disks>
<!--
default disk is special, it always
exists even if not explicitly
configured here, but you can't change
it's path here (you should use <path>
on top level config instead)
-->
<default>
<!--
You can reserve some amount of free space
on any disk (including default) by adding
keep_free_space_bytes tag
-->
</default>
<disk_1> <!-- disk name -->
<path>/mnt/disk_1/clickhouse/</path>
</disk_1>
</disks>
<policies>
<move_from_default_to_disk_1> <!-- name for new storage policy -->
<volumes>
<disk_1_vol> <!-- name of volume -->
<!--
we have only one disk in that volume
and we reference here the name of disk
as configured above in <disks> section
-->
<disk>disk_1</disk>
</disk_1_vol>
</volumes>
<move_factor>0.99</move_factor>
</move_from_default_to_disk_1>
</policies>
</storage_configuration>
</yandex>
#########################################################################################
ClickHouse wouldn’t auto reload config, because we removed some disks from storage policy, so we need to restart it by hand.
- Restart ClickHouse server.
- Make sure that storage policy uses the right disks.
SELECT * FROM system.storage_policies WHERE policy_name='move_from_default_to_disk_1';
33 - Object consistency in a cluster
List of missing tables
WITH (
SELECT groupArray(FQDN()) FROM clusterAllReplicas('{cluster}',system,one)
) AS hosts
SELECT database,
table,
arrayFilter( i-> NOT has(groupArray(host),i), hosts) miss_table
FROM (
SELECT FQDN() host, database, name table
FROM clusterAllReplicas('{cluster}',system,tables)
WHERE engine NOT IN ('Log','Memory','TinyLog')
)
GROUP BY database, table
HAVING miss_table <> []
SETTINGS skip_unavailable_shards=1;
┌─database─┬─table─┬─miss_table────────────────┐
│ default │ test │ ['host366.mynetwork.net'] │
└──────────┴───────┴───────────────────────────┘
List of inconsistent tables
SELECT database, name, engine, uniqExact(create_table_query) AS ddl
FROM clusterAllReplicas('{cluster}',system.tables)
GROUP BY database, name, engine HAVING ddl > 1
List of inconsistent columns
WITH (
SELECT groupArray(FQDN()) FROM clusterAllReplicas('{cluster}',system,one)
) AS hosts
SELECT database,
table,
column,
arrayStringConcat(arrayMap( i -> i.2 ||': '|| i.1,
(groupArray( (type,host) ) AS g)),', ') diff
FROM (
SELECT FQDN() host, database, table, name column, type
FROM clusterAllReplicas('{cluster}',system,columns)
)
GROUP BY database, table, column
HAVING length(arrayDistinct(g.1)) > 1 OR length(g.1) <> length(hosts)
SETTINGS skip_unavailable_shards=1;
┌─database─┬─table───┬─column────┬─diff────────────────────────────────┐
│ default │ z │ A │ ch-host22: Int64, ch-host21: String │
└──────────┴─────────┴───────────┴─────────────────────────────────────┘
List of inconsistent dictionaries
WITH (
SELECT groupArray(FQDN()) FROM clusterAllReplicas('{cluster}',system,one)
) AS hosts
SELECT database,
dictionary,
arrayFilter( i-> NOT has(groupArray(host),i), hosts) miss_dict,
arrayReduce('median', (groupArray((element_count, host)) AS ec).1 )
FROM (
SELECT FQDN() host, database, name dictionary, element_count
FROM clusterAllReplicas('{cluster}',system,dictionaries)
)
GROUP BY database, dictionary
HAVING miss_dict <> []
SETTINGS skip_unavailable_shards=1;
34 - Production Cluster Configuration Guide
Moving from a single ClickHouse server to a clustered format provides several benefits:
- Replication guarantees data integrity.
- Provides redundancy.
- Failover by being able to restart half of the nodes without encountering downtime.
Moving from an unsharded ClickHouse environment to a sharded cluster requires redesign of schema and queries. Starting with a sharded cluster from the beginning makes it easier in the future to scale the cluster up.
Setting up a ClickHouse cluster for a production environment requires the following stages:
- Hardware Requirements
- Network Configuration
- Create Host Names
- Monitoring Considerations
- Configuration Steps
- Setting Up Backups
- Staging Plans
- Upgrading The Cluster
34.1 - Backups
ClickHouse is currently at the design stage of creating some universal backup solution. Some custom backup strategies are:
- Each shard is backed up separately.
- FREEZE the table/partition. For more information, see Alter Freeze Partition.
- This creates hard links in shadow subdirectory.
- rsync that directory to a backup location, then remove that subfolder from shadow.
- Cloud users are recommended to use Rclone.
- Always add the full contents of the metadata subfolder that contains the current DB schema and clickhouse configs to your backup.
- For a second replica, it’s enough to copy metadata and configuration.
- Data in clickhouse is already compressed with lz4, backup can be compressed bit better, but avoid using cpu-heavy compression algorythms like gzip, use something like zstd instead.
The tool automating that process clickhouse-backup.
34.2 - Cluster Configuration FAQ
ClickHouse does not start, some other unexpected behavior happening
Check clickhouse logs, they are your friends:
tail -n 1000 /var/log/clickhouse-server/clickhouse-server.err.log | less tail -n 10000 /var/log/clickhouse-server/clickhouse-server.log | less
How Do I Restrict Memory Usage?
See our knowledge base article and official documentation for more information.
ClickHouse died during big query execution
Misconfigured clickhouse can try to allocate more RAM than is available on the system.
In that case an OS component called oomkiller can kill the clickhouse process.
That event leaves traces inside system logs (can be checked by running dmesg command).
How Do I make huge ‘Group By’ queries use less RAM?
Enable on disk GROUP BY (it is slower, so is disabled by default)
Set max_bytes_before_external_group_by to a value about 70-80% of your max_memory_usage value.
Data returned in chunks by clickhouse-client
See altinity-kb-clickhouse-client
I Can’t Connect From Other Hosts. What do I do?
Check the
34.3 - Cluster Configuration Process
So you set up 3 nodes with zookeeper (zookeeper1, zookeeper2, zookeeper3 - How to install zookeer?), and and 4 nodes with ClickHouse (clickhouse-sh1r1,clickhouse-sh1r2,clickhouse-sh2r1,clickhouse-sh2r2 - how to install ClickHouse?). Now we need to make them work together.
Use ansible/puppet/salt or other systems to control the servers’ configurations.
- Configure ClickHouse access to Zookeeper by adding the file zookeeper.xml in /etc/clickhouse-server/config.d/ folder. This file must be placed on all ClickHouse servers.
<yandex>
<zookeeper>
<node>
<host>zookeeper1</host>
<port>2181</port>
</node>
<node>
<host>zookeeper2</host>
<port>2181</port>
</node>
<node>
<host>zookeeper3</host>
<port>2181</port>
</node>
</zookeeper>
</yandex>
- On each server put the file macros.xml in
/etc/clickhouse-server/config.d/
folder.
<yandex>
<!--
That macros are defined per server,
and they can be used in DDL, to make the DB schema cluster/server neutral
-->
<macros>
<cluster>prod_cluster</cluster>
<shard>01</shard>
<replica>clickhouse-sh1r1</replica> <!-- better - use the same as hostname -->
</macros>
</yandex>
- On each server place the file cluster.xml in /etc/clickhouse-server/config.d/ folder. Before 20.10 ClickHouse will use default user to connect to other nodes (configurable, other users can be used), since 20.10 we recommend to use passwordless intercluster authentication based on common secret (HMAC auth)
<yandex>
<remote_servers>
<prod_cluster> <!-- you need to give a some name for a cluster -->
<!--
<secret>some_random_string, same on all cluster nodes, keep it safe</secret>
-->
<shard>
<internal_replication>true</internal_replication>
<replica>
<host>clickhouse-sh1r1</host>
<port>9000</port>
</replica>
<replica>
<host>clickhouse-sh1r2</host>
<port>9000</port>
</replica>
</shard>
<shard>
<internal_replication>true</internal_replication>
<replica>
<host>clickhouse-sh2r1</host>
<port>9000</port>
</replica>
<replica>
<host>clickhouse-sh2r2</host>
<port>9000</port>
</replica>
</shard>
</prod_cluster>
</remote_servers>
</yandex>
- A good practice is to create 2 additional cluster configurations similar to prod_cluster above with the following distinction: but listing all nodes of single shard (all are replicas) and as nodes of 6 different shards (no replicas)
- all-replicated: All nodes are listed as replicas in a single shard.
- all-sharded: All nodes are listed as separate shards with no replicas.
Once this is complete, other queries that span nodes can be performed. For example:
CREATE TABLE test_table_local ON CLUSTER '{cluster}'
(
id UInt8
)
Engine=ReplicatedMergeTree('/clickhouse/tables/{database}/{table}/{shard}', '{replica}')
ORDER BY (id);
That will create a table on all servers in the cluster. You can insert data into this table and it will be replicated automatically to the other shards.To store the data or read the data from all shards at the same time, create a Distributed table that links to the replicatedMergeTree table.
CREATE TABLE test_table ON CLUSTER '{cluster}'
Engine=Distributed('{cluster}', 'default', '
Hardening ClickHouse Security
See https://docs.altinity.com/operationsguide/security/clickhouse-hardening-guide/
Additional Settings
See altinity-kb-settings-to-adjust
Users
Disable or add password for the default users default and readonly if your server is accessible from non-trusted networks.
If you add password to the default user, you will need to adjust cluster configuration, since the other servers need to know the default user’s should know the default user’s to connect to each other.
If you’re inside a trusted network, you can leave default user set to nothing to allow the ClickHouse nodes to communicate with each other.
Engines & ClickHouse building blocks
For general explanations of roles of different engines - check the post Distributed vs Shard vs Replicated ahhh, help me!!!.
Zookeeper Paths
Use conventions for zookeeper paths. For example, use:
ReplicatedMergeTree(’/clickhouse/{cluster}/tables/{shard}/table_name’, ‘{replica}’)
for:
SELECT * FROM system.zookeeper WHERE path=’/ …';
Configuration Best Practices
Attribution Modified by a post [on GitHub by Mikhail Filimonov](https://github.com/ClickHouse/ClickHouse/issues/3607#issuecomment-440235298). |
---|
The following are recommended Best Practices when it comes to setting up a ClickHouse Cluster with Zookeeper:
- Don’t edit/overwrite default configuration files. Sometimes a newer version of ClickHouse introduces some new settings or changes the defaults in config.xml and users.xml.
- Set configurations via the extra files in conf.d directory. For example, to overwrite the interface save the file config.d/listen.xml, with the following:
<?xml version="1.0"?>
<yandex>
<listen_host replace="replace">::</listen_host>
</yandex>
- The same is true for users. For example, change the default profile by putting the file in users.d/profile_default.xml:
<?xml version="1.0"?>
<yandex>
<profiles>
<default replace="replace">
<max_memory_usage>15000000000</max_memory_usage>
<max_bytes_before_external_group_by>12000000000</max_bytes_before_external_group_by>
<max_bytes_before_external_sort>12000000000</max_bytes_before_external_sort>
<distributed_aggregation_memory_efficient>1</distributed_aggregation_memory_efficient>
<use_uncompressed_cache>0</use_uncompressed_cache>
<load_balancing>random</load_balancing>
<log_queries>1</log_queries>
<max_execution_time>600</max_execution_time>
</default>
</profiles>
</yandex>
- Or you can create a user by putting a file users.d/user_xxx.xml (since 20.5 you can also use CREATE USER)
<?xml version="1.0"?>
<yandex>
<users>
<xxx>
<!-- PASSWORD=$(base64 < /dev/urandom | head -c8); echo "$PASSWORD"; echo -n "$PASSWORD" | sha256sum | tr -d '-' -->
<password_sha256_hex>...</password_sha256_hex>
<networks incl="networks" />
<profile>readonly</profile>
<quota>default</quota>
<allow_databases incl="allowed_databases" />
</xxx>
</users>
</yandex>
- Some parts of configuration will contain repeated elements (like allowed ips for all the users). To avoid repeating that - use substitutions file. By default its /etc/metrika.xml, but you can change it for example to /etc/clickhouse-server/substitutions.xml with the <include_from> section of the main config. Put the repeated parts into substitutions file, like this:
<?xml version="1.0"?>
<yandex>
<networks>
<ip>::1</ip>
<ip>127.0.0.1</ip>
<ip>10.42.0.0/16</ip>
<ip>192.168.0.0/24</ip>
</networks>
</yandex>
These files can be common for all the servers inside the cluster or can be individualized per server. If you choose to use one substitutions file per cluster, not per node, you will also need to generate the file with macros, if macros are used.
This way you have full flexibility; you’re not limited to the settings described in the template. You can change any settings per server or data center just by assigning files with some settings to that server or server group. It becomes easy to navigate, edit, and assign files.
Other Configuration Recommendations
Other configurations that should be evaluated:
in config.xml: Determines which IP addresses and ports the ClickHouse servers listen for incoming communications. - <max_memory_..> and <max_bytes_before_external_…> in users.xml. These are part of the profile
. - <max_execution_time>
- <log_queries>
The following extra debug logs should be considered:
- part_log
- text_log
Understanding The Configuration
ClickHouse configuration stores most of its information in two files:
- config.xml: Stores Server configuration parameters. They are server wide, some are hierarchical , and most of them can’t be changed in runtime. The list of settings to apply without a restart changes from version to version. Some settings can be verified using system tables, for example:
- macros (system.macros)
- remote_servers (system.clusters)
- users.xml: Configure users, and user level / session level settings.
- Each user can change these during their session by:
- Using parameter in http query
- By using parameter for clickhouse-client
- Sending query like set allow_experimental_data_skipping_indices=1.
- Those settings and their current values are visible in system.settings. You can make some settings global by editing default profile in users.xml, which does not need restart.
- You can forbid users to change their settings by using readonly=2 for that user, or using setting constraints.
- Changes in users.xml are applied w/o restart.
- Each user can change these during their session by:
For both config.xml and users.xml, it’s preferable to put adjustments in the config.d and users.d subfolders instead of editing config.xml and users.xml directly.
You can check if the config file was reread by checking /var/lib/clickhouse/preprocessed_configs/ folder.
34.4 - Hardware Requirements
ClickHouse
ClickHouse will use all available hardware to maximize performance. So the more hardware - the better. As of this publication, the hardware requirements are:
- Minimum Hardware: 4-core CPU with support of SSE4.2, 16 Gb RAM, 1Tb HDD.
- Recommended for development and staging environments.
- SSE4.2 is required, and going below 4 Gb of RAM is not recommended.
- Recommended Hardware: >=16-cores, >=64Gb RAM, HDD-raid or SSD.
- For processing up to hundreds of millions / billions of rows.
For clouds: disk throughput is the more important factor compared to IOPS. Be aware of burst / baseline disk speed difference.
See also: https://clickhouse.tech/benchmark/hardware/
Zookeeper
Zookeeper requires separate servers from those used for ClickHouse. Zookeeper has poor performance when installed on the same node as ClickHouse.
Hardware Requirements for Zookeeper:
- Fast disk speed (ideally NVMe, 128Gb should be enough).
- Any modern CPU (one core, better 2)
- 4Gb of RAM
For clouds - be careful with burstable network disks (like gp2 on aws): you may need up to 1000 IOPs on the disk for on a long run, so gp3 with 3000 IOPs baseline is a better choice.
The number of Zookeeper instances depends on the environment:
- Production: 3 is an optimal number of zookeeper instances.
- Development and Staging: 1 zookeeper instance is sufficient.
See also:
- https://docs.altinity.com/operationsguide/clickhouse-zookeeper/
- altinity-kb-proper-setup
- zookeeper-monitoring
ClickHouse Hardware Configuration
Configure the servers according to those recommendations the ClickHouse Usage Recommendations.
Test Your Hardware
Be sure to test the following:
- RAM speed.
- Network speed.
- Storage speed.
It’s better to find any performance issues before installing ClickHouse.
34.5 - Network Configuration
Networking And Server Room Planning
The network used for your ClickHouse cluster should be a fast network, ideally 10 Gbit or more. ClickHouse nodes generate a lot of traffic to exchange the data between nodes (port 9009 for replication, and 9000 for distributed queries). Zookeeper traffic in normal circumstanses is moderate, but in some special cases can also be very significant.
For the zookeeper low latency is more important than bandwidth.
Keep the replicas isolated on the hardware level. This allows for cluster failover from possible outages.
- For Physical Environments: Avoid placing 2 ClickHouse replicas on the same server rack. Ideally, they should be on isolated network switches and an isolated power supply.
- For Clouds Environments: Use different availability zones between the ClickHouse replicas when possible (but be aware of the interzone traffic costs)
These considerations are the same as the Zookeeper nodes.
For example:
Rack | Server | Server | Server | Server |
---|---|---|---|---|
Rack 1 | CH_SHARD1_R1 | CH_SHARD2_R1 | CH_SHARD3_R1 | ZOO_1 |
Rack 2 | CH_SHARD1_R2 | CH_SHARD2_R2 | CH_SHARD3_R2 | ZOO_2 |
Rack 3 | ZOO3 |
Network Ports And Firewall
ClickHouse listens the following ports:
- 9000: clickhouse-client, native clients, other clickhouse-servers connect to here.
- 8123: HTTP clients
- 9009: Other replicas will connect here to download data.
For more information, see CLICKHOUSE NETWORKING, PART 1.
Zookeeper listens the following ports:
- 2181: Client connections.
- 2888: Inter-ensemble connections.
- 3888: Leader election.
Outbound traffic from ClickHouse connects to the following ports:
- ZooKeeper: On port 2181.
- Other CH nodes in the cluster: On port 9000 and 9009.
- Dictionary sources: Depending on what was configured such as HTTP, MySQL, Mongo, etc.
- Kafka or Hadoop: If those integrations were enabled.
SSL
For non-trusted networks enable SSL/HTTPS. If acceptable, it is better to keep interserver communications unencrypted for performance reasons.
Naming Schema
The best time to start creating a naming schema for the servers is before they’re created and configured.
There are a few features based on good server naming in ClickHouse:
- clickhouse-client prompts: Allows a different prompt for clickhouse-client per server hostname.
- Nearest hostname load balancing: For more information, see Nearest Hostname.
A good option is to use the following:
{datacenter}-{serverroom}-{rack identifier}-{clickhouse cluster identifier}-{shard number or server number}.
Other examples:
- rxv-olap-ch-master-sh01-r01:
- rxv - location (rack#15)
- olap - product name
- ch = clickhouse
- master = stage
- sh01 = shard 1
- r01 = replica 1
- hetnzerde1-ch-prod-01.local:
- hetnzerde1 - location (also replica id)
- ch = clickhouse
- prod = stage
- 01 - server number / shard number in that DC
- sh01.ch-front.dev.aws-east1a.example.com:
- sh01 - shard 01
- ch-front - cluster name
- dev = stage
- aws = cloud provider
- east1a = region and availability zone
Host Name References
- What are the best practices for domain names (dev, staging, production)?
- 9 Best Practices and Examples for Working with Kubernetes Labels
- Thoughts On Hostname Nomenclature
Additional Hostname Tips
- Hostnames configured on the server should not change. If you do need to change the host name, one reference to use is How to Change Hostname on Ubuntu 18.04.
- The server should be accessible to other servers in the cluster via it’s hostname. Otherwise you will need to configure interserver_hostname in your config.
- Ensure that
hostname --fqdn
andgetent hosts $(hostname --fqdn)
return the correct name and ip.
34.6 - Version Upgrades
Update itself is simple: update packages, restart clickhouse-server service afterwards.
- Check if the version you want to upgrade to is stable. We highly recommend the Altinity ClickHouse Stable Releases.
- Review the changelog to ensure that no configuration changes are needed.
- Update staging and test to verify all systems are working.
- Prepare and test downgrade procedures so the server can be returned to the previous version if necessary.
- Start with a “canary” update. This is one replica with one shard that is upgraded to make sure that the procedure works.
- Test and verify that everything works properly. Check for any errors in the log files.
- If everything is working well, update the rest of the cluster.
For small clusters, the BlueGreenDeployment technique is also a good option.
35 - Replication queue
SELECT
database,
table,
type,
max(last_exception),
max(postpone_reason),
min(create_time),
max(last_attempt_time),
max(last_postpone_time),
max(num_postponed) AS max_postponed,
max(num_tries) AS max_tries,
min(num_tries) AS min_tries,
countIf(last_exception != '') AS count_err,
countIf(num_postponed > 0) AS count_postponed,
countIf(is_currently_executing) AS count_executing,
count() AS count_all
FROM system.replication_queue
GROUP BY
database,
table,
type
ORDER BY count_all DESC
36 - Schema migration tools for ClickHouse
- golang-migrate tool - see golang-migrate
- bytebase
- Flyway - there are a lot of PRs introducing ClickHouse support, maintainer doesn’t merge them (maybe he will change his mind soon), but’s it’s not hard to build flyway from one of those PRs (latest at the top)
- https://github.com/flyway/flyway/pull/3333 Сlickhouse support
- https://github.com/flyway/flyway/pull/3134 Сlickhouse support
- https://github.com/flyway/flyway/pull/3133 Add support clickhouse
- https://github.com/flyway/flyway/pull/2981 Clickhouse replicated
- https://github.com/flyway/flyway/pull/2640 Yet another ClickHouse support
- https://github.com/flyway/flyway/pull/2166 Clickhouse support (#1772)
- https://github.com/flyway/flyway/pull/1773 Fixed #1772: Add support for ClickHouse (https://clickhouse.yandex/)
- liquibase
- custom tool for ClickHouse
- phpMigrations
- dbmate
know more?
36.1 - golang-migrate
migrate
migrate
is a simple schema migration tool written in golang. No external dependencies are required (like interpreter, jre), only one platform-specific executable. golang-migrate/migrate
migrate
supports several databases, including ClickHouse (support was introduced by @kshvakov).
To store information about migrations state migrate
creates one additional table in target database, by default that table is called schema_migrations
.
Install
download the migrate
executable for your platform and put it to the folder listed in your %PATH.
#wget https://github.com/golang-migrate/migrate/releases/download/v3.2.0/migrate.linux-amd64.tar.gz
wget https://github.com/golang-migrate/migrate/releases/download/v4.14.1/migrate.linux-amd64.tar.gz
tar -xzf migrate.linux-amd64.tar.gz
mkdir -p ~/bin
mv migrate.linux-amd64 ~/bin/migrate
rm migrate.linux-amd64.tar.gz
Sample usage
mkdir migrations
echo 'create table test(id UInt8) Engine = Memory;' > migrations/000001_my_database_init.up.sql
echo 'DROP TABLE test;' > migrations/000001_my_database_init.down.sql
# you can also auto-create file with new migrations with automatic numbering like that:
migrate create -dir migrations -seq -digits 6 -ext sql my_database_init
edit migrations/000001_my_database_init.up.sql & migrations/000001_my_database_init.down.sql
migrate -database 'clickhouse://localhost:9000' -path ./migrations up
1/u my_database_init (6.502974ms)
migrate -database 'clickhouse://localhost:9000' -path ./migrations down
1/d my_database_init (2.164394ms)
# clears the database (use carefully - will not ask any confirmations)
➜ migrate -database 'clickhouse://localhost:9000' -path ./migrations drop
Connection string format
clickhouse://host:port?username=user&password=qwerty&database=clicks
URL Query | Description |
---|---|
x-migrations-table |
Name of the migrations table |
database |
The name of the database to connect to |
username |
The user to sign in as |
password |
The user’s password |
host |
The host to connect to. |
port |
The port to bind to. |
secure |
to use a secure connection (for self-signed also add skip_verify=1 ) |
Replicated / Distributed / Cluster environments
By default migrate
create table schema_migrations
with the following structure
CREATE TABLE schema_migrations (
version UInt32,
dirty UInt8,
sequence UInt64
) ENGINE = TinyLog
That allows storing version of schema locally.
If you need to use migrate
in some multi server environment (replicated / cluster) you should create schema_migrations
manually with the same structure and with the appropriate Engine (Replicated / Distributed), otherwise, other servers will not know the version of the DB schema. As an alternative you can force the current version number on another server manually, like that:
migrate -database 'clickhouse://localhost:9000' -path ./migrations force 123456 # force version 123456
Known issues
could not load time location: unknown time zone Europe/Moscow in line 0:
It’s happens due of missing tzdata package in migrate/migrate docker image of golang-migrate. There is 2 possible solutions:
- You can build your own golang-migrate image from official with tzdata package.
- If you using it as part of your CI you can add installing tzdata package as one of step in CI before using golang-migrate.
Related GitHub issues: https://github.com/golang-migrate/migrate/issues/494 https://github.com/golang-migrate/migrate/issues/201
Using database name in x-migrations-table
- Creates table with
database.table
- When running migrations migrate actually uses database from query settings and encapsulate
database.table
as table name: ``other_database.`database.table```
37 - Server config files
Сonfig management (recommended structure)
Clickhouse server config consists of two parts server settings (config.xml) and users settings (users.xml).
By default they are stored in the folder /etc/clickhouse-server/ in two files config.xml & users.xml.
We suggest never change vendor config files and place your changes into separate .xml files in sub-folders. This way is easier to maintain and ease Clickhouse upgrades.
/etc/clickhouse-server/users.d – sub-folder for user settings.
/etc/clickhouse-server/config.d – sub-folder for server settings.
/etc/clickhouse-server/conf.d – sub-folder for any (both) settings.
File names of your xml files can be arbitrary but they are applied in alphabetical order.
Examples:
$ cat /etc/clickhouse-server/config.d/listen_host.xml
<?xml version="1.0" ?>
<yandex>
<listen_host>::</listen_host>
</yandex>
$ cat /etc/clickhouse-server/config.d/macros.xml
<?xml version="1.0" ?>
<yandex>
<macros>
<cluster>test</cluster>
<replica>host22</replica>
<shard>0</shard>
<server_id>41295</server_id>
<server_name>host22.server.com</server_name>
</macros>
</yandex>
cat /etc/clickhouse-server/config.d/zoo.xml
<?xml version="1.0" ?>
<yandex>
<zookeeper>
<node>
<host>localhost</host>
<port>2181</port>
</node>
</zookeeper>
<distributed_ddl>
<path>/clickhouse/test/task_queue/ddl</path>
</distributed_ddl>
</yandex>
cat /etc/clickhouse-server/users.d/enable_access_management_for_user_default.xml
<?xml version="1.0" ?>
<yandex>
<users>
<default>
<access_management>1</access_management>
</default>
</users>
</yandex>
cat /etc/clickhouse-server/users.d/memory_usage.xml
<?xml version="1.0" ?>
<yandex>
<profiles>
<default>
<max_bytes_before_external_group_by>25290221568</max_bytes_before_external_group_by>
<max_memory_usage>50580443136</max_memory_usage>
</default>
</profiles>
</yandex>
BTW, you can define any macro in your configuration and use them in Zookeeper paths
ReplicatedMergeTree('/clickhouse/{cluster}/tables/my_table','{replica}')
or in your code using function getMacro:
CREATE OR REPLACE VIEW srv_server_info
SELECT (SELECT getMacro('shard')) AS shard_num,
(SELECT getMacro('server_name')) AS server_name,
(SELECT getMacro('server_id')) AS server_key
Settings can be appended to an XML tree (default behaviour) or replaced or removed.
Example how to delete tcp_port & http_port defined on higher level in the main config.xml (it disables open tcp & http ports if you configured secure ssl):
cat /etc/clickhouse-server/config.d/disable_open_network.xml
<?xml version="1.0"?>
<yandex>
<http_port remove="1"/>
<tcp_port remove="1"/>
</yandex>
Example how to replace remote_servers section defined on higher level in the main config.xml (it allows to remove default test clusters.
<?xml version="1.0" ?>
<yandex>
<remote_servers replace="1">
<mycluster>
....
</mycluster>
</remote_servers>
</yandex>
Settings & restart
General ‘rule of thumb’:
- server settings (
config.xml
andconfig.d
) changes require restart; - user settings (
users.xml
andusers.d
) changes don’t require restart.
But there are exceptions from those rules (see below).
Server config (config.xml) sections which don’t require restart
<max_server_memory_usage>
<max_server_memory_usage_to_ram_ratio>
<max_table_size_to_drop>
<max_partition_size_to_drop>
<max_concurrent_queries>
<macros>
<remote_servers>
<dictionaries_config>
<user_defined_executable_functions_config>
<models_config>
<keeper_server>
<zookeeper>
(but reconnect don’t happen automatically)<storage_configuration>
<user_directories>
<access_control_path>
<encryption_codecs>
<logger>
(since 21.11)
Those sections (live in separate files):
<dictionaries>
<functions>
<models>
User settings which require restart.
Most of user setting changes don’t require restart, but they get applied at the connect time, so existing connection may still use old user-level settings. That means that that new setting will be applied to new sessions / after reconnect.
The list of user setting which require server restart:
<background_buffer_flush_schedule_pool_size>
<background_pool_size>
<background_merges_mutations_concurrency_ratio>
<background_move_pool_size>
<background_fetches_pool_size>
<background_common_pool_size>
<background_schedule_pool_size>
<background_message_broker_schedule_pool_size>
<background_distributed_schedule_pool_size>
<max_replicated_fetches_network_bandwidth_for_server>
<max_replicated_sends_network_bandwidth_for_server>
See also select * from system.settings where description ilike '%start%'
Also there are several ’long-running’ user sessions which are almost never restarted and can keep the setting from the server start (it’s DDLWorker, Kafka, and some other service things).
Dictionaries
We suggest to store each dictionary description in a separate (own) file in a /etc/clickhouse-server/dict sub-folder.
$ cat /etc/clickhouse-server/dict/country.xml
<?xml version="1.0"?>
<dictionaries>
<dictionary>
<name>country</name>
<source>
<http>
...
</dictionary>
</dictionaries>
and add to the configuration
$ cat /etc/clickhouse-server/config.d/dictionaries.xml
<?xml version="1.0"?>
<yandex>
<dictionaries_config>dict/*.xml</dictionaries_config>
<dictionaries_lazy_load>true</dictionaries_lazy_load>
</yandex>
dict/*.xml – relative path, servers seeks files in the folder /etc/clickhouse-server/dict. More info in Multiple Clickhouse instances.
incl attribute & metrica.xml
incl attribute allows to include some XML section from a special include file multiple times.
By default include file is /etc/metrika.xml. You can use many include files for each XML section.
For example to avoid repetition of user/password for each dictionary you can create an XML file:
$ cat /etc/clickhouse-server/dict_sources.xml
<?xml version="1.0"?>
<yandex>
<mysql_config>
<port>3306</port>
<user>user</user>
<password>123</password>
<replica>
<host>mysql_host</host>
<priority>1</priority>
</replica>
<db>my_database</db>
</mysql_config>
</yandex>
Include this file:
$ cat /etc/clickhouse-server/config.d/dictionaries.xml
<?xml version="1.0"?>
<yandex>
...
<include_from>/etc/clickhouse-server/dict_sources.xml</include_from>
</yandex>
And use in dictionary descriptions (incl=“mysql_config”):
$ cat /etc/clickhouse-server/dict/country.xml
<?xml version="1.0"?>
<dictionaries>
<dictionary>
<name>country</name>
<source>
<mysql incl="mysql_config">
<table>my_table</table>
<invalidate_query>select max(id) from my_table</invalidate_query>
</mysql>
</source>
...
</dictionary>
</dictionaries>
Multiple Clickhouse instances at one host
By default Clickhouse server configs are in /etc/clickhouse-server/ because clickhouse-server runs with a parameter –config-file /etc/clickhouse-server/config.xml
config-file is defined in startup scripts:
- /etc/init.d/clickhouse-server – init-V
- /etc/systemd/system/clickhouse-server.service – systemd
Clickhouse uses the path from config-file parameter as base folder and seeks for other configs by relative path. All sub-folders users.d / config.d are relative.
You can start multiple clickhouse-server each with own –config-file.
For example:
/usr/bin/clickhouse-server --config-file /etc/clickhouse-server-node1/config.xml
/etc/clickhouse-server-node1/ config.xml ... users.xml
/etc/clickhouse-server-node1/config.d/disable_open_network.xml
/etc/clickhouse-server-node1/users.d/....
/usr/bin/clickhouse-server --config-file /etc/clickhouse-server-node2/config.xml
/etc/clickhouse-server-node2/ config.xml ... users.xml
/etc/clickhouse-server-node2/config.d/disable_open_network.xml
/etc/clickhouse-server-node2/users.d/....
If you need to run multiple servers for CI purposes you can combine all settings in a single fat XML file and start ClickHouse without config folders/sub-folders.
/usr/bin/clickhouse-server --config-file /tmp/ch1.xml
/usr/bin/clickhouse-server --config-file /tmp/ch2.xml
/usr/bin/clickhouse-server --config-file /tmp/ch3.xml
Each ClickHouse instance must work with own data-folder and tmp-folder.
By default ClickHouse uses /var/lib/clickhouse/. It can be overridden in path settings
<path>/data/clickhouse-ch1/</path>
<tmp_path>/data/clickhouse-ch1/tmp/</tmp_path>
<user_files_path>/data/clickhouse-ch1/user_files/</user_files_path>
<local_directory>
<path>/data/clickhouse-ch1/access/</path>
</local_directory>
<format_schema_path>/data/clickhouse-ch1/format_schemas/</format_schema_path>
preprocessed_configs
Clickhouse server watches config files and folders. When you change, add or remove XML files Clickhouse immediately assembles XML files into a combined file. These combined files are stored in /var/lib/clickhouse/preprocessed_configs/ folders.
You can verify that your changes are valid by checking /var/lib/clickhouse/preprocessed_configs/config.xml, /var/lib/clickhouse/preprocessed_configs/users.xml.
If something wrong with with your settings e.g. unclosed XML element or typo you can see alerts about this mistakes in /var/log/clickhouse-server/clickhouse-server.log
If you see your changes in preprocessed_configs it does not mean that changes are applied on running server, check Settings & restart
38 - Settings to adjust
-
query_log
and other_log
tables - set up TTL, or some other cleanup procedures.cat /etc/clickhouse-server/config.d/query_log.xml <yandex> <query_log replace="1"> <database>system</database> <table>query_log</table> <flush_interval_milliseconds>7500</flush_interval_milliseconds> <engine> ENGINE = MergeTree PARTITION BY event_date ORDER BY (event_time) TTL event_date + interval 90 day SETTINGS ttl_only_drop_parts=1 </engine> </query_log> </yandex>
-
query_thread_log
- typically is not too useful for end users, you can disable it (or set up TTL). We do not recommend removing this table completely as you might need it for debug one day and the threads’ logging can be easily disabled/enabled without a restart through user profiles:$ cat /etc/clickhouse-server/users.d/z_log_queries.xml <yandex> <profiles> <default> <log_query_threads>0</log_query_threads> </default> </profiles> </yandex>
-
If you have a good monitoring outside ClickHouse you don’t need to store the history of metrics in ClickHouse
cat /etc/clickhouse-server/config.d/disable_metric_logs.xml <yandex> <metric_log remove="1" /> <asynchronous_metric_log remove="1" /> </yandex>
-
part_log
- may be nice, especially at the beginning / during system tuning/analyze.cat /etc/clickhouse-server/config.d/part_log.xml <yandex> <part_log replace="1"> <database>system</database> <table>part_log</table> <flush_interval_milliseconds>7500</flush_interval_milliseconds> <engine> ENGINE = MergeTree PARTITION BY toYYYYMM(event_date) ORDER BY (event_time) TTL toStartOfMonth(event_date) + INTERVAL 3 MONTH SETTINGS ttl_only_drop_parts=1 </engine> </part_log> </yandex>
-
on older versions
log_queries
is disabled by default, it’s worth having it enabled always.$ cat /etc/clickhouse-server/users.d/log_queries.xml <yandex> <profiles> <default> <log_queries>1</log_queries> </default> </profiles> </yandex>
-
quite often you want to have on-disk group by / order by enabled (both disabled by default).
cat /etc/clickhouse-server/users.d/enable_on_disk_operations.xml <yandex> <profiles> <default> <max_bytes_before_external_group_by>2000000000</max_bytes_before_external_group_by> <max_bytes_before_external_sort>2000000000</max_bytes_before_external_sort> </default> </profiles> </yandex>
-
quite often you want to create more users with different limitations. The most typical is
<max_execution_time>
It’s actually also not a way to plan/share existing resources better, but it at least disciplines users.Also introducing some restrictions on query complexity can be a good option to discipline users.
You can find the preset example here. Also, force_index_by_date + force_primary_key can be a nice idea to avoid queries that ‘accidentally’ do full scans, max_concurrent_queries_for_user
-
merge_tree settings:
max_bytes_to_merge_at_max_space_in_pool
(may be reduced in some scenarios),fsync_*
,inactive_parts_to_throw_insert
- can be enabled,replicated_deduplication_window
- can be extended if single insert create lot of parts ,merge_with_ttl_timeout
- when you use ttl -
settings
default_database_engine
/insert_distributed_sync
/fsync_metadata
/do_not_merge_across_partitions_select_final
/ fsync -
memory usage per server / query / user: memory configuration settings
See also:
https://docs.altinity.com/operationsguide/security/clickhouse-hardening-guide/
39 - Shutting down a node
It’s possible to shutdown server on fly, but that would lead to failure of some queries.
More safer way:
-
Remove server (which is going to be disabled) from remote_server section of config.xml on all servers.
-
Remove server from load balancer, so new queries wouldn’t hit it.
-
Detach Kafka / Rabbit / Buffer tables (if used), and Materialized* databases.
-
Wait until all already running queries would finish execution on it. It’s possible to check it via query:
SHOW PROCESSLIST;
-
Ensure there is no pending data in distributed tables
SELECT * FROM system.distribution_queue; SYSTEM FLUSH DISTRIBUTED <table_name>;
-
Run sync replica query in related shard replicas (others than the one you remove) via query:
SYSTEM SYNC REPLICA db.table;
-
Shutdown server.
SYSTEM SHUTDOWN
query by default doesn’t wait until query completion and tries to kill all queries immediately after receiving signal, if you want to change this behavior, you need to enable setting shutdown_wait_unfinished_queries
.
40 - SSL connection unexpectedly closed
ClickHouse doesn’t probe CA path which is default on CentOS and Amazon Linux.
ClickHouse client
cat /etc/clickhouse-client/conf.d/openssl-ca.xml
<config>
<openSSL>
<client> <!-- Used for connection to server's secure tcp port -->
<caConfig>/etc/ssl/certs</caConfig>
</client>
</openSSL>
</config>
ClickHouse server
cat /etc/clickhouse-server/conf.d/openssl-ca.xml
<config>
<openSSL>
<server> <!-- Used for https server AND secure tcp port -->
<caConfig>/etc/ssl/certs</caConfig>
</server>
<client> <!-- Used for connecting to https dictionary source and secured Zookeeper communication -->
<caConfig>/etc/ssl/certs</caConfig>
</client>
</openSSL>
</config>
41 - Suspiciously many broken parts
Symptom:
clickhouse don’t start with a message DB::Exception: Suspiciously many broken parts to remove.
Cause:
That exception is just a safeguard check/circuit breaker, triggered when clickhouse detects a lot of broken parts during server startup.
Parts are considered broken if they have bad checksums or some files are missing or malformed. Usually, that means the data was corrupted on the disk.
Why data could be corrupted?
-
the most often reason is a hard restart of the system, leading to a loss of the data which was not fully flushed to disk from the system page cache. Please be aware that by default ClickHouse doesn’t do fsync, so data is considered inserted after it was passed to the Linux page cache. See fsync-related settings in ClickHouse.
-
it can also be caused by disk failures, maybe there are bad blocks on hard disk, or logical problems, or some raid issue. Check system journals, use
fsck
/mdadm
and other standard tools to diagnose the disk problem. -
other reasons: manual intervention/bugs etc, for example, the data files or folders are removed by mistake or moved to another folder.
Action:
-
If you ok to accept the data loss: set up
force_restore_data
flag and clickhouse will move the parts to detached. Data loss is possible if the issue is a result of misconfiguration (i.e. someone accidentally has fixed xml configs with incorrect shard/replica macros, data will be moved to detached folder and can be recovered).sudo -u clickhouse touch /var/lib/clickhouse/flags/force_restore_data
then restart clickhouse, the table will be attached, and the broken parts will be detached, which means the data from those parts will not be available for the selects. You can see the list of those parts in the
system.detached_parts
table and drop them if needed usingALTER TABLE ... DROP DETACHED PART ...
commands.If you are ok to tolerate bigger losses automatically you can change that safeguard configuration to be less sensitive by increasing
max_suspicious_broken_parts
setting:cat /etc/clickhouse-server/config.d/max_suspicious_broken_parts.xml <?xml version="1.0"?> <yandex> <merge_tree> <max_suspicious_broken_parts>50</max_suspicious_broken_parts> </merge_tree> </yandex>
this limit is set to 10 by default, we can set a bigger value (50 or 100 or more), but the data will lose because of the corruption.
Check also a similar setting
max_suspicious_broken_parts_bytes
.
See https://clickhouse.com/docs/en/operations/settings/merge-tree-settings/ -
If you can’t accept the data loss - you should recover data from backups / re-insert it once again etc.
If you don’t want to tolerate automatic detaching of broken parts, you can set
max_suspicious_broken_parts_bytes
andmax_suspicious_broken_parts
to 0.
Scenario illustrating / testing
- Create table
create table t111(A UInt32) Engine=MergeTree order by A settings max_suspicious_broken_parts=1;
insert into t111 select number from numbers(100000);
- Detach the table and make Data corruption
detach table t111;
cd /var/lib/clickhouse/data/default/t111/all_*** make data file corruption:
> data.bin
repeat for 2 or more data files.
- Attach the table:
attach table t111;
Received exception from server (version 21.12.3):
Code: 231. DB::Exception: Received from localhost:9000. DB::Exception: Suspiciously many (2) broken parts to remove.. (TOO_MANY_UNEXPEC
- setup force_restrore_data flag
sudo -u clickhouse touch /var/lib/clickhouse/flags/force_restore_data
sudo service clickhouse-server restart
then the table t111
will be attached lost the corrupted data.
42 - System tables eat my disk
Note 1: System database stores virtual tables (parts, tables, columns, etc.) and *_log tables.
Virtual tables do not persist on disk. They reflect ClickHouse memory (c++ structures). They cannot be changed or removed.
Log tables are named with postfix *_log and have the MergeTree engine.
You can drop / rename / truncate *_log tables at any time. ClickHouse will recreate them in about 7 seconds (flush period).
Note 2: Log tables with numeric postfixes (_1 / 2 / 3 …)
query_log_1 query_thread_log_3
are results of Clickhouse upgrades. When a new version of Clickhouse starts and discovers that a system log table’s schema is incompatible with a new schema, then Clickhouse renames the old *_log table to the name with the prefix and creates a table with the new schema. You can drop such tables if you don’t need such historic data.
You can disable all / any of them
Do not create log tables at all (a restart is needed for these changes to take effect).
$ cat /etc/clickhouse-server/config.d/z_log_disable.xml
<?xml version="1.0"?>
<yandex>
<asynchronous_metric_log remove="1"/>
<metric_log remove="1"/>
<query_thread_log remove="1" />
<query_log remove="1" />
<query_views_log remove="1" />
<part_log remove="1"/>
<session_log remove="1"/>
<text_log remove="1" />
<trace_log remove="1"/>
<crash_log remove="1"/>
<opentelemetry_span_log remove="1"/>
<zookeeper_log remove="1"/>
</yandex>
We do not recommend removing query_log
and query_thread_log
as queries’ (they have very useful information for debugging), and logging can be easily turned off without a restart through user profiles:
$ cat /etc/clickhouse-server/users.d/z_log_queries.xml
<yandex>
<profiles>
<default>
<log_queries>0</log_queries> <!-- normally it's better to keep it turned on! -->
<log_query_threads>0</log_query_threads>
</default>
</profiles>
</yandex>
Hint: z_log_disable.xml
is named with z_ in the beginning, it means this config will be applied the last and will override all other config files with these sections (config are applied in alphabetical order).
You can also configure these settings to reduce the amount of data in the system.query_log
table:
name | value | description
----------------------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------
log_queries_min_type | QUERY_START | Minimal type in query_log to log, possible values (from low to high): QUERY_START, QUERY_FINISH, EXCEPTION_BEFORE_START, EXCEPTION_WHILE_PROCESSING.
log_queries_min_query_duration_ms | 0 | Minimal time for the query to run, to get to the query_log/query_thread_log.
log_queries_cut_to_length | 100000 | If query length is greater than specified threshold (in bytes), then cut query when writing to query log. Also limit length of printed query in ordinary text log.
log_profile_events | 1 | Log query performance statistics into the query_log and query_thread_log.
log_query_settings | 1 | Log query settings into the query_log.
log_queries_probability | 1 | Log queries with the specified probabality.
You can configure TTL
Example for query_log
. It drops partitions with data older than 14 days:
$ cat /etc/clickhouse-server/config.d/query_log_ttl.xml
<?xml version="1.0"?>
<yandex>
<query_log>
<database>system</database>
<table>query_log</table>
<engine>ENGINE = MergeTree PARTITION BY (event_date)
ORDER BY (event_time)
TTL event_date + INTERVAL 14 DAY DELETE
SETTINGS ttl_only_drop_parts=1
</engine>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
</query_log>
</yandex>
After that you need to restart ClickHouse and drop or rename the existing system.query_log table, then CH creates a new table with these settings.
RENAME TABLE system.query_log TO system.query_log_1;
Important part here is a daily partitioning PARTITION BY (event_date)
and ttl_only_drop_parts=1
. In this case ClickHouse drops whole partitions. Dropping of partitions is very easy operation for CPU / Disk I/O.
Usual TTL (without ttl_only_drop_parts=1
) is heavy CPU / Disk I/O consuming operation which re-writes data parts without expired rows.
You can add TTL without ClickHouse restart (and table dropping or renaming):
ALTER TABLE system.query_log MODIFY SETTING ttl_only_drop_parts = 1;
ALTER TABLE system.query_log MODIFY TTL event_date + INTERVAL 14 DAY;
But in this case ClickHouse will drop only whole monthly partitions (will store data older than 14 days).
One more way to configure TTL for system tables
This way just adds TTL to a table and leaves monthly (default) partitioning (will store data older than 14 days).
$ cat /etc/clickhouse-server/config.d/query_log_ttl.xml
<?xml version="1.0"?>
<yandex>
<query_log>
<database>system</database>
<table>query_log</table>
<ttl>event_date + INTERVAL 30 DAY DELETE</ttl>
</query_log>
</yandex>
After that you need to restart ClickHouse and drop or rename the existing system.query_log table, then CH creates a new table with this TTL setting.
You can disable logging on a session level or in user’s profile (for all or specific users)
But only for logs generated on session level (query_log
/ query_thread_log
)
In this case a restart is not needed.
Let’s disable query logging for all users (profile = default, all other profiles inherit it).
cat /etc/clickhouse-server/users.d/log_queries.xml
<yandex>
<profiles>
<default>
<log_queries>0</log_queries>
<log_query_threads>0</log_query_threads>
</default>
</profiles>
</yandex>
43 - Threads
Count threads used by clickhouse-server
cat /proc/$(pidof -s clickhouse-server)/status | grep Threads
Threads: 103
ps hH $(pidof -s clickhouse-server) | wc -l
103
ps hH -AF | grep clickhouse | wc -l
116
Thread counts by type (using ps & clickhouse-local)
ps H -o 'tid comm' $(pidof -s clickhouse-server) | tail -n +2 | awk '{ printf("%s\t%s\n", $1, $2) }' | clickhouse-local -S "threadid UInt16, name String" -q "SELECT name, count() FROM table GROUP BY name WITH TOTALS ORDER BY count() DESC FORMAT PrettyCompact"
Threads used by running queries:
SELECT query, length(thread_ids) AS threads_count FROM system.processes ORDER BY threads_count;
Thread pools limits & usage
SELECT
name,
value
FROM system.settings
WHERE name LIKE '%pool%'
┌─name─────────────────────────────────────────┬─value─┐
│ connection_pool_max_wait_ms │ 0 │
│ distributed_connections_pool_size │ 1024 │
│ background_buffer_flush_schedule_pool_size │ 16 │
│ background_pool_size │ 16 │
│ background_move_pool_size │ 8 │
│ background_fetches_pool_size │ 8 │
│ background_schedule_pool_size │ 16 │
│ background_message_broker_schedule_pool_size │ 16 │
│ background_distributed_schedule_pool_size │ 16 │
│ postgresql_connection_pool_size │ 16 │
│ postgresql_connection_pool_wait_timeout │ -1 │
│ odbc_bridge_connection_pool_size │ 16 │
└──────────────────────────────────────────────┴───────┘
SELECT
metric,
value
FROM system.metrics
WHERE metric LIKE 'Background%'
┌─metric──────────────────────────────────┬─value─┐
│ BackgroundPoolTask │ 0 │
│ BackgroundFetchesPoolTask │ 0 │
│ BackgroundMovePoolTask │ 0 │
│ BackgroundSchedulePoolTask │ 0 │
│ BackgroundBufferFlushSchedulePoolTask │ 0 │
│ BackgroundDistributedSchedulePoolTask │ 0 │
│ BackgroundMessageBrokerSchedulePoolTask │ 0 │
└─────────────────────────────────────────┴───────┘
SELECT *
FROM system.asynchronous_metrics
WHERE lower(metric) LIKE '%thread%'
ORDER BY metric ASC
┌─metric───────────────────────────────────┬─value─┐
│ HTTPThreads │ 0 │
│ InterserverThreads │ 0 │
│ MySQLThreads │ 0 │
│ OSThreadsRunnable │ 2 │
│ OSThreadsTotal │ 2910 │
│ PostgreSQLThreads │ 0 │
│ TCPThreads │ 1 │
│ jemalloc.background_thread.num_runs │ 0 │
│ jemalloc.background_thread.num_threads │ 0 │
│ jemalloc.background_thread.run_intervals │ 0 │
└──────────────────────────────────────────┴───────┘
SELECT *
FROM system.metrics
WHERE lower(metric) LIKE '%thread%'
ORDER BY metric ASC
Query id: 6acbb596-e28f-4f89-94b2-27dccfe88ee9
┌─metric─────────────┬─value─┬─description───────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ GlobalThread │ 151 │ Number of threads in global thread pool. │
│ GlobalThreadActive │ 144 │ Number of threads in global thread pool running a task. │
│ LocalThread │ 0 │ Number of threads in local thread pools. The threads in local thread pools are taken from the global thread pool. │
│ LocalThreadActive │ 0 │ Number of threads in local thread pools running a task. │
│ QueryThread │ 0 │ Number of query processing threads │
└────────────────────┴───────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Stack traces of the working threads from the pools
SET allow_introspection_functions = 1;
WITH arrayMap(x -> demangle(addressToSymbol(x)), trace) AS all
SELECT
thread_id,
query_id,
arrayStringConcat(all, '\n') AS res
FROM system.stack_trace
WHERE res ILIKE '%Pool%'
FORMAT Vertical;
44 - Who ate my memory
SELECT *, formatReadableSize(value) FROM system.asynchronous_metrics WHERE metric like '%Cach%' or metric like '%Mem%' order by metric format PrettyCompactMonoBlock;
SELECT event_time, metric, value, formatReadableSize(value) FROM system.asynchronous_metric_log WHERE event_time > now() - 600 and (metric like '%Cach%' or metric like '%Mem%') and value <> 0 order by metric, event_time format PrettyCompactMonoBlock;
SELECT formatReadableSize(sum(bytes_allocated)) FROM system.dictionaries;
SELECT
database,
name,
formatReadableSize(total_bytes)
FROM system.tables
WHERE engine IN ('Memory','Set','Join');
SELECT formatReadableSize(sum(memory_usage)) FROM system.merges;
SELECT formatReadableSize(sum(memory_usage)) FROM system.processes;
SELECT
initial_query_id,
elapsed,
formatReadableSize(memory_usage),
formatReadableSize(peak_memory_usage),
query
FROM system.processes
ORDER BY peak_memory_usage DESC
LIMIT 10;
SELECT
metric,
formatReadableSize(value)
FROM system.asynchronous_metrics
WHERE metric IN ('UncompressedCacheBytes', 'MarkCacheBytes');
SELECT
formatReadableSize(sum(primary_key_bytes_in_memory)) AS primary_key_bytes_in_memory,
formatReadableSize(sum(primary_key_bytes_in_memory_allocated)) AS primary_key_bytes_in_memory_allocated
FROM system.parts;
SELECT
type,
event_time,
initial_query_id,
formatReadableSize(memory_usage),
query
FROM system.query_log
WHERE (event_date >= today()) AND (event_time >= (now() - 7200))
ORDER BY memory_usage DESC
LIMIT 10;
SELECT sum(data_uncompressed_bytes) FROM system.parts WHERE part_type = 'InMemory';
for i in `seq 1 600`; do clickhouse-client --empty_result_for_aggregation_by_empty_set=0 -q "select (select 'Merges: \
'||formatReadableSize(sum(memory_usage)) from system.merges), (select \
'Processes: '||formatReadableSize(sum(memory_usage)) from system.processes)";\
sleep 3; done
Merges: 96.57 MiB Processes: 41.98 MiB
Merges: 82.24 MiB Processes: 41.91 MiB
Merges: 66.33 MiB Processes: 41.91 MiB
Merges: 66.49 MiB Processes: 37.13 MiB
Merges: 67.78 MiB Processes: 37.13 MiB
echo " Merges Processes PrimaryK TempTabs Dicts"; \
for i in `seq 1 600`; do clickhouse-client --empty_result_for_aggregation_by_empty_set=0 -q "select \
(select leftPad(formatReadableSize(sum(memory_usage)),15, ' ') from system.merges)||
(select leftPad(formatReadableSize(sum(memory_usage)),15, ' ') from system.processes)||
(select leftPad(formatReadableSize(sum(primary_key_bytes_in_memory_allocated)),15, ' ') from system.parts)|| \
(select leftPad(formatReadableSize(sum(total_bytes)),15, ' ') from system.tables \
WHERE engine IN ('Memory','Set','Join'))||
(select leftPad(formatReadableSize(sum(bytes_allocated)),15, ' ') FROM system.dictionaries)
"; sleep 3; done
Merges Processes PrimaryK TempTabs Dicts
0.00 B 0.00 B 21.36 MiB 1.58 GiB 911.07 MiB
0.00 B 0.00 B 21.36 MiB 1.58 GiB 911.07 MiB
0.00 B 0.00 B 21.35 MiB 1.58 GiB 911.07 MiB
0.00 B 0.00 B 21.36 MiB 1.58 GiB 911.07 MiB
45 - X rows of Y total rows in filesystem are suspicious
Warning
The local set of parts of table doesn’t look like the set of parts in ZooKeeper. 100.00 rows of 150.00 total rows in filesystem are suspicious. There are 1 unexpected parts with 100 rows (1 of them is not just-written with 100 rows), 0 missing parts (with 0 blocks).: Cannot attach table.ClickHouse has a registry of parts in ZooKeeper.
And during the start ClickHouse compares that list of parts on a local disk is consistent with a list in ZooKeeper. If the lists are too different ClickHouse denies to start because it could be an issue with settings, wrong Shard or wrong Replica macros. But this safe-limiter throws an exception if the difference is more 50% (in rows).
In your case the table is very small and the difference >50% ( 100.00 vs 150.00 ) is only a single part mismatch, which can be the result of hard restart.
SELECT * FROM system.merge_tree_settings WHERE name = 'replicated_max_ratio_of_wrong_parts'
┌─name────────────────────────────────┬─value─┬─changed─┬─description──────────────────────────────────────────────────────────────────────────┬─type──┐
│ replicated_max_ratio_of_wrong_parts │ 0.5 │ 0 │ If ratio of wrong parts to total number of parts is less than this - allow to start. │ Float │
└─────────────────────────────────────┴───────┴─────────┴──────────────────────────────────────────────────────────────────────────────────────┴───────┘
You can set another value of replicated_max_ratio_of_wrong_parts
for all MergeTree tables or per table.
https://clickhouse.tech/docs/en/operations/settings/merge-tree-settings
After manipulation with storage_policies and disks
When storage policy changes (one disk was removed from it), ClickHouse compared parts on disk and this replica state in ZooKeeper and found out that a lot of parts (from removed disk) disappeared. So ClickHouse removed them from the replica state in ZooKeeper and scheduled to fetch them from other replicas.
After we add the removed disk to storage_policy back, ClickHouse finds missing parts, but at this moment they are not registered for that replica. ClickHouse produce error message like this:
Warning
default
.tbl
from metadata file /var/lib/clickhouse/metadata/default/tbl.sql from query ATTACH TABLE default.tbl … ENGINE=ReplicatedMergeTree(’/clickhouse/tables/0/default/tbl’, ‘replica-0’)… SETTINGS index_granularity = 1024, storage_policy = ’ebs_hot_and_cold’: while loading database default
from path /var/lib/clickhouse/metadata/data
At this point, it’s possible to either tune setting replicated_max_ratio_of_wrong_parts
or do force restore, but it will end up downloading all “missing” parts from other replicas, which can take a lot of time for big tables.
ClickHouse 21.7+
- Rename table SQL attach script in order to prevent ClickHouse from attaching it at startup.
mv /var/lib/clickhouse/metadata/default/tbl.sql /var/lib/clickhouse/metadata/default/tbl.sql.bak
-
Start ClickHouse server.
-
Remove metadata for this replica from ZooKeeper.
SYSTEM DROP REPLICA 'replica-0' FROM ZKPATH '/clickhouse/tables/0/default/tbl';
SELECT * FROM system.zookeeper WHERE path = '/clickhouse/tables/0/default/tbl/replicas';
- Rename table SQL attach script back to normal name.
mv /var/lib/clickhouse/metadata/default/tbl.sql.bak /var/lib/clickhouse/metadata/default/tbl.sql
- Attach table to ClickHouse server, because there is no metadata in ZooKeeper, ClickHouse will attach it in read only state.
ATTACH TABLE default.tbl;
- Run
SYSTEM RESTORE REPLICA
in order to sync state on disk and in ZooKeeper.
SYSTEM RESTORE REPLICA default.tbl;
- Run
SYSTEM SYNC REPLICA
to download missing parts from other replicas.
SYSTEM SYNC REPLICA default.tbl;
46 - ZooKeeper
Requirements
TLDR version:
- USE DEDICATED FAST DISKS for the transaction log! (crucial for performance due to write-ahead-log, NVMe is preferred for heavy load setup).
- use 3 nodes (more nodes = slower quorum, less = no HA).
- low network latency between zookeeper nodes is very important (latency, not bandwidth).
- have at least 4Gb of RAM, disable swap, tune JVM sizes, and garbage collector settings
- ensure that zookeeper will not be CPU-starved by some other processes
- monitor zookeeper.
Side note: in many cases, the slowness of the zookeeper is actually a symptom of some issue with clickhouse schema/usage pattern (the most typical issues: an enormous number of partitions/tables/databases with real-time inserts, tiny & frequent inserts).
How to install
- https://docs.altinity.com/operationsguide/clickhouse-zookeeper/zookeeper-installation/
- altinity-kb-setup-and-maintenance/altinity-kb-zookeeper/install_ubuntu/
Random links on best practices
- https://docs.confluent.io/platform/current/zookeeper/deployment.html
- https://zookeeper.apache.org/doc/r3.4.9/zookeeperAdmin.html#sc_commonProblems
- https://clickhouse.tech/docs/en/operations/tips/#zookeeper
- https://lucene.apache.org/solr/guide/7_4/setting-up-an-external-zookeeper-ensemble.html
- https://cwiki.apache.org/confluence/display/ZOOKEEPER/Troubleshooting
Cite from https://zookeeper.apache.org/doc/r3.5.7/zookeeperAdmin.html#sc_commonProblems :
Things to Avoid
Here are some common problems you can avoid by configuring ZooKeeper correctly:
- inconsistent lists of servers : The list of ZooKeeper servers used by the clients must match the list of ZooKeeper servers that each ZooKeeper server has. Things work okay if the client list is a subset of the real list, but things will really act strange if clients have a list of ZooKeeper servers that are in different ZooKeeper clusters. Also, the server lists in each Zookeeper server configuration file should be consistent with one another.
- incorrect placement of transaction log : The most performance critical part of ZooKeeper is the transaction log. ZooKeeper syncs transactions to media before it returns a response. A dedicated transaction log device is key to consistent good performance. Putting the log on a busy device will adversely affect performance. If you only have one storage device, increase the snapCount so that snapshot files are generated less often; it does not eliminate the problem, but it makes more resources available for the transaction log.
- incorrect Java heap size : You should take special care to set your Java max heap size correctly. In particular, you should not create a situation in which ZooKeeper swaps to disk. The disk is death to ZooKeeper. Everything is ordered, so if processing one request swaps the disk, all other queued requests will probably do the same. the disk. DON’T SWAP. Be conservative in your estimates: if you have 4G of RAM, do not set the Java max heap size to 6G or even 4G. For example, it is more likely you would use a 3G heap for a 4G machine, as the operating system and the cache also need memory. The best and only recommend practice for estimating the heap size your system needs is to run load tests, and then make sure you are well below the usage limit that would cause the system to swap.
- Publicly accessible deployment : A ZooKeeper ensemble is expected to operate in a trusted computing environment. It is thus recommended to deploy ZooKeeper behind a firewall.
How to check number of followers:
echo mntr | nc zookeeper 2187 | grep foll
zk_synced_followers 2
zk_synced_non_voting_followers 0
zk_avg_follower_sync_time 0.0
zk_min_follower_sync_time 0
zk_max_follower_sync_time 0
zk_cnt_follower_sync_time 0
zk_sum_follower_sync_time 0
Tools
Alternatives for zkCli
- https://github.com/go-zkcli/zkcli
- https://github.com/outbrain/zookeepercli
- https://idata.co.il/2018/07/a-day-at-the-zoo-graphic-uis-for-apache-zookeeper/
Web UI
46.1 - Install standalone Zookeeper for ClickHouse on Ubuntu / Debian
Reference script to install standalone Zookeeper for Ubuntu / Debian
Tested on Ubuntu 20.
# install java runtime environment
sudo apt-get update
sudo apt install default-jre
# prepare folders, logs folder should be on the low-latency disk.
sudo mkdir -p /var/lib/zookeeper/data /var/lib/zookeeper/logs /etc/zookeeper /var/log/zookeeper /opt
# download and install files
export ZOOKEEPER_VERSION=3.6.3
wget https://dlcdn.apache.org/zookeeper/zookeeper-${ZOOKEEPER_VERSION}/apache-zookeeper-${ZOOKEEPER_VERSION}-bin.tar.gz -O /tmp/apache-zookeeper-${ZOOKEEPER_VERSION}-bin.tar.gz
sudo tar -xvf /tmp/apache-zookeeper-${ZOOKEEPER_VERSION}-bin.tar.gz -C /opt
rm -rf /tmp/apache-zookeeper-${ZOOKEEPER_VERSION}-bin.tar.gz
# create the user
sudo groupadd -r zookeeper
sudo useradd -r -g zookeeper --home-dir=/var/lib/zookeeper --shell=/bin/false zookeeper
# symlink pointing to the used version of zookeeper distibution
sudo ln -s /opt/apache-zookeeper-${ZOOKEEPER_VERSION}-bin /opt/zookeeper
sudo chown -R zookeeper:zookeeper /var/lib/zookeeper /var/log/zookeeper /etc/zookeeper /opt/apache-zookeeper-${ZOOKEEPER_VERSION}-bin
sudo chown -h zookeeper:zookeeper /opt/zookeeper
# shortcuts in /usr/local/bin/
echo -e '#!/usr/bin/env bash\n/opt/zookeeper/bin/zkCli.sh "$@"' | sudo tee /usr/local/bin/zkCli
echo -e '#!/usr/bin/env bash\n/opt/zookeeper/bin/zkServer.sh "$@"' | sudo tee /usr/local/bin/zkServer
echo -e '#!/usr/bin/env bash\n/opt/zookeeper/bin/zkCleanup.sh "$@"' | sudo tee /usr/local/bin/zkCleanup
echo -e '#!/usr/bin/env bash\n/opt/zookeeper/bin/zkSnapShotToolkit.sh "$@"' | sudo tee /usr/local/bin/zkSnapShotToolkit
echo -e '#!/usr/bin/env bash\n/opt/zookeeper/bin/zkTxnLogToolkit.sh "$@"' | sudo tee /usr/local/bin/zkTxnLogToolkit
sudo chmod +x /usr/local/bin/zkCli /usr/local/bin/zkServer /usr/local/bin/zkCleanup /usr/local/bin/zkSnapShotToolkit /usr/local/bin/zkTxnLogToolkit
# put in the config
sudo cp opt/zookeeper/conf/* /etc/zookeeper
cat <<EOF | sudo tee /etc/zookeeper/zoo.cfg
initLimit=20
syncLimit=10
maxSessionTimeout=60000000
maxClientCnxns=2000
preAllocSize=131072
snapCount=3000000
dataDir=/var/lib/zookeeper/data
dataLogDir=/var/lib/zookeeper/logs # use low-latency disk!
clientPort=2181
#clientPortAddress=nthk-zoo1.localdomain
autopurge.snapRetainCount=10
autopurge.purgeInterval=1
4lw.commands.whitelist=*
EOF
sudo chown -R zookeeper:zookeeper /etc/zookeeper
# create systemd service file
cat <<EOF | sudo tee /etc/systemd/system/zookeeper.service
[Unit]
Description=Zookeeper Daemon
Documentation=http://zookeeper.apache.org
Requires=network.target
After=network.target
[Service]
Type=forking
WorkingDirectory=/var/lib/zookeeper
User=zookeeper
Group=zookeeper
Environment=ZK_SERVER_HEAP=1536 # in megabytes, adjust to ~ 80-90% of avaliable RAM (more than 8Gb is rather overkill)
Environment=SERVER_JVMFLAGS="-Xms256m -XX:+AlwaysPreTouch -Djute.maxbuffer=8388608 -XX:MaxGCPauseMillis=50"
Environment=ZOO_LOG_DIR=/var/log/zookeeper
ExecStart=/opt/zookeeper/bin/zkServer.sh start /etc/zookeeper/zoo.cfg
ExecStop=/opt/zookeeper/bin/zkServer.sh stop /etc/zookeeper/zoo.cfg
ExecReload=/opt/zookeeper/bin/zkServer.sh restart /etc/zookeeper/zoo.cfg
TimeoutSec=30
Restart=on-failure
[Install]
WantedBy=default.target
EOF
# start zookeeper
sudo systemctl daemon-reload
sudo systemctl start zookeeper.service
# check status etc.
echo stat | nc localhost 2181
echo ruok | nc localhost 2181
echo mntr | nc localhost 2181
46.2 - clickhouse-keeper
In 21.3 there is already an option to run own clickhouse zookeeper implementation. It’s still experimental, and still need to be started additionally on few nodes (similar to ’normal’ zookeeper) and speaks normal zookeeper protocol - needed to simplify A/B tests with real zookeeper.
No docs, for now, only PR with code & tests. Of course, if you want to play with it - you can, and early feedback is very valuable. But be prepared for a lot of tiny issues here and there, so don’t be disappointed if it will not satisfy your expectations for some reason. It’s very-very fresh :slightly_smiling_face: It’s ready for some trial runs, but not ready yet for production use cases.
To test that you need to run 3 instances of clickhouse-server (which will mimic zookeeper) with an extra config like that:
or event single instance with config like that: https://github.com/ClickHouse/ClickHouse/blob/master/tests/config/config.d/keeper_port.xml https://github.com/ClickHouse/ClickHouse/blob/master/tests/config/config.d/zookeeper.xml
And point all the clickhouses (zookeeper config secton) to those nodes / ports.
Latest testing version is recommended. We will be thankful for any feedback.
Example of a simple cluster with 2 nodes of Clickhouse using built-in keeper
For example you can start two Clikhouse nodes (hostname1, hostname2)
hostname1
$ cat /etc/clickhouse-server/config.d/keeper.xml
<?xml version="1.0" ?>
<yandex>
<keeper_server>
<tcp_port>2181</tcp_port>
<server_id>1</server_id>
<log_storage_path>/var/lib/clickhouse/coordination/log</log_storage_path>
<snapshot_storage_path>/var/lib/clickhouse/coordination/snapshots</snapshot_storage_path>
<coordination_settings>
<operation_timeout_ms>10000</operation_timeout_ms>
<session_timeout_ms>30000</session_timeout_ms>
<raft_logs_level>trace</raft_logs_level>
<rotate_log_storage_interval>10000</rotate_log_storage_interval>
</coordination_settings>
<raft_configuration>
<server>
<id>1</id>
<hostname>hostname1</hostname>
<port>9444</port>
</server>
<server>
<id>2</id>
<hostname>hostname2</hostname>
<port>9444</port>
</server>
</raft_configuration>
</keeper_server>
<zookeeper>
<node>
<host>localhost</host>
<port>2181</port>
</node>
</zookeeper>
<distributed_ddl>
<path>/clickhouse/testcluster/task_queue/ddl</path>
</distributed_ddl>
</yandex>
$ cat /etc/clickhouse-server/config.d/macros.xml
<?xml version="1.0" ?>
<yandex>
<macros>
<cluster>testcluster</cluster>
<replica>replica1</replica>
<shard>1</shard>
</macros>
</yandex>
hostname2
$ cat /etc/clickhouse-server/config.d/keeper.xml
<?xml version="1.0" ?>
<yandex>
<keeper_server>
<tcp_port>2181</tcp_port>
<server_id>2</server_id>
<log_storage_path>/var/lib/clickhouse/coordination/log</log_storage_path>
<snapshot_storage_path>/var/lib/clickhouse/coordination/snapshots</snapshot_storage_path>
<coordination_settings>
<operation_timeout_ms>10000</operation_timeout_ms>
<session_timeout_ms>30000</session_timeout_ms>
<raft_logs_level>trace</raft_logs_level>
<rotate_log_storage_interval>10000</rotate_log_storage_interval>
</coordination_settings>
<raft_configuration>
<server>
<id>1</id>
<hostname>hostname1</hostname>
<port>9444</port>
</server>
<server>
<id>2</id>
<hostname>hostname2</hostname>
<port>9444</port>
</server>
</raft_configuration>
</keeper_server>
<zookeeper>
<node>
<host>localhost</host>
<port>2181</port>
</no