This is the multi-page printable view of this section. Click here to print.
Setup & maintenance
- 1: S3 & object storage
- 1.1: AWS S3 Recipes
- 1.2: S3Disk
- 2: AggregateFunction(uniq, UUID) doubled after ClickHouse upgrade
- 3: Can not connect to my ClickHouse server
- 4: cgroups and kubernetes cloud providers
- 5: ClickHouse and different filesystems
- 6: Transformation Clickhouse logs to ndjson using Vector.dev
- 7: ClickHouse operator
- 8: Compatibility layer for clickhouse-operator metric exporter
- 9: How to convert uniqExact states to approximate uniq functions states
- 10: Custom Settings
- 11: Description of asynchronous_metrics
- 12: Clickhouse data/disk encryption (at rest)
- 13: How ALTER's works in ClickHouse
- 14: http handler example
- 15: Logging
- 16: Precreate parts using clickhouse-local
- 17: Access Control and Account Management (RBAC)
- 18: recovery-after-complete-data-loss
- 19: Replication: Can not resolve host of another clickhouse server
- 20: source parts size is greater than the current maximum
- 21: Successful ClickHouse deployment plan
- 22: sysall database (system tables on a cluster level)
- 23: Timeouts during OPTIMIZE FINAL
- 24: Useful settings to turn on/Defaults that should be reconsidered
- 25: Who ate my CPU
- 26: ZooKeeper session has expired
- 27: Altinity packaging compatibility >21.x and earlier
- 28: AWS EC2 Storage
- 29: ClickHouse in Docker
- 30: ClickHouse Monitoring
- 31: ClickHouse versions
- 32: clickhouse-backup
- 33: Converting MergeTree to Replicated
- 34: Data Migration
- 34.1: MSSQL bcp pipe to clickhouse-client
- 34.2: clickhouse-copier
- 34.2.1: clickhouse-copier 20.3 and earlier
- 34.2.2: clickhouse-copier 20.4 - 21.6
- 34.2.3: Kubernetes job for clickhouse-copier
- 34.3: Distributed table to Cluster
- 34.4: Fetch Alter Table
- 34.5: Remote table function
- 34.6: rsync
- 35: DDLWorker
- 36: differential backups using clickhouse-backup
- 37: High CPU usage
- 38: Load balancers
- 39: memory configuration settings
- 40: Moving a table to another device.
- 41: Object consistency in a cluster
- 42: Production Cluster Configuration Guide
- 42.1: Backups
- 42.2: Cluster Configuration FAQ
- 42.3: Cluster Configuration Process
- 42.4: Hardware Requirements
- 42.5: Network Configuration
- 42.6: Version Upgrades
- 43: Replication queue
- 44: Schema migration tools for ClickHouse
- 44.1: golang-migrate
- 45: Server config files
- 46: Settings to adjust
- 47: Shutting down a node
- 48: SSL connection unexpectedly closed
- 49: Suspiciously many broken parts
- 50: System tables ate my disk
- 51: Threads
- 52: Who ate my memory
- 53: X rows of Y total rows in filesystem are suspicious
- 54: ZooKeeper
- 54.1: clickhouse-keeper-service
- 54.2: Install standalone Zookeeper for ClickHouse on Ubuntu / Debian
- 54.3: clickhouse-keeper
- 54.4: How to check the list of watches
- 54.5: JVM sizes and garbage collector settings
- 54.6: Proper setup
- 54.7: Recovering from complete metadata loss in ZooKeeper
- 54.8: ZooKeeper backup
- 54.9: ZooKeeper cluster migration
- 54.10: ZooKeeper Monitoring
- 54.11: ZooKeeper schema
1 - S3 & object storage
1.1 - AWS S3 Recipes
Using AWS IAM — Identity and Access Management roles
For EC2 instance, there is an option to configure an IAM role:
Role shall contain a policy with permissions like:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "allow-put-and-get",
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject"
],
"Resource": "arn:aws:s3:::BUCKET_NAME/test_s3_disk/*"
}
]
}
Corresponding configuration of ClickHouse:
<clickhouse>
<storage_configuration>
<disks>
<disk_s3>
<type>s3</type>
<endpoint>http://s3.us-east-1.amazonaws.com/BUCKET_NAME/test_s3_disk/</endpoint>
<use_environment_credentials>true</use_environment_credentials>
</disk_s3>
</disks>
<policies>
<policy_s3_only>
<volumes>
<volume_s3>
<disk>disk_s3</disk>
</volume_s3>
</volumes>
</policy_s3_only>
</policies>
</storage_configuration>
</clickhouse>
Small check:
CREATE TABLE table_s3 (number Int64) ENGINE=MergeTree() ORDER BY tuple() PARTITION BY tuple() SETTINGS storage_policy='policy_s3_only';
INSERT INTO table_s3 SELECT * FROM system.numbers LIMIT 100000000;
SELECT * FROM table_s3;
DROP TABLE table_s3;
1.2 - S3Disk
Settings
<clickhouse>
<storage_configuration>
<disks>
<s3>
<type>s3</type>
<endpoint>http://s3.us-east-1.amazonaws.com/BUCKET_NAME/test_s3_disk/</endpoint>
<access_key_id>ACCESS_KEY_ID</access_key_id>
<secret_access_key>SECRET_ACCESS_KEY</secret_access_key>
<skip_access_check>true</skip_access_check>
<send_metadata>true</send_metadata>
</s3>
</disks>
</storage_configuration>
</clickhouse>
skip_access_check — if true, it’s possible to use read only credentials with regular MergeTree table. But you would need to disable merges (
prefer_not_to_merge
setting) on s3 volume as well.send_metadata — if true, ClickHouse will populate s3 object with initial part & file path, which allow you to recover metadata from s3 and make debug easier.
Restore metadata from S3
Default
Limitations:
- ClickHouse need RW access to this bucket
In order to restore metadata, you would need to create restore file in metadata_path/_s3_disk_name_
directory:
touch /var/lib/clickhouse/disks/_s3_disk_name_/restore
In that case ClickHouse would restore to the same bucket and path and update only metadata files in s3 bucket.
Custom
Limitations:
- ClickHouse needs RO access to the old bucket and RW to the new.
- ClickHouse will copy objects in case of restoring to a different bucket or path.
If you would like to change bucket or path, you need to populate restore file with settings in key=value format:
cat /var/lib/clickhouse/disks/_s3_disk_name_/restore
source_bucket=s3disk
source_path=vol1/
Links
2 - AggregateFunction(uniq, UUID) doubled after ClickHouse upgrade
What happened
After ClickHouse upgrade from version pre 21.6 to version after 21.6, count of unique UUID in AggregatingMergeTree tables nearly doubled in case of merging of data which was generated in different ClickHouse versions.
Why happened
In pull request which changed the internal representation of big integers data types (and UUID). SipHash64 hash-function used for uniq aggregation function for UUID data type was replaced with intHash64, which leads to different result for the same UUID value across different ClickHouse versions. Therefore, it results in doubling of counts, when uniqState created by different ClickHouse versions being merged together.
Related issue.
Solution
You need to replace any occurrence of uniqState(uuid)
in MATERIALIZED VIEWs with uniqState(sipHash64(uuid))
and change data type for already saved data from AggregateFunction(uniq, UUID)
to AggregateFunction(uniq, UInt64)
, because result data type of sipHash64 is UInt64.
-- On ClickHouse version 21.3
CREATE TABLE uniq_state
(
`key` UInt32,
`value` AggregateFunction(uniq, UUID)
)
ENGINE = MergeTree
ORDER BY key
INSERT INTO uniq_state SELECT
number % 10000 AS key,
uniqState(reinterpretAsUUID(number))
FROM numbers(1000000)
GROUP BY key
Ok.
0 rows in set. Elapsed: 0.404 sec. Processed 1.05 million rows, 8.38 MB (2.59 million rows/s., 20.74 MB/s.)
SELECT
key % 20,
uniqMerge(value)
FROM uniq_state
GROUP BY key % 20
┌─modulo(key, 20)─┬─uniqMerge(value)─┐
│ 0 │ 50000 │
│ 1 │ 50000 │
│ 2 │ 50000 │
│ 3 │ 50000 │
│ 4 │ 50000 │
│ 5 │ 50000 │
│ 6 │ 49999 │
│ 7 │ 50000 │
│ 8 │ 49999 │
│ 9 │ 50000 │
│ 10 │ 50000 │
│ 11 │ 50000 │
│ 12 │ 50000 │
│ 13 │ 50000 │
│ 14 │ 50000 │
│ 15 │ 50000 │
│ 16 │ 50000 │
│ 17 │ 50000 │
│ 18 │ 50000 │
│ 19 │ 50000 │
└─────────────────┴──────────────────┘
-- After upgrade of ClickHouse to 21.8
SELECT
key % 20,
uniqMerge(value)
FROM uniq_state
GROUP BY key % 20
┌─modulo(key, 20)─┬─uniqMerge(value)─┐
│ 0 │ 50000 │
│ 1 │ 50000 │
│ 2 │ 50000 │
│ 3 │ 50000 │
│ 4 │ 50000 │
│ 5 │ 50000 │
│ 6 │ 49999 │
│ 7 │ 50000 │
│ 8 │ 49999 │
│ 9 │ 50000 │
│ 10 │ 50000 │
│ 11 │ 50000 │
│ 12 │ 50000 │
│ 13 │ 50000 │
│ 14 │ 50000 │
│ 15 │ 50000 │
│ 16 │ 50000 │
│ 17 │ 50000 │
│ 18 │ 50000 │
│ 19 │ 50000 │
└─────────────────┴──────────────────┘
20 rows in set. Elapsed: 0.240 sec. Processed 10.00 thousand rows, 1.16 MB (41.72 thousand rows/s., 4.86 MB/s.)
CREATE TABLE uniq_state_2
ENGINE = MergeTree
ORDER BY key AS
SELECT *
FROM uniq_state
Ok.
0 rows in set. Elapsed: 0.128 sec. Processed 10.00 thousand rows, 1.16 MB (78.30 thousand rows/s., 9.12 MB/s.)
INSERT INTO uniq_state_2 SELECT
number % 10000 AS key,
uniqState(reinterpretAsUUID(number))
FROM numbers(1000000)
GROUP BY key
Ok.
0 rows in set. Elapsed: 0.266 sec. Processed 1.05 million rows, 8.38 MB (3.93 million rows/s., 31.48 MB/s.)
SELECT
key % 20,
uniqMerge(value)
FROM uniq_state_2
GROUP BY key % 20
┌─modulo(key, 20)─┬─uniqMerge(value)─┐
│ 0 │ 99834 │ <- Count of unique values nearly doubled.
│ 1 │ 100219 │
│ 2 │ 100128 │
│ 3 │ 100457 │
│ 4 │ 100272 │
│ 5 │ 100279 │
│ 6 │ 99372 │
│ 7 │ 99450 │
│ 8 │ 99974 │
│ 9 │ 99632 │
│ 10 │ 99562 │
│ 11 │ 100660 │
│ 12 │ 100439 │
│ 13 │ 100252 │
│ 14 │ 100650 │
│ 15 │ 99320 │
│ 16 │ 100095 │
│ 17 │ 99632 │
│ 18 │ 99540 │
│ 19 │ 100098 │
└─────────────────┴──────────────────┘
20 rows in set. Elapsed: 0.356 sec. Processed 20.00 thousand rows, 2.33 MB (56.18 thousand rows/s., 6.54 MB/s.)
CREATE TABLE uniq_state_3
ENGINE = MergeTree
ORDER BY key AS
SELECT *
FROM uniq_state
0 rows in set. Elapsed: 0.126 sec. Processed 10.00 thousand rows, 1.16 MB (79.33 thousand rows/s., 9.24 MB/s.)
-- Option 1, create separate column
ALTER TABLE uniq_state_3
ADD COLUMN `value_2` AggregateFunction(uniq, UInt64) DEFAULT unhex(hex(value));
ALTER TABLE uniq_state_3
UPDATE value_2 = value_2 WHERE 1;
SELECT *
FROM system.mutations
WHERE is_done = 0;
Ok.
0 rows in set. Elapsed: 0.008 sec.
INSERT INTO uniq_state_3 (key, value_2) SELECT
number % 10000 AS key,
uniqState(sipHash64(reinterpretAsUUID(number)))
FROM numbers(1000000)
GROUP BY key
Ok.
0 rows in set. Elapsed: 0.337 sec. Processed 1.05 million rows, 8.38 MB (3.11 million rows/s., 24.89 MB/s.)
SELECT
key % 20,
uniqMerge(value),
uniqMerge(value_2)
FROM uniq_state_3
GROUP BY key % 20
┌─modulo(key, 20)─┬─uniqMerge(value)─┬─uniqMerge(value_2)─┐
│ 0 │ 50000 │ 50000 │
│ 1 │ 50000 │ 50000 │
│ 2 │ 50000 │ 50000 │
│ 3 │ 50000 │ 50000 │
│ 4 │ 50000 │ 50000 │
│ 5 │ 50000 │ 50000 │
│ 6 │ 49999 │ 49999 │
│ 7 │ 50000 │ 50000 │
│ 8 │ 49999 │ 49999 │
│ 9 │ 50000 │ 50000 │
│ 10 │ 50000 │ 50000 │
│ 11 │ 50000 │ 50000 │
│ 12 │ 50000 │ 50000 │
│ 13 │ 50000 │ 50000 │
│ 14 │ 50000 │ 50000 │
│ 15 │ 50000 │ 50000 │
│ 16 │ 50000 │ 50000 │
│ 17 │ 50000 │ 50000 │
│ 18 │ 50000 │ 50000 │
│ 19 │ 50000 │ 50000 │
└─────────────────┴──────────────────┴────────────────────┘
20 rows in set. Elapsed: 0.768 sec. Processed 20.00 thousand rows, 4.58 MB (26.03 thousand rows/s., 5.96 MB/s.)
-- Option 2, modify column in-place with String as intermediate data type.
ALTER TABLE uniq_state_3
MODIFY COLUMN `value` String
Ok.
0 rows in set. Elapsed: 0.280 sec.
ALTER TABLE uniq_state_3
MODIFY COLUMN `value` AggregateFunction(uniq, UInt64)
Ok.
0 rows in set. Elapsed: 0.254 sec.
INSERT INTO uniq_state_3 (key, value) SELECT
number % 10000 AS key,
uniqState(sipHash64(reinterpretAsUUID(number)))
FROM numbers(1000000)
GROUP BY key
Ok.
0 rows in set. Elapsed: 0.554 sec. Processed 1.05 million rows, 8.38 MB (1.89 million rows/s., 15.15 MB/s.)
SELECT
key % 20,
uniqMerge(value),
uniqMerge(value_2)
FROM uniq_state_3
GROUP BY key % 20
┌─modulo(key, 20)─┬─uniqMerge(value)─┬─uniqMerge(value_2)─┐
│ 0 │ 50000 │ 50000 │
│ 1 │ 50000 │ 50000 │
│ 2 │ 50000 │ 50000 │
│ 3 │ 50000 │ 50000 │
│ 4 │ 50000 │ 50000 │
│ 5 │ 50000 │ 50000 │
│ 6 │ 49999 │ 49999 │
│ 7 │ 50000 │ 50000 │
│ 8 │ 49999 │ 49999 │
│ 9 │ 50000 │ 50000 │
│ 10 │ 50000 │ 50000 │
│ 11 │ 50000 │ 50000 │
│ 12 │ 50000 │ 50000 │
│ 13 │ 50000 │ 50000 │
│ 14 │ 50000 │ 50000 │
│ 15 │ 50000 │ 50000 │
│ 16 │ 50000 │ 50000 │
│ 17 │ 50000 │ 50000 │
│ 18 │ 50000 │ 50000 │
│ 19 │ 50000 │ 50000 │
└─────────────────┴──────────────────┴────────────────────┘
20 rows in set. Elapsed: 0.589 sec. Processed 30.00 thousand rows, 6.87 MB (50.93 thousand rows/s., 11.66 MB/s.)
SHOW CREATE TABLE uniq_state_3;
CREATE TABLE default.uniq_state_3
(
`key` UInt32,
`value` AggregateFunction(uniq, UInt64),
`value_2` AggregateFunction(uniq, UInt64) DEFAULT unhex(hex(value))
)
ENGINE = MergeTree
ORDER BY key
SETTINGS index_granularity = 8192
-- Option 3, CAST uniqState(UInt64) to String.
CREATE TABLE uniq_state_4
ENGINE = MergeTree
ORDER BY key AS
SELECT *
FROM uniq_state
Ok.
0 rows in set. Elapsed: 0.146 sec. Processed 10.00 thousand rows, 1.16 MB (68.50 thousand rows/s., 7.98 MB/s.)
INSERT INTO uniq_state_4 (key, value) SELECT
number % 10000 AS key,
CAST(uniqState(sipHash64(reinterpretAsUUID(number))), 'String')
FROM numbers(1000000)
GROUP BY key
Ok.
0 rows in set. Elapsed: 0.476 sec. Processed 1.05 million rows, 8.38 MB (2.20 million rows/s., 17.63 MB/s.)
SELECT
key % 20,
uniqMerge(value)
FROM uniq_state_4
GROUP BY key % 20
┌─modulo(key, 20)─┬─uniqMerge(value)─┐
│ 0 │ 50000 │
│ 1 │ 50000 │
│ 2 │ 50000 │
│ 3 │ 50000 │
│ 4 │ 50000 │
│ 5 │ 50000 │
│ 6 │ 49999 │
│ 7 │ 50000 │
│ 8 │ 49999 │
│ 9 │ 50000 │
│ 10 │ 50000 │
│ 11 │ 50000 │
│ 12 │ 50000 │
│ 13 │ 50000 │
│ 14 │ 50000 │
│ 15 │ 50000 │
│ 16 │ 50000 │
│ 17 │ 50000 │
│ 18 │ 50000 │
│ 19 │ 50000 │
└─────────────────┴──────────────────┘
20 rows in set. Elapsed: 0.281 sec. Processed 20.00 thousand rows, 2.33 MB (71.04 thousand rows/s., 8.27 MB/s.)
SHOW CREATE TABLE uniq_state_4;
CREATE TABLE default.uniq_state_4
(
`key` UInt32,
`value` AggregateFunction(uniq, UUID)
)
ENGINE = MergeTree
ORDER BY key
SETTINGS index_granularity = 8192
3 - Can not connect to my ClickHouse server
Can not connect to my ClickHouse server
Errors like “Connection reset by peer, while reading from socket”
Ensure that the clickhouse-server is running
systemctl status clickhouse-server
If server was restarted recently and don’t accept the connections after the restart - most probably it still just starting. During the startup sequence it need to iterate over all data folders in /var/lib/clickhouse-server In case if you have a very high number of folders there (usually caused by a wrong partitioning, or a very high number of tables / databases) that startup time can take a lot of time (same can happen if disk is very slow, for example NFS).
You can check that by looking for ‘Ready for connections’ line in
/var/log/clickhouse-server/clickhouse-server.log
(Information
log level neede)Ensure you use the proper port ip / interface?
Ensure you’re not trying to connect to secure port without tls / https or vice versa.
For clickhouse-client - pay attention on host / port / secure flags.
Ensure the interface you’re connecting to is the one which clickhouse listens (by default clickhouse listens only localhost).
Note: If you uncomment line
<listen_host>0.0.0.0</listen_host>
only - clickhouse will listen only ipv4 interfaces, while the localhost (used by clickhouse-client) may be resolved to ipv6 address. And clickhouse-client may be failing to connect.How to check which interfaces / ports do clickhouse listen?
sudo lsof -i -P -n | grep LISTEN echo listen_host sudo clickhouse-extract-from-config --config=/etc/clickhouse-server/config.xml --key=listen_host echo tcp_port sudo clickhouse-extract-from-config --config=/etc/clickhouse-server/config.xml --key=tcp_port echo tcp_port_secure sudo clickhouse-extract-from-config --config=/etc/clickhouse-server/config.xml --key=tcp_port_secure echo http_port sudo clickhouse-extract-from-config --config=/etc/clickhouse-server/config.xml --key=http_port echo https_port sudo clickhouse-extract-from-config --config=/etc/clickhouse-server/config.xml --key=https_port
For secure connection:
- ensure that server uses some certificate which can be validated by the client
- OR disable certificate checks on the client (UNSECURE)
Check for errors in /var/log/clickhouse-server/clickhouse-server.err.log ?
Is clickhouse able to serve some trivial tcp / http requests from localhost?
curl 127.0.0.1:9200 curl 127.0.0.1:8123
Check number of sockets opened by clickhouse
sudo lsof -i -a -p $(pidof clickhouse-server) # or (adjust 9000 / 8123 ports if needed) netstat -tn 2>/dev/null | tail -n +3 | awk '{ printf("%s\t%s\t%s\t%s\t%s\t%s\n", $1, $2, $3, $4, $5, $6) }' | clickhouse-local -S "Proto String, RecvQ Int64, SendQ Int64, LocalAddress String, ForeignAddress String, State LowCardinality(String)" --query="SELECT * FROM table WHERE LocalAddress like '%:9000' FORMAT PrettyCompact" netstat -tn 2>/dev/null | tail -n +3 | awk '{ printf("%s\t%s\t%s\t%s\t%s\t%s\n", $1, $2, $3, $4, $5, $6) }' | clickhouse-local -S "Proto String, RecvQ Int64, SendQ Int64, LocalAddress String, ForeignAddress String, State LowCardinality(String)" --query="SELECT * FROM table WHERE LocalAddress like '%:8123' FORMAT PrettyCompact"
ClickHouse has a limit of number of open connections (4000 by default).
Check also:
# system overall support limited number of connections it can handle netstat # you can also be reaching of of the process ulimits (Max open files) cat /proc/$(pidof -s clickhouse-server)/limits
Check firewall / selinux rules (if used)
4 - cgroups and kubernetes cloud providers
cgroups and kubernetes cloud providers
Why my ClickHouse is slow after upgrade to version 22.2 and higher?
The probable reason is that ClickHouse 22.2 started to respect cgroups (Respect cgroups limits in max_threads autodetection. #33342 (JaySon).
You can observe that max_threads = 1
SELECT
name,
value
FROM system.settings
WHERE name = 'max_threads'
┌─name────────┬─value─────┐
│ max_threads │ 'auto(1)' │
└─────────────┴───────────┘
This makes ClickHouse to execute all queries with a single thread (normal behavior is half of available CPU cores, cores = 64, then ‘auto(32)’).
We observe this cgroups behavior with AWS EKS (Kubernetes) environment and Altinity ClickHouse Operator in case if requests.cpu and limits.cpu are not set for a resource.
Workaround
We suggest to set requests.cpu = half of available CPU cores
, and limits.cpu = CPU cores
.
For example in case of 16 CPU cores:
resources:
requests:
memory: ...
cpu: 8
limits:
memory: ....
cpu: 16
Then you should get a new result:
SELECT
name,
value
FROM system.settings
WHERE name = 'max_threads'
┌─name────────┬─value─────┐
│ max_threads │ 'auto(8)' │
└─────────────┴───────────┘
in depth
For some reason AWS EKS sets cgroup kernel parameters in case of empty requests.cpu & limits.cpu into these:
# cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us
-1
# cat /sys/fs/cgroup/cpu/cpu.cfs_period_us
100000
# cat /sys/fs/cgroup/cpu/cpu.shares
2
This makes ClickHouse to set max_threads = 1
because of
cgroup_share = /sys/fs/cgroup/cpu/cpu.shares (2)
PER_CPU_SHARES = 1024
share_count = ceil( cgroup_share / PER_CPU_SHARES ) ---> ceil(2 / 1024) ---> 1
Fix
Incorrect calculation was fixed in https://github.com/ClickHouse/ClickHouse/pull/35815 and will work correctly on newer releases.
5 - ClickHouse and different filesystems
ClickHouse and different filesystems
In general ClickHouse should work with any POSIX-compatible filesystem.
- hard links and soft links support is mandatory.
- clickhouse can use O_DIRECT mode to bypass the cache (and async io)
- clickhouse can use renameat2 command for some atomic operations (not all the filesystems support that).
- depending on the schema and details of the usage the filesystem load can vary between the setup. The most natural load - is high throughput, with low or moderate IOPS.
- data is compressed in clickhouse (LZ4 by default), while indexes / marks / metadata files - no. Enabling disk-level compression can sometimes improve the compression, but can affect read / write speed.
ext4
no issues, fully supported.
The minimum kernel version required is 3.15 (newer are recommended)
XFS
Performance issues reported by users, use on own risk. Old kernels are not recommended (4.0 or newer is recommended).
According to the users’ feedback, XFS behaves worse with ClickHouse under heavy load. We don’t have real proofs/benchmarks though, example reports:
- In GitHub there are complaints about XFS from Cloudflare.
- Recently my colleague discovered that two of ClickHouse servers perform worse in a cluster than others and they found that they accidentally set up those servers with XFS instead of Ext4.
- in the system journal you can sometimes see reports like ’task XYZ blocked for more than 120 seconds’ and stack trace pointing to XFS code (example: https://gist.github.com/filimonov/85b894268f978c2ccc18ea69bae5adbd )
- system goes to 99% io kernel under load sometimes.
- we have XFS, sometimes clickhouse goes to “sleep” because XFS daemon is doing smth unknown
Maybe the above problem can be workaround by some tuning/settings, but so far we do not have a working and confirmed way to do this.
ZFS
Limitations exist, extra tuning may be needed, and having more RAM is recommended. Old kernels are not recommended.
Memory usage control - ZFS adaptive replacement cache (ARC) can take a lot of RAM. It can be the reason of out-of-memory issues when memory is also requested by the ClickHouse.
- It seems that the most important thing is zfs_arc_max - you just need to limit the maximum size of the ARC so that the sum of the maximum size of the arc + the CH itself does not exceed the size of the available RAM. For example, we set a limit of 80% RAM for Clickhouse and 10% for ARC. 10% will remain for the system and other applications
Tuning:
- another potentially interesting setting is primarycache=metadata, see benchmark example: https://www.ikus-soft.com/en/blog/2018-05-23-proxmox-primarycache-all-metadata/
- examples of tuning ZFS for MySQL https://wiki.freebsd.org/ZFSTuningGuide - perhaps some of this can also be useful (atime, recordsize) but everything needs to be carefully checked with benchmarks (I have no way).
- best practices
important note: ZFS does not support the renameat2
command, which is used by the Atomic database engine, and
therefore some of the Atomic functionality will not be available.
In old versions of clickhouse, you can face issues with the O_DIRECT mode.
Also there is a well-known (and controversional) Linus Torvalds opinion: “Don’t Use ZFS on Linux” [1], [2], [3].
BTRFS
Not enough information. Some users report performance improvement for their use case.
ReiserFS
Not enough information.
Lustre
There are reports that some people successfully use it in their setups. A fast network is required.
There were some reports about data damage on the disks on older clickhouse versions, which could be caused by the issues with O_DIRECT or async io support on Lustre.
NFS (and EFS)
Accouding to the reports - it works, throughput depends a lot on the network speed. IOPS / number of file operations per seconds can be super low (due to the locking mechanism).
https://github.com/ClickHouse/ClickHouse/issues/31113
MooseFS
There are installations using that. No extra info.
GlusterFS
There are installations using that. No extra info.
Ceph
There are installations using that. Some information: https://github.com/ClickHouse/ClickHouse/issues/8315
6 - Transformation Clickhouse logs to ndjson using Vector.dev
ClickHouse 22.8
Starting from 22.8 version, ClickHouse support writing logs in JSON format:
<?xml version="1.0"?>
<clickhouse>
<logger>
<!-- Structured log formatting:
You can specify log format(for now, JSON only). In that case, the console log will be printed
in specified format like JSON.
For example, as below:
{"date_time":"1650918987.180175","thread_name":"#1","thread_id":"254545","level":"Trace","query_id":"","logger_name":"BaseDaemon","message":"Received signal 2","source_file":"../base/daemon/BaseDaemon.cpp; virtual void SignalListener::run()","source_line":"192"}
To enable JSON logging support, just uncomment <formatting> tag below.
-->
<formatting>json</formatting>
</logger>
</clickhouse>
Transformation Clickhouse logs to ndjson using Vector.dev"
Installation of vector.dev
# arm64
wget https://packages.timber.io/vector/0.15.2/vector_0.15.2-1_arm64.deb
# amd64
wget https://packages.timber.io/vector/0.15.2/vector_0.15.2-1_amd64.deb
dpkg -i vector_0.15.2-1_*.deb
systemctl stop vector
mkdir /var/log/clickhouse-server-json
chown vector.vector /var/log/clickhouse-server-json
usermod -a -G clickhouse vector
vector config
# cat /etc/vector/vector.toml
data_dir = "/var/lib/vector"
[sources.clickhouse-log]
type = "file"
include = [ "/var/log/clickhouse-server/clickhouse-server.log" ]
fingerprinting.strategy = "device_and_inode"
message_start_indicator = '^\d+\.\d+\.\d+ \d+:\d+:\d+'
multi_line_timeout = 1000
[transforms.clickhouse-log-text]
inputs = [ "clickhouse-log" ]
type = "remap"
source = '''
. |= parse_regex!(.message, r'^(?P<timestamp>\d+\.\d+\.\d+ \d+:\d+:\d+\.\d+) \[\s?(?P<thread_id>\d+)\s?\] \{(?P<query_id>.*)\} <(?P<severity>\w+)> (?s)(?P<message>.*$)')
'''
[sinks.emit-clickhouse-log-json]
type = "file"
inputs = [ "clickhouse-log-text" ]
compression = "none"
path = "/var/log/clickhouse-server-json/clickhouse-server.%Y-%m-%d.ndjson"
encoding.only_fields = ["timestamp", "thread_id", "query_id", "severity", "message" ]
encoding.codec = "ndjson"
start
systemctl start vector
tail /var/log/clickhouse-server-json/clickhouse-server.2022-04-21.ndjson
{"message":"DiskLocal: Reserving 1.00 MiB on disk `default`, having unreserved 166.80 GiB.","query_id":"","severity":"Debug","thread_id":"283239","timestamp":"2022.04.21 13:43:21.164660"}
{"message":"MergedBlockOutputStream: filled checksums 202204_67118_67118_0 (state Temporary)","query_id":"","severity":"Trace","thread_id":"283239","timestamp":"2022.04.21 13:43:21.166810"}
{"message":"system.metric_log (e3365172-4c9b-441b-b803-756ae030e741): Renaming temporary part tmp_insert_202204_67118_67118_0 to 202204_171703_171703_0.","query_id":"","severity":"Trace","thread_id":"283239","timestamp":"2022.04.21 13:43:21.167226"}
....
sink logs into ClickHouse table
Be carefull with logging ClickHouse messages into the same ClickHouse instance, it will cause endless recursive self-logging.
create table default.clickhouse_logs(
timestamp DateTime64(3),
host LowCardinality(String),
thread_id LowCardinality(String),
severity LowCardinality(String),
query_id String,
message String)
Engine = MergeTree
Partition by toYYYYMM(timestamp)
Order by (toStartOfHour(timestamp), host, severity, query_id);
create user vector identified by 'vector1234';
grant insert on default.clickhouse_logs to vector;
create settings profile or replace profile_vector settings log_queries=0 readonly TO vector;
[sinks.clickhouse-output-clickhouse]
inputs = ["clickhouse-log-text"]
type = "clickhouse"
host = "http://localhost:8123"
database = "default"
auth.strategy = "basic"
auth.user = "vector"
auth.password = "vector1234"
healthcheck = true
table = "clickhouse_logs"
encoding.timestamp_format = "unix"
buffer.type = "disk"
buffer.max_size = 104900000
buffer.when_full = "block"
request.in_flight_limit = 20
encoding.only_fields = ["host", "timestamp", "thread_id", "query_id", "severity", "message"]
select * from default.clickhouse_logs limit 10;
┌───────────────timestamp─┬─host───────┬─thread_id─┬─severity─┬─query_id─┬─message─────────────────────────────────────────────────────
│ 2022-04-21 19:08:13.443 │ clickhouse │ 283155 │ Debug │ │ HTTP-Session: 13e87050-7824-46b0-9bd5-29469a1b102f Authentic
│ 2022-04-21 19:08:13.443 │ clickhouse │ 283155 │ Debug │ │ HTTP-Session: 13e87050-7824-46b0-9bd5-29469a1b102f Authentic
│ 2022-04-21 19:08:13.443 │ clickhouse │ 283155 │ Debug │ │ HTTP-Session: 13e87050-7824-46b0-9bd5-29469a1b102f Creating
│ 2022-04-21 19:08:13.447 │ clickhouse │ 283155 │ Debug │ │ MemoryTracker: Peak memory usage (for query): 4.00 MiB.
│ 2022-04-21 19:08:13.447 │ clickhouse │ 283155 │ Debug │ │ HTTP-Session: 13e87050-7824-46b0-9bd5-29469a1b102f Destroyin
│ 2022-04-21 19:08:13.495 │ clickhouse │ 283155 │ Debug │ │ HTTP-Session: f7eb829f-7b3a-4c43-8a41-a2e6676177fb Authentic
│ 2022-04-21 19:08:13.495 │ clickhouse │ 283155 │ Debug │ │ HTTP-Session: f7eb829f-7b3a-4c43-8a41-a2e6676177fb Authentic
│ 2022-04-21 19:08:13.495 │ clickhouse │ 283155 │ Debug │ │ HTTP-Session: f7eb829f-7b3a-4c43-8a41-a2e6676177fb Creating
│ 2022-04-21 19:08:13.496 │ clickhouse │ 283155 │ Debug │ │ MemoryTracker: Peak memory usage (for query): 4.00 MiB.
│ 2022-04-21 19:08:13.496 │ clickhouse │ 283155 │ Debug │ │ HTTP-Session: f7eb829f-7b3a-4c43-8a41-a2e6676177fb Destroyin
└─────────────────────────┴────────────┴───────────┴──────────┴──────────┴─────────────────────────────────────────────────────────────
7 - ClickHouse operator
ClickHouse operator
https://github.com/Altinity/clickhouse-operator/blob/master/docs/README.md
8 - Compatibility layer for clickhouse-operator metric exporter
It’s possible to expose clickhouse-server metrics in clickhouse-operator style. It’s for clickhouse-operator grafana dashboard.
CREATE VIEW system.operator_compatible_metrics
(
`name` String,
`value` Float64,
`help` String,
`labels` Map(String, String),
`type` String
) AS
SELECT
concat('chi_clickhouse_event_', event) AS name,
CAST(value, 'Float64') AS value,
description AS help,
map('hostname', hostName()) AS labels,
'counter' AS type
FROM system.events
UNION ALL
SELECT
concat('chi_clickhouse_metric_', metric) AS name,
CAST(value, 'Float64') AS value,
description AS help,
map('hostname', hostName()) AS labels,
'gauge' AS type
FROM system.metrics
UNION ALL
SELECT
concat('chi_clickhouse_metric_', metric) AS name,
value,
'' AS help,
map('hostname', hostName()) AS labels,
'gauge' AS type
FROM system.asynchronous_metrics
UNION ALL
SELECT
'chi_clickhouse_metric_MemoryDictionaryBytesAllocated' AS name,
CAST(sum(bytes_allocated), 'Float64') AS value,
'Memory size allocated for dictionaries' AS help,
map('hostname', hostName()) AS labels,
'gauge' AS type
FROM system.dictionaries
UNION ALL
SELECT
'chi_clickhouse_metric_LongestRunningQuery' AS name,
CAST(max(elapsed), 'Float64') AS value,
'Longest running query time' AS help,
map('hostname', hostName()) AS labels,
'gauge' AS type
FROM system.processes
UNION ALL
WITH
['chi_clickhouse_table_partitions', 'chi_clickhouse_table_parts', 'chi_clickhouse_table_parts_bytes', 'chi_clickhouse_table_parts_bytes_uncompressed', 'chi_clickhouse_table_parts_rows', 'chi_clickhouse_metric_DiskDataBytes', 'chi_clickhouse_metric_MemoryPrimaryKeyBytesAllocated'] AS names,
[uniq(partition), count(), sum(bytes), sum(data_uncompressed_bytes), sum(rows), sum(bytes_on_disk), sum(primary_key_bytes_in_memory_allocated)] AS values,
arrayJoin(arrayZip(names, values)) AS tpl
SELECT
tpl.1 AS name,
CAST(tpl.2, 'Float64') AS value,
'' AS help,
map('database', database, 'table', table, 'active', toString(active), 'hostname', hostName()) AS labels,
'gauge' AS type
FROM system.parts
GROUP BY
active,
database,
table
UNION ALL
WITH
['chi_clickhouse_table_mutations', 'chi_clickhouse_table_mutations_parts_to_do'] AS names,
[CAST(count(), 'Float64'), CAST(sum(parts_to_do), 'Float64')] AS values,
arrayJoin(arrayZip(names, values)) AS tpl
SELECT
tpl.1 AS name,
tpl.2 AS value,
'' AS help,
map('database', database, 'table', table, 'hostname', hostName()) AS labels,
'gauge' AS type
FROM system.mutations
WHERE is_done = 0
GROUP BY
database,
table
UNION ALL
WITH if(coalesce(reason, 'unknown') = '', 'detached_by_user', coalesce(reason, 'unknown')) AS detach_reason
SELECT
'chi_clickhouse_metric_DetachedParts' AS name,
CAST(count(), 'Float64') AS value,
'' AS help,
map('database', database, 'table', table, 'disk', disk, 'hostname', hostName()) AS labels,
'gauge' AS type
FROM system.detached_parts
GROUP BY
database,
table,
disk,
reason
ORDER BY name ASC
nano /etc/clickhouse-server/config.d/operator_metrics.xml
<clickhouse>
<http_handlers>
<rule>
<url>/metrics</url>
<methods>POST,GET</methods>
<handler>
<type>predefined_query_handler</type>
<query>SELECT * FROM system.operator_compatible_metrics FORMAT Prometheus</query>
<content_type>text/plain; charset=utf-8</content_type>
</handler>
</rule>
<defaults/>
<rule>
<url>/</url>
<methods>POST,GET</methods>
<headers><pragma>no-cache</pragma></headers>
<handler>
<type>dynamic_query_handler</type>
<query_param_name>query</query_param_name>
</handler>
</rule>
</http_handlers>
</clickhouse>
curl http://localhost:8123/metrics
# HELP chi_clickhouse_metric_Query Number of executing queries
# TYPE chi_clickhouse_metric_Query gauge
chi_clickhouse_metric_Query{hostname="LAPTOP"} 1
# HELP chi_clickhouse_metric_Merge Number of executing background merges
# TYPE chi_clickhouse_metric_Merge gauge
chi_clickhouse_metric_Merge{hostname="LAPTOP"} 0
# HELP chi_clickhouse_metric_PartMutation Number of mutations (ALTER DELETE/UPDATE)
# TYPE chi_clickhouse_metric_PartMutation gauge
chi_clickhouse_metric_PartMutation{hostname="LAPTOP"} 0
9 - How to convert uniqExact states to approximate uniq functions states
uniqExactState
uniqExactState
is stored in two parts: a count of values in LEB128
format + list values without a delimeter.
In our case, the value is sipHash128
of strings passed to uniqExact function.
┌─hex(uniqExactState(toString(arrayJoin([1]))))─┐
│ 01E2756D8F7A583CA23016E03447724DE7 │
└───────────────────────────────────────────────┘
01 E2756D8F7A583CA23016E03447724DE7
^ ^
LEB128 sipHash128
┌─hex(uniqExactState(toString(arrayJoin([1, 2]))))───────────────────┐
│ 024809CB4528E00621CF626BE9FA14E2BFE2756D8F7A583CA23016E03447724DE7 │
└────────────────────────────────────────────────────────────────────┘
02 4809CB4528E00621CF626BE9FA14E2BF E2756D8F7A583CA23016E03447724DE7
^ ^ ^
LEB128 sipHash128 sipHash128
So, our task is to find how we can generate such values by ourself.
In case of String
data type, it just the simple sipHash128
function.
┌─hex(sipHash128(toString(2)))─────┬─hex(sipHash128(toString(1)))─────┐
│ 4809CB4528E00621CF626BE9FA14E2BF │ E2756D8F7A583CA23016E03447724DE7 │
└──────────────────────────────────┴──────────────────────────────────┘
The second task: it needs to read a state and split it into an array of values.
Luckly for us, ClickHouse use the exact same serialization (LEB128
+ list of values) for Arrays (in this case if uniqExactState
and Array
are serialized into RowBinary
format).
We need one a helper – UDF
function to do that conversion:
cat /etc/clickhouse-server/pipe_function.xml
<clickhouse>
<function>
<type>executable</type>
<execute_direct>0</execute_direct>
<name>pipe</name>
<return_type>Array(FixedString(16))</return_type>
<argument>
<type>String</type>
</argument>
<format>RowBinary</format>
<command>cat</command>
<send_chunk_header>0</send_chunk_header>
</function>
</clickhouse>
This UDF – pipe
converts uniqExactState
to the Array(FixedString(16))
.
┌─arrayMap(x -> hex(x), pipe(uniqExactState(toString(arrayJoin([1, 2])))))──────────────┐
│ ['4809CB4528E00621CF626BE9FA14E2BF','E2756D8F7A583CA23016E03447724DE7'] │
└───────────────────────────────────────────────────────────────────────────────────────┘
And here is the full example, how you can convert uniqExactState(string)
to uniqState(string)
or uniqCombinedState(string)
using pipe
UDF and arrayReduce('func', [..])
.
-- Generate demo with random data, uniqs are stored as heavy uniqExact
CREATE TABLE aggregates
(
`id` UInt32,
`uniqExact` AggregateFunction(uniqExact, String)
)
ENGINE = AggregatingMergeTree
ORDER BY id as
SELECT
number % 10000 AS id,
uniqExactState(toString(number))
FROM numbers(10000000)
GROUP BY id;
0 rows in set. Elapsed: 2.042 sec. Processed 10.01 million rows, 80.06 MB (4.90 million rows/s., 39.21 MB/s.)
-- Let's add a new columns to store optimized, approximate uniq & uniqCombined
ALTER TABLE aggregates
ADD COLUMN `uniq` AggregateFunction(uniq, FixedString(16))
default arrayReduce('uniqState', pipe(uniqExact)),
ADD COLUMN `uniqCombined` AggregateFunction(uniqCombined, FixedString(16))
default arrayReduce('uniqCombinedState', pipe(uniqExact));
-- Materialize defaults in the new columns
ALTER TABLE aggregates UPDATE uniqCombined = uniqCombined, uniq = uniq
WHERE 1 settings mutations_sync=2;
-- Let's reset defaults to remove the dependancy of the UDF from our table
ALTER TABLE aggregates
modify COLUMN `uniq` remove default,
modify COLUMN `uniqCombined` remove default;
-- Alternatively you can populate data in the new columns directly without using DEFAULT columns
-- ALTER TABLE aggregates UPDATE
-- uniqCombined = arrayReduce('uniqCombinedState', pipe(uniqExact)),
-- uniq = arrayReduce('uniqState', pipe(uniqExact))
-- WHERE 1 settings mutations_sync=2;
-- Check results, results are slighty different, because uniq & uniqCombined are approximate functions
SELECT
id % 20 AS key,
uniqExactMerge(uniqExact),
uniqCombinedMerge(uniqCombined),
uniqMerge(uniq)
FROM aggregates
GROUP BY key
┌─key─┬─uniqExactMerge(uniqExact)─┬─uniqCombinedMerge(uniqCombined)─┬─uniqMerge(uniq)─┐
│ 0 │ 500000 │ 500195 │ 500455 │
│ 1 │ 500000 │ 502599 │ 501549 │
│ 2 │ 500000 │ 498058 │ 504428 │
│ 3 │ 500000 │ 499748 │ 500195 │
│ 4 │ 500000 │ 500791 │ 500836 │
│ 5 │ 500000 │ 502430 │ 497558 │
│ 6 │ 500000 │ 500262 │ 501785 │
│ 7 │ 500000 │ 501514 │ 495758 │
│ 8 │ 500000 │ 500121 │ 498597 │
│ 9 │ 500000 │ 502173 │ 500455 │
│ 10 │ 500000 │ 499144 │ 498386 │
│ 11 │ 500000 │ 500525 │ 503139 │
│ 12 │ 500000 │ 503624 │ 497103 │
│ 13 │ 500000 │ 499986 │ 497992 │
│ 14 │ 500000 │ 502027 │ 494833 │
│ 15 │ 500000 │ 498831 │ 500983 │
│ 16 │ 500000 │ 501103 │ 500836 │
│ 17 │ 500000 │ 499409 │ 496791 │
│ 18 │ 500000 │ 501641 │ 502991 │
│ 19 │ 500000 │ 500648 │ 500881 │
└─────┴───────────────────────────┴─────────────────────────────────┴─────────────────┘
20 rows in set. Elapsed: 2.312 sec. Processed 10.00 thousand rows, 7.61 MB (4.33 thousand rows/s., 3.29 MB/s.)
Now, lets repeat the same insert, but in that case we will also populate uniq
& uniqCombined
with values converted via sipHash128
function.
If we did everything right, uniq
counts will not change, because we inserted the exact same values.
INSERT INTO aggregates SELECT
number % 10000 AS id,
uniqExactState(toString(number)),
uniqState(sipHash128(toString(number))),
uniqCombinedState(sipHash128(toString(number)))
FROM numbers(10000000)
GROUP BY id;
0 rows in set. Elapsed: 5.386 sec. Processed 10.01 million rows, 80.06 MB (1.86 million rows/s., 14.86 MB/s.)
SELECT
id % 20 AS key,
uniqExactMerge(uniqExact),
uniqCombinedMerge(uniqCombined),
uniqMerge(uniq)
FROM aggregates
GROUP BY key
┌─key─┬─uniqExactMerge(uniqExact)─┬─uniqCombinedMerge(uniqCombined)─┬─uniqMerge(uniq)─┐
│ 0 │ 500000 │ 500195 │ 500455 │
│ 1 │ 500000 │ 502599 │ 501549 │
│ 2 │ 500000 │ 498058 │ 504428 │
│ 3 │ 500000 │ 499748 │ 500195 │
│ 4 │ 500000 │ 500791 │ 500836 │
│ 5 │ 500000 │ 502430 │ 497558 │
│ 6 │ 500000 │ 500262 │ 501785 │
│ 7 │ 500000 │ 501514 │ 495758 │
│ 8 │ 500000 │ 500121 │ 498597 │
│ 9 │ 500000 │ 502173 │ 500455 │
│ 10 │ 500000 │ 499144 │ 498386 │
│ 11 │ 500000 │ 500525 │ 503139 │
│ 12 │ 500000 │ 503624 │ 497103 │
│ 13 │ 500000 │ 499986 │ 497992 │
│ 14 │ 500000 │ 502027 │ 494833 │
│ 15 │ 500000 │ 498831 │ 500983 │
│ 16 │ 500000 │ 501103 │ 500836 │
│ 17 │ 500000 │ 499409 │ 496791 │
│ 18 │ 500000 │ 501641 │ 502991 │
│ 19 │ 500000 │ 500648 │ 500881 │
└─────┴───────────────────────────┴─────────────────────────────────┴─────────────────┘
20 rows in set. Elapsed: 3.318 sec. Processed 20.00 thousand rows, 11.02 MB (6.03 thousand rows/s., 3.32 MB/s.)
Let’s compare the data size, uniq
won in this case, but check this article Functions to count uniqs, milage may vary.
optimize table aggregates final;
SELECT
column,
formatReadableSize(sum(column_data_compressed_bytes) AS size) AS compressed,
formatReadableSize(sum(column_data_uncompressed_bytes) AS usize) AS uncompressed
FROM system.parts_columns
WHERE (active = 1) AND (table LIKE 'aggregates') and column like '%uniq%'
GROUP BY column
ORDER BY size DESC;
┌─column───────┬─compressed─┬─uncompressed─┐
│ uniqExact │ 153.21 MiB │ 152.61 MiB │
│ uniqCombined │ 76.62 MiB │ 76.32 MiB │
│ uniq │ 38.33 MiB │ 38.18 MiB │
└──────────────┴────────────┴──────────────┘
10 - Custom Settings
Using custom settings in config
You can not use the custom settings in config file ‘as is’, because clickhouse don’t know which datatype should be used to parse it.
cat /etc/clickhouse-server/users.d/default_profile.xml
<?xml version="1.0"?>
<yandex>
<profiles>
<default>
<custom_data_version>1</custom_data_version> <!-- will not work! see below -->
</default>
</profiles>
</yandex>
That will end up with the following error:
2021.09.24 12:50:37.369259 [ 264905 ] {} <Error> ConfigReloader: Error updating configuration from '/etc/clickhouse-server/users.xml' config.: Code: 536. DB::Exception: Couldn't restore Field from dump: 1: while parsing value '1' for setting 'custom_data_version'. (CANNOT_RESTORE_FROM_FIELD_DUMP), Stack trace (when copying this message, always include the lines below):
0. DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0x9440eba in /usr/lib/debug/.build-id/ba/25f6646c3be7aa95f452ec85461e96178aa365.debug
1. DB::Field::restoreFromDump(std::__1::basic_string_view<char, std::__1::char_traits<char> > const&)::$_4::operator()() const @ 0x10449da0 in /usr/lib/debug/.build-id/ba/25f6646c3be7aa95f452ec85461e96178aa365.debug
2. DB::Field::restoreFromDump(std::__1::basic_string_view<char, std::__1::char_traits<char> > const&) @ 0x10449bf1 in /usr/lib/debug/.build-id/ba/25f6646c3be7aa95f452ec85461e96178aa365.debug
3. DB::BaseSettings<DB::SettingsTraits>::stringToValueUtil(std::__1::basic_string_view<char, std::__1::char_traits<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) @ 0x1042e2bf in /usr/lib/debug/.build-id/ba/25f6646c3be7aa95f452ec85461e96178aa365.debug
4. DB::UsersConfigAccessStorage::parseFromConfig(Poco::Util::AbstractConfiguration const&) @ 0x1041a097 in /usr/lib/debug/.build-id/ba/25f6646c3be7aa95f452ec85461e96178aa365.debug
5. void std::__1::__function::__policy_invoker<void (Poco::AutoPtr<Poco::Util::AbstractConfiguration>, bool)>::__call_impl<std::__1::__function::__default_alloc_func<DB::UsersConfigAccessStorage::load(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::function<std::__1::shared_ptr<zkutil::ZooKeeper> ()> const&)::$_0, void (Poco::AutoPtr<Poco::Util::AbstractConfiguration>, bool)> >(std::__1::__function::__policy_storage const*, Poco::AutoPtr<Poco::Util::AbstractConfiguration>&&, bool) @ 0x1042e7ff in /usr/lib/debug/.build-id/ba/25f6646c3be7aa95f452ec85461e96178aa365.debug
6. DB::ConfigReloader::reloadIfNewer(bool, bool, bool, bool) @ 0x11caf54e in /usr/lib/debug/.build-id/ba/25f6646c3be7aa95f452ec85461e96178aa365.debug
7. DB::ConfigReloader::run() @ 0x11cb0f8f in /usr/lib/debug/.build-id/ba/25f6646c3be7aa95f452ec85461e96178aa365.debug
8. ThreadFromGlobalPool::ThreadFromGlobalPool<void (DB::ConfigReloader::*)(), DB::ConfigReloader*>(void (DB::ConfigReloader::*&&)(), DB::ConfigReloader*&&)::'lambda'()::operator()() @ 0x11cb19f1 in /usr/lib/debug/.build-id/ba/25f6646c3be7aa95f452ec85461e96178aa365.debug
9. ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>) @ 0x9481f5f in /usr/lib/debug/.build-id/ba/25f6646c3be7aa95f452ec85461e96178aa365.debug
10. void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void ThreadPoolImpl<std::__1::thread>::scheduleImpl<void>(std::__1::function<void ()>, int, std::__1::optional<unsigned long>)::'lambda0'()> >(void*) @ 0x9485843 in /usr/lib/debug/.build-id/ba/25f6646c3be7aa95f452ec85461e96178aa365.debug
11. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
12. __clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so
(version 21.10.1.8002 (official build))
2021.09.29 11:36:07.722213 [ 2090 ] {} <Error> Application: DB::Exception: Couldn't restore Field from dump: 1: while parsing value '1' for setting 'custom_data_version'
To make it work you need to change it an the following way:
cat /etc/clickhouse-server/users.d/default_profile.xml
<?xml version="1.0"?>
<yandex>
<profiles>
<default>
<custom_data_version>UInt64_1</custom_data_version>
</default>
</profiles>
</yandex>
or
cat /etc/clickhouse-server/users.d/default_profile.xml
<?xml version="1.0"?>
<yandex>
<profiles>
<default>
<custom_data_version>'1'</custom_data_version>
</default>
</profiles>
</yandex>
The list of recognized prefixes is in the sources: https://github.com/ClickHouse/ClickHouse/blob/ea13a8b562edbc422c07b5b4ecce353f79b6cb63/src/Core/Field.cpp#L253-L270
11 - Description of asynchronous_metrics
CompiledExpressionCacheCount -- number or compiled cached expression (if CompiledExpressionCache is enabled)
jemalloc -- parameters of jemalloc allocator, they are not very useful, and not interesting
MarkCacheBytes / MarkCacheFiles -- there are cache for .mrk files (default size is 5GB), you can see is it use all 5GB or not
MemoryCode -- how much memory allocated for ClickHouse executable
MemoryDataAndStack -- virtual memory allocated for data and stack
MemoryResident -- real memory used by ClickHouse ( the same as top RES/RSS)
MemoryShared -- shared memory used by ClickHouse
MemoryVirtual -- virtual memory used by ClickHouse ( the same as top VIRT)
NumberOfDatabases
NumberOfTables
ReplicasMaxAbsoluteDelay -- important parameter - replica max absolute delay in seconds
ReplicasMaxRelativeDelay -- replica max relative delay (from other replicas) in seconds
ReplicasMaxInsertsInQueue -- max number of parts to fetch for a single Replicated table
ReplicasSumInsertsInQueue -- sum of parts to fetch for all Replicated tables
ReplicasMaxMergesInQueue -- max number of merges in queue for a single Replicated table
ReplicasSumMergesInQueue -- total number of merges in queue for all Replicated tables
ReplicasMaxQueueSize -- max number of tasks for a single Replicated table
ReplicasSumQueueSize -- total number of tasks in replication queue
UncompressedCacheBytes/UncompressedCacheCells -- allocated memory for uncompressed cache (disabled by default)
Uptime -- uptime seconds
12 - Clickhouse data/disk encryption (at rest)
Create folder
mkdir /data/clickhouse_encrypted
chown clickhouse.clickhouse /data/clickhouse_encrypted
Configure encrypted disk and storage
- https://clickhouse.com/docs/en/operations/storing-data/#encrypted-virtual-file-system
- https://clickhouse.com/docs/en/operations/server-configuration-parameters/settings/#server-settings-encryption
cat /etc/clickhouse-server/config.d/encrypted_storage.xml
<clickhouse>
<storage_configuration>
<disks>
<disk1>
<type>local</type>
<path>/data/clickhouse_encrypted/</path>
</disk1>
<encrypted_disk>
<type>encrypted</type>
<disk>disk1</disk>
<path>encrypted/</path>
<algorithm>AES_128_CTR</algorithm>
<key_hex id="0">00112233445566778899aabbccddeeff</key_hex>
<current_key_id>0</current_key_id>
</encrypted_disk>
</disks>
<policies>
<encrypted>
<volumes>
<encrypted_volume>
<disk>encrypted_disk</disk>
</encrypted_volume>
</volumes>
</encrypted>
</policies>
</storage_configuration>
</clickhouse>
systemctl restart clickhouse-server
select name, path, type, is_encrypted from system.disks;
┌─name───────────┬─path──────────────────────────────────┬─type──┬─is_encrypted─┐
│ default │ /var/lib/clickhouse/ │ local │ 0 │
│ disk1 │ /data/clickhouse_encrypted/ │ local │ 0 │
│ encrypted_disk │ /data/clickhouse_encrypted/encrypted/ │ local │ 1 │
└────────────────┴───────────────────────────────────────┴───────┴──────────────┘
select * from system.storage_policies;
┌─policy_name─┬─volume_name──────┬─volume_priority─┬─disks──────────────┬─volume_type─┬─max_data_part_size─┬─move_factor─┬─prefer_not_to_merge─┐
│ default │ default │ 1 │ ['default'] │ JBOD │ 0 │ 0 │ 0 │
│ encrypted │ encrypted_volume │ 1 │ ['encrypted_disk'] │ JBOD │ 0 │ 0 │ 0 │
└─────────────┴──────────────────┴─────────────────┴────────────────────┴─────────────┴────────────────────┴─────────────┴─────────────────────┘
Create table
CREATE TABLE bench_encrypted(c_int Int64, c_str varchar(255), c_float Float64)
engine=MergeTree order by c_int
settings storage_policy = 'encrypted';
cat /data/clickhouse_encrypted/encrypted/store/906/9061167e-d5f7-45ea-8e54-eb6ba3b678dc/format_version.txt
ENC�AdruM�˪h"��^�
Compare performance of encrypted and not encrypted tables
CREATE TABLE bench_encrypted(c_int Int64, c_str varchar(255), c_float Float64)
engine=MergeTree order by c_int
settings storage_policy = 'encrypted';
insert into bench_encrypted
select toInt64(cityHash64(number)), lower(hex(MD5(toString(number)))), number/cityHash64(number)*10000000
from numbers_mt(100000000);
0 rows in set. Elapsed: 33.357 sec. Processed 100.66 million rows, 805.28 MB (3.02 million rows/s., 24.14 MB/s.)
CREATE TABLE bench_unencrypted(c_int Int64, c_str varchar(255), c_float Float64)
engine=MergeTree order by c_int;
insert into bench_unencrypted
select toInt64(cityHash64(number)), lower(hex(MD5(toString(number)))), number/cityHash64(number)*10000000
from numbers_mt(100000000);
0 rows in set. Elapsed: 31.175 sec. Processed 100.66 million rows, 805.28 MB (3.23 million rows/s., 25.83 MB/s.)
select avg(c_float) from bench_encrypted;
1 row in set. Elapsed: 0.195 sec. Processed 100.00 million rows, 800.00 MB (511.66 million rows/s., 4.09 GB/s.)
select avg(c_float) from bench_unencrypted;
1 row in set. Elapsed: 0.150 sec. Processed 100.00 million rows, 800.00 MB (668.71 million rows/s., 5.35 GB/s.)
select sum(c_int) from bench_encrypted;
1 row in set. Elapsed: 0.281 sec. Processed 100.00 million rows, 800.00 MB (355.74 million rows/s., 2.85 GB/s.)
select sum(c_int) from bench_unencrypted;
1 row in set. Elapsed: 0.193 sec. Processed 100.00 million rows, 800.00 MB (518.88 million rows/s., 4.15 GB/s.)
set max_threads=1;
select avg(c_float) from bench_encrypted;
1 row in set. Elapsed: 0.934 sec. Processed 100.00 million rows, 800.00 MB (107.03 million rows/s., 856.23 MB/s.)
select avg(c_float) from bench_unencrypted;
1 row in set. Elapsed: 0.874 sec. Processed 100.00 million rows, 800.00 MB (114.42 million rows/s., 915.39 MB/s.)
read key_hex from environment variable
- https://clickhouse.com/docs/en/operations/server-configuration-parameters/settings/#server-settings-encryption
- https://serverfault.com/questions/413397/how-to-set-environment-variable-in-systemd-service
cat /etc/clickhouse-server/config.d/encrypted_storage.xml
<clickhouse>
<storage_configuration>
<disks>
<disk1>
<type>local</type>
<path>/data/clickhouse_encrypted/</path>
</disk1>
<encrypted_disk>
<type>encrypted</type>
<disk>disk1</disk>
<path>encrypted/</path>
<algorithm>AES_128_CTR</algorithm>
<key_hex from_env="DiskKey"/>
</encrypted_disk>
</disks>
<policies>
<encrypted>
<volumes>
<encrypted_volume>
<disk>encrypted_disk</disk>
</encrypted_volume>
</volumes>
</encrypted>
</policies>
</storage_configuration>
</clickhouse>
cat /etc/default/clickhouse-server
DiskKey=00112233445566778899aabbccddeeff
systemctl restart clickhouse-server
13 - How ALTER's works in ClickHouse
How ALTER’s works in ClickHouse:
ADD (COLUMN/INDEX/PROJECTION)
Lightweight, will only change table metadata. So new entity will be added in case of creation of new parts during INSERT’s OR during merges of old parts.
In case of COLUMN, ClickHouse will calculate column value on fly in query context.
Warning
CREATE TABLE test_materialization
(
`key` UInt32,
`value` UInt32
)
ENGINE = MergeTree
ORDER BY key;
INSERT INTO test_materialization(key, value) SELECT 1, 1;
INSERT INTO test_materialization(key, value) SELECT 2, 2;
ALTER TABLE test_materialization ADD COLUMN inserted_at DateTime DEFAULT now();
SELECT key, inserted_at FROM test_materialization;
┌─key─┬─────────inserted_at─┐
│ 1 │ 2022-09-01 03:28:58 │
└─────┴─────────────────────┘
┌─key─┬─────────inserted_at─┐
│ 2 │ 2022-09-01 03:28:58 │
└─────┴─────────────────────┘
SELECT key, inserted_at FROM test_materialization;
┌─key─┬─────────inserted_at─┐
│ 1 │ 2022-09-01 03:29:11 │
└─────┴─────────────────────┘
┌─key─┬─────────inserted_at─┐
│ 2 │ 2022-09-01 03:29:11 │
└─────┴─────────────────────┘
Each query will return different inserted_at value, because each time now() function being executed.
INSERT INTO test_materialization(key, value) SELECT 3, 3;
SELECT key, inserted_at FROM test_materialization;
┌─key─┬─────────inserted_at─┐
│ 3 │ 2022-09-01 03:29:36 │ -- < This value was materialized during ingestion, that's why it's smaller than value for keys 1 & 2
└─────┴─────────────────────┘
┌─key─┬─────────inserted_at─┐
│ 1 │ 2022-09-01 03:29:53 │
└─────┴─────────────────────┘
┌─key─┬─────────inserted_at─┐
│ 2 │ 2022-09-01 03:29:53 │
└─────┴─────────────────────┘
OPTIMIZE TABLE test_materialization FINAL;
SELECT key, inserted_at FROM test_materialization;
┌─key─┬─────────inserted_at─┐
│ 1 │ 2022-09-01 03:30:52 │
│ 2 │ 2022-09-01 03:30:52 │
│ 3 │ 2022-09-01 03:29:36 │
└─────┴─────────────────────┘
SELECT key, inserted_at FROM test_materialization;
┌─key─┬─────────inserted_at─┐
│ 1 │ 2022-09-01 03:30:52 │
│ 2 │ 2022-09-01 03:30:52 │
│ 3 │ 2022-09-01 03:29:36 │
└─────┴─────────────────────┘
So, data inserted after addition of column can have lower inserted_at value then old data without materialization.
If you want to backpopulate data for old parts, you have multiple options:
MATERIALIZE (COLUMN/INDEX/PROJECTION) (PART[ITION ID] ‘’)
Will materialize this entity.
OPTIMIZE TABLE xxxx (PART[ITION ID] ‘’) (FINAL)
Will trigger merge, which will lead to materialization of all entities in affected parts.
ALTER TABLE xxxx UPDATE column_name = column_name WHERE 1;
Will trigger mutation, which will materialize this column.
DROP (COLUMN/INDEX/PROJECTION)
Lightweight, it’s only about changing of table metadata and removing corresponding files from filesystem. For Compact parts it will trigger merge, which can be heavy. issue
MODIFY COLUMN (DATE TYPE)
- Change column type in table schema.
- Schedule mutation to change type for old parts.
Mutations
Affected parts - parts with rows mathing condition.
ALTER TABLE xxxxx DELETE WHERE column_1 = 1;
- Will overwrite all column data in affected parts.
- For all part(ition)s will create new directories on disk and write new data to them or create hardlinks if they untouched.
- Register new parts names in ZooKeeper.
ALTER TABLE xxxxx DELETE IN PARTITION ID ’’ WHERE column_1 = 1;
Will do the same but only for specific partition.
ALTER TABLE xxxxx UPDATE SET column_2 = column_2, column_3 = column_3 WHERE column_1 = 1;
- Will overwrite column_2, column_3 data in affected parts.
- For all part(ition)s will create new directories on disk and write new data to them or create hardlinks if they untouched.
- Register new parts names in ZooKeeper.
DELETE FROM xxxxx WHERE column_1 = 1;
- Will create & populate hidden boolean column in affected parts. (_row_exists column)
- For all part(ition)s will create new directories on disk and write new data to them or create hardlinks if they untouched.
- Register new parts names in ZooKeeper.
Despite that LWD mutations will not rewrite all columns, steps 2 & 3 in case of big tables can take significiant time.
14 - http handler example
http handler example (how to disable /play)
# cat /etc/clickhouse-server/config.d/play_disable.xml
<?xml version="1.0" ?>
<yandex>
<http_handlers>
<rule>
<url>/play</url>
<methods>GET</methods>
<handler>
<type>static</type>
<status>403</status>
<content_type>text/plain; charset=UTF-8</content_type>
<response_content></response_content>
</handler>
</rule>
<defaults/> <!-- handler to save default handlers ?query / ping -->
</http_handlers>
</yandex>
15 - Logging
Q. I get errors:
File not found: /var/log/clickhouse-server/clickhouse-server.log.0.
File not found: /var/log/clickhouse-server/clickhouse-server.log.8.gz.
...
File not found: /var/log/clickhouse-server/clickhouse-server.err.log.0, Stack trace (when copying this message, always include the lines below):
0. Poco::FileImpl::handleLastErrorImpl(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) @ 0x11c2b345 in /usr/bin/clickhouse
1. Poco::PurgeOneFileStrategy::purge(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) @ 0x11c84618 in /usr/bin/clickhouse
2. Poco::FileChannel::log(Poco::Message const&) @ 0x11c314cc in /usr/bin/clickhouse
3. DB::OwnFormattingChannel::logExtended(DB::ExtendedLogMessage const&) @ 0x8681402 in /usr/bin/clickhouse
4. DB::OwnSplitChannel::logSplit(Poco::Message const&) @ 0x8682fa8 in /usr/bin/clickhouse
5. DB::OwnSplitChannel::log(Poco::Message const&) @ 0x8682e41 in /usr/bin/clickhouse
A. Check if you have proper permission to a log files folder, and enough disk space (& inode numbers) on the block device used for logging.
ls -la /var/log/clickhouse-server/
df -Th
df -Thi
Q. How to configure logging in clickhouse?
16 - Precreate parts using clickhouse-local
Precreate parts using clickhouse-local
rm -rf /tmp/precreate_parts
mkdir -p /tmp/precreate_parts/metadata/local/
cd /tmp/precreate_parts
## 1. Imagine we want to process this file:
cat <<EOF > /tmp/data.csv
1,2020-01-01,"String"
2,2020-02-02,"Another string"
3,2020-03-03,"One more string"
4,2020-01-02,"String for first partition"
EOF
## 2. that is the metadata for the table we want to fill
## schema should match the schema of the table from server
## (the easiest way is just to copy it from the server)
## I've added sleepEachRow(0.5) here just to mimic slow insert
cat <<EOF > metadata/local/test.sql
ATTACH TABLE local.test (id UInt64, d Date, s String, x MATERIALIZED sleepEachRow(0.5)) Engine=MergeTree ORDER BY id PARTITION BY toYYYYMM(d);
EOF
## 3a. that is the metadata for the input file we want to read
## it should match the structure of source file
## use stdin to read from pipe
cat <<EOF > metadata/local/stdin.sql
ATTACH TABLE local.stdin (id UInt64, d Date, s String) Engine=File(CSV, stdin);
EOF
## 3b. Instead of stdin you can use file path
cat <<EOF > metadata/local/data_csv.sql
ATTACH TABLE local.data_csv (id UInt64, d Date, s String) Engine=File(CSV, '/tmp/data.csv');
EOF
## All preparations done,
## the rest is simple:
# option a (if 3a used) with pipe / reading stdin
cat /tmp/data.csv | clickhouse-local --query "INSERT INTO local.test SELECT * FROM local.stdin" -- --path=.
# option b (if 3b used) 0 with filepath
cd /tmp/precreate_parts
clickhouse-local --query "INSERT INTO local.test SELECT * FROM local.data_csv" -- --path=.
# now you can check what was inserted (i did both options so i have doubled data)
clickhouse-local --query "SELECT _part,* FROM local.test ORDER BY id" -- --path=.
202001_4_4_0 1 2020-01-01 String
202001_1_1_0 1 2020-01-01 String
202002_5_5_0 2 2020-02-02 Another string
202002_2_2_0 2 2020-02-02 Another string
202003_6_6_0 3 2020-03-03 One more string
202003_3_3_0 3 2020-03-03 One more string
202001_4_4_0 4 2020-01-02 String for first partition
202001_1_1_0 4 2020-01-02 String for first partition
# But you can't do OPTIMIZE (local will die with coredump) :) That would be too good
# clickhouse-local --query "OPTIMIZE TABLE local.test FINAL" -- --path=.
## now you can upload those parts to a server (in detached subfolder) and attach them.
mfilimonov@laptop5591:/tmp/precreate_parts$ ls -la data/local/test/
total 40
drwxrwxr-x 9 mfilimonov mfilimonov 4096 paź 15 11:15 .
drwxrwxr-x 3 mfilimonov mfilimonov 4096 paź 15 11:15 ..
drwxrwxr-x 2 mfilimonov mfilimonov 4096 paź 15 11:15 202001_1_1_0
drwxrwxr-x 2 mfilimonov mfilimonov 4096 paź 15 11:15 202001_4_4_0
drwxrwxr-x 2 mfilimonov mfilimonov 4096 paź 15 11:15 202002_2_2_0
drwxrwxr-x 2 mfilimonov mfilimonov 4096 paź 15 11:15 202002_5_5_0
drwxrwxr-x 2 mfilimonov mfilimonov 4096 paź 15 11:15 202003_3_3_0
drwxrwxr-x 2 mfilimonov mfilimonov 4096 paź 15 11:15 202003_6_6_0
drwxrwxr-x 2 mfilimonov mfilimonov 4096 paź 15 11:15 detached
-rw-rw-r-- 1 mfilimonov mfilimonov 1 paź 15 11:15 format_version.txt
mfilimonov@laptop5591:/tmp/precreate_parts$ ls -la data/local/test/202001_1_1_0/
total 44
drwxrwxr-x 2 mfilimonov mfilimonov 4096 paź 15 11:15 .
drwxrwxr-x 9 mfilimonov mfilimonov 4096 paź 15 11:15 ..
-rw-rw-r-- 1 mfilimonov mfilimonov 250 paź 15 11:15 checksums.txt
-rw-rw-r-- 1 mfilimonov mfilimonov 79 paź 15 11:15 columns.txt
-rw-rw-r-- 1 mfilimonov mfilimonov 1 paź 15 11:15 count.txt
-rw-rw-r-- 1 mfilimonov mfilimonov 155 paź 15 11:15 data.bin
-rw-rw-r-- 1 mfilimonov mfilimonov 144 paź 15 11:15 data.mrk3
-rw-rw-r-- 1 mfilimonov mfilimonov 10 paź 15 11:15 default_compression_codec.txt
-rw-rw-r-- 1 mfilimonov mfilimonov 4 paź 15 11:15 minmax_d.idx
-rw-rw-r-- 1 mfilimonov mfilimonov 4 paź 15 11:15 partition.dat
-rw-rw-r-- 1 mfilimonov mfilimonov 16 paź 15 11:15 primary.idx
17 - Access Control and Account Management (RBAC)
Documentation https://clickhouse.com/docs/en/operations/access-rights/
Enable RBAC and create admin user
Create an admin
user like (root in MySQL or postgres in PostgreSQL) to do the DBA/admin ops in the user.xml
file and set the access management property for the admin user
<clickhouse>
<users>
<default>
....
</default>
<admin>
<!--
Password could be specified in plaintext or in SHA256 (in hex format).
If you want to specify password in plaintext (not recommended), place it in 'password' element.
Example: <password>qwerty</password>.
Password could be empty.
If you want to specify SHA256, place it in 'password_sha256_hex' element.
Example: <password_sha256_hex>65e84be33532fb784c48129675f9eff3a682b27168c0ea744b2cf58ee02337c5</password_sha256_hex>
Restrictions of SHA256: impossibility to connect to ClickHouse using MySQL JS client (as of July 2019).
If you want to specify double SHA1, place it in 'password_double_sha1_hex' element.
Example: <password_double_sha1_hex>e395796d6546b1b65db9d665cd43f0e858dd4303</password_double_sha1_hex>
-->
<password></password>
<networks>
<ip>::/0</ip>
</networks>
<!-- Settings profile for user. -->
<profile>default</profile>
<!-- Quota for user. -->
<quota>default</quota>
<!-- Set This parameter to Enable RBAC
Admin user can create other users and grant rights to them. -->
<access_management>1</access_management>
</admin>
...
</clickhouse>
default user
As default
is used for many internal and background operations, so it is not convenient to set it up with a password, because you would have to change it in many configs/parts. Best way to secure the default user is only allow localhost or trusted network connections like this in users.xml
:
<clickhouse>
<users>
<default>
......
<networks>
<ip>127.0.0.1/8</ip>
<ip>10.10.10.0/24</ip>
</networks>
......
</default>
</clickhouse>
replication user
The replication user is usually default
. There is no need to create this user unless you know what you’re doing because you need an specific user for replication. Ports 9009 and 9010(tls) provide low-level data access between servers.This ports should not be accessible from untrusted networks. You can specify credentials for authenthication between replicas. This is required when interserver_https_port
is accessible from untrusted networks. You can do so creating a user with the default
profile:
CREATE USER replication IDENTIFIED WITH sha256_password BY 'password' SETTINGS PROFILE 'default'
After this assign this user to the interserver credentials:
<interserver_http_credentials>
<user>replication</user>
<password>password</password>
</interserver_http_credentials>
We also can use sha256 passwords like this:
<password_sha256_hex>65e84be33532fb784c48129675f9eff3a682b27168c0ea744b2cf58ee02337c5</password_sha256_hex>
When the CREATE USER
query is executed in the clickhouse-client
it will echo the sha256
digest to copy it wherever you need
Create users and roles
Now we can setup users/roles using a generic best-practice approach for RBAC from other databases, like using roles, granting permissions to roles, creating users for different applications, etc…
Example: 3 roles (dba, dashboard_ro, ingester_rw)
create role dba on cluster '{cluster}';
grant all on *.* to dba on cluster '{cluster}';
create user `user1` identified by 'pass1234' on cluster '{cluster}';
grant dba to user1 on cluster '{cluster}';
create role dashboard_ro on cluster '{cluster}';
grant select on default.* to dashboard_ro on cluster '{cluster}';
grant dictGet on *.* to dashboard_ro on cluster '{cluster}';
create settings profile or replace profile_dashboard_ro on cluster '{cluster}'
settings max_concurrent_queries_for_user = 10 READONLY,
max_threads = 16 READONLY,
max_memory_usage_for_user = '30G' READONLY,
max_memory_usage = '30G' READONLY,
max_execution_time = 60 READONLY,
max_rows_to_read = 1000000000 READONLY,
max_bytes_to_read = '5000G' READONLY
TO dashboard_ro;
create user `dash1` identified by 'pass1234' on cluster '{cluster}';
grant dashboard_ro to dash1 on cluster '{cluster}';
create role ingester_rw on cluster '{cluster}';
grant select,insert on default.* to ingester_rw on cluster '{cluster}';
create settings profile or replace profile_ingester_rw on cluster '{cluster}'
settings max_concurrent_queries_for_user = 40 READONLY, -- user can run 40 queries (select, insert ...) simultaneously
max_threads = 10 READONLY, -- each query can use up to 10 cpu (READONLY means user cannot override a value)
max_memory_usage_for_user = '30G' READONLY, -- all queries of the user can use up to 30G RAM
max_memory_usage = '25G' READONLY, -- each query can use up to 25G RAM
max_execution_time = 200 READONLY, -- each query can executes no longer 200 seconds
max_rows_to_read = 1000000000 READONLY, -- each query can read up to 1 billion rows
max_bytes_to_read = '5000G' READONLY -- each query can read up to 5 TB from a MergeTree
TO ingester_rw;
create user `ingester_app1` identified by 'pass1234' on cluster '{cluster}';
grant ingester_rw to ingester_app1 on cluster '{cluster}';
check
$ clickhouse-client -u dash1 --password pass1234
create table test ( A Int64) Engine=Log;
DB::Exception: dash1: Not enough privileges
$ clickhouse-client -u user1 --password pass1234
create table test ( A Int64) Engine=Log;
Ok.
drop table test;
Ok.
$ clickhouse-client -u ingester_app1 --password pass1234
select count() from system.numbers limit 1000000000000;
DB::Exception: Received from localhost:9000. DB::Exception: Limit for rows or bytes to read exceeded, max rows: 1.00 billion
clean up
show profiles;
┌─name─────────────────┐
│ default │
│ profile_dashboard_ro │
│ profile_ingester_rw │
│ readonly │
└──────────────────────┘
drop profile if exists readonly on cluster '{cluster}';
drop profile if exists profile_dashboard_ro on cluster '{cluster}';
drop profile if exists profile_ingester_rw on cluster '{cluster}';
show roles;
┌─name─────────┐
│ dashboard_ro │
│ dba │
│ ingester_rw │
└──────────────┘
drop role if exists dba on cluster '{cluster}';
drop role if exists dashboard_ro on cluster '{cluster}';
drop role if exists ingester_rw on cluster '{cluster}';
show users;
┌─name──────────┐
│ dash1 │
│ default │
│ ingester_app1 │
│ user1 │
└───────────────┘
drop user if exists ingester_app1 on cluster '{cluster}';
drop user if exists user1 on cluster '{cluster}';
drop user if exists dash1 on cluster '{cluster}';
18 - recovery-after-complete-data-loss
Atomic & Ordinary databases.
srv1 – good replica
srv2 – lost replica / we will restore it from srv1
test data (3 tables (atomic & ordinary databases))
srv1
create database testatomic on cluster '{cluster}' engine=Atomic;
create table testatomic.test on cluster '{cluster}' (A Int64, D Date, s String)
Engine = ReplicatedMergeTree('/clickhouse/{cluster}/tables/{database}/{table}','{replica}')
partition by toYYYYMM(D)
order by A;
insert into testatomic.test select number, today(), '' from numbers(1000000);
create database testordinary on cluster '{cluster}' engine=Ordinary;
create table testordinary.test on cluster '{cluster}' (A Int64, D Date, s String)
Engine = ReplicatedMergeTree('/clickhouse/{cluster}/tables/{database}/{table}','{replica}')
partition by toYYYYMM(D)
order by A;
insert into testordinary.test select number, today(), '' from numbers(1000000);
create table default.test on cluster '{cluster}' (A Int64, D Date, s String)
Engine = ReplicatedMergeTree('/clickhouse/{cluster}/tables/{database}/{table}','{replica}')
partition by toYYYYMM(D)
order by A;
insert into default.test select number, today(), '' from numbers(1000000);
destroy srv2
srv2
/etc/init.d/clickhouse-server stop
rm -rf /var/lib/clickhouse/*
generate script to re-create databases (create_database.sql).
srv1
$ cat /home/ubuntu/generate_schema.sql
SELECT concat('CREATE DATABASE "', name, '" ENGINE = ', engine, ' COMMENT \'', comment, '\';')
FROM system.databases
WHERE name NOT IN ('INFORMATION_SCHEMA', 'information_schema', 'system', 'default');
clickhouse-client < /home/denis.zhuravlev/generate_schema.sql > create_database.sql
check the result
$ cat create_database.sql
CREATE DATABASE "testatomic" ENGINE = Atomic COMMENT '';
CREATE DATABASE "testordinary" ENGINE = Ordinary COMMENT '';
transfer this create_database.sql to srv2 (scp / rsync)
make a copy of schema sql files (metadata_schema.tar)
srv1
cd /var/lib/clickhouse/
tar -cvhf /home/ubuntu/metadata_schema.tar metadata
-h
- is important! (-h, –dereference Follow symlinks; archive and dump the files they point to.)
transfer this metadata_schema.tar to srv2 (scp / rsync)
create databases at srv2
srv2
/etc/init.d/clickhouse-server start
clickhouse-client < create_database.sql
/etc/init.d/clickhouse-server stop
create tables at srv2
srv2
cd /var/lib/clickhouse/
tar xkfv /home/ubuntu/metadata_schema.tar
sudo -u clickhouse touch /var/lib/clickhouse/flags/force_restore_data
/etc/init.d/clickhouse-server start
tar xkfv
-k
is important! To save folders/symlinks created with create database ( -k, –keep-old-files Don’t replace existing files when extracting )
check a recovery
srv2
SELECT count() FROM testatomic.test;
┌─count()─┐
│ 1000000 │
└─────────┘
SELECT count() FROM testordinary.test;
┌─count()─┐
│ 1000000 │
└─────────┘
SELECT count() FROM default.test;
┌─count()─┐
│ 1000000 │
└─────────┘
19 - Replication: Can not resolve host of another clickhouse server
Symptom
When configuring Replication the ClickHouse cluster nodes are experiencing communication issues, and an error message appears in the log that states that the ClickHouse host cannot be resolved.
<Error> DNSResolver: Cannot resolve host (xxxxx), error 0: DNS error.
auto DB::StorageReplicatedMergeTree::processQueueEntry(ReplicatedMergeTreeQueue::SelectedEntryPtr)::(anonymous class)::operator()(DB::StorageReplicatedMergeTree::LogEntryPtr &) const: Code: 198. DB::Exception: Not found address of host: xxxx. (DNS_ERROR),
Cause:
The error message indicates that the host name of the one of the nodes of the cluster cannot be resolved by other cluster members, causing communication issues between the nodes.
Each node in the replication setup pushes its Fully Qualified Domain Name (FQDN) to Zookeeper, and if other nodes cannot access it using its FQDN, this can cause issues.
Action:
There are two possible solutions to this problem:
- Change the FQDN to allow other nodes to access it. This solution can also help to keep the environment more organized. To do this, use the following command to edit the hostname file:
sudo vim /etc/hostname
Or use the following command to change the hostname:
sudo hostnamectl set-hostname ...
- Use the configuration parameter
<interserver_http_host>
to specify the IP address or hostname that the nodes can use to communicate with each other. This solution can have some issues, such as the one described in this link: https://github.com/ClickHouse/ClickHouse/issues/2154. To configure this parameter, refer to the documentation for more information: https://clickhouse.com/docs/en/operations/server-configuration-parameters/settings/#interserver-http-host.
20 - source parts size is greater than the current maximum
Symptom
I see messages like: source parts size (...) is greater than the current maximum (...)
in the logs and/or inside system.replication_queue
Cause
Usually that means that there are already few big merges running. You can see the running merges using the query:
SELECT * FROM system.merges
That logic is needed to prevent picking a log of huge merges simultaneously (otherwise they will take all available slots and clickhouse will not be able to do smaller merges, which usally are important for keeping the number of parts stable).
Action
It is normal to see those messages on some stale replicas. And it should be resolved automatically after some time. So just wait & monitor system.merges & system.replication_queue tables, it should be resolved by it’s own.
If it happens often or don’t resolves by it’s own during some longer period of time, it could be caused by:
- increased insert pressure
- disk issues / high load (it works slow, not enought space etc.)
- high CPU load (not enough CPU power to catch up with merges)
- issue with table schemas leading to high merges pressure (high / increased number of tables / partitions / etc.)
Start from checking dmesg / system journals / clickhouse monitoring to find the anomalies.
21 - Successful ClickHouse deployment plan
Successful ClickHouse deployment plan
Stage 0. Build POC
- Install single node clickhouse
- Start with creating a single table (the biggest one), use MergeTree engine. Create ‘some’ schema (most probably it will be far from optimal). Prefer denormalized approach for all immutable dimensions, for mutable dimensions - consider dictionaries.
- Load some amount of data (at least 5 Gb, and 10 mln rows) - preferable the real one, or as close to real as possible. Usully the simplest options are either through CSV / TSV files (or
insert into clickhouse_table select * FROM mysql(...) where ...
) - Create several representative queries.
- Check the columns cardinality, and appropriate types, use minimal needed type
- Review the partition by and order by. https://kb.altinity.com/engines/mergetree-table-engine-family/pick-keys/
- Create the schema(s) with better/promising order by / partitioning, load data in. Pick the best schema.
- consider different improvements of particular columns (codecs / better data types etc.) https://kb.altinity.com/altinity-kb-schema-design/codecs/altinity-kb-how-to-test-different-compression-codecs/
- If the performance of certain queries is not enough - consider using PREWHERE / skipping indexes
- Repeat 2-9 for next big table(s). Avoid scenarios when you need to join big tables.
- Pick the clients library for you programming language (the most mature are python / golang / java / c++), build some pipeline - for inserts (low QPS, lot of rows in singe insert, check acknowledgements & retry the same block on failures), ETLs if needed, some reporting layer (https://kb.altinity.com/altinity-kb-integrations/bi-tools/)
Stage 1. Planning the production setup
- Collect more data / estimate insert speed, estimate the column sizes per day / month.
- Measure the speed of queries
- Consider improvement using materialized views / projections / dictionaries.
- Collect requirements (ha / number of simultaneous queries / insert pressure / ’exactly once’ etc)
- Do a cluster sizing estimation, plan the hardware
- plan the network, if needed - consider using LoadBalancers etc.
- If you need sharding - consider different sharding approaches.
Stage 2. Preprod setup & developement
- Install clickhouse in cluster - several nodes / VMs + zookeeper
- https://kb.altinity.com/altinity-kb-setup-and-maintenance/cluster-production-configuration-guide/cluster-configuration-process/
- https://kb.altinity.com/altinity-kb-setup-and-maintenance/altinity-kb-zookeeper/altinity-kb-proper-setup/
- https://kb.altinity.com/altinity-kb-setup-and-maintenance/altinity-kb-zookeeper/install_ubuntu/
- Create good config & automate config / os / restarts (ansible / puppet etc)
- https://kb.altinity.com/altinity-kb-setup-and-maintenance/altinity-kb-settings-to-adjust/
- for docker: https://kb.altinity.com/altinity-kb-setup-and-maintenance/altinity-kb-clickhouse-in-docker/
- for k8, use clickhouse-operator OR https://kb.altinity.com/altinity-kb-kubernetes/altinity-kb-possible-issues-with-running-clickhouse-in-k8s/
- Set up monitoring / log processing / alerts etc.
- Set up users.
- Think of schema management. Deploy the schema.
- Design backup / failover strategies:
- Develop pipelines / queries, create test suite, CI/CD
- Do benchmark / stress tests
- Test configuration changes / server restarts / failovers / version upgrades
- Review the security topics (tls, limits / restrictions, network, passwords)
- Document the solution for operations
Stage 3. Production setup
- Deploy the production setup (consider also canary / blue-greed deployments etc)
- Schedule ClickHouse upgrades every 6 to 12 months (if possible)
22 - sysall database (system tables on a cluster level)
Requirements
The idea is that you have a macros cluster
with cluster name.
For example you have a cluster named production
and this cluster includes all ClickHouse nodes.
$ cat /etc/clickhouse-server/config.d/clusters.xml
<?xml version="1.0" ?>
<yandex>
<remote_servers>
<production>
<shard>
...
And you need to have a macro cluster
set to production
:
cat /etc/clickhouse-server/config.d/macros.xml
<?xml version="1.0" ?>
<yandex>
<macros>
<cluster>production</cluster>
<replica>....</replica>
....
</macros>
</yandex>
Now you should be able to query all nodes using clusterAllReplicas
:
SELECT
hostName(),
FQDN(),
materialize(uptime()) AS uptime
FROM clusterAllReplicas('{cluster}', system.one)
SETTINGS skip_unavailable_shards = 1
┌─hostName()─┬─FQDN()──────────────┬──uptime─┐
│ chhost1 │ chhost1.localdomain │ 1071574 │
│ chhost2 │ chhost2.localdomain │ 1071517 │
└────────────┴─────────────────────┴─────────┘
skip_unavailable_shards
is necessary to query a system with some nodes are down.
Script to create DB ojects
CREATE DATABASE sysall;
CREATE OR REPLACE VIEW sysall.cluster_state AS
SELECT
shard_num,
replica_num,
host_name,
host_address,
port,
errors_count,
uptime,
if(uptime > 0, 'UP', 'DOWN') AS node_state
FROM system.clusters
LEFT JOIN
(
SELECT
hostName() AS host_name,
FQDN() AS fqdn,
materialize(uptime()) AS uptime
FROM clusterAllReplicas('{cluster}', system.one)
) as hosts_info USING (host_name)
WHERE cluster = getMacro('cluster')
SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.asynchronous_metrics as select hostName() nodeHost, FQDN() nodeFQDN, *
from clusterAllReplicas('{cluster}', system.asynchronous_metrics) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.query_log as select hostName() nodeHost, FQDN() nodeFQDN, *
from clusterAllReplicas('{cluster}', system.query_log) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.dictionaries as select hostName() nodeHost, FQDN() nodeFQDN, *
from clusterAllReplicas('{cluster}', system.dictionaries) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.replication_queue as select hostName() nodeHost, FQDN() nodeFQDN, *
from clusterAllReplicas('{cluster}', system.replication_queue) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.replicas as select hostName() nodeHost, FQDN() nodeFQDN, *
from clusterAllReplicas('{cluster}', system.replicas) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.merges as select hostName() nodeHost, FQDN() nodeFQDN, *
from clusterAllReplicas('{cluster}', system.merges) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.mutations as select hostName() nodeHost, FQDN() nodeFQDN, *
from clusterAllReplicas('{cluster}', system.mutations) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.parts as select hostName() nodeHost, FQDN() nodeFQDN, *
from clusterAllReplicas('{cluster}', system.parts) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.detached_parts as select hostName() nodeHost, FQDN() nodeFQDN, *
from clusterAllReplicas('{cluster}', system.detached_parts) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.disks as select hostName() nodeHost, FQDN() nodeFQDN, *
from clusterAllReplicas('{cluster}', system.disks) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.distribution_queue as select hostName() nodeHost, FQDN() nodeFQDN, *
from clusterAllReplicas('{cluster}', system.distribution_queue) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.databases as select hostName() nodeHost, FQDN() nodeFQDN, *
from clusterAllReplicas('{cluster}', system.databases) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.events as select hostName() nodeHost, FQDN() nodeFQDN, *
from clusterAllReplicas('{cluster}', system.events) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.metrics as select hostName() nodeHost, FQDN() nodeFQDN, *
from clusterAllReplicas('{cluster}', system.metrics) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.macros as select hostName() nodeHost, FQDN() nodeFQDN, *
from clusterAllReplicas('{cluster}', system.macros) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.tables as select hostName() nodeHost, FQDN() nodeFQDN, *
from clusterAllReplicas('{cluster}', system.tables) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.clusters as select hostName() nodeHost, FQDN() nodeFQDN, *
from clusterAllReplicas('{cluster}', system.clusters) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.columns as select hostName() nodeHost, FQDN() nodeFQDN, *
from clusterAllReplicas('{cluster}', system.columns) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.processes as select hostName() nodeHost, FQDN() nodeFQDN, *
from clusterAllReplicas('{cluster}', system.processes) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.errors as select hostName() nodeHost, FQDN() nodeFQDN, *
from clusterAllReplicas('{cluster}', system.errors) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.settings as select hostName() nodeHost, FQDN() nodeFQDN, *
from clusterAllReplicas('{cluster}', system.settings) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.parts_columns as select hostName() nodeHost, FQDN() nodeFQDN, *
from clusterAllReplicas('{cluster}', system.parts_columns) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.zookeeper as select hostName() nodeHost, FQDN() nodeFQDN, *
from clusterAllReplicas('{cluster}', system.zookeeper) SETTINGS skip_unavailable_shards = 1;
Some examples
select * from sysall.cluster_state;
┌─shard_num─┬─replica_num─┬─host_name───────────┬─host_address─┬─port─┬─errors_count─┬──uptime─┬─node_state─┐
│ 1 │ 1 │ chhost1.localdomain │ 10.253.86.2 │ 9000 │ 0 │ 1071788 │ UP │
│ 2 │ 1 │ chhost2.localdomain │ 10.253.215.2 │ 9000 │ 0 │ 1071731 │ UP │
│ 3 │ 1 │ chhost3.localdomain │ 10.252.83.8 │ 9999 │ 0 │ 0 │ DOWN │
└───────────┴─────────────┴─────────────────────┴──────────────┴──────┴──────────────┴─────────┴────────────┘
SELECT
nodeFQDN,
path,
formatReadableSize(free_space) AS free,
formatReadableSize(total_space) AS total
FROM sysall.disks
┌─nodeFQDN────────────┬─path─────────────────┬─free───────┬─total──────┐
│ chhost1.localdomain │ /var/lib/clickhouse/ │ 511.04 GiB │ 937.54 GiB │
│ chhost2.localdomain │ /var/lib/clickhouse/ │ 495.77 GiB │ 937.54 GiB │
└─────────────────────┴──────────────────────┴────────────┴────────────┘
23 - Timeouts during OPTIMIZE FINAL
Timeout exceeded ...
or executing longer than distributed_ddl_task_timeout
during OPTIMIZE FINAL
.Timeout exceeded ...
or executing longer than distributed_ddl_task_timeout
during OPTIMIZE FINAL
Timeout may occur
due to the fact that the client reach timeout interval.
- in case of TCP / native clients - you can change send_timeout / recieve_timeout + tcp_keep_alive_timeout + driver timeout settings
- in case of HTTP clients - you can change http_send_timeout / http_receive_timeout + tcp_keep_alive_timeout + driver timeout settings
(in the case of ON CLUSTER queries) due to the fact that the timeout for query execution by shards ends
- see setting
distributed_ddl_task_timeout
- see setting
In the first case you additionally may get the misleading messages: Cancelling query. ... Query was cancelled.
In both cases, this does NOT stop the execution of the OPTIMIZE command. It continues to work even after
the client is disconnected. You can see the progress of that in system.processes
/ show processlist
/ system.merges
/ system.query_log
.
The same applies to queries like:
INSERT ... SELECT
CREATE TABLE ... AS SELECT
CREATE MATERIALIZED VIEW ... POPULATE ...
It is possible to run a query with some special query_id
and then poll the status from the processlist (in the case of a cluster, it can be a bit more complicated).
See also
24 - Useful settings to turn on/Defaults that should be reconsidered
Useful settings to turn on/Defaults that should be reconsidered
Some setting that are not enabled by default.
Enables or disables complete dropping of data parts where all rows are expired in MergeTree tables.
When ttl_only_drop_parts is disabled (by default), the ClickHouse server only deletes expired rows according to their TTL.
When ttl_only_drop_parts is enabled, the ClickHouse server drops a whole part when all rows in it are expired.
Dropping whole parts instead of partial cleaning TTL-d rows allows having shorter merge_with_ttl_timeout times and lower impact on system performance.
Might be you not expect that join will be filled with default values for missing columns (instead of classic NULLs) during JOIN.
Sets the type of JOIN behaviour. When merging tables, empty cells may appear. ClickHouse fills them differently based on this setting.
Possible values:
0 — The empty cells are filled with the default value of the corresponding field type. 1 — JOIN behaves the same way as in standard SQL. The type of the corresponding field is converted to Nullable, and empty cells are filled with NULL.
Default behaviour is not compatible with ANSI SQL (ClickHouse avoids Nullable types by perfomance reasons)
select sum(x), avg(x) from (select 1 x where 0);
┌─sum(x)─┬─avg(x)─┐
│ 0 │ nan │
└────────┴────────┘
set aggregate_functions_null_for_empty=1;
select sum(x), avg(x) from (select 1 x where 0);
┌─sumOrNull(x)─┬─avgOrNull(x)─┐
│ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │
└──────────────┴──────────────┘
25 - Who ate my CPU
Merges
SELECT
table,
round((elapsed * (1 / progress)) - elapsed, 2) AS estimate,
elapsed,
progress,
is_mutation,
formatReadableSize(total_size_bytes_compressed) AS size,
formatReadableSize(memory_usage) AS mem
FROM system.merges
ORDER BY elapsed DESC
Mutations
SELECT
database,
table,
substr(command, 1, 30) AS command,
sum(parts_to_do) AS parts_to_do,
anyIf(latest_fail_reason, latest_fail_reason != '')
FROM system.mutations
WHERE NOT is_done
GROUP BY
database,
table,
command
Current Processes
select elapsed, query from system.processes where is_initial_query and elapsed > 2
Processes retrospectively
SELECT
normalizedQueryHash(query),
current_database,
sum(`ProfileEvents.Values`[indexOf(`ProfileEvents.Names`, 'UserTimeMicroseconds')]) AS userCPU,
count(),
avg(query_duration_ms) query_duration_ms,
any( substr(query, 1, 60) ) _query
FROM system.query_log
WHERE (type = 2) AND (event_date >= today())
GROUP BY
current_database,
normalizedQueryHash(query)
ORDER BY userCPU DESC
LIMIT 10;
26 - ZooKeeper session has expired
Q. I get “ZooKeeper session has expired” once. What should i do? Should I worry?
Getting exceptions or lack of acknowledgement in distributed system from time to time is a normal situation. Your client should do the retry. If that happened once and your client do retries correctly - nothing to worry about.
It it happens often, or with every retry - it may be a sign of some misconfiguration / issue in cluster (see below).
Q. we see a lot of these: ZooKeeper session has expired. Switching to a new session
A. There is a single zookeeper session per server. But there are many threads that can use zookeeper simultaneously. So the same event (we lose the single zookeeper session we had), will be reported by all the threads/queries which were using that zookeeper session.
Usually after loosing the zookeeper session that exception is printed by all the thread which watch zookeeper replication queues, and all the threads which had some in-flight zookeeper operations (for example inserts, ON CLUSTER
commands etc).
If you see a lot of those simultaneously - that just means you have a lot of threads talking to zookeeper simultaneously (or may be you have many replicated tables?).
BTW: every Replicated table comes with its own cost, so you can’t scale the number of replicated tables indefinitely.
Typically after several hundreds (sometimes thousands) of replicated tables, the clickhouse server becomes unusable: it can’t do any other work, but only keeping replication housekeeping tasks. ‘ClickHouse-way’ is to have a few (maybe dozens) of very huge tables instead of having thousands of tiny tables. (Side note: the number of not-replicated tables can be scaled much better).
So again if during short period of time you see lot of those exceptions and that don’t happen anymore for a while - nothing to worry about. Just ensure your client is doing retries properly.
Q. We are wondering what is causing that session to “timeout” as the default looks like 30 seconds, and there’s certainly stuff happening much more frequently than every 30 seconds.
Typically that has nothing with an expiration/timeout - even if you do nothing there are heartbeat events in the zookeeper protocol.
So internally inside clickhouse:
- we have a ‘zookeeper client’ which in practice is a single zookeeper connection (TCP socket), with 2 threads - one serving reads, the seconds serving writes, and some API around.
- while everything is ok zookeeper client keeps a single logical ‘zookeeper session’ (also by sending heartbeats etc).
- we may have hundreds of ‘users’ of that zookeeper client - those are threads that do some housekeeping, serve queries etc.
- zookeeper client normally have dozen ‘in-flight’ requests (asked by different threads). And if something bad happens with that (disconnect, some issue with zookeeper server, some other failure), zookeeper client needs to re-establish the connection and switch to the new session so all those ‘in-flight’ requests will be terminated with a ‘session expired’ exception.
Q. That problem happens very often (all the time, every X minutes / hours / days).
Sometimes the real issue can be visible somewhere close to the first ‘session expired’ exception in the log. (i.e. zookeeper client thread can know & print to logs the real reason, while all ‘user’ threads just get ‘session expired’).
Also zookeeper logs may ofter have a clue to that was the real problem.
Known issues which can lead to session termination by zookeeper:
- connectivity / network issues.
jute.maxbuffer
overrun. If you need to pass too much data in a single zookeeper transaction. (often happens if you need to do ALTER table UPDATE or other mutation on the table with big number of parts). The fix is adjusting JVM setting: -Djute.maxbuffer=8388608. See https://kb.altinity.com/altinity-kb-setup-and-maintenance/altinity-kb-zookeeper/jvm-sizes-and-garbage-collector-settings/- XID overflow. XID is a transaction counter in zookeeper, if you do too many transactions the counter reaches maxint32, and to restart the counter zookeeper closes all the connections. Usually, that happens rarely, and is not avoidable in zookeeper (well in clickhouse-keeper that problem solved). There are some corner cases / some schemas which may end up with that XID overflow happening quite often. (a worst case we saw was once per 3 weeks).
Q. “ZooKeeper session has expired” happens every time I try to start the mutation / do other ALTER on Replicated table.
During ALTERing replicated table ClickHouse need to create a record in zookeeper listing all the parts which should be mutated (that usually means = list names of all parts of the table). If the size of list of parts exceeds maximum buffer size - zookeeper drops the connection.
Parts name length can be different for different tables. In average with default jute.maxbuffer
(1Mb) mutations start to fail for tables which have more than 5000 parts.
Solutions:
- rethink partitioning, high number of parts in table is usually not recommended
- increase
jute.maxbuffer
on zookeeper side to values about 8M - use IN PARITION clause for mutations (where applicable) - since 20.12
- switch to clickhouse-keeper
Related issues:
27 - Altinity packaging compatibility >21.x and earlier
Working with Altinity & Yandex packaging together
Since version 21.1 Altinity switches to the same packaging as used by Yandex. That is needed for syncing things and introduces several improvements (like adding systemd service file).
Unfortunately, that change leads to compatibility issues - automatic dependencies resolution gets confused by the conflicting package names: both when you update ClickHouse to the new version (the one which uses older packaging) and when you want to install older altinity packages (20.8 and older).
Installing old clickhouse version (with old packaging schema)
When you try to install versions 20.8 or older from Altinity repo -
version=20.8.12.2-1.el7
yum install clickhouse-client-${version} clickhouse-server-${version}
yum outputs smth like
yum install clickhouse-client-${version} clickhouse-server-${version}
Loaded plugins: fastestmirror, ovl
Loading mirror speeds from cached hostfile
* base: centos.hitme.net.pl
* extras: centos1.hti.pl
* updates: centos1.hti.pl
Altinity_clickhouse-altinity-stable/x86_64/signature | 833 B 00:00:00
Altinity_clickhouse-altinity-stable/x86_64/signature | 1.0 kB 00:00:01 !!!
Altinity_clickhouse-altinity-stable-source/signature | 833 B 00:00:00
Altinity_clickhouse-altinity-stable-source/signature | 951 B 00:00:00 !!!
Resolving Dependencies
--> Running transaction check
---> Package clickhouse-client.x86_64 0:20.8.12.2-1.el7 will be installed
---> Package clickhouse-server.x86_64 0:20.8.12.2-1.el7 will be installed
--> Processing Dependency: clickhouse-server-common = 20.8.12.2-1.el7 for package: clickhouse-server-20.8.12.2-1.el7.x86_64
Package clickhouse-server-common is obsoleted by clickhouse-server, but obsoleting package does not provide for requirements
--> Processing Dependency: clickhouse-common-static = 20.8.12.2-1.el7 for package: clickhouse-server-20.8.12.2-1.el7.x86_64
--> Running transaction check
---> Package clickhouse-common-static.x86_64 0:20.8.12.2-1.el7 will be installed
---> Package clickhouse-server.x86_64 0:20.8.12.2-1.el7 will be installed
--> Processing Dependency: clickhouse-server-common = 20.8.12.2-1.el7 for package: clickhouse-server-20.8.12.2-1.el7.x86_64
Package clickhouse-server-common is obsoleted by clickhouse-server, but obsoleting package does not provide for requirements
--> Finished Dependency Resolution
Error: Package: clickhouse-server-20.8.12.2-1.el7.x86_64 (Altinity_clickhouse-altinity-stable)
Requires: clickhouse-server-common = 20.8.12.2-1.el7
Available: clickhouse-server-common-1.1.54370-2.x86_64 (clickhouse-stable)
clickhouse-server-common = 1.1.54370-2
Available: clickhouse-server-common-1.1.54378-2.x86_64 (clickhouse-stable)
clickhouse-server-common = 1.1.54378-2
...
Available: clickhouse-server-common-20.8.11.17-1.el7.x86_64 (Altinity_clickhouse-altinity-stable)
clickhouse-server-common = 20.8.11.17-1.el7
Available: clickhouse-server-common-20.8.12.2-1.el7.x86_64 (Altinity_clickhouse-altinity-stable)
clickhouse-server-common = 20.8.12.2-1.el7
You could try using --skip-broken to work around the problem
You could try running: rpm -Va --nofiles --nodigest
As you can see yum has an issue with resolving clickhouse-server-common
dependency, which marked as obsoleted by newer packages.
Solution with Old Packaging Scheme
add --setopt=obsoletes=0
flag to the yum call.
version=20.8.12.2-1.el7
yum install --setopt=obsoletes=0 clickhouse-client-${version} clickhouse-server-${version}
---
title: "installation succeeded"
linkTitle: "installation succeeded"
description: >
installation succeeded
---
Alternatively, you can add obsoletes=0
into /etc/yum.conf
.
To update to new ClickHouse version (from old packaging schema to new packaging schema)
version=21.1.7.1-2
yum install clickhouse-client-${version} clickhouse-server-${version}
Loaded plugins: fastestmirror, ovl
Loading mirror speeds from cached hostfile
* base: centos.hitme.net.pl
* extras: centos1.hti.pl
* updates: centos1.hti.pl
Altinity_clickhouse-altinity-stable/x86_64/signature | 833 B 00:00:00
Altinity_clickhouse-altinity-stable/x86_64/signature | 1.0 kB 00:00:01 !!!
Altinity_clickhouse-altinity-stable-source/signature | 833 B 00:00:00
Altinity_clickhouse-altinity-stable-source/signature | 951 B 00:00:00 !!!
Nothing to do
It is caused by wrong dependencies resolution.
Solution with New Package Scheme
To update to the latest available version - just add clickhouse-server-common
:
yum install clickhouse-client clickhouse-server clickhouse-server-common
This way the latest available version will be installed (even if you will request some other version explicitly).
To install some specific version remove old packages first, then install new ones.
yum erase clickhouse-client clickhouse-server clickhouse-server-common clickhouse-common-static
version=21.1.7.1
yum install clickhouse-client-${version} clickhouse-server-${version}
Downgrade from new version to old one
version=20.8.12.2-1.el7
yum downgrade clickhouse-client-${version} clickhouse-server-${version}
will not work:
Loaded plugins: fastestmirror, ovl
Loading mirror speeds from cached hostfile
* base: ftp.agh.edu.pl
* extras: ftp.agh.edu.pl
* updates: centos.wielun.net
Resolving Dependencies
--> Running transaction check
---> Package clickhouse-client.x86_64 0:20.8.12.2-1.el7 will be a downgrade
---> Package clickhouse-client.noarch 0:21.1.7.1-2 will be erased
---> Package clickhouse-server.x86_64 0:20.8.12.2-1.el7 will be a downgrade
--> Processing Dependency: clickhouse-server-common = 20.8.12.2-1.el7 for package: clickhouse-server-20.8.12.2-1.el7.x86_64
Package clickhouse-server-common-20.8.12.2-1.el7.x86_64 is obsoleted by clickhouse-server-21.1.7.1-2.noarch which is already installed
--> Processing Dependency: clickhouse-common-static = 20.8.12.2-1.el7 for package: clickhouse-server-20.8.12.2-1.el7.x86_64
---> Package clickhouse-server.noarch 0:21.1.7.1-2 will be erased
--> Finished Dependency Resolution
Error: Package: clickhouse-server-20.8.12.2-1.el7.x86_64 (Altinity_clickhouse-altinity-stable)
Requires: clickhouse-common-static = 20.8.12.2-1.el7
Installed: clickhouse-common-static-21.1.7.1-2.x86_64 (@clickhouse-stable)
clickhouse-common-static = 21.1.7.1-2
Available: clickhouse-common-static-1.1.54378-2.x86_64 (clickhouse-stable)
clickhouse-common-static = 1.1.54378-2
Error: Package: clickhouse-server-20.8.12.2-1.el7.x86_64 (Altinity_clickhouse-altinity-stable)
...
Available: clickhouse-server-common-20.8.12.2-1.el7.x86_64 (Altinity_clickhouse-altinity-stable)
clickhouse-server-common = 20.8.12.2-1.el7
You could try using --skip-broken to work around the problem
You could try running: rpm -Va --nofiles --nodigest
Solution With Downgrading
Remove packages first, then install older versions:
yum erase clickhouse-client clickhouse-server clickhouse-server-common clickhouse-common-static
version=20.8.12.2-1.el7
yum install --setopt=obsoletes=0 clickhouse-client-${version} clickhouse-server-${version}
28 - AWS EC2 Storage
EBS
Most native choose for ClickHouse as fast storage, because it usually guarantees best throughput, IOPS, latency for reasonable price.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-optimized.html
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volume-types.html
General Purpose SSD volumes
In usual conditions ClickHouse being limited by throughput of volumes and amount of provided IOPS doesn’t make any big difference for performance starting from a certain number. So the most native choice for clickhouse is gp3 and gp2 volumes.
EC2 instances also have an EBS throughput limit, it depends on the size of the EC2 instance. That means if you would attach multiple volumes which would have high potential throughput, you would be limited by your EC2 instance, so usually there is no reason to have more than 1-3 GP3 volume or 4-5 GP2 volume per node.
It’s pretty straightforward to set up a ClickHouse for using multiple EBS volumes with jbod storage_policies.
Volume type | gp3 | gp2 |
---|---|---|
Max throughput per volume | 1000 MiB/s | 250 MiB/s |
Price | $0.08/GB-month 3,000 IOPS free and $0.005/provisioned IOPS-month over 3,000; 125 MB/s free and $0.04/provisioned MB/s-month over 125 | $0.10/GB-month |
GP3
It’s recommended option, as it allow you to have only one volume, for instances which have less than 10 Gbps EBS Bandwidth (nodes =<32 VCPU usually) and still have maximum performance. For bigger instances, it make sense to look into option of having several GP3 volumes.
It’s a new type of volume, which is 20% cheaper than gp2 per GB-month and has lower free throughput: only 125 MiB/s vs 250 MiB/s. But you can buy additional throughput and IOPS for volume. It also works better if most of your queries read only one or several parts, because in that case you are not being limited by performance of a single EBS disk, as parts can be located only on one disk at once.
Because, you need to have less GP3 volumes compared to GP2 option, it’s suggested approach for now.
For best performance, it’s suggested to buy:
- 7000 IOPS
- Throughput up to the limit of your EC2 instance (1000 MiB/s is safe option)
GP2
GP2 volumes have a hard limit of 250 MiB/s per volume (for volumes bigger than 334 GB), it usually makes sense to split one big volume in multiple smaller volumes larger than 334GB in order to have maximum possible throughput.
Throughput Optimized HDD volumes
ST1
Looks like a good candidate for cheap cold storage for old data with decent maximum throughput 500 MiB/s. But it achieved only for big volumes >5 TiB.
Throughput credits and burst performance
Provisioned IOPS SSD volumes
IO2 Block Express, IO2, IO1
In 99.99% cases doesn’t give any benefit for ClickHouse compared to GP3 option and perform worse because maximum throughput is limited to 500 MiB/s per volume if you buy less than 32 000 IOPS, which is really expensive (compared to other options) and unneded for ClickHouse. And if you have spare money, it’s better to spend them on better EC2 instance.
S3
Best option for cold data, it can give considerably good throughput and really good price, but latencies and IOPS much worse than EBS option. Another intresting point is, for EC2 instance throughput limit for EBS and S3 calculated separately, so if you access your data both from EBS and S3, you can get double throughput.
It’s stated in AWS documentation, that S3 can fully utilize network capacity of EC2 instance. (up to 100 Gb/s) Latencies or (first-byte-out) estimated to be 100-200 milliseconds withing single region.
EFS
Works over NFSv4.1 version. We have clients, which run their ClickHouse installations over NFS. It works considerabely well as cold storage, so it’s recommended to have EBS disks for hot data. A fast network is required.
ClickHouse doesn’t have any native option to reuse the same data on durable network disk via several replicas. You either need to store the same data twice or build custom tooling around ClickHouse and use it without Replicated*MergeTree tables.
FSx
Lustre
We have several clients, who use Lustre (some of them use AWS FSx Lustre, another is self managed Lustre) without any big issue. Fast network is requered. There were known problems with data damage on older versions caused by issues with O_DIRECT or async IO support on Lustre.
ClickHouse doesn’t have any native option to reuse the same data on durable network disk via several replicas. You either need to store the same data twice or build custom tooling around ClickHouse and use it without Replicated*MergeTree tables.
https://altinity.com/blog/2019/11/27/amplifying-clickhouse-capacity-with-multi-volume-storage-part-1
https://altinity.com/blog/2019/11/29/amplifying-clickhouse-capacity-with-multi-volume-storage-part-2
https://calculator.aws/#/createCalculator/EBS?nc2=h_ql_pr_calc
29 - ClickHouse in Docker
Do you have documentation on Docker deployments?
Check
- https://hub.docker.com/r/yandex/clickhouse-server/
- https://docs.altinity.com/clickhouseonkubernetes/
- sources of entry point - https://github.com/ClickHouse/ClickHouse/blob/master/docker/server/entrypoint.sh
Important things:
- use concrete version tag (avoid using latest)
- if possible use
--network=host
(due to performance reasons) - you need to mount the folder
/var/lib/clickhouse
to have persistency. - you MAY also mount the folder
/var/log/clickhouse-server
to have logs accessible outside of the container. - Also, you may mount in some files or folders in the configuration folder:
/etc/clickhouse-server/config.d/listen_ports.xml
--ulimit nofile=262144:262144
- You can also set on some linux capabilities to enable some of extra features of ClickHouse (not obligatory):
SYS_PTRACE NET_ADMIN IPC_LOCK SYS_NICE
- you may also mount in the folder
/docker-entrypoint-initdb.d/
- all SQL or bash scripts there will be executed during container startup. - if you use cgroup limits - it may misbehave https://github.com/ClickHouse/ClickHouse/issues/2261 (set up
<max_server_memory_usage>
manually) - there are several ENV switches, see: https://github.com/ClickHouse/ClickHouse/blob/master/docker/server/entrypoint.sh
TLDR version: use it as a starting point:
docker run -d \
--name some-clickhouse-server \
--ulimit nofile=262144:262144 \
--volume=$(pwd)/data:/var/lib/clickhouse \
--volume=$(pwd)/logs:/var/log/clickhouse-server \
--volume=$(pwd)/configs/memory_adjustment.xml:/etc/clickhouse-server/config.d/memory_adjustment.xml \
--cap-add=SYS_NICE \
--cap-add=NET_ADMIN \
--cap-add=IPC_LOCK \
--cap-add=SYS_PTRACE \
--network=host \
yandex/clickhouse-server:21.1.7
docker exec -it some-clickhouse-server clickhouse-client
docker exec -it some-clickhouse-server bash
30 - ClickHouse Monitoring
ClickHouse Monitoring
Monitoring helps to track potential issues in your cluster before they cause a critical error.
What to read / watch on subject:
- Altinity webinar “ClickHouse Monitoring 101: What to monitor and how”. recording, slides
- docs https://clickhouse.com/docs/en/operations/monitoring/
What should be monitored
The following metrics should be collected / monitored
For Host Machine:
- CPU
- Memory
- Network (bytes/packets)
- Storage (iops)
- Disk Space (free / used)
For ClickHouse:
- Connections (count)
- RWLocks
- Read / Write / Return (bytes)
- Read / Write / Return (rows)
- Zookeeper operations (count)
- Absolute delay
- Query duration (optional)
- Replication parts and queue (count)
For Zookeeper:
Monitoring tools
Prometheus (embedded exporter) + Grafana
- Enable embedded exporter
- Grafana dashboards https://grafana.com/grafana/dashboards/14192 or https://grafana.com/grafana/dashboards/13500
Prometheus (embedded http handler with clickhouse-operator style metrics) + Grafana
- Enable http handler
- Useful, if you want to use clickhouse-operator dashboard, but do not run ClickHouse in k8s.
Prometheus (clickhouse-operator embedded exporter) + Grafana
- exporter is included in clickhouse-operator, and enabled automatically
- see instructions of Prometheus and Grafana installation (if you don’t have one)
- Grafana dashboard https://github.com/Altinity/clickhouse-operator/tree/master/grafana-dashboard
- Prometheus alerts https://github.com/Altinity/clickhouse-operator/blob/master/deploy/prometheus/prometheus-alert-rules-clickhouse.yaml
Prometheus (clickhouse external exporter) + Grafana
(unmaintained)
Dashboards quering clickhouse directly via vertamedia / Altinity plugin
- Overview: https://grafana.com/grafana/dashboards/13606
- Queries dashboard (analyzing system.query_log) https://grafana.com/grafana/dashboards/2515
Dashboard quering clickhouse directly via Grafana plugin
Zabbix
- https://www.zabbix.com/integrations/clickhouse
- https://github.com/Altinity/clickhouse-zabbix-template
Graphite
- Use the embedded exporter. See docs and config.xml
InfluxDB
- You can use embedded exporter, plus Telegraf. For more information, see Graphite protocol support in InfluxDB.
Nagios/Icinga
Commercial solution
- Datadog https://docs.datadoghq.com/integrations/clickhouse/?tab=host
- Sematext https://sematext.com/docs/integration/clickhouse/
- Instana https://www.instana.com/supported-technologies/clickhouse-monitoring/
- site24x7 https://www.site24x7.com/plugins/clickhouse-monitoring.html
- Acceldata Pulse https://www.acceldata.io/blog/acceldata-pulse-for-clickhouse-monitoring
“Build your own” monitoring
ClickHouse allow to access lot of internals using system tables. The main tables to access monitoring data are:
- system.metrics
- system.asynchronous_metrics
- system.events
Minimum neccessary set of checks
Check Name | Shell or SQL command | Severity |
ClickHouse status | $ curl 'http://localhost:8123/'
| Critical |
Too many simultaneous queries. Maximum: 100 (by default) | select value from system.metrics
| Critical |
Replication status | $ curl 'http://localhost:8123/replicas_status'
| High |
Read only replicas (reflected by replicas_status as well) | select value from system.metrics
| High |
Some replication tasks are stuck | select count()
| High |
ZooKeeper is available | select count() from system.zookeeper
| Critical for writes |
ZooKeeper exceptions | select value from system.events
| Medium |
Other CH nodes are available | $ for node in `echo "select distinct host_address from system.clusters where host_name !='localhost'" | curl 'http://localhost:8123/' --silent --data-binary @-`; do curl "http://$node:8123/" --silent ; done | sort -u
| High |
All CH clusters are available (i.e. every configured cluster has enough replicas to serve queries) | for cluster in `echo "select distinct cluster from system.clusters where host_name !='localhost'" | curl 'http://localhost:8123/' --silent --data-binary @-` ; do clickhouse-client --query="select '$cluster', 'OK' from cluster('$cluster', system, one)" ; done | Critical |
There are files in 'detached' folders | $ find /var/lib/clickhouse/data/*/*/detached/* -type d | wc -l; \
19.8+
| Medium |
Too many parts: \ Number of parts is growing; \ Inserts are being delayed; \ Inserts are being rejected | select value from system.asynchronous_metrics
| Critical |
Dictionaries: exception | select concat(name,': ',last_exception)
| Medium |
ClickHouse has been restarted | select uptime();
| |
DistributedFilesToInsert should not be always increasing | select value from system.metrics
| Medium |
A data part was lost | select value from system.events
| High |
Data parts are not the same on different replicas | select value from system.events where event='DataAfterMergeDiffersFromReplica'; \
select value from system.events where event='DataAfterMutationDiffersFromReplica' | Medium |
The following queries are recommended to be included in monitoring:
SELECT * FROM system.replicas
- For more information, see the ClickHouse guide on System Tables
SELECT * FROM system.merges
- Checks on the speed and progress of currently executed merges.
SELECT * FROM system.mutations
- This is the source of information on the speed and progress of currently executed merges.
Logs monitoring
ClickHouse logs can be another important source of information. There are 2 logs enabled by default
- /var/log/clickhouse-server/clickhouse-server.err.log (error & warning, you may want to keep an eye on that or send it to some monitoring system)
- /var/log/clickhouse-server/clickhouse-server.log (trace logs, very detailed, useful for debugging, usually too verbose to monitor).
You can additionally enable system.text_log table to have an access to the logs from clickhouse sql queries (ensure that you will not expose some information to the users which should not see it).
$ cat /etc/clickhouse-server/config.d/text_log.xml
<yandex>
<text_log>
<database>system</database>
<table>text_log</table>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
<level>warning</level>
</text_log>
</yandex>
OpenTelemetry support
See https://clickhouse.com/docs/en/operations/opentelemetry/
Other sources
- https://tech.marksblogg.com/clickhouse-prometheus-grafana.html
- Key Metrics for Monitoring ClickHouse
- ClickHouse Monitoring Key Metrics to Monitor
- ClickHouse Monitoring Tools: Five Tools to Consider
- Monitoring ClickHouse
- Monitor ClickHouse with Datadog
- Unsorted notes on monitor and Alerts
- https://intl.cloud.tencent.com/document/product/1026/36887
- https://chowdera.com/2021/03/20210301161806704Y.html
- https://chowdera.com/2021/03/20210301160252465m.html#
31 - ClickHouse versions
ClickHouse versioning schema
Example:
21.3.10.1-lts
- 21 is the year of release.
- 3 indicates a Feature Release. This is an increment where features are delivered.
- 10 is the bugfix / maintenance version. When that version is incremented it means that some bugs was fixed comparing to 21.3.9.
- 1 - build number, means nothing for end users.
- lts - type of release. (long time support).
What is Altinity Stable version?
It is one of general / public version of ClickHouse which has passed some extra testings, the upgrade path and changelog was analyzed, known issues are documented, and at least few big companies use it on production. All those things take some time, so usually that means that Altinity Stable is always a ‘behind’ the main releases.
Altinity version - is an option for conservative users, who prefer bit older but better known things.
Usually there is no reason to use version older than Altinity Stable. If you see that new Altinity Version arrived and you still use some older version - you should for sure consider an upgrade.
Additionally for Altinity client we provide extra support for those version for a longer time (and we also support newer versions).
Which version should I use?
We recommend the following approach:
- When you start using ClickHouse and before you go on production - pick the latest stable version.
- If you already have ClickHouse running on production:
- Check all the new queries / schemas on the staging first, especially if some new ClickHouse features are used.
- Do minor (bugfix) upgrades regularly: monitor new maintenance releases of the feature release you use.
- When considering upgrade - check Altinity Stable release docs, if you want to use newer release - analyze changelog and known issues.
- Check latest stable or test versions of ClickHouse on your staging environment regularly and pass the feedback to us or on the official ClickHouse github.
- Consider blue/green or canary upgrades.
See also: https://clickhouse.tech/docs/en/faq/operations/production/
How do I upgrade?
Warning
Check upgrade / downgrade scenario on staging first.- check if you need to adjust some settings / to opt-out some new features you don’t need (maybe needed to to make the downgrade path possible, or to make it possible for 2 versions to work together).
- upgrade packages on odd replicas
- (if needed / depends on use case) stop ingestion into odd replicas / remove them for load-balancer etc.
- restart clickhouse-server service on odd replicas.
- once odd replicas will go back online - repeat the same procedure on the even replicas.
In some upgrade scenarios (depending from which version to which you do upgrate) when differerent replicas use different clickhouse versions you may see following issues:
- the replication don’t work at all and delays grow.
- errors about ‘checksum mismatch’ and traffic between replicase increase (they need to resync merge results).
Both problems will go away once all replicas will be upgraded.
Bugs?
ClickHouse development process goes in a very high pace and has already thousands of features. CI system doing tens of thousands of tests (including tests with different sanitizers) against every commit.
All core features are well-tested, and very stable, and code is high-quality. But as with any other software bad things may happen. Usually the most of bugs happens in the new, freshly added functionality, and in some complex combination of several features (of course all possible combinations of features just physically can’t be tested). Usually new features are adopted by the community and stabilize quickly.
What should I do if I found a bug in clickhouse?
- First of all: try to upgrade to the latest bugfix release Example: if you use v21.3.5.42-lts but you know that v21.3.10.1-lts already exists - start with upgrade to that. Upgrades to latest maintenance releases are smooth and safe.
- Look for similar issues in github. Maybe the fix is on the way.
- If you can reproduce the bug: try to isolate it - remove some pieces of query one-by-one / simplify the scenario until the issue still reproduces. This way you can figure out which part is responsible for that bug, and you can try to create minimal reproducible example
- Once you have minimal reproducible example:
- report it to github (or to Altinity Support)
- check if it reproduces on newer clickhouse versions
32 - clickhouse-backup
Installation and configuration
Download the latest clickhouse-backup.tar.gz
from assets from https://github.com/AlexAkulov/clickhouse-backup/releases
This tar.gz contains a single binary of clickhouse-backup
and an example of config file.
Backblaze has s3 compatible API but requires empty acl parameter acl: ""
.
https://www.backblaze.com/ has 15 days and free 10Gb S3 trial.
$ mkdir clickhouse-backup
$ cd clickhouse-backup
$ wget https://github.com/AlexAkulov/clickhouse-backup/releases/download/2.2.0/clickhouse-backup.tar.gz
$ tar zxf clickhouse-backup.tar.gz
$ rm clickhouse-backup.tar.gz
$ cat config.yml
general:
remote_storage: s3
max_file_size: 1099511627776
disable_progress_bar: false
backups_to_keep_local: 0
backups_to_keep_remote: 0
log_level: info
allow_empty_backups: false
clickhouse:
username: default
password: ""
host: localhost
port: 9000
disk_mapping: {}
skip_tables:
- system.*
timeout: 5m
freeze_by_part: false
secure: false
skip_verify: false
sync_replicated_tables: true
log_sql_queries: false
s3:
access_key: 0****1
secret_key: K****1
bucket: "mybucket"
endpoint: https://s3.us-west-000.backblazeb2.com
region: us-west-000
acl: ""
force_path_style: false
path: clickhouse-backup
disable_ssl: false
part_size: 536870912
compression_level: 1
compression_format: tar
sse: ""
disable_cert_verification: false
storage_class: STANDARD
I have a database test
with table test
select count() from test.test;
┌─count()─┐
│ 400000 │
└─────────┘
clickhouse-backup list should work without errors (it scans local and remote (s3) folders):
$ sudo ./clickhouse-backup list -c config.yml
$
Backup
- create a local backup of database test
- upload this backup to remote
- remove the local backup
- drop the source database
$ sudo ./clickhouse-backup create --tables='test.*' bkp01 -c config.yml
2021/05/31 23:11:13 info done backup=bkp01 operation=create table=test.test
2021/05/31 23:11:13 info done backup=bkp01 operation=create
$ sudo ./clickhouse-backup upload bkp01 -c config.yml
1.44 MiB / 1.44 MiB [=====================] 100.00% 2s
2021/05/31 23:12:13 info done backup=bkp01 operation=upload table=test.test
2021/05/31 23:12:17 info done backup=bkp01 operation=upload
$ sudo ./clickhouse-backup list -c config.yml
bkp01 1.44MiB 31/05/2021 23:11:13 local
bkp01 1.44MiB 31/05/2021 23:11:13 remote tar
$ sudo ./clickhouse-backup delete local bkp01 -c config.yml
2021/05/31 23:13:29 info delete 'bkp01'
DROP DATABASE test;
Restore
- download the remote backup
- restore database
$ sudo ./clickhouse-backup list -c config.yml
bkp01 1.44MiB 31/05/2021 23:11:13 remote tar
$ sudo ./clickhouse-backup download bkp01 -c config.yml
2021/05/31 23:14:41 info done backup=bkp01 operation=download table=test.test
1.47 MiB / 1.47 MiB [=====================] 100.00% 0s
2021/05/31 23:14:43 info done backup=bkp01 operation=download table=test.test
2021/05/31 23:14:43 info done backup=bkp01 operation=download
$ sudo ./clickhouse-backup restore bkp01 -c config.yml
2021/05/31 23:16:04 info done backup=bkp01 operation=restore table=test.test
2021/05/31 23:16:04 info done backup=bkp01 operation=restore
SELECT count() FROM test.test;
┌─count()─┐
│ 400000 │
└─────────┘
Delete backups
$ sudo ./clickhouse-backup delete local bkp01 -c config.yml
2021/05/31 23:17:05 info delete 'bkp01'
$ sudo ./clickhouse-backup delete remote bkp01 -c config.yml
33 - Converting MergeTree to Replicated
Options here are:
- Use
INSERT INTO foo_replicated SELECT * FROM foo
. (suitable for small tables) - Create table aside and attach all partition from the existing table then drop original table (uses hard links don’t require extra disk space).
ALTER TABLE foo_replicated ATTACH PARTITION ID 'bar' FROM 'foo'
You can easily auto generate those commands using a query like:SELECT DISTINCT 'ALTER TABLE foo_replicated ATTACH PARTITION ID \'' || partition_id || '\' FROM foo;' from system.parts WHERE table = 'foo';
- Do it ‘in place’ using some file manipulation. see the procedure described here: https://clickhouse.tech/docs/en/engines/table-engines/mergetree-family/replication/#converting-from-mergetree-to-replicatedmergetree
- Do a backup of MergeTree and recover as ReplicatedMergeTree. https://github.com/AlexAkulov/clickhouse-backup/blob/master/Examples.md#how-to-convert-mergetree-to-replicatedmegretree
- Embedded command for that should be added in future.
example for option 2
Note: ATTACH PARTITION ID ‘bar’ FROM ‘foo’` is practically free from compute and disk space perspective. This feature utilizes filesystem hard-links and the fact that files are immutable in Clickhouse ( it’s the core of the Clickhouse design, filesystem hard-links and such file manipulations are widely used ).
create table foo( A Int64, D Date, S String )
Engine MergeTree
partition by toYYYYMM(D) order by A;
insert into foo select number, today(), '' from numbers(1e8);
insert into foo select number, today()-60, '' from numbers(1e8);
select count() from foo;
┌───count()─┐
│ 200000000 │
└───────────┘
create table foo_replicated as foo
Engine ReplicatedMergeTree('/clickhouse/{cluster}/tables/{database}/{table}/{shard}','{replica}')
partition by toYYYYMM(D) order by A;
SYSTEM STOP MERGES;
SELECT DISTINCT 'ALTER TABLE foo_replicated ATTACH PARTITION ID \'' || partition_id || '\' FROM foo;' from system.parts WHERE table = 'foo' AND active;
┌─concat('ALTER TABLE foo_replicated ATTACH PARTITION ID \'', partition_id, '\' FROM foo;')─┐
│ ALTER TABLE foo_replicated ATTACH PARTITION ID '202111' FROM foo; │
│ ALTER TABLE foo_replicated ATTACH PARTITION ID '202201' FROM foo; │
└───────────────────────────────────────────────────────────────────────────────────────────┘
clickhouse-client -q "SELECT DISTINCT 'ALTER TABLE foo_replicated ATTACH PARTITION ID \'' || partition_id || '\' FROM foo;' from system.parts WHERE table = 'foo' format TabSeparatedRaw" |clickhouse-client -mn
SYSTEM START MERGES;
SELECT count() FROM foo_replicated;
┌───count()─┐
│ 200000000 │
└───────────┘
rename table foo to foo_old, foo_replicated to foo;
-- you can drop foo_old any time later, it's kinda a cheap backup,
-- it cost nothing until you insert a lot of additional data into foo_replicated
34 - Data Migration
Export & Import into common data formats
Pros:
- Data can be inserted into any DBMS.
Cons:
- Decoding & encoding of common data formats may be slower / require more CPU
- The data size is usually bigger than ClickHouse formats.
- Some of the common data formats have limitations.
Info
The best approach to do that is using clickhouse-client, in that case, encoding/decoding of format happens client-side, while client and server speak clickhouse Native format (columnar & compressed).
In contrast: when you use HTTP protocol, the server do encoding/decoding and more data is passed between client and server.
remote/remoteSecure or cluster/Distributed table
Pros:
- Simple to run.
- It’s possible to change the schema and distribution of data between shards.
- It’s possible to copy only some subset of data.
- Needs only access to ClickHouse TCP port.
Cons:
- Uses CPU / RAM (mostly on the receiver side)
See details of both approaches in:
clickhouse-copier
Pros:
- Possible to do some changes in schema.
- Needs only access to ClickHouse TCP port.
- It’s possible to change the distribution of data between shards.
- Suitable for large clusters: many clickhouse-copier can execute the same task together.
Cons:
- May create an inconsistent result if source cluster data is changing during the process.
- Hard to setup.
- Requires zookeeper.
- Uses CPU / RAM (mostly on the clickhouse-copier and receiver side)
Info
Internally it works like smartINSERT INTO cluster(…) SELECT * FROM ...
with some consistency checks.Info
Run clickhouse copier on the same nodes as receiver clickhouse, to avoid doubling the network load.See details in:
Manual parts moving: freeze / rsync / attach
Pros:
- Low CPU / RAM usage.
Cons:
- Table schema should be the same.
- A lot of manual operations/scripting.
Info
With some additional care and scripting, it’s possible to do cheap re-sharding on parts level.See details in:
clickhouse-backup
Pros:
- Low CPU / RAM usage.
- Suitable to recover both schema & data for all tables at once.
Cons:
- Table schema should be the same.
Just create the backup on server 1, upload it to server 2, and restore the backup.
See https://github.com/AlexAkulov/clickhouse-backup
https://altinity.com/blog/introduction-to-clickhouse-backups-and-clickhouse-backup
Fetch from zookeeper path
Pros:
- Low CPU / RAM usage.
Cons:
- Table schema should be the same.
- Works only when the source and the destination clickhouse servers share the same zookeeper (without chroot)
- Needs to access zookeeper and ClickHouse replication ports: (
interserver_http_port
orinterserver_https_port
)
ALTER TABLE table_name FETCH PARTITION partition_expr FROM 'path-in-zookeeper'
Replication protocol
Just make one more replica in another place.
Pros:
- Simple to setup
- Data is consistent all the time automatically.
- Low CPU and network usage.
Cons:
- Needs to reach both zookeeper client (2181) and ClickHouse replication ports: (
interserver_http_port
orinterserver_https_port
) - In case of cluster migration, zookeeper need’s to be migrated too.
- Replication works both ways.
../altinity-kb-zookeeper/altinity-kb-zookeeper-cluster-migration.md
See also
Github issues
https://github.com/ClickHouse/ClickHouse/issues/10943 https://github.com/ClickHouse/ClickHouse/issues/20219 https://github.com/ClickHouse/ClickHouse/pull/17871
Other links
34.1 - MSSQL bcp pipe to clickhouse-client
How to pipe data from bcp export tool for MSSQL database
Prepare tables
LAPTOP.localdomain :) CREATE TABLE tbl(key UInt32) ENGINE=MergeTree ORDER BY key;
root@LAPTOP:/home/user# sqlcmd -U sa -P Password78
1> WITH t0(i) AS (SELECT 0 UNION ALL SELECT 0), t1(i) AS (SELECT 0 FROM t0 a, t0 b), t2(i) AS (SELECT 0 FROM t1 a, t1 b), t3(i) AS (SELECT 0 FROM t2 a, t2 b), t4(i) AS (SELECT 0 FROM t3 a, t3 b), t5(i) AS (SELECT 0 FROM t4 a, t3 b),n(i) AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) FROM t5) SELECT i INTO tbl FROM n WHERE i BETWEEN 1 AND 16777216
2> GO
(16777216 rows affected)
root@LAPTOP:/home/user# sqlcmd -U sa -P Password78 -Q "SELECT count(*) FROM tbl"
-----------
16777216
(1 rows affected)
Piping
root@LAPTOP:/home/user# mkfifo import_pipe
root@LAPTOP:/home/user# bcp "SELECT * FROM tbl" queryout import_pipe -t, -c -b 200000 -U sa -P Password78 -S localhost &
[1] 6038
root@LAPTOP:/home/user#
Starting copy...
1000 rows successfully bulk-copied to host-file. Total received: 1000
1000 rows successfully bulk-copied to host-file. Total received: 2000
1000 rows successfully bulk-copied to host-file. Total received: 3000
1000 rows successfully bulk-copied to host-file. Total received: 4000
1000 rows successfully bulk-copied to host-file. Total received: 5000
1000 rows successfully bulk-copied to host-file. Total received: 6000
1000 rows successfully bulk-copied to host-file. Total received: 7000
1000 rows successfully bulk-copied to host-file. Total received: 8000
1000 rows successfully bulk-copied to host-file. Total received: 9000
1000 rows successfully bulk-copied to host-file. Total received: 10000
1000 rows successfully bulk-copied to host-file. Total received: 11000
1000 rows successfully bulk-copied to host-file. Total received: 12000
1000 rows successfully bulk-copied to host-file. Total received: 13000
1000 rows successfully bulk-copied to host-file. Total received: 14000
1000 rows successfully bulk-copied to host-file. Total received: 15000
1000 rows successfully bulk-copied to host-file. Total received: 16000
1000 rows successfully bulk-copied to host-file. Total received: 17000
1000 rows successfully bulk-copied to host-file. Total received: 18000
1000 rows successfully bulk-copied to host-file. Total received: 19000
1000 rows successfully bulk-copied to host-file. Total received: 20000
1000 rows successfully bulk-copied to host-file. Total received: 21000
1000 rows successfully bulk-copied to host-file. Total received: 22000
1000 rows successfully bulk-copied to host-file. Total received: 23000
-- Enter
root@LAPTOP:/home/user# cat import_pipe | clickhouse-client --query "INSERT INTO tbl FORMAT CSV" &
...
1000 rows successfully bulk-copied to host-file. Total received: 16769000
1000 rows successfully bulk-copied to host-file. Total received: 16770000
1000 rows successfully bulk-copied to host-file. Total received: 16771000
1000 rows successfully bulk-copied to host-file. Total received: 16772000
1000 rows successfully bulk-copied to host-file. Total received: 16773000
1000 rows successfully bulk-copied to host-file. Total received: 16774000
1000 rows successfully bulk-copied to host-file. Total received: 16775000
1000 rows successfully bulk-copied to host-file. Total received: 16776000
1000 rows successfully bulk-copied to host-file. Total received: 16777000
16777216 rows copied.
Network packet size (bytes): 4096
Clock Time (ms.) Total : 11540 Average : (1453831.5 rows per sec.)
[1]- Done bcp "SELECT * FROM tbl" queryout import_pipe -t, -c -b 200000 -U sa -P Password78 -S localhost
[2]+ Done cat import_pipe | clickhouse-client --query "INSERT INTO tbl FORMAT CSV"
Another shell
root@LAPTOP:/home/user# for i in `seq 1 600`; do clickhouse-client -q "select count() from tbl";sleep 1; done
0
0
0
0
0
0
1048545
4194180
6291270
9436905
11533995
13631085
16777216
16777216
16777216
16777216
34.2 - clickhouse-copier
The description of the utility and its parameters, as well as examples of the config files that you need to create for the copier are in the official doc ClickHouse copier utility
The steps to run a task:
Create a config file for clickhouse-copier (zookeeper.xml)
Create a config file for the task (task1.xml)
Create the task in ZooKeeper and start an instance of clickhouse-copier
clickhouse-copier --daemon --base-dir=/opt/clickhouse-copier --config=/opt/clickhouse-copier/zookeeper.xml --task-path=/clickhouse/copier/task1 --task-file=/opt/clickhouse-copier/task1.xml
If the node in ZooKeeper already exists and you want to change it, you need to add the
task-upload-force
parameter:clickhouse-copier --daemon --base-dir=/opt/clickhouse-copier --config=/opt/clickhouse-copier/zookeeper.xml --task-path=/clickhouse/copier/task1 --task-file=/opt/clickhouse-copier/task1.xml --task-upload-force=1
If you want to run another instance of clickhouse-copier for the same task, you need to copy the config file (zookeeper.xml) to another server, and run this command:
clickhouse-copier --daemon --base-dir=/opt/clickhouse-copier --config=/opt/clickhouse-copier/zookeeper.xml --task-path=/clickhouse/copier/task1
The number of simultaneously running instances is controlled be the max_workers
parameter in your task configuration file. If you run more workers superfluous workers will sleep and log messages like this:
<Debug> ClusterCopier: Too many workers (1, maximum 1). Postpone processing
See also
- https://clickhouse.com/docs/en/operations/utilities/clickhouse-copier/
- Никита Михайлов. Кластер ClickHouse ctrl-с ctrl-v. HighLoad++ Весна 2021 slides
- 21.7 have a huge bulk of fixes / improvements. https://github.com/ClickHouse/ClickHouse/pull/23518
- https://altinity.com/blog/2018/8/22/clickhouse-copier-in-practice
- http://www.clickhouse.com.cn/topic/601fb322b06e5e0f21ba79e1
- https://github.com/getsentry/snuba/blob/master/docs/clickhouse-copier.md
- https://hughsite.com/post/clickhouse-copier-usage.html
- https://www.jianshu.com/p/c058edd664a6
34.2.1 - clickhouse-copier 20.3 and earlier
Clickhouse-copier was created to move data between clusters. It runs simple INSERT…SELECT queries and can copy data between tables with different engine parameters and between clusters with different number of shards. In the task configuration file you need to describe the layout of the source and the target cluster, and list the tables that you need to copy. You can copy whole tables or specific partitions. Clickhouse-copier uses temporary distributed tables to select from the source cluster and insert into the target cluster.
The process is as follows
- Process the configuration files.
- Discover the list of partitions if not provided in the config.
- Copy partitions one by one.
- Drop the partition from the target table if it’s not empty
- Copy data from source shards one by one.
- Check if there is data for the partition on a source shard.
- Check the status of the task in ZooKeeper.
- Create target tables on all shards of the target cluster.
- Insert the partition of data into the target table.
- Mark the partition as completed in ZooKeeper.
If there are several workers running simultaneously, they will assign themselves to different source shards. If a worker was interrupted, another worker can be started to continue the task. The next worker will drop incomplete partitions and resume the copying.
Configuring the engine of the target table
Clickhouse-copier uses the engine from the task configuration file for these purposes:
- to create target tables if they don’t exist.
- PARTITION BY: to SELECT a partition of data from the source table, to DROP existing partitions from target tables.
Clickhouse-copier does not support the old MergeTree format. However, you can create the target tables manually and specify the engine in the task configuration file in the new format so that clickhouse-copier can parse it for its SELECT queries.
How to monitor the status of running tasks
Clickhouse-copier uses ZooKeeper to keep track of the progress and to communicate between workers. Here is a list of queries that you can use to see what’s happening.
--task-path /clickhouse/copier/task1
-- The task config
select * from system.zookeeper
where path='<task-path>'
name | ctime | mtime
----------------------------+---------------------+--------------------
description | 2019-10-18 15:40:00 | 2020-09-11 16:01:14
task_active_workers_version | 2019-10-18 16:00:09 | 2020-09-11 16:07:08
tables | 2019-10-18 16:00:25 | 2019-10-18 16:00:25
task_active_workers | 2019-10-18 16:00:09 | 2019-10-18 16:00:09
-- Running workers
select * from system.zookeeper
where path='<task-path>/task_active_workers'
-- The list of processed tables
select * from system.zookeeper
where path='<task-path>/tables'
-- The list of processed partitions
select * from system.zookeeper
where path='<task-path>/tables/<table>'
name | ctime
-------+--------------------
201909 | 2019-10-18 18:24:18
-- The status of a partition
select * from system.zookeeper
where path='<task-path>/tables/<table>/<partition>'
name | ctime
-------------------------+--------------------
shards | 2019-10-18 18:24:18
partition_active_workers | 2019-10-18 18:24:18
-- The status of source shards
select * from system.zookeeper
where path='<task-path>/tables/<table>/<partition>/shards'
name | ctime | mtime
-----+---------------------+--------------------
1 | 2019-10-18 22:37:48 | 2019-10-18 22:49:29
34.2.2 - clickhouse-copier 20.4 - 21.6
Clickhouse-copier was created to move data between clusters.
It runs simple INSERT…SELECT
queries and can copy data between tables with different engine parameters and between clusters with different number of shards.
In the task configuration file you need to describe the layout of the source and the target cluster, and list the tables that you need to copy. You can copy whole tables or specific partitions.
Clickhouse-copier uses temporary distributed tables to select from the source cluster and insert into the target cluster.
The behavior of clickhouse-copier was changed in 20.4:
- Now clickhouse-copier inserts data into intermediate tables, and after the insert finishes successfully clickhouse-copier attaches the completed partition into the target table. This allows for incremental data copying, because the data in the target table is intact during the process. Important note: ATTACH PARTITION respects the
max_partition_size_to_drop
limit. Make sure themax_partition_size_to_drop
limit is big enough (or set to zero) in the destination cluster. If clickhouse-copier is unable to attach a partition because of the limit, it will proceed to the next partition, and it will drop the intermediate table when the task is finished (if the intermediate table is less than themax_table_size_to_drop
limit). Another important note: ATTACH PARTITION is replicated. The attached partition will need to be downloaded by the other replicas. This can create significant network traffic between ClickHouse nodes. If an attach takes a long time, clickhouse-copier will log a timeout and will proceed to the next step. - Now clickhouse-copier splits the source data into chunks and copies them one by one. This is useful for big source tables, when inserting one partition of data can take hours. If there is an error during the insert clickhouse-copier has to drop the whole partition and start again. The
number_of_splits
parameter lets you split your data into chunks so that in case of an exception clickhouse-copier has to re-insert only one chunk of the data. - Now clickhouse-copier runs
OPTIMIZE target_table PARTITION ... DEDUPLICATE
for non-Replicated MergeTree tables. Important note: This is a very strange feature that can do more harm than good. We recommend to disable it by configuring the engine of the target table as Replicated in the task configuration file, and create the target tables manually if they are not supposed to be replicated. Intermediate tables are always created as plain MergeTree.
The process is as follows
- Process the configuration files.
- Discover the list of partitions if not provided in the config.
- Copy partitions one by one ** The metadata in ZooKeeper suggests the order described here.**
- Copy chunks of data one by one.
- Copy data from source shards one by one.
- Create intermediate tables on all shards of the target cluster.
- Check the status of the chunk in ZooKeeper.
- Drop the partition from the intermediate table if the previous attempt was interrupted.
- Insert the chunk of data into the intermediate tables.
- Mark the shard as completed in ZooKeeper
- Copy data from source shards one by one.
- Attach the chunks of the completed partition into the target table one by one
- Attach a chunk into the target table.
- non-Replicated: Run OPTIMIZE target_table DEDUPLICATE for the partition on the target table.
- Copy chunks of data one by one.
- Drop intermediate tables (may not succeed if the tables are bigger than
max_table_size_to_drop
).
If there are several workers running simultaneously, they will assign themselves to different source shards. If a worker was interrupted, another worker can be started to continue the task. The next worker will drop incomplete partitions and resume the copying.
Configuring the engine of the target table
Clickhouse-copier uses the engine from the task configuration file for these purposes:
- to create target and intermediate tables if they don’t exist.
- PARTITION BY: to SELECT a partition of data from the source table, to ATTACH partitions into target tables, to DROP incomplete partitions from intermediate tables, to OPTIMIZE partitions after they are attached to the target.
- ORDER BY: to SELECT a chunk of data from the source table.
Here is an example of SELECT that clickhouse-copier runs to get the sixth of ten chunks of data:
WHERE (<the PARTITION BY clause> = (<a value of the PARTITION BY expression> AS partition_key))
AND (cityHash64(<the ORDER BY clause>) % 10 = 6 )
Clickhouse-copier does not support the old MergeTree format. However, you can create the intermediate tables manually with the same engine as the target tables (otherwise ATTACH will not work), and specify the engine in the task configuration file in the new format so that clickhouse-copier can parse it for SELECT, ATTACH PARTITION and DROP PARTITION queries.
Important note: always configure engine as Replicated to disable OPTIMIZE … DEDUPLICATE (unless you know why you need clickhouse-copier to run OPTIMIZE … DEDUPLICATE).
How to configure the number of chunks
The default value for number_of_splits
is 10.
You can change this parameter in the table
section of the task configuration file. We recommend setting it to 1 for smaller tables.
<cluster_push>target_cluster</cluster_push>
<database_push>target_database</database_push>
<table_push>target_table</table_push>
<number_of_splits>1</number_of_splits>
<engine>Engine=Replicated...<engine>
How to monitor the status of running tasks
Clickhouse-copier uses ZooKeeper to keep track of the progress and to communicate between workers. Here is a list of queries that you can use to see what’s happening.
--task-path=/clickhouse/copier/task1
-- The task config
select * from system.zookeeper
where path='<task-path>'
name | ctime | mtime
----------------------------+---------------------+--------------------
description | 2021-03-22 13:15:48 | 2021-03-22 13:25:28
status | 2021-03-22 13:15:48 | 2021-03-22 13:25:28
task_active_workers_version | 2021-03-22 13:15:48 | 2021-03-22 20:32:09
tables | 2021-03-22 13:16:47 | 2021-03-22 13:16:47
task_active_workers | 2021-03-22 13:15:48 | 2021-03-22 13:15:48
-- Status
select * from system.zookeeper
where path='<task-path>/status'
-- Running workers
select * from system.zookeeper
where path='<task-path>/task_active_workers'
-- The list of processed tables
select * from system.zookeeper
where path='<task-path>/tables'
-- The list of processed partitions
select * from system.zookeeper
where path='<task-path>/tables/<table>'
name | ctime
-------+--------------------
202103 | 2021-03-22 13:16:47
202102 | 2021-03-22 13:18:31
202101 | 2021-03-22 13:27:36
202012 | 2021-03-22 14:05:08
-- The status of a partition
select * from system.zookeeper
where path='<task-path>/tables/<table>/<partition>'
name | ctime
---------------+--------------------
piece_0 | 2021-03-22 13:18:31
attach_is_done | 2021-03-22 14:05:05
-- The status of a piece
select * from system.zookeeper
where path='<task-path>/tables/<table>/<partition>/piece_N'
name | ctime
-------------------------------+--------------------
shards | 2021-03-22 13:18:31
is_dirty | 2021-03-22 13:26:51
partition_piece_active_workers | 2021-03-22 13:26:54
clean_start | 2021-03-22 13:26:54
-- The status of source shards
select * from system.zookeeper
where path='<task-path>/tables/<table>/<partition>/piece_N/shards'
name | ctime | mtime
-----+---------------------+--------------------
1 | 2021-03-22 13:26:54 | 2021-03-22 14:05:05
34.2.3 - Kubernetes job for clickhouse-copier
ClickHouse-copier deployment in kubernetes
Clickhouse-copier can be deployed in a kubernetes environment to automate some simple backups or copy fresh data between clusters.
Some documentation to read:
- https://kb.altinity.com/altinity-kb-setup-and-maintenance/altinity-kb-data-migration/altinity-kb-clickhouse-copier/
- https://clickhouse.com/docs/en/operations/utilities/clickhouse-copier/
Deployment
Use a kubernetes job is recommended but a simple pod can be used if you only want to execute the copy one time.
Just edit/change all the yaml
files to your needs.
1) Create the PVC:
First create a namespace in which all the pods and resources are going to be deployed
kubectl create namespace clickhouse-copier
Then create the PVC using a storageClass
gp2-encrypted class or use any other storageClass from other providers:
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: copier-logs
namespace: clickhouse-copier
spec:
storageClassName: gp2-encrypted
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Mi
and deploy:
kubectl -n clickhouse-copier create -f ./kubernetes/copier-pvc.yaml
2) Create the configmap:
The configmap has both files zookeeper.xml
and task01.xml
with the zookeeper node listing and the parameters for the task respectively.
---
apiVersion: