Normally the upgrade procedure looks like that:
- pick the release to upgrade
- check the release notes/changelog between the release you use currently and the target release
- sometimes you may need to change some configuration settings to change the defaults (for better compatibility, etc)
- upgrade itself is simple:
- upgrade package (it doesn’t trigger the restart of clickhouse-server automatically)
- restart clickhouse-server
- check healthchecks / logs
- repeat on other nodes
- Mixing several versions working together in the same cluster may often lead to different degradations. Usually, it’s not recommended to have a big delay between upgrading different nodes on the same cluster. Usually, you do upgrade on the odd replicas first, and after they were back online - restart the even replicas.
- upgrade the dev / staging first
- ensure your schema/queries work properly on the staging env
- do the production upgrade.
1 - Vulnerabilities
2022-03-15: 7 vulnerabulities in ClickHouse were published.
See the details https://jfrog.com/blog/7-rce-and-dos-vulnerabilities-found-in-clickhouse-dbms/
Those vulnerabilities were fixed by 2 PRs:
All releases starting from v184.108.40.206 have that problem fixed.
Also, the fix was backported to 21.3 and 21.8 branches - versions v220.127.116.11-lts and v18.104.22.168-lts
accordingly have the problem fixed (and all newer releases in those branches).
The latest Altinity stable releases also contain the bugfix.
If you use some older version we recommend upgrading.
Before the upgrade - please ensure that ports 9000 and 8123 are not exposed to the internet, so external
clients who can try to exploit those vulnerabilities can not access your clickhouse node.
2 - Clickhouse Function/Engines/Settings Report
Report on ClickHouse functions, table functions, table engines, system and MergeTree settings, with availability information.
Follow this link for a complete report on ClickHouse features with their availability: https://github.com/anselmodadams/ChMisc/blob/main/report/report.md. It is frequently updated (at least once a month).
3 - Removing empty parts
Removing empty parts
Removing of empty parts is a new feature introduced in 20.12.
Earlier versions leave empty parts (with 0 rows) if TTL removes all rows from a part (https://github.com/ClickHouse/ClickHouse/issues/5491).
If you set up TTL for your data it is likely that there are quite many empty parts in your system.
The new version notices empty parts and tries to remove all of them immediately.
This is a one-time operation which runs right after an upgrade.
After that TTL will remove empty parts on its own.
There is a problem when different replicas of the same table start to remove empty parts at the same time. Because of the bug they can block each other (https://github.com/ClickHouse/ClickHouse/issues/23292).
What we can do to avoid this problem during an upgrade:
Drop empty partitions before upgrading to decrease the number of empty parts in the system.
SELECT concat('alter table ',database, '.', table, ' drop partition id ''', partition_id, ''';')
FROM system.parts WHERE active
GROUP BY database, table, partition_id
HAVING count() = countIf(rows=0)
Upgrade/restart one replica (in a shard) at a time.
If only one replica is cleaning empty parts there will be no deadlock because of replicas waiting for one another.
Restart one replica, wait for replication queue to process, then restart the next one.
Removing of empty parts can be disabled by adding
remove_empty_parts=0 to the default profile.
$ cat /etc/clickhouse-server/users.d/remove_empty_parts.xml
4 - Removing lost parts
Removing lost parts
There might be parts left in ZooKeeper that don’t exist on disk
The explanation is here https://github.com/ClickHouse/ClickHouse/pull/26716
The problem is introduced in 20.1.
The problem is fixed in 21.8 and backported to 21.3.16, 21.6.9, 21.7.6.
Regarding the procedure to reproduce the issue:
The procedure was not confirmed, but I think it should work.
Wait for a merge on a particular partition (or run an OPTIMIZE to trigger one)
At this point you can collect the names of parts participating in the merge from the system.merges table, or the system.parts table.
When the merge finishes, stop one of the replicas before the inactive parts are dropped (or detach the table).
Bring the replica back up (or attach the table).
Check that there are no inactive parts in system.parts, but they stayed in ZooKeeper.
Also check that the inactive parts got removed from ZooKeeper for another replica.
Here is the query to check ZooKeeper:
select name, ctime from system.zookeeper
and name like '<put an expression for the parts that were merged>'
- Drop the partition on the replica that DOES NOT have those extra parts in ZooKeeper.
Check the list of parts in ZooKeeper.
We hope that after this the parts on disk will be removed on all replicas, but one of the replicas will still have some parts left in ZooKeeper.
If this happens, then we think that after a restart of the replica with extra parts in ZooKeeper it will try to download them from another replica.
A query to find ‘forgotten’ parts
A query to drop empty partitions with failing replication tasks
select 'alter table '||database||'.'||table||' drop partition id '''||partition_id||''';'
select database, table, splitByChar('_',new_part_name) partition_id
where type='GET_PART' and not is_currently_executing and create_time < toStartOfDay(yesterday())
group by database, table, partition_id) q
(select database, table, partition_id, countIf(active) cnt_active, count() cnt_total
from system.parts group by database, table, partition_id
) p using database, table, partition_id