MultiDisk (JBOD) Balancing
ClickHouse provides two options to balance an insert across disks in a volume with more than one disk: round_robin and least_used .
Round Robin (Default):
ClickHouse selects the next disk in a round robin manner to write a part.
This is the default setting and is most effective when parts created on insert are roughly the same size.
Drawbacks: may lead to disk skew
Least Used:
ClickHouse selects the disk with the most available space and writes to that disk.
Changing to least_used when even disk space consumption is desirable or when you have a JBOD volume with differing disk sizes. To prevent hot-spots, it is best to set this policy on a fresh volume or on a volume that has already been (re)balanced.
Drawbacks: may lead to hot-spots
Configurations
Configurations that can affect disk selected:
- storage policy volume configuration:
least_used_ttl_ms. Only applies toleast_usedpolicy, 60s default. - disk setting:
keep_free_space_bytes,keep_free_space_ratio
Configuration to assist rebalancing:
- The MergeTree setting
min_bytes_to_rebalance_partition_over_jboddoes not control where data is written during inserts. Instead, it governs how parts are redistributed across disks within the same volume during merge operations.
Note: setting
min_bytes_to_rebalance_partition_over_jboddoes not guarantee balanced partitions and balanced disk usage.
Example of least_used policy:
<clickhouse>
<storage_configuration>
<disks>
<default>
<path>/var/lib/clickhouse/</path>
<keep_free_space_bytes>10737418240</keep_free_space_bytes>
</disk1>
<disk1>
<path>/mnt/disk1/</path>
<keep_free_space_bytes>10737418240</keep_free_space_bytes>
</disk1>
<disk2>
<path>/mnt/disk2/</path>
<keep_free_space_bytes>10737418240</keep_free_space_bytes>
</disk2>
</disks>
<policies>
<hot>
<volumes>
<default>
<disk>disk1</disk>
<disk>disk2</disk>
<load_balancing>least_used</load_balancing>
<least_used_ttl_ms>60000</least_used_ttl_ms> <!-- 60s -->
</default>
</volumes>
</hot>
</policies>
</storage_configuration>
</clickhouse>
Manual Rebalancing Parts over JBOD Disks
Following query will select large parts in target_tables and target_databases that can be candidates to move to another disk. Disk chosen should comply with the following requirements:
- Should only select valid moves for the same storage_policy used by that table
- storage_policy must be JBODs type
- moves to other disks in the same volume
- select a different disk, i.e not the same disk as the one that part is in
- select the disk to move the part to by order of largest free_space on that disk
Set target_tables and target_databases based on requirements.
WITH
'%' AS target_tables,
'%' AS target_databases
SELECT sub.q FROM
(
SELECT
'ALTER TABLE ' || parts.database || '.' || parts.`table` || ' MOVE PART \'' || parts.name ||'\' TO DISK \'' || other_disk_candidate || '\';' as q,
parts.database as db,
parts.`table` as t,
parts.name as part_name,
parts.disk_name as part_disk_name,
parts.bytes_on_disk AS part_bytes_on_disk,
sp.storage_policy as part_storage_policy,
arrayJoin(arrayRemove(v.disks, parts.disk_name)) AS other_disk_candidate,
candidate_disks.free_space AS candidate_disk_free_space
FROM system.parts AS parts
INNER JOIN ( SELECT database, `table`, storage_policy FROM system.tables where (name LIKE target_tables) AND (database LIKE target_databases) group by 1, 2, 3 ) AS sp ON sp.`table` = parts.`table` AND sp.database = parts.database
INNER JOIN ( SELECT policy_name, volume_name, disks AS disks FROM system.storage_policies WHERE volume_type = 0 ) AS v ON sp.storage_policy = v.policy_name
INNER JOIN ( SELECT name, free_space FROM system.disks ORDER BY free_space DESC ) AS candidate_disks ON candidate_disks.name = other_disk_candidate
WHERE parts.active = 1
AND (parts.bytes_on_disk >= 10737418240) --10GB prioritize larger parts
AND (parts.`table` LIKE target_tables)
AND (parts.database LIKE target_databases)
AND candidate_disks.free_space > parts.bytes_on_disk*2 -- 2x buffer
ORDER BY parts.bytes_on_disk DESC, candidate_disk_free_space DESC
LIMIT 1 BY db, t, part_name
) as sub
FORMAT TSVRaw