The Altinity Knowledge Base is licensed under Apache 2.0, and is available to all ClickHouse users. The information and code samples are available freely and distributed under the Apache 2.0 license.

For more detailed information about Altinity services support, see the following:

Altinity : Providers of Altinity.Cloud, providing SOC-2 certified support for ClickHouse.
Altinity.com Documentation : Detailed guides on working with:
Altinity Blog : Blog posts about ClickHouse the database and Altinity services.

The following sites are also useful references regarding ClickHouse:

ClickHouse.com documentation : Official documentation from ClickHouse Inc.
ClickHouse at Stackoverflow : Community driven responses to questions regarding ClickHouse
Google groups (Usenet) yes we remember it : The grandparent of all modern discussion boards.

1 - Engines

Learn about ClickHouse® engines, from MergeTree, Atomic Database to RocksDB.

Generally: the main engine in ClickHouse® is called MergeTree . It allows to store and process data on one server and feel all the advantages of ClickHouse. Basic usage of MergeTree does not require any special configuration, and you can start using it ‘out of the box’.

But one server and one copy of data are not fault-tolerant - something can happen with the server itself, with datacenter availability, etc. So you need to have the replica(s) - i.e. server(s) with the same data and which can ‘substitute’ the original server at any moment.

To have an extra copy (replica) of your data you need to use ReplicatedMergeTree engine. It can be used instead of MergeTree engine, and you can always upgrade from MergeTree to ReplicatedMergeTree (and downgrade back) if you need. To use that you need to have ZooKeeper installed and running. For tests, you can use one standalone Zookeeper instance, but for production usage, you should have zookeeper ensemble at least of 3 servers.

When you use ReplicatedMergeTree then the inserted data is copied automatically to all the replicas, but all the SELECTs are executed on the single server you have connected to. So you can have 5 replicas of your data, but if you will always connect to one replica - it will not ‘share’ / ‘balance’ that traffic automatically between all the replicas, one server will be loaded and the rest will generally do nothing. If you need that balancing of load between multiple replicas - you can use the internal ’loadbalancer’ mechanism which is provided by Distributed engine of ClickHouse. As an alternative in that scenario you can work without Distributed table , but with some external load balancer that will balance the requests between several replicas according to your specific rules or preferences, or just cluster-aware client which will pick one of the servers for the query time.

The Distributed engine does not store any data, but it can ‘point’ to the same ReplicatedMergeTree/MergeTree table on multiple servers. To use Distributed engine you need to configure <cluster> settings in your ClickHouse server config file.

So let’s say you have 3 replicas of table my_replicated_data with ReplicatedMergeTree engine. You can create a table with Distributed engine called my_distributed_replicated_data which will ‘point’ to all of that 3 servers, and when you will select from that my_distributed_replicated_data table the select will be forwarded and executed on one of the replicas. So in that scenario, each replica will get 1/3 of requests (but each request still will be fully executed on one chosen replica).

All that is great, and will work well while one copy of your data is fitting on a single physical server, and can be processed by the resources of one server. When you have too much data to be stored/processed on one server - you need to use sharding (it’s just a way to split the data into smaller parts). Sharding is the mechanism also provided by Distributed engine.

With sharding data is divided into parts (shards) according to some sharding key. You can just use random distribution, so let’s say - throw a coin to decide on each of the servers the data should be stored, or you can use some ‘smarter’ sharding scheme, to make the data connected to the same subject (let’s say to the same customer) stored on one server, and to another subject on another. So in that case all the shards should be requested at the same time and later the ‘common’ result should be calculated.

In ClickHouse each shard works independently and process its part of data, inside each shard replication can work. And later to query all the shards at the same time and combine the final result - Distributed engine is used. So Distributed work as load balancer inside each shard, and can combine the data coming from different shards together to make the ‘common’ result.

You can use Distributed table for inserts, in that case, it will pass the data to one of the shards according to the sharding key. Or you can insert to the underlying table on one of the shards bypassing the Distributed table.

Short summary

start with MergeTree
to have several copies of data use ReplicatedMergeTree
if your data is too big to fit/ to process on one server - use sharding
to balance the load between replicas and to combine the result of selects from different shards - use Distributed table .

Please check @alex-zaitsev presentation, which covers that subject: https://www.youtube.com/watch?v=zbjub8BQPyE ( Slides are here: https://yadi.sk/i/iLA5ssAv3NdYGy )

P.S. Actually you can create replication without Zookeeper and ReplicatedMergeTree, just by using the Distributed table above MergeTree and internal_replication=false cluster setting, but in that case, there will be no guarantee that all the replicas will have 100% the same data, so I rather would not recommend that scenario.

Based on my original answer on github: https://github.com/ClickHouse/ClickHouse/issues/2161

1.1 - ClickHouse® Atomic Database Engine

Capabilities of the Atomic database engine

In version 20.5, ClickHouse® first introduced database engine=Atomic.

Since version 20.10 it is a default database engine (before engine=Ordinary was used).

Those 2 database engine differs in a way how they store data on a filesystem, and engine Atomic allows to resolve some of the issues existed in engine=Ordinary.

engine=Atomic supports

non-blocking drop table / rename table
tables delete (&detach) async (wait for selects finish but invisible for new selects)
atomic drop table (all files / folders removed)
atomic table swap (table swap by “EXCHANGE TABLES t1 AND t2;”)
rename dictionary / rename database
unique automatic UUID paths in FS and ZK for Replicated

FAQ

Q. Data is not removed immediately

A. UseDROP TABLE t SYNC;

Or use parameter (user level) database_atomic_wait_for_drop_and_detach_synchronously:

SET database_atomic_wait_for_drop_and_detach_synchronously = 1;

Also, you can decrease the delay used by Atomic for real table drop (it’s 8 minutes by default)

cat /etc/clickhouse-server/config.d/database_atomic_delay_before_drop_table.xml
<clickhouse>
    <database_atomic_delay_before_drop_table_sec>1</database_atomic_delay_before_drop_table_sec>
</clickhouse>

Q. I cannot reuse zookeeper path after dropping the table.

A. This happens because real table deletion occurs with a controlled delay. See the previous question to remove the table immediately.

With engine=Atomic it’s possible (and is a good practice if you do it correctly) to include UUID into zookeeper path, i.e. :

CREATE ...
ON CLUSTER ...
ENGINE=ReplicatedMergeTree('/clickhouse/tables/{uuid}/{shard}/', '{replica}')

It’s very important that the table will have the same UUID cluster-wide.

When the table is created using ON CLUSTER - all tables will get the same UUID automatically. When it needs to be done manually (for example - you need to add one more replica), pick CREATE TABLE statement with UUID from one of the existing replicas.

set show_table_uuid_in_table_create_qquery_if_not_nil=1　;
SHOW CREATE TABLE xxx; /* or SELECT create_table_query FROM system.tables WHERE ... */

Q. Should I use Atomic or Ordinary for new setups?

All things inside ClickHouse itself should work smoothly with Atomic.

But some external tools - backup tools, things involving other kinds of direct manipulations with ClickHouse files & folders may have issues with Atomic.

Ordinary layout on the filesystem is simpler. And the issues which address Atomic (lock-free renames, drops, atomic exchange of table) are not so critical in most cases.

	Ordinary	Atomic
filesystem layout	very simple	more complicated
external tool support (like `clickhouse-backup`)	good / mature	good / mature
some DDL queries (DROP / RENAME) may hang for a long time (waiting for some other things)	yes 👎	no 👍
Possibility to swap 2 tables	rename a to a_old, b to a, a_old to b; Operation is not atomic, and can break in the middle (while chances are low).	EXCHANGE TABLES t1 AND t2 Atomic, have no intermediate states.
uuid in zookeeper path	Not possible to use. The typical pattern is to add version suffix to zookeeper path when you need to create the new version of the same table.	You can use uuid in zookeeper paths. That requires some extra care when you expand the cluster, and makes zookeeper paths harder to map to real table. But allows to to do any kind of manipulations on tables (rename, recreate with same name etc).
Materialized view without TO syntax (!we recommend using TO syntax always!)	.inner.mv_name The name is predictable, easy to match with MV.	.inner_id.{uuid} The name is unpredictable, hard to match with MV (maybe problematic for MV chains, and similar scenarios)

Using Ordinary by default instead of Atomic

---
title: "cat /etc/clickhouse-server/users.d/disable_atomic_database.xml "
linkTitle: "cat /etc/clickhouse-server/users.d/disable_atomic_database.xml "
description: >
    cat /etc/clickhouse-server/users.d/disable_atomic_database.xml
---
<?xml version="1.0"?>
<clickhouse>
    <profiles>
        <default>
            <default_database_engine>Ordinary</default_database_engine>
        </default>
    </profiles>
</clickhouse>

Other sources

Presentation https://youtu.be/1LVJ_WcLgF8?t=2744

https://github.com/ClickHouse/clickhouse-presentations/blob/master/meetup46/database_engines.pdf

1.1.1 - How to Convert Ordinary to Atomic

New, official way

Implemented automatic conversion of database engine from Ordinary to Atomic (ClickHouse® Server 22.8+). Create empty convert_ordinary_to_atomic file in flags directory and all Ordinary databases will be converted automatically on next server start.
The conversion is not automatic between upgrades, you need to set the flag as explained below:

Warnings:
 * Server has databases (for example `test`) with Ordinary engine, which was deprecated. To convert this database to the new Atomic engine, create a flag /var/lib/clickhouse/flags/convert_ordinary_to_atomic and make sure that ClickHouse has write permission for it.
Example: sudo touch '/var/lib/clickhouse/flags/convert_ordinary_to_atomic' && sudo chmod 666 '/var/lib/clickhouse/flags/convert_ordinary_to_atomic'

Resolves #39546 . #39933 (Alexander Tokmakov )
There can be some problems if the default database is Ordinary and fails for some reason. You can add:

<clickhouse>
     <allow_reserved_database_name_tmp_convert>1</allow_reserved_database_name_tmp_convert>
</clickhouse>

More detailed info here

Don’t forget to remove detached parts from all Ordinary databases, or you can get the error:

│ 2025.01.28 11:34:57.510330 [ 7 ] {} <Error> Application: Code: 219. DB::Exception: Cannot drop: filesystem error: in remove: Directory not empty ["/var/lib/clickhouse/data/db/"]. Probably data │
│ base contain some detached tables or metadata leftovers from Ordinary engine. If you want to remove all data anyway, try to attach database back and drop it again with enabled force_remove_data_recursively_ │

1.1.2 - How to Convert Atomic to Ordinary

How to Convert Atomic to Ordinary

The following instructions are an example on how to convert a database with the Engine type Atomic to a database with the Engine type Ordinary.

Warning

That can be used only for simple schemas. Schemas with MATERIALIZED views will require extra manipulations.

CREATE DATABASE atomic_db ENGINE = Atomic;
CREATE DATABASE ordinary_db ENGINE = Ordinary;
CREATE TABLE atomic_db.x ENGINE = MergeTree ORDER BY tuple() AS system.numbers;
INSERT INTO atomic_db.x SELECT number FROM numbers(100000);
RENAME TABLE atomic_db.x TO ordinary_db.x;

ls -1 /var/lib/clickhouse/data/ordinary_db/x
all_1_1_0
detached
format_version.txt

DROP DATABASE atomic_db;
DETACH DATABASE ordinary_db;

mv /var/lib/clickhouse/metadata/ordinary_db.sql /var/lib/clickhouse/metadata/atomic_db.sql
vi /var/lib/clickhouse/metadata/atomic_db.sql
mv /var/lib/clickhouse/metadata/ordinary_db /var/lib/clickhouse/metadata/atomic_db
mv /var/lib/clickhouse/data/ordinary_db /var/lib/clickhouse/data/atomic_db

ATTACH DATABASE atomic_db;
SELECT count() FROM atomic_db.x
┌─count()─┐
│  100000 │
└─────────┘
SHOW CREATE DATABASE atomic_db
┌─statement──────────────────────────────────┐
│ CREATE DATABASE atomic_db
ENGINE = Ordinary │
└────────────────────────────────────────────┘

Schemas with Materialized VIEW

DROP DATABASE IF EXISTS atomic_db;
DROP DATABASE IF EXISTS ordinary_db;

CREATE DATABASE atomic_db engine=Atomic;
CREATE DATABASE ordinary_db engine=Ordinary;

CREATE TABLE atomic_db.x ENGINE = MergeTree ORDER BY tuple() AS system.numbers;
CREATE MATERIALIZED VIEW atomic_db.x_mv ENGINE = MergeTree ORDER BY tuple() AS SELECT * FROM atomic_db.x;
CREATE MATERIALIZED VIEW atomic_db.y_mv ENGINE = MergeTree ORDER BY tuple() AS SELECT * FROM atomic_db.x;
CREATE TABLE atomic_db.z ENGINE = MergeTree ORDER BY tuple() AS system.numbers;
CREATE MATERIALIZED VIEW atomic_db.z_mv TO atomic_db.z AS SELECT * FROM atomic_db.x;

INSERT INTO atomic_db.x SELECT * FROM numbers(100);

--- USE atomic_db;
---
--- Query id: 28af886d-a339-4e9c-979c-8bdcfb32fd95
---
--- ┌─name───────────────────────────────────────────┐
--- │ .inner_id.b7906fec-f4b2-455b-bf9b-2b18ca64842c │
--- │ .inner_id.bd32d79b-272d-4710-b5ad-bca78d09782f │
--- │ x                                              │
--- │ x_mv                                           │
--- │ y_mv                                           │
--- │ z                                              │
--- │ z_mv                                           │
--- └────────────────────────────────────────────────┘


SELECT mv_storage.database, mv_storage.name, mv.database, mv.name
FROM system.tables AS mv_storage
LEFT JOIN system.tables AS mv ON substring(mv_storage.name, 11) = toString(mv.uuid)
WHERE mv_storage.name LIKE '.inner_id.%' AND mv_storage.database = 'atomic_db';

-- ┌─database──┬─name───────────────────────────────────────────┬─mv.database─┬─mv.name─┐
-- │ atomic_db │ .inner_id.81e1a67d-3d02-4b2a-be17-84d8626d2328 │ atomic_db   │ y_mv    │
-- │ atomic_db │ .inner_id.e428225c-982a-4859-919b-ba5026db101d │ atomic_db   │ x_mv    │
-- └───────────┴────────────────────────────────────────────────┴─────────────┴─────────┘




/* STEP 1: prepare rename statements, also to rename implicit mv storage table to explicit one */

SELECT
if(
   t.name LIKE '.inner_id.%',
  'RENAME TABLE `' || t.database || '`.`' ||  t.name || '` TO `ordinary_db`.`' || mv.name || '_storage`;',
   'RENAME TABLE `' || t.database || '`.`' ||  t.name || '` TO `ordinary_db`.`' || t.name || '`;'
)
FROM system.tables as t
LEFT JOIN system.tables mv ON (substring(t.name,11) = toString(mv.uuid) AND t.database =  mv.database )
WHERE t.database = 'atomic_db' AND t.engine <> 'MaterializedView'
FORMAT TSVRaw;

-- RENAME TABLE `atomic_db`.`.inner_id.b7906fec-f4b2-455b-bf9b-2b18ca64842c` TO `ordinary_db`.`y_mv_storage`;
-- RENAME TABLE `atomic_db`.`.inner_id.bd32d79b-272d-4710-b5ad-bca78d09782f` TO `ordinary_db`.`x_mv_storage`;
-- RENAME TABLE `atomic_db`.`x` TO `ordinary_db`.`x`;
-- RENAME TABLE `atomic_db`.`z` TO `ordinary_db`.`z`;


/* STEP 2: prepare statements to reattach MV */
-- Can be done manually: pick existing MV definition (SHOW CREATE TABLE), and change it in the following way:
-- 1) add TO keyword 2) remove column names and engine settings after mv name


SELECT
if(
   t.name LIKE '.inner_id.%',
   replaceRegexpOne(mv.create_table_query, '^CREATE MATERIALIZED VIEW ([^ ]+) (.*? AS ', 'CREATE MATERIALIZED VIEW \\1 TO \\1_storage AS '),
   mv.create_table_query
)
FROM system.tables as mv
LEFT JOIN system.tables t ON (substring(t.name,11) = toString(mv.uuid) AND t.database =  mv.database)
WHERE mv.database = 'atomic_db' AND mv.engine='MaterializedView'
FORMAT TSVRaw;

-- CREATE MATERIALIZED VIEW atomic_db.x_mv TO atomic_db.x_mv_storage AS SELECT * FROM atomic_db.x
-- CREATE MATERIALIZED VIEW atomic_db.y_mv TO atomic_db.y_mv_storage AS SELECT * FROM atomic_db.x

/* STEP 3: stop inserts, fire renames statements prepared at the step 1 (hint: use clickhouse-client -mn) */

RENAME ...

/* STEP 4: ensure that only MaterializedView left in source db, and drop it.  */

SELECT * FROM system.tables WHERE database = 'atomic_db' and engine <> 'MaterializedView';
DROP DATABASE atomic_db;


/* STEP 4. rename table to old name: */

DETACH DATABASE ordinary_db;

-- rename files / folders:

mv /var/lib/clickhouse/metadata/ordinary_db.sql /var/lib/clickhouse/metadata/atomic_db.sql
vi /var/lib/clickhouse/metadata/atomic_db.sql
mv /var/lib/clickhouse/metadata/ordinary_db /var/lib/clickhouse/metadata/atomic_db
mv /var/lib/clickhouse/data/ordinary_db /var/lib/clickhouse/data/atomic_db

-- attach database atomic_db;

ATTACH DATABASE atomic_db;

/* STEP 5. restore MV using statements created on STEP 2 */

1.2 - EmbeddedRocksDB & dictionary

EmbeddedRocksDB & dictionary

RocksDB is faster than MergeTree on Key/Value queries because MergeTree primary key index is sparse. Probably it’s possible to speedup MergeTree by reducing index_granularity.

NVMe disk is used for the tests.

The main feature of RocksDB is instant updates. You can update a row instantly (microseconds):

select * from rocksDB where A=15645646;
┌────────A─┬─B────────────────────┐
│ 15645646 │ 12517841379565221195 │
└──────────┴──────────────────────┘
1 rows in set. Elapsed: 0.001 sec.

insert into rocksDB values (15645646, 'xxxx');
1 rows in set. Elapsed: 0.001 sec.

select * from rocksDB where A=15645646;
┌────────A─┬─B────┐
│ 15645646 │ xxxx │
└──────────┴──────┘
1 rows in set. Elapsed: 0.001 sec.

Let’s load 100 millions rows:

create table rocksDB(A UInt64, B String, primary key A) Engine=EmbeddedRocksDB();
insert into rocksDB select number, toString(cityHash64(number))
from numbers(100000000);

-- 0 rows in set. Elapsed: 154.559 sec. Processed 100.66 million rows, 805.28 MB (651.27 thousand rows/s., 5.21 MB/s.)
-- Size on disk: 1.5GB

create table mergeTreeDB(A UInt64, B String) Engine=MergeTree() order by A;
insert into mergeTreeDB select number, toString(cityHash64(number))
from numbers(100000000);

Size on disk: 973MB

CREATE DICTIONARY test_rocksDB(A UInt64,B String)
PRIMARY KEY A
SOURCE(CLICKHOUSE(HOST 'localhost' PORT 9000 TABLE rocksDB DB 'default'
         USER 'default'))
LAYOUT(DIRECT());

CREATE DICTIONARY test_mergeTreeDB(A UInt64,B String)
PRIMARY KEY A
SOURCE(CLICKHOUSE(HOST 'localhost' PORT 9000 TABLE mergeTreeDB DB 'default'
         USER 'default'))
LAYOUT(DIRECT());

Direct queries to tables to request 10000 rows by a random key

select count() from (
select * from rocksDB where A in (select toUInt64(rand64()%100000000)
 from numbers(10000)))
Elapsed: 0.076 sec. Processed 10.00 thousand rows

select count() from (
select * from mergeTreeDB where A in (select toUInt64(rand64()%100000000)
  from numbers(10000)))
Elapsed: 0.202 sec. Processed 55.95 million rows

RocksDB as expected is much faster: 0.076 sec. VS 0.202 sec.

RocksDB processes less rows: 10.00 thousand rows VS 55.95 million rows

dictGet – 100.00 thousand random rows

select count() from (
   select dictGet( 'default.test_rocksDB', 'B', toUInt64(rand64()%100000000) )
   from numbers_mt(100000))
Elapsed: 0.786 sec. Processed 100.00 thousand rows

select count() from (
   select dictGet( 'default.test_mergeTreeDB', 'B', toUInt64(rand64()%100000000) )
   from numbers_mt(100000))
Elapsed: 3.160 sec. Processed 100.00 thousand rows

dictGet – 1million random rows

select count() from (
   select dictGet( 'default.test_rocksDB', 'B', toUInt64(rand64()%100000000) )
   from numbers_mt(1000000))
Elapsed: 5.643 sec. Processed 1.00 million rows

select count() from (
   select dictGet( 'default.test_mergeTreeDB', 'B', toUInt64(rand64()%100000000) )
   from numbers_mt(1000000))
Elapsed: 31.111 sec. Processed 1.00 million rows

dictGet – 1million random rows from Hashed

CREATE DICTIONARY test_mergeTreeDBHashed(A UInt64,B String)
PRIMARY KEY A
SOURCE(CLICKHOUSE(HOST 'localhost' PORT 9000 TABLE mergeTreeDB DB 'default'
         USER 'default'))
LAYOUT(Hashed())
LIFETIME(0);

0 rows in set. Elapsed: 46.564 sec.

┌─name───────────────────┬─type───┬─status─┬─element_count─┬─RAM──────┐
│ test_mergeTreeDBHashed │ Hashed │ LOADED │     100000000 │ 7.87 GiB │
└────────────────────────┴────────┴────────┴───────────────┴──────────┘

select count() from (
   select dictGet( 'default.test_mergeTreeDBHashed', 'B', toUInt64(rand64()%100000000) )
   from numbers_mt(1000000))
Elapsed: 0.079 sec. Processed 1.00 million rows

dictGet – 1million random rows from SparseHashed

CREATE DICTIONARY test_mergeTreeDBSparseHashed(A UInt64,B String)
PRIMARY KEY A
SOURCE(CLICKHOUSE(HOST 'localhost' PORT 9000 TABLE mergeTreeDB DB 'default'
         USER 'default'))
LAYOUT(SPARSE_HASHED())
LIFETIME(0);
0 rows in set. Elapsed: 81.404 sec.

┌─name─────────────────────────┬─type─────────┬─status─┬─element_count─┬─RAM──────┐
│ test_mergeTreeDBSparseHashed │ SparseHashed │ LOADED │     100000000 │ 4.24 GiB │
└──────────────────────────────┴──────────────┴────────┴───────────────┴──────────┘

select count() from (
   select dictGet( 'default.test_mergeTreeDBSparseHashed', 'B', toUInt64(rand64()%100000000) )
   from numbers_mt(1000000))

Elapsed: 0.065 sec. Processed 1.00 million rows

1.3 - MergeTree table engine family

MergeTree table engine family

Internals:

https://github.com/ClickHouse/clickhouse-presentations/blob/master/meetup41/merge_tree.pdf

https://youtu.be/1UIl7FpNo2M?t=2467

1.3.1 - CollapsingMergeTree vs ReplacingMergeTree

CollapsingMergeTree vs ReplacingMergeTree

ReplacingMergeTree	CollapsingMergeTree
+ very easy to use (always replace)	- more complex (accounting-alike, put ‘rollback’ records to fix something)
+ you don’t need to store the previous state of the row	- you need to the store (somewhere) the previous state of the row, OR extract it from the table itself (point queries is not nice for ClickHouse®)
- no deletes	+ support deletes
- w/o FINAL - you can can always see duplicates, you need always to ‘pay’ FINAL performance penalty	+ properly crafted query can give correct results without final (i.e. `sum(amount * sign)` will be correct, no matter of you have duplicated or not)
- only `uniq()`-alike things can be calculated in materialized views	+ you can do basic counts & sums in materialized views

1.3.2 - Part names & MVCC

Part names & multiversion concurrency control.

Part names & multiversion concurrency control

Part name format is:

<partitionid>_<min_block_number>_<max_block_number>_<level>_<data_version>

system.parts contains all the information parsed.

partitionid is quite simple (it just comes from your partitioning key).

What are block_numbers?

DROP TABLE IF EXISTS part_names;
create table part_names (date Date, n UInt8, m UInt8) engine=MergeTree PARTITION BY toYYYYMM(date) ORDER BY n;

insert into part_names VALUES (now(), 0, 0);
select name, partition_id, min_block_number, max_block_number, level, data_version from system.parts where table = 'part_names' and active;
┌─name─────────┬─partition_id─┬─min_block_number─┬─max_block_number─┬─level─┬─data_version─┐
│ 202203_1_1_0 │ 202203       │                1 │                1 │     0 │            1 │
└──────────────┴──────────────┴──────────────────┴──────────────────┴───────┴──────────────┘

insert into part_names VALUES (now(), 0, 0);
select name, partition_id, min_block_number, max_block_number, level, data_version from system.parts where table = 'part_names' and active;
┌─name─────────┬─partition_id─┬─min_block_number─┬─max_block_number─┬─level─┬─data_version─┐
│ 202203_1_1_0 │ 202203       │                1 │                1 │     0 │            1 │
│ 202203_2_2_0 │ 202203       │                2 │                2 │     0 │            2 │
└──────────────┴──────────────┴──────────────────┴──────────────────┴───────┴──────────────┘

insert into part_names VALUES (now(), 0, 0);
select name, partition_id, min_block_number, max_block_number, level, data_version from system.parts where table = 'part_names' and active;
┌─name─────────┬─partition_id─┬─min_block_number─┬─max_block_number─┬─level─┬─data_version─┐
│ 202203_1_1_0 │ 202203       │                1 │                1 │     0 │            1 │
│ 202203_2_2_0 │ 202203       │                2 │                2 │     0 │            2 │
│ 202203_3_3_0 │ 202203       │                3 │                3 │     0 │            3 │
└──────────────┴──────────────┴──────────────────┴──────────────────┴───────┴──────────────┘

As you can see every insert creates a new incremental block_number which is written in part names both as <min_block_number> and <min_block_number> (and the level is 0 meaning that the part was never merged).

Those block numbering works in the scope of partition (for Replicated table) or globally across all partition (for plain MergeTree table).

ClickHouse® always merge only continuous blocks . And new part names always refer to the minimum and maximum block numbers.

OPTIMIZE TABLE part_names;

┌─name─────────┬─partition_id─┬─min_block_number─┬─max_block_number─┬─level─┬─data_version─┐
│ 202203_1_3_1 │ 202203       │                1 │                3 │     1 │            1 │
└──────────────┴──────────────┴──────────────────┴──────────────────┴───────┴──────────────┘

As you can see here - three parts (with block number 1,2,3) were merged and they formed the new part with name 1_3 as min/max block size. Level get incremented.

Now even while previous (merged) parts still exists in filesystem for a while (as inactive) ClickHouse is smart enough to understand that new part ‘covers’ same range of blocks as 3 parts of the prev ‘generation’

There might be a fifth section in the part name, data version.

Data version gets increased when a part mutates.

Every mutation takes one block number:

insert into part_names VALUES (now(), 0, 0);
insert into part_names VALUES (now(), 0, 0);
insert into part_names VALUES (now(), 0, 0);

select name, partition_id, min_block_number, max_block_number, level, data_version from system.parts where table = 'part_names' and active;

┌─name─────────┬─partition_id─┬─min_block_number─┬─max_block_number─┬─level─┬─data_version─┐
│ 202203_1_3_1 │ 202203       │                1 │                3 │     1 │            1 │
│ 202203_4_4_0 │ 202203       │                4 │                4 │     0 │            4 │
│ 202203_5_5_0 │ 202203       │                5 │                5 │     0 │            5 │
│ 202203_6_6_0 │ 202203       │                6 │                6 │     0 │            6 │
└──────────────┴──────────────┴──────────────────┴──────────────────┴───────┴──────────────┘

insert into part_names VALUES (now(), 0, 0);

alter table part_names update m=n where 1;

select name, partition_id, min_block_number, max_block_number, level, data_version from system.parts where table = 'part_names' and active;

┌─name───────────┬─partition_id─┬─min_block_number─┬─max_block_number─┬─level─┬─data_version─┐
│ 202203_1_3_1_7 │ 202203       │                1 │                3 │     1 │            7 │
│ 202203_4_4_0_7 │ 202203       │                4 │                4 │     0 │            7 │
│ 202203_5_5_0_7 │ 202203       │                5 │                5 │     0 │            7 │
│ 202203_6_6_0_7 │ 202203       │                6 │                6 │     0 │            7 │
│ 202203_8_8_0   │ 202203       │                8 │                8 │     0 │            8 │
└────────────────┴──────────────┴──────────────────┴──────────────────┴───────┴──────────────┘

OPTIMIZE TABLE part_names;

select name, partition_id, min_block_number, max_block_number, level, data_version from system.parts where table = 'part_names' and active;
┌─name───────────┬─partition_id─┬─min_block_number─┬─max_block_number─┬─level─┬─data_version─┐
│ 202203_1_8_2_7 │ 202203       │                1 │                8 │     2 │            7 │
└────────────────┴──────────────┴──────────────────┴──────────────────┴───────┴──────────────┘

1.3.3 - How to pick an ORDER BY / PRIMARY KEY / PARTITION BY for the MergeTree family table

Optimizing ClickHouse® MergeTree tables

Good order by usually has 3 to 5 columns, from lowest cardinal on the left (and the most important for filtering) to highest cardinal (and less important for filtering).

Practical approach to create a good ORDER BY for a table:

Pick the columns you use in filtering always
The most important for filtering and the lowest cardinal should be the left-most. Typically, it’s something like tenant_id
Next column is more cardinal, less important. It can be a rounded time sometimes, or site_id, or source_id, or group_id or something similar.
Repeat step 3 once again (or a few times)
If you already added all columns important for filtering and you’re still not addressing a single row with your pk - you can add more columns which can help to put similar records close to each other (to improve the compression)
If you have something like hierarchy / tree-like relations between the columns - put there the records from ‘root’ to ’leaves’ for example (continent, country, cityname). This way ClickHouse® can do a lookup by country/city even if the continent is not specified (it will just ‘check all continents’) special variants of MergeTree may require special ORDER BY to make the record unique etc.
For timeseries , it usually makes sense to put the timestamp as the latest column in ORDER BY, which helps with putting the same data nearby for better locality. There are only 2 major patterns for timestamps in ORDER BY: (…, toStartOf(Day|Hour|…)(timestamp), …, timestamp) and (…, timestamp). The first one is useful when you often query a small part of a table partition. (table partitioned by months, and you read only 1-4 days 90% of the time).
There are exceptions to the rule “low cordinality - first” related to compression ratio. For example, data with a lot of repeated attributes in rows (like clickstream), ordering by session_id will benefit compression and reduce disk read, while setting a low cardinality column (like event type) in the first place makes compression and overall query time worse.

Some examples of good ORDER BY:

ORDER BY (tenantid, site_id, utm_source, clientid, timestamp)

ORDER BY (site_id, toStartOfHour(timestamp), sessionid, timestamp )
PRIMARY KEY (site_id, toStartOfHour(timestamp), sessionid)

(FWIW, the Altinity blog has a great article on the LowCardinality datatype .)

For Summing / Aggregating

All dimensions go to ORDER BY, all metrics - outside of that.

The most important for filtering columns with the lowest cardinality should be the left-most.

If the number of dimensions is high, it typically makes sense to use a prefix of ORDER BY as a PRIMARY KEY to avoid polluting the sparse index.

Examples:

ORDER BY (tenant_id, hour, country_code, team_id, group_id, source_id)
PRIMARY KEY (tenant_id, hour, country_code, team_id)

For Replacing / Collapsing

You need to keep all ‘mutable’ columns outside of ORDER BY, and have some unique id (a base to collapse duplicates) inside. Typically the right-most column is some row identifier. And it’s often not needed in sparse index (so PRIMARY KEY can be a prefix of ORDER BY) The rest consideration are the same.

Examples:

ORDER BY (tenantid, site_id, eventid) --  utm_source is mutable, while tenantid, site_id is not
PRIMARY KEY (tenantid, site_id) -- eventid is not used for filtering, needed only for collapsing duplicates

Also read about LIGHT ORDER BY for speeding FINAL queries - https://kb.altinity.com/altinity-kb-queries-and-syntax/altinity-kb-final-clause-speed/#light-order-by

ORDER BY example

-- col1: high Cardinality
-- col2: low cardinality

CREATE TABLE tests.order_test
(    
     `col1` DateTime,    
     `col2` UInt8
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(col1)
ORDER BY (col1, col2)
--
SELECT count() 
┌───count()─┐ 
│ 126371225 │ 
└───────────┘

So let’s put the highest cardinal column to the left and the least to the right in the ORDER BY definition. This will impact in queries like:

SELECT * FROM order_test
WHERE col1 > toDateTime('2020-10-01')
ORDER BY col1, col2
FORMAT `Null`

Here for the filtering it will use the skipping index to select the parts WHERE col1 > xxx and the result won’t be need to be ordered because the ORDER BY in the query aligns with the ORDER BY in the table and the data is already ordered in disk. (FWIW, Alexander Zaitsev and Mikhail Filimonov wrote a great post on skipping indexes and how they work for the Altinity blog.)

executeQuery: (from [::ffff:192.168.11.171]:39428, user: admin) SELECT * FROM order_test WHERE col1 > toDateTime('2020-10-01') ORDER BY col1,col2 FORMAT Null; (stage: Complete)
ContextAccess (admin): Access granted: SELECT(col1, col2) ON tests.order_test
ContextAccess (admin): Access granted: SELECT(col1, col2) ON tests.order_test
InterpreterSelectQuery: FetchColumns -> Complete
tests.order_test (SelectExecutor): Key condition: (column 0 in [1601503201, +Inf))
tests.order_test (SelectExecutor): MinMax index condition: (column 0 in [1601503201, +Inf))
tests.order_test (SelectExecutor): Running binary search on index range for part 202010_367_545_8 (7612 marks)
tests.order_test (SelectExecutor): Running binary search on index range for part 202010_549_729_12 (37 marks)
tests.order_test (SelectExecutor): Running binary search on index range for part 202011_689_719_2 (1403 marks)
tests.order_test (SelectExecutor): Running binary search on index range for part 202012_550_730_12 (3 marks)
tests.order_test (SelectExecutor): Found (LEFT) boundary mark: 0
tests.order_test (SelectExecutor): Found (LEFT) boundary mark: 0
tests.order_test (SelectExecutor): Found (LEFT) boundary mark: 0
tests.order_test (SelectExecutor): Found (RIGHT) boundary mark: 37
tests.order_test (SelectExecutor): Found (RIGHT) boundary mark: 3
tests.order_test (SelectExecutor): Found (RIGHT) boundary mark: 1403
tests.order_test (SelectExecutor): Found continuous range in 11 steps
tests.order_test (SelectExecutor): Found continuous range in 3 steps
tests.order_test (SelectExecutor): Running binary search on index range for part 202011_728_728_0 (84 marks)
tests.order_test (SelectExecutor): Found continuous range in 21 steps
tests.order_test (SelectExecutor): Running binary search on index range for part 202011_725_725_0 (128 marks)
tests.order_test (SelectExecutor): Found (LEFT) boundary mark: 0
tests.order_test (SelectExecutor): Found (LEFT) boundary mark: 0
tests.order_test (SelectExecutor): Found (RIGHT) boundary mark: 84
tests.order_test (SelectExecutor): Running binary search on index range for part 202011_722_722_0 (128 marks)
tests.order_test (SelectExecutor): Found continuous range in 13 steps
tests.order_test (SelectExecutor): Found (RIGHT) boundary mark: 128
tests.order_test (SelectExecutor): Found continuous range in 14 steps
tests.order_test (SelectExecutor): Running binary search on index range for part 202011_370_686_19 (5993 marks)
tests.order_test (SelectExecutor): Found (LEFT) boundary mark: 0
tests.order_test (SelectExecutor): Found (RIGHT) boundary mark: 5993
tests.order_test (SelectExecutor): Found (LEFT) boundary mark: 0
tests.order_test (SelectExecutor): Found continuous range in 25 steps
tests.order_test (SelectExecutor): Found (RIGHT) boundary mark: 128
tests.order_test (SelectExecutor): Found continuous range in 14 steps
tests.order_test (SelectExecutor): Found (LEFT) boundary mark: 0
tests.order_test (SelectExecutor): Found (RIGHT) boundary mark: 7612
tests.order_test (SelectExecutor): Found continuous range in 25 steps
tests.order_test (SelectExecutor): Selected 8/9 parts by partition key, 8 parts by primary key, 15380/15380 marks by primary key, 15380 marks to read from 8 ranges
Ok.

0 rows in set. Elapsed: 0.649 sec. Processed 125.97 million rows, 629.86 MB (194.17 million rows/s., 970.84 MB/s.)

If we change the ORDER BY expression in the query, ClickHouse will need to retrieve the rows and reorder them:

SELECT * FROM order_test
WHERE col1 > toDateTime('2020-10-01')
ORDER BY col2, col1
FORMAT `Null`

As seen In the MergingSortedTransform message, the ORDER BY in the table definition is not aligned with the ORDER BY in the query, so ClickHouse has to reorder the resultset.

executeQuery: (from [::ffff:192.168.11.171]:39428, user: admin) SELECT * FROM order_test WHERE col1 > toDateTime('2020-10-01') ORDER BY col2,col1 FORMAT Null; (stage: Complete)
ContextAccess (admin): Access granted: SELECT(col1, col2) ON tests.order_test
ContextAccess (admin): Access granted: SELECT(col1, col2) ON tests.order_test
InterpreterSelectQuery: FetchColumns -> Complete
tests.order_test (SelectExecutor): Key condition: (column 0 in [1601503201, +Inf))
tests.order_test (SelectExecutor): MinMax index condition: (column 0 in [1601503201, +Inf))
tests.order_test (SelectExecutor): Running binary search on index range for part 202010_367_545_8 (7612 marks)
tests.order_test (SelectExecutor): Running binary search on index range for part 202012_550_730_12 (3 marks)
tests.order_test (SelectExecutor): Found (LEFT) boundary mark: 0
tests.order_test (SelectExecutor): Running binary search on index range for part 202011_725_725_0 (128 marks)
tests.order_test (SelectExecutor): Found (RIGHT) boundary mark: 3
tests.order_test (SelectExecutor): Running binary search on index range for part 202011_689_719_2 (1403 marks)
tests.order_test (SelectExecutor): Running binary search on index range for part 202010_549_729_12 (37 marks)
tests.order_test (SelectExecutor): Running binary search on index range for part 202011_728_728_0 (84 marks)
tests.order_test (SelectExecutor): Found (LEFT) boundary mark: 0
tests.order_test (SelectExecutor): Found continuous range in 3 steps
tests.order_test (SelectExecutor): Found (LEFT) boundary mark: 0
tests.order_test (SelectExecutor): Found (LEFT) boundary mark: 0
tests.order_test (SelectExecutor): Found (LEFT) boundary mark: 0
tests.order_test (SelectExecutor): Running binary search on index range for part 202011_722_722_0 (128 marks)
tests.order_test (SelectExecutor): Found (RIGHT) boundary mark: 7612
tests.order_test (SelectExecutor): Found (RIGHT) boundary mark: 37
tests.order_test (SelectExecutor): Found (LEFT) boundary mark: 0
tests.order_test (SelectExecutor): Found continuous range in 11 steps
tests.order_test (SelectExecutor): Found (RIGHT) boundary mark: 1403
tests.order_test (SelectExecutor): Found (RIGHT) boundary mark: 84
tests.order_test (SelectExecutor): Found continuous range in 25 steps
tests.order_test (SelectExecutor): Running binary search on index range for part 202011_370_686_19 (5993 marks)
tests.order_test (SelectExecutor): Found continuous range in 21 steps
tests.order_test (SelectExecutor): Found (RIGHT) boundary mark: 128
tests.order_test (SelectExecutor): Found continuous range in 13 steps
tests.order_test (SelectExecutor): Found (LEFT) boundary mark: 0
tests.order_test (SelectExecutor): Found continuous range in 14 steps
tests.order_test (SelectExecutor): Found (RIGHT) boundary mark: 128
tests.order_test (SelectExecutor): Found (LEFT) boundary mark: 0
tests.order_test (SelectExecutor): Found continuous range in 14 steps
tests.order_test (SelectExecutor): Found (RIGHT) boundary mark: 5993
tests.order_test (SelectExecutor): Found continuous range in 25 steps
tests.order_test (SelectExecutor): Selected 8/9 parts by partition key, 8 parts by primary key, 15380/15380 marks by primary key, 15380 marks to read from 8 ranges
tests.order_test (SelectExecutor): MergingSortedTransform: Merge sorted 1947 blocks, 125972070 rows in 1.423973879 sec., 88465155.05499662 rows/sec., 423.78 MiB/sec
Ok.

0 rows in set. Elapsed: 1.424 sec. Processed 125.97 million rows, 629.86 MB (88.46 million rows/s., 442.28 MB/s.)

PARTITION BY

Things to consider:

Good size for single partition is something like 1-300Gb.
For Summing/Replacing a bit smaller (400Mb-40Gb)
Better to avoid touching more that few dozens of partitions with typical SELECT query.
Single insert should bring data to one or few partitions.
The number of partitions in table - dozen or hundreds, not thousands.

The size of partitions you can check in system.parts table.

Examples:

-- for time-series:
PARTITION BY toYear(timestamp)          -- long retention, not too much data
PARTITION BY toYYYYMM(timestamp)        --  
PARTITION BY toMonday(timestamp)        -- 
PARTITION BY toDate(timestamp)          --
PARTITION BY toStartOfHour(timestamp)   -- short retention, lot of data

-- for table with some incremental (non time-bounded) counter

PARTITION BY intDiv(transaction_id, 1000000)

-- for some dimention tables (always requested with WHERE userid)
PARTITION BY userid % 16

For the small tables (smaller than few gigabytes) partitioning is usually not needed at all (just skip PARTITION BY expression when you create the table).

1.3.4 - ClickHouse® AggregatingMergeTree

FAQs for storing and merging pre-aggregated data

Q. What happens with columns which are not part of the ORDER BY key, nor have the AggregateFunction type?

A. it picks the first value met, (similar to any)

CREATE TABLE agg_test
(
    `a` String,
    `b` UInt8,
    `c` SimpleAggregateFunction(max, UInt8)
)
ENGINE = AggregatingMergeTree
ORDER BY a;

INSERT INTO agg_test VALUES ('a', 1, 1);
INSERT INTO agg_test VALUES ('a', 2, 2);

SELECT * FROM agg_test FINAL;

┌─a─┬─b─┬─c─┐
│ a │ 1 │ 2 │
└───┴───┴───┘

INSERT INTO agg_test VALUES ('a', 3, 3);

SELECT * FROM agg_test;

┌─a─┬─b─┬─c─┐
│ a │ 1 │ 2 │
└───┴───┴───┘
┌─a─┬─b─┬─c─┐
│ a │ 3 │ 3 │
└───┴───┴───┘

OPTIMIZE TABLE agg_test FINAL;

SELECT * FROM agg_test;

┌─a─┬─b─┬─c─┐
│ a │ 1 │ 3 │
└───┴───┴───┘

Last non-null value for each column

CREATE TABLE test_last
(
    `col1` Int32,
    `col2` SimpleAggregateFunction(anyLast, Nullable(DateTime)),
    `col3` SimpleAggregateFunction(anyLast, Nullable(DateTime))
)
ENGINE = AggregatingMergeTree
ORDER BY col1

Ok.

0 rows in set. Elapsed: 0.003 sec.

INSERT INTO test_last (col1, col2) VALUES (1, now());

Ok.

1 rows in set. Elapsed: 0.014 sec.

INSERT INTO test_last (col1, col3) VALUES (1, now())

Ok.

1 rows in set. Elapsed: 0.006 sec.

SELECT
    col1,
    anyLast(col2),
    anyLast(col3)
FROM test_last
GROUP BY col1

┌─col1─┬───────anyLast(col2)─┬───────anyLast(col3)─┐
│    1 │ 2020-01-16 20:57:46 │ 2020-01-16 20:57:51 │
└──────┴─────────────────────┴─────────────────────┘

1 rows in set. Elapsed: 0.005 sec.

SELECT *
FROM test_last
FINAL

┌─col1─┬────────────────col2─┬────────────────col3─┐
│    1 │ 2020-01-16 20:57:46 │ 2020-01-16 20:57:51 │
└──────┴─────────────────────┴─────────────────────┘

1 rows in set. Elapsed: 0.003 sec.

Merge two data streams

Q. I have 2 Kafka topics from which I am storing events into 2 different tables (A and B) having the same unique ID. I want to create a single table that combines the data in tables A and B into one table C. The problem is that data is received asynchronously and not all the data is available when a row arrives in Table A or vice-versa.

A. You can use AggregatingMergeTree with Nullable columns and any aggregation function or Non-Nullable column and max aggregation function if it is acceptable for your data.

CREATE TABLE table_C (
    id      Int64,
    colA    SimpleAggregatingFunction(any,Nullable(UInt32)),
    colB    SimpleAggregatingFunction(max, String)
) ENGINE = AggregatingMergeTree()
ORDER BY id;

CREATE MATERIALIZED VIEW mv_A TO table_C AS
SELECT id,colA FROM Kafka_A;

CREATE MATERIALIZED VIEW mv_B TO table_C AS
SELECT id,colB FROM Kafka_B;

Here is a more complicated example (from here https://gist.github.com/den-crane/d03524eadbbce0bafa528101afa8f794 )

CREATE TABLE states_raw(
    d date,
    uid UInt64,
    first_name String,
    last_name String,
    modification_timestamp_mcs DateTime64(3) default now64(3)
) ENGINE = Null;

CREATE TABLE final_states_by_month(
    d date,
    uid UInt64,
    final_first_name      AggregateFunction(argMax, String, DateTime64(3)),
    final_last_name      AggregateFunction(argMax, String, DateTime64(3)))
ENGINE = AggregatingMergeTree
PARTITION BY toYYYYMM(d)
ORDER BY (uid, d);

CREATE MATERIALIZED VIEW final_states_by_month_mv TO final_states_by_month AS
SELECT
    d, uid,
    argMaxState(first_name, if(first_name<>'', modification_timestamp_mcs, toDateTime64(0,3))) AS final_first_name,
    argMaxState(last_name, if(last_name<>'', modification_timestamp_mcs, toDateTime64(0,3)))   AS final_last_name
FROM states_raw
GROUP BY d, uid;


insert into states_raw(d,uid,first_name) values (today(), 1, 'Tom');
insert into states_raw(d,uid,last_name) values (today(),  1, 'Jones');
insert into states_raw(d,uid,first_name,last_name) values (today(), 2, 'XXX', '');
insert into states_raw(d,uid,first_name,last_name) values (today(), 2, 'YYY', 'YYY');


select uid, argMaxMerge(final_first_name) first_name, argMaxMerge(final_last_name) last_name 
from final_states_by_month group by uid

┌─uid─┬─first_name─┬─last_name─┐
│   2 │ YYY        │ YYY       │
│   1 │ Tom        │ Jones     │
└─────┴────────────┴───────────┘

optimize table final_states_by_month final;

select uid, finalizeAggregation(final_first_name) first_name, finalizeAggregation(final_last_name) last_name 
from final_states_by_month 

┌─uid─┬─first_name─┬─last_name─┐
│   1 │ Tom        │ Jones     │
│   2 │ YYY        │ YYY       │
└─────┴────────────┴───────────┘

1.3.5 - index & column files

index & column files

Key Condition

Links

https://github.com/ClickHouse/clickhouse-presentations/blob/master/meetup27/adaptive_index_granularity.pdf

1.3.6 - Merge performance and OPTIMIZE FINAL

Merge Performance

Main things affecting the merge speed are:

Schema (especially compression codecs, some bad types, sorting order…)
Horizontal vs Vertical merge
- Horizontal = reads all columns at once, do merge sort, write new part
- Vertical = first read columns from order by, do merge sort, write them to disk, remember permutation, then process the rest of columns on by one, applying permutation.
compact vs wide parts
Other things like server load, concurrent merges…

SELECT name, value
FROM system.merge_tree_settings
WHERE name LIKE '%vert%';

│ enable_vertical_merge_algorithm                  │ 1      
│ vertical_merge_algorithm_min_rows_to_activate    │ 131072
│ vertical_merge_algorithm_min_columns_to_activate │ 11

Vertical merge will be used if part has more than 131072 rows and more than 11 columns in the table.

-- Disable Vertical Merges
ALTER TABLE test MODIFY SETTING enable_vertical_merge_algorithm = 0

Horizontal merge used by default, will use more memory if there are more than 80 columns in the table

OPTIMIZE TABLE example FINAL DEDUPLICATE BY expr

When using deduplicate feature in OPTIMIZE FINAL, the question is which row will remain and won’t be deduped?

For SELECT operations ClickHouse® does not guarantee the order of the resultset unless you specify ORDER BY. This random ordering is affected by different parameters, like for example max_threads.

In a merge operation ClickHouse reads rows sequentially in storage order, which is determined by ORDER BY specified in CREATE TABLE statement, and only the first unique row in that order survives deduplication. So it is a bit different from how SELECT actually works. As FINAL clause is used then ClickHouse will merge all rows across all partitions (If it is not specified then the merge operation will be done per partition), and so the first unique row of the first partition will survive deduplication. Merges are single-threaded because it is too complicated to apply merge ops in-parallel, and it generally makes no sense.

1.3.7 - Nulls in order by

Nulls in order by

It is NOT RECOMMENDED for a general use
Use on your own risk
Use latest ClickHouse® version if you need that.

CREATE TABLE x
(
    `a` Nullable(UInt32),
    `b` Nullable(UInt32),
    `cnt` UInt32
)
ENGINE = SummingMergeTree
ORDER BY (a, b)
SETTINGS allow_nullable_key = 1;
INSERT INTO x VALUES (Null,2,1), (Null,Null,1), (3, Null, 1), (4,4,1);
INSERT INTO x VALUES (Null,2,1), (Null,Null,1), (3, Null, 1), (4,4,1);
SELECT * FROM x;
┌────a─┬────b─┬─cnt─┐
│    3 │  null │   2 │
│    4 │    4 │   2 │
│  null │    2 │   2 │
│  null │  null │   2 │
└──────┴──────┴─────┘

1.3.8 - ReplacingMergeTree

ReplacingMergeTree

ReplacingMergeTree is a powerful ClickHouse® MergeTree engine. It is one of the techniques that can be used to guarantee unicity or exactly once delivery in ClickHouse.

General Operations

Engine Parameters

Engine = ReplacingMergeTree([version_column],[is_deleted_column])
ORDER BY <list_of_columns>

ORDER BY – The ORDER BY defines the columns that need to be unique at merge time. Since merge time can not be decided most of the time, the FINAL keyword is required to remove duplicates.
version_column – An monotonically increasing number, which can be based on a timestamp. Used for make sure sure updates are executed in a right order.
is_deleted_column (23.2+ see https://github.com/ClickHouse/ClickHouse/pull/41005 ) – the column used to delete rows.

DML operations

CREATE – INSERT INTO t values(..)
READ – SELECT FROM t final
UPDATE – INSERT INTO t(..., _version) values (...), insert with incremented version
DELETE – INSERT INTO t(..., _version, is_deleted) values(..., 1)

FINAL

ClickHouse does not guarantee that merge will fire and replace rows using ReplacingMergeTree logic. FINAL keyword should be used in order to apply merge in a query time. It works reasonably fast when PK filter is used, but maybe slow for SELECT * type of queries:

See these links for reference:

Since 23.2, profile level final=1 can force final automatically, see https://github.com/ClickHouse/ClickHouse/pull/40945

ClickHouse merge parts only in scope of single partition, so if two rows with the same replacing key would land in different partitions, they would never be merged in single row. FINAL keyword works in other way, it merge all rows across all partitions. But that behavior can be changed viado_not_merge_across_partitions_select_final setting.

CREATE TABLE repl_tbl_part
(
    `key` UInt32,
    `value` UInt32,
    `part_key` UInt32
)
ENGINE = ReplacingMergeTree
PARTITION BY part_key
ORDER BY key;

INSERT INTO repl_tbl_part SELECT
    1 AS key,
    number AS value,
    number % 2 AS part_key
FROM numbers(4)
SETTINGS optimize_on_insert = 0;

SELECT * FROM repl_tbl_part;

┌─key─┬─value─┬─part_key─┐
│   1 │     1 │        1 │
│   1 │     3 │        1 │
└─────┴───────┴──────────┘
┌─key─┬─value─┬─part_key─┐
│   1 │     0 │        0 │
│   1 │     2 │        0 │
└─────┴───────┴──────────┘

SELECT * FROM repl_tbl_part FINAL;

┌─key─┬─value─┬─part_key─┐
│   1 │     3 │        1 │
└─────┴───────┴──────────┘

SELECT * FROM repl_tbl_part FINAL SETTINGS do_not_merge_across_partitions_select_final=1;

┌─key─┬─value─┬─part_key─┐
│   1 │     3 │        1 │
└─────┴───────┴──────────┘
┌─key─┬─value─┬─part_key─┐
│   1 │     2 │        0 │
└─────┴───────┴──────────┘

OPTIMIZE TABLE repl_tbl_part FINAL;

SELECT * FROM repl_tbl_part;

┌─key─┬─value─┬─part_key─┐
│   1 │     3 │        1 │
└─────┴───────┴──────────┘
┌─key─┬─value─┬─part_key─┐
│   1 │     2 │        0 │
└─────┴───────┴──────────┘

Deleting the data

Delete in partition: ALTER TABLE t DELETE WHERE ... in PARTITION 'partition' – slow and asynchronous, rebuilds the partition
Filter is_deleted in queries: SELECT ... WHERE is_deleted = 0
Before 23.2, use ROW POLICY to apply a filter automatically: CREATE ROW POLICY delete_masking on t using is_deleted = 0 for ALL;
23.2+ ReplacingMergeTree(version, is_deleted) ORDER BY .. SETTINGS clean_deleted_rows='Always' (see https://github.com/ClickHouse/ClickHouse/pull/41005 )

Other options:

Partition operations: ALTER TABLE t DROP PARTITION 'partition' – locks the table, drops full partition only
Lightweight delete: DELETE FROM t WHERE ... – experimental

Use cases

Last state

Tested on ClickHouse 23.6 version FINAL is good in all cases

CREATE TABLE repl_tbl
(
    `key` UInt32,
    `val_1` UInt32,
    `val_2` String,
    `val_3` String,
    `val_4` String,
    `val_5` UUID,
    `ts` DateTime
)
ENGINE = ReplacingMergeTree(ts)
ORDER BY key

SYSTEM STOP MERGES repl_tbl;

INSERT INTO repl_tbl SELECT number as key, rand() as val_1, randomStringUTF8(10) as val_2, randomStringUTF8(5) as val_3, randomStringUTF8(4) as val_4, generateUUIDv4() as val_5, now() as ts FROM numbers(10000000);
INSERT INTO repl_tbl SELECT number as key, rand() as val_1, randomStringUTF8(10) as val_2, randomStringUTF8(5) as val_3, randomStringUTF8(4) as val_4, generateUUIDv4() as val_5, now() as ts FROM numbers(10000000);
INSERT INTO repl_tbl SELECT number as key, rand() as val_1, randomStringUTF8(10) as val_2, randomStringUTF8(5) as val_3, randomStringUTF8(4) as val_4, generateUUIDv4() as val_5, now() as ts FROM numbers(10000000);
INSERT INTO repl_tbl SELECT number as key, rand() as val_1, randomStringUTF8(10) as val_2, randomStringUTF8(5) as val_3, randomStringUTF8(4) as val_4, generateUUIDv4() as val_5, now() as ts FROM numbers(10000000);

SELECT count() FROM repl_tbl

┌──count()─┐
│ 40000000 │
└──────────┘

Single key

-- GROUP BY
SELECT key, argMax(val_1, ts) as val_1, argMax(val_2, ts) as val_2, argMax(val_3, ts) as val_3, argMax(val_4, ts) as val_4, argMax(val_5, ts) as val_5, max(ts) FROM repl_tbl WHERE key = 10 GROUP BY key;
1 row in set. Elapsed: 0.008 sec.

-- ORDER BY LIMIT BY
SELECT * FROM repl_tbl WHERE key = 10 ORDER BY ts DESC LIMIT 1 BY key ;
1 row in set. Elapsed: 0.006 sec.

-- Subquery
SELECT * FROM repl_tbl WHERE key = 10 AND ts = (SELECT max(ts) FROM repl_tbl WHERE key = 10);
1 row in set. Elapsed: 0.009 sec.

-- FINAL
SELECT * FROM repl_tbl FINAL WHERE key = 10;
1 row in set. Elapsed: 0.008 sec.

Multiple keys

-- GROUP BY
SELECT key, argMax(val_1, ts) as val_1, argMax(val_2, ts) as val_2, argMax(val_3, ts) as val_3, argMax(val_4, ts) as val_4, argMax(val_5, ts) as val_5, max(ts) FROM repl_tbl WHERE key IN (SELECT toUInt32(number) FROM numbers(1000000) WHERE number % 100) GROUP BY key FORMAT Null;
Peak memory usage (for query): 2.19 GiB.
0 rows in set. Elapsed: 1.043 sec. Processed 5.08 million rows, 524.38 MB (4.87 million rows/s., 502.64 MB/s.)

-- SET optimize_aggregation_in_order=1;
Peak memory usage (for query): 349.94 MiB.
0 rows in set. Elapsed: 0.901 sec. Processed 4.94 million rows, 506.55 MB (5.48 million rows/s., 562.17 MB/s.)

-- ORDER BY LIMIT BY
SELECT * FROM repl_tbl WHERE key IN (SELECT toUInt32(number)　FROM numbers(1000000) WHERE number % 100) ORDER BY ts DESC LIMIT 1 BY key FORMAT Null;
Peak memory usage (for query): 1.12 GiB.
0 rows in set. Elapsed: 1.171 sec. Processed 5.08 million rows, 524.38 MB (4.34 million rows/s., 447.95 MB/s.)

-- Subquery
SELECT * FROM repl_tbl WHERE (key, ts) IN (SELECT key, max(ts) FROM repl_tbl WHERE key IN (SELECT toUInt32(number) FROM numbers(1000000) WHERE number % 100) GROUP BY key) FORMAT Null;
Peak memory usage (for query): 197.30 MiB.
0 rows in set. Elapsed: 0.484 sec. Processed 8.72 million rows, 507.33 MB (18.04 million rows/s., 1.05 GB/s.)

-- SET optimize_aggregation_in_order=1;
Peak memory usage (for query): 171.93 MiB.
0 rows in set. Elapsed: 0.465 sec. Processed 8.59 million rows, 490.55 MB (18.46 million rows/s., 1.05 GB/s.)

-- FINAL
SELECT * FROM repl_tbl FINAL WHERE key IN (SELECT toUInt32(number) FROM numbers(1000000) WHERE number % 100) FORMAT Null;
Peak memory usage (for query): 537.13 MiB.
0 rows in set. Elapsed: 0.357 sec. Processed 4.39 million rows, 436.28 MB (12.28 million rows/s., 1.22 GB/s.)

Full table

-- GROUP BY
SELECT key, argMax(val_1, ts) as val_1, argMax(val_2, ts) as val_2, argMax(val_3, ts) as val_3, argMax(val_4, ts) as val_4, argMax(val_5, ts) as val_5, max(ts) FROM repl_tbl GROUP BY key FORMAT Null;
Peak memory usage (for query): 16.08 GiB.
0 rows in set. Elapsed: 11.600 sec. Processed 40.00 million rows, 5.12 GB (3.45 million rows/s., 441.49 MB/s.)

-- SET optimize_aggregation_in_order=1;
Peak memory usage (for query): 865.76 MiB.
0 rows in set. Elapsed: 9.677 sec. Processed 39.82 million rows, 5.10 GB (4.12 million rows/s., 526.89 MB/s.)

-- ORDER BY LIMIT BY
SELECT * FROM repl_tbl ORDER BY ts DESC LIMIT 1 BY key FORMAT Null;
Peak memory usage (for query): 8.39 GiB.
0 rows in set. Elapsed: 14.489 sec. Processed 40.00 million rows, 5.12 GB (2.76 million rows/s., 353.45 MB/s.)

-- Subquery
SELECT * FROM repl_tbl WHERE (key, ts) IN (SELECT key, max(ts) FROM repl_tbl GROUP BY key) FORMAT Null;
Peak memory usage (for query): 2.40 GiB.
0 rows in set. Elapsed: 5.225 sec. Processed 79.65 million rows, 5.40 GB (15.24 million rows/s., 1.03 GB/s.)

-- SET optimize_aggregation_in_order=1;
Peak memory usage (for query): 924.39 MiB.
0 rows in set. Elapsed: 4.126 sec. Processed 79.67 million rows, 5.40 GB (19.31 million rows/s., 1.31 GB/s.)

-- FINAL
SELECT * FROM repl_tbl FINAL FORMAT Null;
Peak memory usage (for query): 834.09 MiB.
0 rows in set. Elapsed: 2.314 sec. Processed 38.80 million rows, 4.97 GB (16.77 million rows/s., 2.15 GB/s.)

1.3.8.1 - ReplacingMergeTree does not collapse duplicates

ReplacingMergeTree does not collapse duplicates

Hi there, I have a question about replacing merge trees. I have set up a Materialized View with ReplacingMergeTree table, but even if I call optimize on it, the parts don’t get merged. I filled that table yesterday, nothing happened since then. What should I do?

Merges are eventual and may never happen. It depends on the number of inserts that happened after, the number of parts in the partition, size of parts. If the total size of input parts are greater than the maximum part size then they will never be merged.

https://clickhouse.com/docs/en/operations/settings/merge-tree-settings#max-bytes-to-merge-at-max-space-in-pool

https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/replacingmergetree ReplacingMergeTree is suitable for clearing out duplicate data in the background in order to save space, but it doesn’t guarantee the absence of duplicates.

1.3.9 - Skip index

Skip index

Warning

When you are creating skip indexes

in non-regular (Replicated)MergeTree tables over non ORDER BY columns. ClickHouse® applies index condition on the first step of query execution, so it’s possible to get outdated rows.

--(1) create test table
drop table if exists test;
create table test
(
    version UInt32
    ,id UInt32
    ,state UInt8
    ,INDEX state_idx (state) type set(0) GRANULARITY 1
) ENGINE ReplacingMergeTree(version)
      ORDER BY (id);

--(2) insert sample data
INSERT INTO test (version, id, state) VALUES (1,1,1);
INSERT INTO test (version, id, state) VALUES (2,1,0);
INSERT INTO test (version, id, state) VALUES (3,1,1);

--(3) check the result:
-- expected 3, 1, 1
select version, id, state from test final;
┌─version─┬─id─┬─state─┐
│       3 │  1 │     1 │
└─────────┴────┴───────┘

-- expected empty result
select version, id, state from test final where state=0;
┌─version─┬─id─┬─state─┐
│       2 │  1 │     0 │
└─────────┴────┴───────┘

1.3.10 - SummingMergeTree

SummingMergeTree

Nested structures

In certain conditions it could make sense to collapse one of dimensions to set of arrays. It’s usually profitable to do if this dimension is not commonly used in queries. It would reduce amount of rows in aggregated table and speed up queries which doesn’t care about this dimension in exchange of aggregation performance by collapsed dimension.

CREATE TABLE traffic
(
    `key1` UInt32,
    `key2` UInt32,
    `port` UInt16,
    `bits_in` UInt32 CODEC (T64,LZ4),
    `bits_out` UInt32 CODEC (T64,LZ4),
    `packets_in` UInt32 CODEC (T64,LZ4),
    `packets_out` UInt32 CODEC (T64,LZ4)
)
ENGINE = SummingMergeTree
ORDER BY (key1, key2, port);

INSERT INTO traffic SELECT
    number % 1000,
    intDiv(number, 10000),
    rand() % 20,
    rand() % 753,
    rand64() % 800,
    rand() % 140,
    rand64() % 231
FROM numbers(100000000);

CREATE TABLE default.traffic_map
(
    `key1` UInt32,
    `key2` UInt32,
    `bits_in` UInt32 CODEC(T64, LZ4),
    `bits_out` UInt32 CODEC(T64, LZ4),
    `packets_in` UInt32 CODEC(T64, LZ4),
    `packets_out` UInt32 CODEC(T64, LZ4),
    `portMap.port` Array(UInt16),
    `portMap.bits_in` Array(UInt32) CODEC(T64, LZ4),
    `portMap.bits_out` Array(UInt32) CODEC(T64, LZ4),
    `portMap.packets_in` Array(UInt32) CODEC(T64, LZ4),
    `portMap.packets_out` Array(UInt32) CODEC(T64, LZ4)
)
ENGINE = SummingMergeTree
ORDER BY (key1, key2);

INSERT INTO traffic_map WITH rand() % 20 AS port
SELECT
    number % 1000 AS key1,
    intDiv(number, 10000) AS key2,
    rand() % 753 AS bits_in,
    rand64() % 800 AS bits_out,
    rand() % 140 AS packets_in,
    rand64() % 231 AS packets_out,
    [port],
    [bits_in],
    [bits_out],
    [packets_in],
    [packets_out]
FROM numbers(100000000);

┌─table───────┬─column──────────────┬─────rows─┬─compressed─┬─uncompressed─┬──ratio─┐
│ traffic     │ bits_out            │ 80252317 │ 109.09 MiB │ 306.14 MiB   │   2.81 │
│ traffic     │ bits_in             │ 80252317 │ 108.34 MiB │ 306.14 MiB   │   2.83 │
│ traffic     │ port                │ 80252317 │ 99.21 MiB  │ 153.07 MiB   │   1.54 │
│ traffic     │ packets_out         │ 80252317 │ 91.36 MiB  │ 306.14 MiB   │   3.35 │
│ traffic     │ packets_in          │ 80252317 │ 84.61 MiB  │ 306.14 MiB   │   3.62 │
│ traffic     │ key2                │ 80252317 │ 47.88 MiB  │ 306.14 MiB   │   6.39 │
│ traffic     │ key1                │ 80252317 │ 1.38 MiB   │ 306.14 MiB   │ 221.42 │
│ traffic_map │ portMap.bits_out    │ 10000000 │ 108.96 MiB │ 306.13 MiB   │   2.81 │
│ traffic_map │ portMap.bits_in     │ 10000000 │ 108.32 MiB │ 306.13 MiB   │   2.83 │
│ traffic_map │ portMap.port        │ 10000000 │ 92.00 MiB  │ 229.36 MiB   │   2.49 │
│ traffic_map │ portMap.packets_out │ 10000000 │ 90.95 MiB  │ 306.13 MiB   │   3.37 │
│ traffic_map │ portMap.packets_in  │ 10000000 │ 84.19 MiB  │ 306.13 MiB   │   3.64 │
│ traffic_map │ key2                │ 10000000 │ 23.46 MiB  │ 38.15 MiB    │   1.63 │
│ traffic_map │ bits_in             │ 10000000 │ 15.59 MiB  │ 38.15 MiB    │   2.45 │
│ traffic_map │ bits_out            │ 10000000 │ 15.59 MiB  │ 38.15 MiB    │   2.45 │
│ traffic_map │ packets_out         │ 10000000 │ 13.22 MiB  │ 38.15 MiB    │   2.89 │
│ traffic_map │ packets_in          │ 10000000 │ 12.62 MiB  │ 38.15 MiB    │   3.02 │
│ traffic_map │ key1                │ 10000000 │ 180.29 KiB │ 38.15 MiB    │ 216.66 │
└─────────────┴─────────────────────┴──────────┴────────────┴──────────────┴────────┘

-- Queries

SELECT
    key1,
    sum(packets_in),
    sum(bits_out)
FROM traffic
GROUP BY key1
FORMAT `Null`

0 rows in set. Elapsed: 0.488 sec. Processed 80.25 million rows, 963.03 MB (164.31 million rows/s., 1.97 GB/s.)

SELECT
    key1,
    sum(packets_in),
    sum(bits_out)
FROM traffic_map
GROUP BY key1
FORMAT `Null`

0 rows in set. Elapsed: 0.063 sec. Processed 10.00 million rows, 120.00 MB (159.43 million rows/s., 1.91 GB/s.)


SELECT
    key1,
    port,
    sum(packets_in),
    sum(bits_out)
FROM traffic
GROUP BY
    key1,
    port
FORMAT `Null`

0 rows in set. Elapsed: 0.668 sec. Processed 80.25 million rows, 1.12 GB (120.14 million rows/s., 1.68 GB/s.)

WITH arrayJoin(arrayZip(untuple(sumMap(portMap.port, portMap.packets_in, portMap.bits_out)))) AS tpl
SELECT
    key1,
    tpl.1 AS port,
    tpl.2 AS packets_in,
    tpl.3 AS bits_out
FROM traffic_map
GROUP BY key1
FORMAT `Null`

0 rows in set. Elapsed: 0.915 sec. Processed 10.00 million rows, 1.08 GB (10.93 million rows/s., 1.18 GB/s.)

1.3.11 - UPSERT by VersionedCollapsingMergeTree

How to aggregate mutating event stream with duplicates

Challenges with mutated data

When you have an incoming event stream with duplicates, updates, and deletes, building a consistent row state inside the ClickHouse® table is a big challenge.

The UPDATE/DELETE approach in the OLTP world won’t help with OLAP databases tuned to handle big batches. UPDATE/DELETE operations in ClickHouse are executed as “mutations,” rewriting a lot of data and being relatively slow. You can’t run such operations very often, as for OLTP databases. But the UPSERT operation (insert and replace) runs fast with the ReplacingMergeTree Engine. It’s even set as the default mode for INSERT without any special keyword. We can emulate UPDATE (or even DELETE) with the UPSERT operation.

There are a lot of blog posts on how to use ReplacingMergeTree Engine to handle mutated data streams. A properly designed table schema with ReplacingMergeTree Engine is a good instrument for building the DWH Dimensions table. But when maintaining metrics in Fact tables, there are several problems:

it’s not possible to use a valuable ClickHouse feature - online aggregation of incoming data by Materialized Views or Projections on top of the ReplacingMT table, because duplicates and updates will not be deduplicated by the engine during inserts, and calculated aggregates (like sum or count) will be incorrect. For significant amounts of data, it’s become critical because aggregating raw data during report queries will take too much time.
unfinished support for DELETEs. While in the newest versions of ClickHouse, it’s possible to add the is_deleted to ReplacingMergeTree parameters, the necessity of manually filtering out deleted rows after FINAL processing makes that feature less useful.
Mutated data should be localized to the same partition. If the “replacing” row is saved to a partition different from the previous one, the report query will be much slower or produce unexpected results.

-- multiple partitions problem
CREATE TABLE RMT
(
    `key` Int64,
    `someCol` String,
    `eventTime` DateTime
)
ENGINE = ReplacingMergeTree()
PARTITION BY toYYYYMM(eventTime)
ORDER BY key;

INSERT INTO RMT Values (1, 'first', '2024-04-25T10:16:21');
INSERT INTO RMT Values (1, 'second', '2024-05-02T08:36:59');

with merged as (select * from RMT FINAL)
select * from merged
where eventTime < '2024-05-01'

You will get a row with ‘first’, not an empty set, as one might expect with the FINAL processing of a whole table.

Collapsing

ClickHouse has other table engines, such as CollapsingMergeTree and VersionedCollapsingMergeTree, that can be used even better for UPSERT operation.

Both work by inserting a “rollback row” to compensate for the previous insert. The difference between CollapsingMergeTree and VersionedCollapsingMergeTree is in the algorithm of collapsing. For Cluster configurations, it’s essential to understand which row came first and who should replace whom. That is why using ReplicatedVersionedCollapsingMergeTree is mandatory for Replicated Clusters.

When dealing with such complicated data streams, it needs to be solved 3 tasks simultaneously:

remove duplicates
process updates and deletes
calculate correct aggregates

It’s essential to understand how the collapsing algorithm of VersionedCollapsingMergeTree works. Quote from the documentation :

When ClickHouse merges data parts, it deletes each pair of rows that have the same primary key and version and different Sign. The order of rows does not matter.

The version column should increase over time. You may use a natural timestamp for that. Random-generated IDs are not suitable for the version column.

Replace data in another partition

Let’s first fix the problem with mutated data in a different partition.

CREATE TABLE VCMT
(
    key Int64,
    someCol String,
    eventTime DateTime,
    sign Int8
)
ENGINE = VersionedCollapsingMergeTree(sign,eventTime)
PARTITION BY toYYYYMM(eventTime)
ORDER BY key;

INSERT INTO VCMT Values (1, 'first', '2024-04-25 10:16:21',1);
INSERT INTO VCMT Values (1, 'first', '2024-04-25 10:16:21',-1), (1, 'second', '2024-05-02 08:36:59',1);

set do_not_merge_across_partitions_select_final=1; -- for fast FINAL

select 'no rows after:';
with merged as 
  (select * from VCMT FINAL)
select * from merged
where eventTime < '2024-05-01';

With VersionedCollapsingMergeTree, we can use more partition strategies, even with columns not tied to the row’s primary key. This could facilitate the creation of faster queries, more convenient TTLs (Time-To-Live), and backups.

Row deduplication

There are several ways to remove duplicates from the event stream. The most effective feature is block deduplication, which occurs when ClickHouse drops incoming blocks with the same checksum (or tag). However, this requires building a smart ingestor capable of saving positions in a transactional manner.

However, another method is possible: verifying whether a particular row already exists in the destination table to avoid redundant insertions. Together with block deduplication, that method also avoids using ReplacingMergeTree and FINAL during query time.

Ensuring accuracy and consistency in results requires executing this process on a single thread within one cluster node. This method is particularly suitable for less active event streams, such as those with up to 100,000 events per second. To boost performance, incoming streams should be segmented into several partitions (or ‘shards’) based on the table/event’s Primary Key, with each partition processed on a single thread.

An example of row deduplication:

create table Example1 (id Int64, metric UInt64) 
engine = MergeTree order by id;

create table Example1Null engine = Null as Example1;

create materialized view __Example1 to Example1 as
select * from Example1Null 
where id not in (
   select id from Example1 where id in (
      select id from Example1Null
   )
);

Here is the trick:

use Null table and MatView to be able to access both the insert block and the dest table
check the existence of IDs in the destination table with a fast index scan by a primary key using the IN operator
filter existing rows from insert block by NOT IN operator

In most cases, the insert block does not have too many rows (like 1000-100k), so checking the destination table for their existence by scanning the Primary Key (residing in memory) won’t take much time. However, due to the high table index granularity, it can still be noticeable on high load. To enhance performance, consider reducing index granularity to 4096 (from the default 8192) or even fewer values.

Getting old row

To process updates in CollapsingMergeTree, the ’last row state’ must be known before inserting the ‘compensation row.’ Sometimes, this is possible - CDC events coming from MySQL’s binlog or Postgres’s WAL contain not only ’new’ data but also ‘old’ values. If one of the columns includes a sequence-generated version or timestamp of the row’s update time, it can be used as the row’s ‘version’ for VersionedCollapsingMergeTree. When the incoming event stream lacks old metric values and suitable version information, we can retrieve that data by examining the ClickHouse table using the same method used for row deduplication in the previous example.

create table Example2 (id Int64, metric UInt64, sign Int8) 
engine = CollapsingMergeTree(sign) order by id;

create table Example2Null engine = Null as Example2;

create materialized view __Example2 to Example2 as
with _old as (
   select *, arrayJoin([-1,1]) as _sign 
   from Example2 where id in (select id from Example2Null)
   )
select id,
       if(_old._sign=-1, _old.metric, _new.metric) as metric
from Example2Null as _new
join _old using id;

I read more data from the Example2 table than from Example1. Instead of simply checking the row existence by the IN operator, a JOIN with existing rows is used to build a “compensate row.”

For UPSERT, the collapsing algorithm requires inserting two rows. So, I need to create two rows from any row that is found in the local table. It´s an essential part of the suggested approach, which allows me to produce proper rows for inserting with a human-readable code with clear if() statements. That is why I execute arrayJoin while reading old data.

Don’t try to run the code above. It’s just a short explanation of the idea, lacking many needed elements.

UPSERT by Collapsing

Here is a more realistic example with more checks that can be played with:

create table Example3 
(
    id              Int32,   
    metric1         UInt32,
    metric2         UInt32,
    _version        UInt64,
    sign            Int8 default 1
) engine = VersionedCollapsingMergeTree(sign, _version)
ORDER BY id
;
create table Stage engine=Null as Example3 ;

create materialized view Example3Transform to Example3 as
with __new as ( SELECT * FROM Stage order by  _version desc, sign desc limit 1 by id ),
 __old AS ( SELECT *, arrayJoin([-1,1]) AS _sign from
                 ( select * FROM Example3 final
                   PREWHERE id IN (SELECT id FROM __new)
                   where sign = 1
                 )
    )
select id,
    if(__old._sign = -1, __old.metric1, __new.metric1)   AS metric1,
    if(__old._sign = -1, __old.metric2, __new.metric2)   AS metric2,
    if(__old._sign = -1, __old._version, __new._version) AS _version,
    if(__old._sign = -1, -1, 1)                          AS sign
from __new left join __old
using id
where if(__new.sign=-1,
  __old._sign = -1,                -- insert only delete row if it's found in old data
  __new._version > __old._version  -- skip duplicates for updates
);

-- original
insert into Stage values (1,1,1,1,1), (2,2,2,1,1);
select 'step1',* from Example3 ;

-- no duplicates (with the same version) inserted
insert into Stage values (1,3,1,1,1),(2,3,2,1,1);
select 'step2',* from Example3 ;

-- delete a row with id=2. version for delete row does not have any meaning
insert into Stage values (2,2,2,0,-1);
select 'step3',* from Example3 final;

-- replace a row with id=1. row with sign=-1 not needed, but can be in the insert blocks (will be skipped)
insert into Stage values (1,1,1,0,-1),(1,3,3,2,1);
select 'step4',* from Example3 final;

Important additions:

When multiple events with the same ID and different versions are received in the one insert batch, the most recent event is applied.
“delete rows” with sign=-1 and the wrong version are not used for processing. For the Collapsing algorithm, the delete row version should match the version from the row stored in the local table, not the same version from the replacing row. That’s why I decided to skip such a “delete row” received from the incoming stream and build it from the table’s data.
using FINAL and PREWHERE (to speed up FINAL) while reading the destination table. PREWHERE filters are applied before FINAL processing, reducing the number of grouped rows.
filter to skip out-of-order events by checking the version
DELETE event processing (inside last WHERE)

Speed Test

set allow_experimental_analyzer=0;
create table Example3
(
    id              Int32,
    Department      String,
    metric1         UInt32,
    metric2         Float32,
    _version        UInt64,
    sign            Int8 default 1
) engine = VersionedCollapsingMergeTree(sign, _version)
      ORDER BY id
  partition by (id % 20)
settings index_granularity=4096
;

set do_not_merge_across_partitions_select_final=1;

-- make 100M table
INSERT INTO Example3
SELECT
    number AS id,
    ['HR', 'Finance', 'Engineering', 'Sales', 'Marketing'][rand() % 5 + 1] AS Department,
    rand() % 1000 AS metric1,
    (rand() % 10000) / 100.0 AS metric2,
    0 AS _version,
    1 AS sign
FROM numbers(1E8);

create function timeSpent as () ->
    date_diff('millisecond',(select ts from t1),now64(3));

-- measure plain INSERT time for 1M batch
create temporary table t1 (ts DateTime64(3)) as select now64(3);
INSERT INTO Example3
SELECT
    number AS id,
    ['HR', 'Finance', 'Engineering', 'Sales', 'Marketing'][rand() % 5 + 1] AS Department,
    rand() % 1000 AS metric1,
    (rand() % 10000) / 100.0 AS metric2,
    1 AS _version,
    1 AS sign
FROM numbers(1E6);
select '---',timeSpent(),'INSERT';

--create table Stage engine=MergeTree order by id as Example3 ;
create table Stage engine=Null as Example3 ;

create materialized view Example3Transform to Example3 as
with __new as ( SELECT * FROM Stage order by  _version desc,sign desc limit 1 by id ),
     __old AS ( SELECT *, arrayJoin([-1,1]) AS _sign from
         ( select * FROM Example3 final
             PREWHERE id IN (SELECT id FROM __new)
           where sign = 1
             )
                                                                                  )
select id,
       if(__old._sign = -1, __old.Department, __new.Department)   AS
           Department,
       if(__old._sign = -1, __old.metric1, __new.metric1)   AS metric1,
       if(__old._sign = -1, __old.metric2, __new.metric2)   AS metric2,
       if(__old._sign = -1, __old._version, __new._version) AS _version,
       if(__old._sign = -1, -1, 1)                          AS sign
from __new left join __old using id
where if(__new.sign=-1,
         __old._sign = -1,                -- insert only delete row if it's found in old data
         __new._version > __old._version  -- skip duplicates for updates
      );

-- calculate UPSERT time for 1M batch
drop table t1;
create temporary table t1 (ts DateTime64(3)) as select now64(3);
INSERT INTO Stage
SELECT
    (rand() % 1E6)*100 AS id,
    --number AS id,
    ['HR', 'Finance', 'Engineering', 'Sales', 'Marketing'][rand() % 5 + 1] AS Department,
    rand() % 1000 AS metric1,
    (rand() % 10000) / 100.0 AS metric2,
    2 AS _version,
    1 AS sign
FROM numbers(1E6);

select '---',timeSpent(),'UPSERT';

-- FINAL query
drop table t1;
create temporary table t1 (ts DateTime64(3)) as select now64(3);
select Department, count(), sum(metric1) from Example3 FINAL
group by Department order by Department
format Null
;
select '---',timeSpent(),'FINAL';

-- GROUP BY query
drop table t1;
create temporary table t1 (ts DateTime64(3)) as select now64(3);
select Department, sum(sign), sum(sign*metric1) from Example3
group by Department order by Department
format Null
;
select '---',timeSpent(),'GROUP BY';

optimize table Example3 final;
-- FINAL query
drop table t1;
create temporary table t1 (ts DateTime64(3)) as select now64(3);
select Department, count(), sum(metric1) from Example3 FINAL
group by Department order by Department
format Null
;
select '---',timeSpent(),'FINAL OPTIMIZED';

-- GROUP BY query
drop table t1;
create temporary table t1 (ts DateTime64(3)) as select now64(3);
select Department, sum(sign), sum(sign*metric1) from Example3
group by Department order by Department
format Null
;
select '---',timeSpent(),'GROUP BY OPTIMIZED';

You can use fiddle or clickhouse-local to run such a test:

cat test.sql | clickhouse-local -nm

Results (Mac A2 Pro), milliseconds:

---	252	INSERT
---	1710	UPSERT
---	763	FINAL
---	311	GROUP BY
---	314	FINAL OPTIMIZED
---	295	GROUP BY OPTIMIZED

UPSERT is six times slower than direct INSERT because it requires looking up the destination table. That is the price. It is better to use idempotent inserts with an exactly-once delivery guarantee. However, it’s not always possible.

The FINAL speed is quite good, especially if we split the table by 20 partitions, use do_not_merge_across_partitions_select_final setting, and keep most of the table’s partitions optimized (1 part per partition). But we can do it better.

Adding projections

Let’s add an aggregating projection, and also add a more useful updated_at timestamp instead of an abstract _version and replace String for Department dimension by LowCardinality(String). Let’s look at the difference in time execution.

https://fiddle.clickhouse.com/3140d341-ccc5-4f57-8fbf-55dbf4883a21

set allow_experimental_analyzer=0;
create table Example4
(
    id              Int32,
    Department      LowCardinality(String),
    metric1         Int32,
    metric2         Float32,
    _version        DateTime64(3) default now64(3),
    sign            Int8 default 1
) engine = VersionedCollapsingMergeTree(sign, _version)
      ORDER BY id
      partition by (id % 20)
      settings index_granularity=4096
;

set do_not_merge_across_partitions_select_final=1;

-- make 100M table
INSERT INTO Example4
SELECT
    number AS id,
    ['HR', 'Finance', 'Engineering', 'Sales', 'Marketing'][rand() % 5 + 1] AS Department,
    rand() % 1000 AS metric1,
    (rand() % 10000) / 100.0 AS metric2,
    0 AS _version,
    1 AS sign
FROM numbers(1E8);

create temporary table timeMark (ts DateTime64(3));
create function timeSpent as () ->
    date_diff('millisecond',(select max(ts) from timeMark),now64(3));

-- measure plain INSERT time for 1M batch
insert into timeMark select now64(3);
INSERT INTO Example4(id,Department,metric1,metric2)
SELECT
    number AS id,
    ['HR', 'Finance', 'Engineering', 'Sales', 'Marketing'][rand() % 5 + 1] AS Department,
    rand() % 1000 AS metric1,
    (rand() % 10000) / 100.0 AS metric2
FROM numbers(1E6);
select '---',timeSpent(),'INSERT';

--create table Stage engine=MergeTree order by id as Example4 ;
create table Stage engine=Null as Example4 ;

create materialized view Example4Transform to Example4 as
with __new as ( SELECT * FROM Stage order by  _version desc,sign desc limit 1 by id ),
     __old AS ( SELECT *, arrayJoin([-1,1]) AS _sign from
         ( select * FROM Example4 final
             PREWHERE id IN (SELECT id FROM __new)
           where sign = 1
             )
                                                                                    )
select id,
       if(__old._sign = -1, __old.Department, __new.Department)   AS
           Department,
       if(__old._sign = -1, __old.metric1, __new.metric1)   AS metric1,
       if(__old._sign = -1, __old.metric2, __new.metric2)   AS metric2,
       if(__old._sign = -1, __old._version, __new._version) AS _version,
       if(__old._sign = -1, -1, 1)                          AS sign
from __new left join __old using id
where if(__new.sign=-1,
         __old._sign = -1,                -- insert only delete row if it's found in old data
         __new._version > __old._version  -- skip duplicates for updates
      );

-- calculate UPSERT time for 1M batch
insert into timeMark select now64(3);
INSERT INTO Stage(id,Department,metric1,metric2)
SELECT
    (rand() % 1E6)*100 AS id,
    --number AS id,
    ['HR', 'Finance', 'Engineering', 'Sales', 'Marketing'][rand() % 5 + 1] AS Department,
    rand() % 1000 AS metric1,
    (rand() % 10000) / 100.0 AS metric2
FROM numbers(1E6);

select '---',timeSpent(),'UPSERT';

-- FINAL query
insert into timeMark select now64(3);
select Department, count(), sum(metric1) from Example4 FINAL
group by Department order by Department
    format Null
;
select '---',timeSpent(),'FINAL';

-- GROUP BY query
insert into timeMark select now64(3);
select Department, sum(sign), sum(sign*metric1) from Example4
group by Department order by Department
    format Null
;
select '---',timeSpent(),'GROUP BY';

--select '--parts1',partition, count() from system.parts where active and table='Example4'  group by partition;

insert into timeMark select now64(3);
optimize table Example4 final;
select '---',timeSpent(),'OPTIMIZE';

-- FINAL OPTIMIZED
insert into timeMark select now64(3);
select Department, count(), sum(metric1) from Example4 FINAL
group by Department order by Department
    format Null
;
select '---',timeSpent(),'FINAL OPTIMIZED';

-- GROUP BY OPTIMIZED
insert into timeMark select now64(3);
select Department, sum(sign), sum(sign*metric1) from Example4
group by Department order by Department
    format Null
;
select '---',timeSpent(),'GROUP BY OPTIMIZED';

--  UPSERT a little data to create more parts
INSERT INTO Stage(id,Department,metric1,metric2)
SELECT
    number AS id,
    ['HR', 'Finance', 'Engineering', 'Sales', 'Marketing'][rand() % 5 + 1] AS Department,
    rand() % 1000 AS metric1,
    (rand() % 10000) / 100.0 AS metric2
FROM numbers(1000);

--select '--parts2',partition, count() from system.parts where active and table='Example4' group by partition;

-- GROUP BY SEMI-OPTIMIZED
insert into timeMark select now64(3);
select Department, sum(sign), sum(sign*metric1) from Example4
group by Department order by Department
    format Null
;
select '---',timeSpent(),'GROUP BY SEMI-OPTIMIZED';

--alter table Example4 add column Smetric1 Int32 alias metric1*sign;
alter table Example4 add projection byDep  (select Department, sum(sign), sum(sign*metric1) group by Department);

-- Materialize Projection
insert into timeMark select now64(3);
alter table Example4 materialize projection byDep settings mutations_sync=1;
select '---',timeSpent(),'Materialize Projection';

-- GROUP BY query Projected
insert into timeMark select now64(3);
select Department, sum(sign), sum(sign*metric1) from Example4
group by Department order by Department
    settings force_optimize_projection=1
    format Null
;
select '---',timeSpent(),'GROUP BY Projected';

Results (Mac A2 Pro), milliseconds:

---	175	INSERT
---	1613	UPSERT
---	329	FINAL
---	102	GROUP BY
---	10498	OPTIMIZE
---	103	FINAL OPTIMIZED
---	90	GROUP BY OPTIMIZED
---	94	GROUP BY SEMI-OPTIMIZED
---	919	Materialize Projection
---	5	GROUP BY Projected

Some thoughts:

INSERT, UPSERT, and SELECT benefit from switching the Department column to LowCardinality. Fewer reads - faster queries.
OPTIMIZE is VERY expensive
FINAL is quite fast (especially for the OPTIMIZED table). You don’t need to OPTIMIZE the table till the 1 part for partition to remove FINAL from the query. Not having too many parts already gives you a performance boost.
GROUP BY for that task is still faster
projections building requires resources. Inserts to the table with Projections will be longer. Tune the insert timeouts.
Query over projection is very fast (as it should be). However, it’s not always possible to aggregate data in such a simple way.

DELETEs inaccuracy

The typical CDC event for DWH systems besides INSERT is UPSERT—a new row replaces the old one (with suitable aggregate corrections). But DELETE events are also supported (ones with column sign=-1). The Materialized View described above will correctly process the DELETE event by inserting only 1 row with sign=-1 if a row with a particular ID already exists in the table. In such cases, VersionedCollapsingMergeTree will wipe both rows (with sign=1 & -1) during merge or final operations.

However, it can lead to incorrect duplicate processing in some rare situations. Here is the scenario:

two events happen in the source database (insert and delete) for the very same ID
only insert event create a duplicate (delete event does not duplicate)
all 3 events (delete and two inserts) were processed in separate batches
ClickHouse executes the merge operation very quickly after the first INSERT and DELETE events are received, effectively removing the row with that ID from the table
the second (duplicated) insert is saved to the table because we lost the information about the first insertion

The probability of such a sequence is relatively low, especially in normal operations when the amount of DELETEs is not too significant. Processing events in big batches will reduce the probability even more.

Combine old and new

The presented technique can be used to reimplement the AggregatingMergeTree algorithm to combine old and new row data using VersionedCollapsingMergeTree.

https://fiddle.clickhouse.com/e1d7e04c-f1d6-4a25-9aac-1fe2b543c693

create table Example5 
(
    id              Int32,   
    metric1         UInt32,
    metric2         Nullable(UInt32),
    updated_at      DateTime64(3) default now64(3),
    sign            Int8 default 1
) engine = VersionedCollapsingMergeTree(sign, updated_at)
ORDER BY id
;
create table Stage engine=Null as Example5 ;
  
create materialized view Example5Transform to Example5 as
with __new as ( SELECT * FROM Stage order by sign desc, updated_at desc limit 1 by id ),
     __old AS ( SELECT *, arrayJoin([-1,1]) AS _sign from
                 ( select * FROM Example5 final
                   PREWHERE id IN (SELECT id FROM __new)
                   where sign = 1
                 )
    )
select id,
    if(__old._sign = -1, __old.metric1, greatest(__new.metric1, __old.metric1)) AS metric1,    
    if(__old._sign = -1, __old.metric2, ifNull(__new.metric2, __old.metric2)) AS metric2,
    if(__old._sign = -1, __old.updated_at, __new.updated_at) AS updated_at,
    if(__old._sign = -1, -1, 1)                          AS sign
from __new left join __old using id
where if(__new.sign=-1,
  __old._sign = -1,                -- insert only delete row if it's found in old data
  __new.updated_at > __old.updated_at  -- skip duplicates for updates
);

-- original
insert into Stage(id) values (1), (2);
select 'step0',* from Example5 ;

insert into Stage(id,metric1) values (1,1), (2,2);
select 'step1',* from Example5 final;

insert into Stage(id,metric2) values (1,11), (2,12);
select 'step2',* from Example5 final ;

Complex Primary Key

I used a simple, compact column with Int64 type for the primary key in previous examples. It’s better to go this route with monotonically growing IDs like autoincrement ID or SnowFlakeId (based on timestamp). However, in some cases, a more complex primary key is needed. For instance, when storing data for multiple tenants (Customers, partners, etc.) in the same table. This is not a problem for the suggested technique - use all the necessary columns in all filters and JOIN operations as Tuple.

create table Example6 
(
    id              Int64,  
    tenant_id       Int32, 
    metric1         UInt32,
    _version        UInt64,
    sign            Int8 default 1
) engine = VersionedCollapsingMergeTree(sign, _version)
ORDER BY (tenant_id,id)
;
create table Stage engine=Null as Example6 ;

create materialized view Example6Transform to Example6 as
with __new as ( SELECT * FROM Stage order by sign desc, _version desc limit 1 by tenant_id,id ),
     __old AS ( SELECT *, arrayJoin([-1,1]) AS _sign from
                 ( select * FROM Example6 final
                   PREWHERE (tenant_id,id) IN (SELECT tenant_id,id FROM __new)
                   where sign = 1
                 )
    )
select id,tenant_id,
    if(__old._sign = -1, __old.metric1, __new.metric1)   AS metric1,
    if(__old._sign = -1, __old._version, __new._version) AS _version,
    if(__old._sign = -1, -1, 1)                          AS sign
from __new left join __old
using (tenant_id,id)
where if(__new.sign=-1,
  __old._sign = -1,                -- insert only delete row if it's found in old data
  __new._version > __old._version  -- skip duplicates for updates
);

Sharding

The suggested approach works well when inserting data in a single thread on a single replica. This is suitable for up to 1M events per second. However, for higher traffic, it’s necessary to use multiple ingesting threads across several replicas. In such cases, collisions caused by parts manipulation and replication delay can disrupt the entire Collapsing algorithm.

But inserting different shards with a sharding key derived from ID works fine. Every shard will operate with its own non-intersecting set of IDs, and don’t interfere with each other.

The same approach can be implemented when inserting several threads into the same replica node. For big installations with high traffic and many shards and replicas, the ingesting app can split the data stream into a considerably large number of “virtual shards” (or partitions in Kafka terminology) and then map the “virtual shards” to the threads doing inserts to “physical shards.”

The incoming stream could be split into several ones by using an expression like cityHash64(id) % 50 = 0 as a sharding key. The ingesting app should calculate the shard number before sending data to internal buffers that will be flushed to INSERTs.

-- emulate insert into distributed table
INSERT INTO function remote('localhos{t,t,t}',default,Stage,id)
SELECT
    (rand() % 1E6)*100 AS id,
    --number AS id,
    ['HR', 'Finance', 'Engineering', 'Sales', 'Marketing'][rand() % 5 + 1] AS Department,
    rand() % 1000 AS metric1,
    (rand() % 10000) / 100.0 AS metric2,
    2 AS _version,
    1 AS sign
FROM numbers(1000)
settings prefer_localhost_replica=0;

2 - Queries & Syntax

Learn about ClickHouse® queries & syntax, including Joins & Window Functions.

2.1 - GROUP BY

Learn about the GROUP BY clause in ClickHouse®

Internal implementation

Code

ClickHouse® uses non-blocking? hash tables, so each thread has at least one hash table.

It makes easier to not care about sync between multiple threads, but has such disadvantages as:

Bigger memory usage.
Needs to merge those per-thread hash tables afterwards.

Because second step can be a bottleneck in case of a really big GROUP BY with a lot of distinct keys, another solution has been made.

Two-Level

https://youtu.be/SrucFOs8Y6c?t=2132

┌─name───────────────────────────────┬─value────┬─changed─┬─description────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─min──┬─max──┬─readonly─┬─type───┐
│ group_by_two_level_threshold       │ 100000   │       0 │ From what number of keys, a two-level aggregation starts. 0 - the threshold is not set.                                                                                                                    │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │        0 │ UInt64 │
│ group_by_two_level_threshold_bytes │ 50000000 │       0 │ From what size of the aggregation state in bytes, a two-level aggregation begins to be used. 0 - the threshold is not set. Two-level aggregation is used when at least one of the thresholds is triggered. │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │        0 │ UInt64 │
└────────────────────────────────────┴──────────┴─────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──────┴──────┴──────────┴────────┘

In order to parallelize merging of hash tables, ie execute such merge via multiple threads, ClickHouse use two-level approach:

On the first step ClickHouse creates 256 buckets for each thread. (determined by one byte of hash function) On the second step ClickHouse can merge those 256 buckets independently by multiple threads.

https://github.com/ClickHouse/ClickHouse/blob/1ea637d996715d2a047f8cd209b478e946bdbfb0/src/Common/HashTable/TwoLevelHashTable.h#L6

GROUP BY in external memory

It utilizes a two-level group by and dumps those buckets on disk. And at the last stage ClickHouse will read those buckets from disk one by one and merge them. So you should have enough RAM to hold one bucket (1/256 of whole GROUP BY size).

https://clickhouse.com/docs/en/sql-reference/statements/select/group-by/#select-group-by-in-external-memory

optimize_aggregation_in_order GROUP BY

Usually it works slower than regular GROUP BY, because ClickHouse needs to read and process data in specific ORDER, which makes it much more complicated to parallelize reading and aggregating.

But it use much less memory, because ClickHouse can stream resultset and there is no need to keep it in memory.

Last item cache

ClickHouse saves value of previous hash calculation, just in case next value will be the same.

https://github.com/ClickHouse/ClickHouse/pull/5417 https://github.com/ClickHouse/ClickHouse/blob/808d9afd0f8110faba5ae027051bf0a64e506da3/src/Common/ColumnsHashingImpl.h#L40

StringHashMap

Actually uses 5 different hash tables

For empty strings
For strings < 8 bytes
For strings < 16 bytes
For strings < 24 bytes
For strings > 24 bytes

SELECT count()
FROM
(
    SELECT materialize('1234567890123456') AS key           -- length(key) = 16
    FROM numbers(1000000000)
)
GROUP BY key

Aggregator: Aggregation method: key_string

Elapsed: 8.888 sec. Processed 1.00 billion rows, 8.00 GB (112.51 million rows/s., 900.11 MB/s.)

SELECT count()
FROM
(
    SELECT materialize('12345678901234567') AS key          -- length(key) = 17
    FROM numbers(1000000000)
)
GROUP BY key

Aggregator: Aggregation method: key_string

Elapsed: 9.089 sec. Processed 1.00 billion rows, 8.00 GB (110.03 million rows/s., 880.22 MB/s.)

SELECT count()
FROM
(
    SELECT materialize('123456789012345678901234') AS key   -- length(key) = 24
    FROM numbers(1000000000)
)
GROUP BY key

Aggregator: Aggregation method: key_string

Elapsed: 9.134 sec. Processed 1.00 billion rows, 8.00 GB (109.49 million rows/s., 875.94 MB/s.)

SELECT count()
FROM
(
    SELECT materialize('1234567890123456789012345') AS key  -- length(key) = 25
    FROM numbers(1000000000)
)
GROUP BY key

Aggregator: Aggregation method: key_string

Elapsed: 12.566 sec. Processed 1.00 billion rows, 8.00 GB (79.58 million rows/s., 636.67 MB/s.)

length

16 8.89 17 9.09 24 9.13 25 12.57

For what GROUP BY statement use memory

Hash tables

It will grow with:

Amount of unique combinations of keys participated in GROUP BY

Size of keys participated in GROUP BY

States of aggregation functions:

Be careful with function, which state can use unrestricted amount of memory and grow indefinitely:

groupArray (groupArray(1000)())
uniqExact (uniq,uniqCombined)
quantileExact (medianExact) (quantile,quantileTDigest)
windowFunnel
groupBitmap
sequenceCount (sequenceMatch)
*Map

Why my GROUP BY eat all the RAM

run your query with set send_logs_level='trace'
Remove all aggregation functions from the query, try to understand how many memory simple GROUP BY will take.
One by one remove aggregation functions from query in order to understand which one is taking most of memory

2.1.1 - GROUP BY tricks

Tricks for GROUP BY memory usage optimization

Tricks

Testing dataset

CREATE TABLE sessions
(
    `app` LowCardinality(String),
    `user_id` String,
    `created_at` DateTime,
    `platform` LowCardinality(String),
    `clicks` UInt32,
    `session_id` UUID
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(created_at)
ORDER BY (app, user_id, session_id, created_at)

INSERT INTO sessions WITH
    CAST(number % 4, 'Enum8(\'Orange\' = 0, \'Melon\' = 1, \'Red\' = 2, \'Blue\' = 3)') AS app,
    concat('UID: ', leftPad(toString(number % 20000000), 8, '0')) AS user_id,
    toDateTime('2021-10-01 10:11:12') + (number / 300) AS created_at,
    CAST((number + 14) % 3, 'Enum8(\'Bat\' = 0, \'Mice\' = 1, \'Rat\' = 2)') AS platform,
    number % 17 AS clicks,
    generateUUIDv4() AS session_id
SELECT
    app,
    user_id,
    created_at,
    platform,
    clicks,
    session_id
FROM numbers_mt(1000000000)

0 rows in set. Elapsed: 46.078 sec. Processed 1.00 billion rows, 8.00 GB (21.70 million rows/s., 173.62 MB/s.)

┌─database─┬─table────┬─column─────┬─type───────────────────┬───────rows─┬─compressed_bytes─┬─compressed─┬─uncompressed─┬──────────────ratio─┬─codec─┐
│ default  │ sessions │ session_id │ UUID                   │ 1000000000 │      16065918103 │ 14.96 GiB  │ 14.90 GiB    │ 0.9958970223439835 │       │
│ default  │ sessions │ user_id    │ String                 │ 1000000000 │       3056977462 │ 2.85 GiB   │ 13.04 GiB    │   4.57968701896828 │       │
│ default  │ sessions │ clicks     │ UInt32                 │ 1000000000 │       1859359032 │ 1.73 GiB   │ 3.73 GiB     │  2.151278979023993 │       │
│ default  │ sessions │ created_at │ DateTime               │ 1000000000 │       1332089630 │ 1.24 GiB   │ 3.73 GiB     │ 3.0028009451586226 │       │
│ default  │ sessions │ platform   │ LowCardinality(String) │ 1000000000 │        329702248 │ 314.43 MiB │ 956.63 MiB   │  3.042446801879252 │       │
│ default  │ sessions │ app        │ LowCardinality(String) │ 1000000000 │          4825544 │ 4.60 MiB   │ 956.63 MiB   │ 207.87333386660654 │       │
└──────────┴──────────┴────────────┴────────────────────────┴────────────┴──────────────────┴────────────┴──────────────┴────────────────────┴───────┘

All queries and datasets are unique, so in different situations different hacks could work better or worse.

PreFilter values before GROUP BY

SELECT
    user_id,
    sum(clicks)
FROM sessions
WHERE created_at > '2021-11-01 00:00:00'
GROUP BY user_id
HAVING (argMax(clicks, created_at) = 16) AND (argMax(platform, created_at) = 'Rat')
FORMAT `Null`


<Debug> MemoryTracker: Peak memory usage (for query): 18.36 GiB.

SELECT
    user_id,
    sum(clicks)
FROM sessions
WHERE user_id IN (
    SELECT user_id
    FROM sessions
    WHERE (platform = 'Rat') AND (clicks = 16) AND (created_at > '2021-11-01 00:00:00') -- So we will select user_id which could potentially match our HAVING clause in OUTER query.
) AND (created_at > '2021-11-01 00:00:00')
GROUP BY user_id
HAVING (argMax(clicks, created_at) = 16) AND (argMax(platform, created_at) = 'Rat')
FORMAT `Null`

<Debug> MemoryTracker: Peak memory usage (for query): 4.43 GiB.

Use Fixed-width data types instead of String

For example, you have 2 strings which has values in special form like this

‘ABX 1412312312313’

You can just remove the first 4 characters and convert the rest to UInt64

toUInt64(substr(‘ABX 1412312312313’,5))

And you packed 17 bytes in 8, more than 2 times the improvement of size!

SELECT
    user_id,
    sum(clicks)
FROM sessions
GROUP BY
    user_id,
    platform
FORMAT `Null`

Aggregator: Aggregation method: serialized

<Debug> MemoryTracker: Peak memory usage (for query): 28.19 GiB.

Elapsed: 7.375 sec. Processed 1.00 billion rows, 27.00 GB (135.60 million rows/s., 3.66 GB/s.)

WITH
    CAST(user_id, 'FixedString(14)') AS user_fx,
    CAST(platform, 'FixedString(4)') AS platform_fx
SELECT
    user_fx,
    sum(clicks)
FROM sessions
GROUP BY
    user_fx,
    platform_fx
FORMAT `Null`

Aggregator: Aggregation method: keys256

MemoryTracker: Peak memory usage (for query): 22.24 GiB.

Elapsed: 6.637 sec. Processed 1.00 billion rows, 27.00 GB (150.67 million rows/s., 4.07 GB/s.)

WITH
    CAST(user_id, 'FixedString(14)') AS user_fx,
    CAST(platform, 'Enum8(\'Rat\' = 1, \'Mice\' = 2, \'Bat\' = 0)') AS platform_enum
SELECT
    user_fx,
    sum(clicks)
FROM sessions
GROUP BY
    user_fx,
    platform_enum
FORMAT `Null`

Aggregator: Aggregation method: keys128

MemoryTracker: Peak memory usage (for query): 14.14 GiB.

Elapsed: 5.335 sec. Processed 1.00 billion rows, 27.00 GB (187.43 million rows/s., 5.06 GB/s.)

WITH
    toUInt32OrZero(trim( LEADING '0' FROM substr(user_id,6))) AS user_int,
    CAST(platform, 'Enum8(\'Rat\' = 1, \'Mice\' = 2, \'Bat\' = 0)') AS platform_enum
SELECT
    user_int,
    sum(clicks)
FROM sessions
GROUP BY
    user_int,
    platform_enum
FORMAT `Null`

Aggregator: Aggregation method: keys64

MemoryTracker: Peak memory usage (for query): 10.14 GiB.

Elapsed: 8.549 sec. Processed 1.00 billion rows, 27.00 GB (116.97 million rows/s., 3.16 GB/s.)


WITH
    toUInt32('1' || substr(user_id,6)) AS user_int,
    CAST(platform, 'Enum8(\'Rat\' = 1, \'Mice\' = 2, \'Bat\' = 0)') AS platform_enum
SELECT
    user_int,
    sum(clicks)
FROM sessions
GROUP BY
    user_int,
    platform_enum
FORMAT `Null`

Aggregator: Aggregation method: keys64

Peak memory usage (for query): 10.14 GiB.

Elapsed: 6.247 sec. Processed 1.00 billion rows, 27.00 GB (160.09 million rows/s., 4.32 GB/s.)

It can be especially useful when you tries to do GROUP BY lc_column_1, lc_column_2 and ClickHouse® falls back to serialized algorithm.

Two LowCardinality Columns in GROUP BY

SELECT
    app,
    sum(clicks)
FROM sessions
GROUP BY app
FORMAT `Null`

Aggregator: Aggregation method: low_cardinality_key_string

MemoryTracker: Peak memory usage (for query): 43.81 MiB.

Elapsed: 0.545 sec. Processed 1.00 billion rows, 5.00 GB (1.83 billion rows/s., 9.17 GB/s.)

SELECT
    app,
    platform,
    sum(clicks)
FROM sessions
GROUP BY
    app,
    platform
FORMAT `Null`

Aggregator: Aggregation method: serialized -- Slowest method!

MemoryTracker: Peak memory usage (for query): 222.86 MiB.

Elapsed: 2.923 sec. Processed 1.00 billion rows, 6.00 GB (342.11 million rows/s., 2.05 GB/s.)

SELECT
    CAST(app, 'FixedString(6)') AS app_fx,
    CAST(platform, 'FixedString(4)') AS platform_fx,
    sum(clicks)
FROM sessions
GROUP BY
    app_fx,
    platform_fx
FORMAT `Null`

Aggregator: Aggregation method: keys128

MemoryTracker: Peak memory usage (for query): 160.23 MiB.

Elapsed: 1.632 sec. Processed 1.00 billion rows, 6.00 GB (612.63 million rows/s., 3.68 GB/s.)

Split your query in multiple smaller queries and execute them one BY one

SELECT
    user_id,
    sum(clicks)
FROM sessions
GROUP BY
    user_id,
    platform
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 28.19 GiB.

Elapsed: 7.375 sec. Processed 1.00 billion rows, 27.00 GB (135.60 million rows/s., 3.66 GB/s.)


SELECT
    user_id,
    sum(clicks)
FROM sessions
WHERE (cityHash64(user_id) % 4) = 0
GROUP BY
    user_id,
    platform
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 8.16 GiB.

Elapsed: 2.910 sec. Processed 1.00 billion rows, 27.00 GB (343.64 million rows/s., 9.28 GB/s.)

Shard your data by one of common high cardinal GROUP BY key

So on each shard you will have 1/N of all unique combination and this will result in smaller hash tables.

Let’s create 2 distributed tables with different distribution: rand() and by user_id

CREATE TABLE sessions_distributed AS sessions
ENGINE = Distributed('distr-groupby', default, sessions, rand());

INSERT INTO sessions_distributed WITH
    CAST(number % 4, 'Enum8(\'Orange\' = 0, \'Melon\' = 1, \'Red\' = 2, \'Blue\' = 3)') AS app,
    concat('UID: ', leftPad(toString(number % 20000000), 8, '0')) AS user_id,
    toDateTime('2021-10-01 10:11:12') + (number / 300) AS created_at,
    CAST((number + 14) % 3, 'Enum8(\'Bat\' = 0, \'Mice\' = 1, \'Rat\' = 2)') AS platform,
    number % 17 AS clicks,
    generateUUIDv4() AS session_id
SELECT
    app,
    user_id,
    created_at,
    platform,
    clicks,
    session_id
FROM numbers_mt(1000000000);

CREATE TABLE sessions_2 ON CLUSTER 'distr-groupby'
(
    `app` LowCardinality(String),
    `user_id` String,
    `created_at` DateTime,
    `platform` LowCardinality(String),
    `clicks` UInt32,
    `session_id` UUID
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(created_at)
ORDER BY (app, user_id, session_id, created_at);

CREATE TABLE sessions_distributed_2 AS sessions
ENGINE = Distributed('distr-groupby', default, sessions_2, cityHash64(user_id));

INSERT INTO sessions_distributed_2 WITH
    CAST(number % 4, 'Enum8(\'Orange\' = 0, \'Melon\' = 1, \'Red\' = 2, \'Blue\' = 3)') AS app,
    concat('UID: ', leftPad(toString(number % 20000000), 8, '0')) AS user_id,
    toDateTime('2021-10-01 10:11:12') + (number / 300) AS created_at,
    CAST((number + 14) % 3, 'Enum8(\'Bat\' = 0, \'Mice\' = 1, \'Rat\' = 2)') AS platform,
    number % 17 AS clicks,
    generateUUIDv4() AS session_id
SELECT
    app,
    user_id,
    created_at,
    platform,
    clicks,
    session_id
FROM numbers_mt(1000000000);

SELECT
    app,
    platform,
    sum(clicks)
FROM
(
    SELECT
        argMax(app, created_at) AS app,
        argMax(platform, created_at) AS platform,
        user_id,
        argMax(clicks, created_at) AS clicks
    FROM sessions_distributed
    GROUP BY user_id
)
GROUP BY
    app,
    platform;

[chi-distr-groupby-distr-groupby-0-0-0] MemoryTracker: Current memory usage (for query): 12.02 GiB.
[chi-distr-groupby-distr-groupby-1-0-0] MemoryTracker: Current memory usage (for query): 12.05 GiB.
[chi-distr-groupby-distr-groupby-2-0-0] MemoryTracker: Current memory usage (for query): 12.05 GiB.

MemoryTracker: Peak memory usage (for query): 12.20 GiB.

12 rows in set. Elapsed: 28.345 sec. Processed 1.00 billion rows, 32.00 GB (35.28 million rows/s., 1.13 GB/s.)

SELECT
    app,
    platform,
    sum(clicks)
FROM
(
    SELECT
        argMax(app, created_at) AS app,
        argMax(platform, created_at) AS platform,
        user_id,
        argMax(clicks, created_at) AS clicks
    FROM sessions_distributed_2
    GROUP BY user_id
)
GROUP BY
    app,
    platform;

[chi-distr-groupby-distr-groupby-0-0-0] MemoryTracker: Current memory usage (for query): 5.05 GiB.
[chi-distr-groupby-distr-groupby-1-0-0] MemoryTracker: Current memory usage (for query): 5.05 GiB.
[chi-distr-groupby-distr-groupby-2-0-0] MemoryTracker: Current memory usage (for query): 5.05 GiB.

MemoryTracker: Peak memory usage (for query): 5.61 GiB.

12 rows in set. Elapsed: 11.952 sec. Processed 1.00 billion rows, 32.00 GB (83.66 million rows/s., 2.68 GB/s.)

SELECT
    app,
    platform,
    sum(clicks)
FROM
(
    SELECT
        argMax(app, created_at) AS app,
        argMax(platform, created_at) AS platform,
        user_id,
        argMax(clicks, created_at) AS clicks
    FROM sessions_distributed_2
    GROUP BY user_id
)
GROUP BY
    app,
    platform
SETTINGS optimize_distributed_group_by_sharding_key = 1

[chi-distr-groupby-distr-groupby-0-0-0] MemoryTracker: Current memory usage (for query): 5.05 GiB.
[chi-distr-groupby-distr-groupby-1-0-0] MemoryTracker: Current memory usage (for query): 5.05 GiB.
[chi-distr-groupby-distr-groupby-2-0-0] MemoryTracker: Current memory usage (for query): 5.05 GiB.
MemoryTracker: Peak memory usage (for query): 5.61 GiB.

12 rows in set. Elapsed: 11.916 sec. Processed 1.00 billion rows, 32.00 GB (83.92 million rows/s., 2.69 GB/s.)


SELECT
    app,
    platform,
    sum(clicks)
FROM cluster('distr-groupby', view(
    SELECT
        app,
        platform,
        sum(clicks) as clicks
    FROM
    (
        SELECT
            argMax(app, created_at) AS app,
            argMax(platform, created_at) AS platform,
            user_id,
            argMax(clicks, created_at) AS clicks
        FROM sessions_2
        GROUP BY user_id
    )
    GROUP BY
        app,
        platform
))
GROUP BY
    app,
    platform;

[chi-distr-groupby-distr-groupby-0-0-0] MemoryTracker: Current memory usage (for query): 5.05 GiB.
[chi-distr-groupby-distr-groupby-1-0-0] MemoryTracker: Current memory usage (for query): 5.05 GiB.
[chi-distr-groupby-distr-groupby-2-0-0] MemoryTracker: Current memory usage (for query): 5.05 GiB.

MemoryTracker: Peak memory usage (for query): 5.55 GiB.

12 rows in set. Elapsed: 9.491 sec. Processed 1.00 billion rows, 32.00 GB (105.36 million rows/s., 3.37 GB/s.)

Query with bigger state:


SELECT
    app,
    platform,
    sum(last_click) as sum,
    max(max_clicks) as max,
    min(min_clicks) as min,
    max(max_time) as max_time,
    min(min_time) as min_time
FROM
(
    SELECT
        argMax(app, created_at) AS app,
        argMax(platform, created_at) AS platform,
        user_id,
        argMax(clicks, created_at) AS last_click,
        max(clicks) AS max_clicks,
        min(clicks) AS min_clicks,
        max(created_at) AS max_time,
        min(created_at) AS min_time
    FROM sessions_distributed
    GROUP BY user_id
)
GROUP BY
    app,
    platform;

MemoryTracker: Peak memory usage (for query): 19.95 GiB.
12 rows in set. Elapsed: 34.339 sec. Processed 1.00 billion rows, 32.00 GB (29.12 million rows/s., 932.03 MB/s.)

SELECT
    app,
    platform,
    sum(last_click) as sum,
    max(max_clicks) as max,
    min(min_clicks) as min,
    max(max_time) as max_time,
    min(min_time) as min_time
FROM
(
    SELECT
        argMax(app, created_at) AS app,
        argMax(platform, created_at) AS platform,
        user_id,
        argMax(clicks, created_at) AS last_click,
        max(clicks) AS max_clicks,
        min(clicks) AS min_clicks,
        max(created_at) AS max_time,
        min(created_at) AS min_time
    FROM sessions_distributed_2
    GROUP BY user_id
)
GROUP BY
    app,
    platform;


MemoryTracker: Peak memory usage (for query): 10.09 GiB.

12 rows in set. Elapsed: 13.220 sec. Processed 1.00 billion rows, 32.00 GB (75.64 million rows/s., 2.42 GB/s.)

SELECT
    app,
    platform,
    sum(last_click) AS sum,
    max(max_clicks) AS max,
    min(min_clicks) AS min,
    max(max_time) AS max_time,
    min(min_time) AS min_time
FROM
(
    SELECT
        argMax(app, created_at) AS app,
        argMax(platform, created_at) AS platform,
        user_id,
        argMax(clicks, created_at) AS last_click,
        max(clicks) AS max_clicks,
        min(clicks) AS min_clicks,
        max(created_at) AS max_time,
        min(created_at) AS min_time
    FROM sessions_distributed_2
    GROUP BY user_id
)
GROUP BY
    app,
    platform
SETTINGS optimize_distributed_group_by_sharding_key = 1;

MemoryTracker: Peak memory usage (for query): 10.09 GiB.

12 rows in set. Elapsed: 13.361 sec. Processed 1.00 billion rows, 32.00 GB (74.85 million rows/s., 2.40 GB/s.)

SELECT
    app,
    platform,
    sum(last_click) AS sum,
    max(max_clicks) AS max,
    min(min_clicks) AS min,
    max(max_time) AS max_time,
    min(min_time) AS min_time
FROM
(
    SELECT
        argMax(app, created_at) AS app,
        argMax(platform, created_at) AS platform,
        user_id,
        argMax(clicks, created_at) AS last_click,
        max(clicks) AS max_clicks,
        min(clicks) AS min_clicks,
        max(created_at) AS max_time,
        min(created_at) AS min_time
    FROM sessions_distributed_2
    GROUP BY user_id
)
GROUP BY
    app,
    platform
SETTINGS distributed_group_by_no_merge=2;

MemoryTracker: Peak memory usage (for query): 10.02 GiB.

12 rows in set. Elapsed: 9.789 sec. Processed 1.00 billion rows, 32.00 GB (102.15 million rows/s., 3.27 GB/s.)

SELECT
    app,
    platform,
    sum(sum),
    max(max),
    min(min),
    max(max_time) AS max_time,
    min(min_time) AS min_time
FROM cluster('distr-groupby', view(
    SELECT
        app,
        platform,
        sum(last_click) AS sum,
        max(max_clicks) AS max,
        min(min_clicks) AS min,
        max(max_time) AS max_time,
        min(min_time) AS min_time
    FROM
    (
        SELECT
            argMax(app, created_at) AS app,
            argMax(platform, created_at) AS platform,
            user_id,
            argMax(clicks, created_at) AS last_click,
            max(clicks) AS max_clicks,
            min(clicks) AS min_clicks,
            max(created_at) AS max_time,
            min(created_at) AS min_time
        FROM sessions_2
        GROUP BY user_id
    )
    GROUP BY
        app,
        platform
))
GROUP BY
    app,
    platform;

MemoryTracker: Peak memory usage (for query): 10.09 GiB.

12 rows in set. Elapsed: 9.525 sec. Processed 1.00 billion rows, 32.00 GB (104.98 million rows/s., 3.36 GB/s.)


SELECT
    app,
    platform,
    sum(sessions)
FROM
(
    SELECT
        argMax(app, created_at) AS app,
        argMax(platform, created_at) AS platform,
        user_id,
        uniq(session_id) as sessions
    FROM sessions_distributed_2
    GROUP BY user_id
)
GROUP BY
    app,
    platform

MemoryTracker: Peak memory usage (for query): 14.57 GiB.
12 rows in set. Elapsed: 37.730 sec. Processed 1.00 billion rows, 44.01 GB (26.50 million rows/s., 1.17 GB/s.)


SELECT
    app,
    platform,
    sum(sessions)
FROM
(
    SELECT
        argMax(app, created_at) AS app,
        argMax(platform, created_at) AS platform,
        user_id,
        uniq(session_id) as sessions
    FROM sessions_distributed_2
    GROUP BY user_id
)
GROUP BY
    app,
    platform
SETTINGS optimize_distributed_group_by_sharding_key = 1;

MemoryTracker: Peak memory usage (for query): 14.56 GiB.

12 rows in set. Elapsed: 37.792 sec. Processed 1.00 billion rows, 44.01 GB (26.46 million rows/s., 1.16 GB/s.)

SELECT
    app,
    platform,
    sum(sessions)
FROM
(
    SELECT
        argMax(app, created_at) AS app,
        argMax(platform, created_at) AS platform,
        user_id,
        uniq(session_id) as sessions
    FROM sessions_distributed_2
    GROUP BY user_id
)
GROUP BY
    app,
    platform
SETTINGS distributed_group_by_no_merge = 2;

MemoryTracker: Peak memory usage (for query): 14.54 GiB.
12 rows in set. Elapsed: 17.762 sec. Processed 1.00 billion rows, 44.01 GB (56.30 million rows/s., 2.48 GB/s.)

SELECT
    app,
    platform,
    sum(sessions)
FROM cluster('distr-groupby', view(
SELECT
    app,
    platform,
    sum(sessions) as sessions
FROM
(
    SELECT
        argMax(app, created_at) AS app,
        argMax(platform, created_at) AS platform,
        user_id,
        uniq(session_id) as sessions
    FROM sessions_2
    GROUP BY user_id
)
GROUP BY
    app,
    platform))
GROUP BY
    app,
    platform   

MemoryTracker: Peak memory usage (for query): 14.55 GiB.

12 rows in set. Elapsed: 17.574 sec. Processed 1.00 billion rows, 44.01 GB (56.90 million rows/s., 2.50 GB/s.)

Reduce number of threads

Because each thread uses an independent hash table, if you lower thread amount it will reduce number of hash tables as well and lower memory usage at the cost of slower query execution.


SELECT
    user_id,
    sum(clicks)
FROM sessions
GROUP BY
    user_id,
    platform
FORMAT `Null`


MemoryTracker: Peak memory usage (for query): 28.19 GiB.

Elapsed: 7.375 sec. Processed 1.00 billion rows, 27.00 GB (135.60 million rows/s., 3.66 GB/s.)

SET max_threads = 2;

SELECT
    user_id,
    sum(clicks)
FROM sessions
GROUP BY
    user_id,
    platform
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 13.26 GiB.

Elapsed: 62.014 sec. Processed 1.00 billion rows, 27.00 GB (16.13 million rows/s., 435.41 MB/s.)

UNION ALL


SELECT
    user_id,
    sum(clicks)
FROM sessions
GROUP BY
    app,
    user_id
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 24.19 GiB.

Elapsed: 5.043 sec. Processed 1.00 billion rows, 27.00 GB (198.29 million rows/s., 5.35 GB/s.)


SELECT
    user_id,
    sum(clicks)
FROM sessions WHERE app = 'Orange'
GROUP BY
    user_id
UNION ALL
SELECT
    user_id,
    sum(clicks)
FROM sessions WHERE app = 'Red'
GROUP BY
    user_id
UNION ALL
SELECT
    user_id,
    sum(clicks)
FROM sessions WHERE app = 'Melon'
GROUP BY
    user_id
UNION ALL
SELECT
    user_id,
    sum(clicks)
FROM sessions WHERE app = 'Blue'
GROUP BY
    user_id
FORMAT Null

MemoryTracker: Peak memory usage (for query): 7.99 GiB.

Elapsed: 2.852 sec. Processed 1.00 billion rows, 27.01 GB (350.74 million rows/s., 9.47 GB/s.)

aggregation_in_order

SELECT
    user_id,
    sum(clicks)
FROM sessions
WHERE app = 'Orange'
GROUP BY user_id
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 969.33 MiB.

Elapsed: 2.494 sec. Processed 250.09 million rows, 6.75 GB (100.27 million rows/s., 2.71 GB/s.)



SET optimize_aggregation_in_order = 1;

SELECT
    user_id,
    sum(clicks)
FROM sessions
WHERE app = 'Orange'
GROUP BY
    app,
    user_id
FORMAT `Null`

AggregatingInOrderTransform: Aggregating in order

MemoryTracker: Peak memory usage (for query): 169.24 MiB.

Elapsed: 4.925 sec. Processed 250.09 million rows, 6.75 GB (50.78 million rows/s., 1.37 GB/s.)

Reduce dimensions from GROUP BY with functions like sumMap, *Resample

One

SELECT
    user_id,
    toDate(created_at) AS day,
    sum(clicks)
FROM sessions
WHERE (created_at >= toDate('2021-10-01')) AND (created_at < toDate('2021-11-01')) AND (app IN ('Orange', 'Red', 'Blue'))
GROUP BY
    user_id,
    day
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 50.74 GiB.

Elapsed: 22.671 sec. Processed 594.39 million rows, 18.46 GB (26.22 million rows/s., 814.41 MB/s.)


SELECT
    user_id,
    (toDate('2021-10-01') + date_diff) - 1 AS day,
    clicks
FROM
(
    SELECT
        user_id,
        sumResample(0, 31, 1)(clicks, toDate(created_at) - toDate('2021-10-01')) AS clicks_arr
    FROM sessions
    WHERE (created_at >= toDate('2021-10-01')) AND (created_at < toDate('2021-11-01')) AND (app IN ('Orange', 'Red', 'Blue'))
    GROUP BY user_id
)
ARRAY JOIN
    clicks_arr AS clicks,
    arrayEnumerate(clicks_arr) AS date_diff
FORMAT `Null`

Peak memory usage (for query): 8.24 GiB.

Elapsed: 5.191 sec. Processed 594.39 million rows, 18.46 GB (114.50 million rows/s., 3.56 GB/s.)

Multiple


SELECT
    user_id,
    platform,
    toDate(created_at) AS day,
    sum(clicks)
FROM sessions
WHERE (created_at >= toDate('2021-10-01')) AND (created_at < toDate('2021-11-01')) AND (app IN ('Orange')) AND user_id ='UID: 08525196'
GROUP BY
    user_id,
    platform,
    day
ORDER BY user_id,
    day,
    platform
FORMAT `Null`

Peak memory usage (for query): 29.50 GiB.

Elapsed: 8.181 sec. Processed 198.14 million rows, 6.34 GB (24.22 million rows/s., 775.14 MB/s.)

WITH arrayJoin(arrayZip(clicks_arr_lvl_2, range(3))) AS clicks_res
SELECT
    user_id,
    CAST(clicks_res.2 + 1, 'Enum8(\'Rat\' = 1, \'Mice\' = 2, \'Bat\' = 3)') AS platform,
    (toDate('2021-10-01') + date_diff) - 1 AS day,
    clicks_res.1 AS clicks
FROM
(
    SELECT
        user_id,
        sumResampleResample(1, 4, 1, 0, 31, 1)(clicks, CAST(platform, 'Enum8(\'Rat\' = 1, \'Mice\' = 2, \'Bat\' = 3)'), toDate(created_at) - toDate('2021-10-01')) AS clicks_arr
    FROM sessions
    WHERE (created_at >= toDate('2021-10-01')) AND (created_at < toDate('2021-11-01')) AND (app IN ('Orange'))
    GROUP BY user_id
)
ARRAY JOIN
    clicks_arr AS clicks_arr_lvl_2,
    range(31) AS date_diff
FORMAT `Null`

Peak memory usage (for query): 9.92 GiB.

Elapsed: 4.170 sec. Processed 198.14 million rows, 6.34 GB (47.52 million rows/s., 1.52 GB/s.)


WITH arrayJoin(arrayZip(clicks_arr_lvl_2, range(3))) AS clicks_res
SELECT
    user_id,
    CAST(clicks_res.2 + 1, 'Enum8(\'Rat\' = 1, \'Mice\' = 2, \'Bat\' = 3)') AS platform,
    (toDate('2021-10-01') + date_diff) - 1 AS day,
    clicks_res.1 AS clicks
FROM
(
    SELECT
        user_id,
        sumResampleResample(1, 4, 1, 0, 31, 1)(clicks, CAST(platform, 'Enum8(\'Rat\' = 1, \'Mice\' = 2, \'Bat\' = 3)'), toDate(created_at) - toDate('2021-10-01')) AS clicks_arr
    FROM sessions
    WHERE (created_at >= toDate('2021-10-01')) AND (created_at < toDate('2021-11-01')) AND (app IN ('Orange'))
    GROUP BY user_id
)
ARRAY JOIN
    clicks_arr AS clicks_arr_lvl_2,
    range(31) AS date_diff
WHERE clicks > 0
FORMAT `Null`

Peak memory usage (for query): 10.14 GiB.

Elapsed: 9.533 sec. Processed 198.14 million rows, 6.34 GB (20.78 million rows/s., 665.20 MB/s.)

SELECT
    user_id,
    CAST(plat + 1, 'Enum8(\'Rat\' = 1, \'Mice\' = 2, \'Bat\' = 3)') AS platform,
    (toDate('2021-10-01') + date_diff) - 1 AS day,
    clicks
FROM
(
    WITH
        (SELECT flatten(arrayMap(x -> range(3) AS platforms, range(31) as days))) AS platform_arr,
        (SELECT flatten(arrayMap(x -> [x, x, x], range(31) as days))) AS days_arr
    SELECT
        user_id,
        flatten(sumResampleResample(1, 4, 1, 0, 31, 1)(clicks, CAST(platform, 'Enum8(\'Rat\' = 1, \'Mice\' = 2, \'Bat\' = 3)'), toDate(created_at) - toDate('2021-10-01'))) AS clicks_arr,
        platform_arr,
        days_arr
    FROM sessions
    WHERE (created_at >= toDate('2021-10-01')) AND (created_at < toDate('2021-11-01')) AND (app IN ('Orange'))
    GROUP BY user_id
)
ARRAY JOIN
    clicks_arr AS clicks,
    platform_arr AS plat,
    days_arr AS date_diff
FORMAT `Null`

Peak memory usage (for query): 9.95 GiB.

Elapsed: 3.095 sec. Processed 198.14 million rows, 6.34 GB (64.02 million rows/s., 2.05 GB/s.)

SELECT
    user_id,
    CAST(plat + 1, 'Enum8(\'Rat\' = 1, \'Mice\' = 2, \'Bat\' = 3)') AS platform,
    (toDate('2021-10-01') + date_diff) - 1 AS day,
    clicks
FROM
(
    WITH
        (SELECT flatten(arrayMap(x -> range(3) AS platforms, range(31) as days))) AS platform_arr,
        (SELECT flatten(arrayMap(x -> [x, x, x], range(31) as days))) AS days_arr
    SELECT
        user_id,
        sumResampleResample(1, 4, 1, 0, 31, 1)(clicks, CAST(platform, 'Enum8(\'Rat\' = 1, \'Mice\' = 2, \'Bat\' = 3)'), toDate(created_at) - toDate('2021-10-01')) AS clicks_arr,
        arrayFilter(x -> ((x.1) > 0), arrayZip(flatten(clicks_arr), platform_arr, days_arr)) AS result
    FROM sessions
    WHERE (created_at >= toDate('2021-10-01')) AND (created_at < toDate('2021-11-01')) AND (app IN ('Orange'))
    GROUP BY user_id
)
ARRAY JOIN
    result.1 AS clicks,
    result.2 AS plat,
    result.3 AS date_diff
FORMAT `Null`

Peak memory usage (for query): 9.93 GiB.

Elapsed: 4.717 sec. Processed 198.14 million rows, 6.34 GB (42.00 million rows/s., 1.34 GB/s.)

SELECT
    user_id,
    CAST(range % 3, 'Enum8(\'Rat\' = 0, \'Mice\' = 1, \'Bat\' = 2)') AS platform,
    toDate('2021-10-01') + intDiv(range, 3) AS day,
    clicks
FROM
(
    WITH (
            SELECT range(93)
        ) AS range_arr
    SELECT
        user_id,
        sumResample(0, 93, 1)(clicks, ((toDate(created_at) - toDate('2021-10-01')) * 3) + toUInt8(CAST(platform, 'Enum8(\'Rat\' = 0, \'Mice\' = 1, \'Bat\' = 2)'))) AS clicks_arr,
        range_arr
    FROM sessions
    WHERE (created_at >= toDate('2021-10-01')) AND (created_at < toDate('2021-11-01')) AND (app IN ('Orange'))
    GROUP BY user_id
)
ARRAY JOIN
    clicks_arr AS clicks,
    range_arr AS range
FORMAT `Null`

Peak memory usage (for query): 8.24 GiB.

Elapsed: 4.838 sec. Processed 198.14 million rows, 6.36 GB (40.95 million rows/s., 1.31 GB/s.)

SELECT
    user_id,
    sumResampleResample(1, 4, 1, 0, 31, 1)(clicks, CAST(platform, 'Enum8(\'Rat\' = 1, \'Mice\' = 2, \'Bat\' = 3)'), toDate(created_at) - toDate('2021-10-01')) AS clicks_arr
FROM sessions
WHERE (created_at >= toDate('2021-10-01')) AND (created_at < toDate('2021-11-01')) AND (app IN ('Orange'))
GROUP BY user_id
FORMAT `Null`

Peak memory usage (for query): 5.19 GiB.

0 rows in set. Elapsed: 1.160 sec. Processed 198.14 million rows, 6.34 GB (170.87 million rows/s., 5.47 GB/s.)

ARRAY JOIN can be expensive

https://kb.altinity.com/altinity-kb-functions/array-like-memory-usage/

sumMap, *Resample

https://kb.altinity.com/altinity-kb-functions/resample-vs-if-vs-map-vs-subquery/

Play with two-level

Disable:

SET group_by_two_level_threshold = 0, group_by_two_level_threshold_bytes = 0;

From 22.4 ClickHouse can predict, when it make sense to initialize aggregation with two-level from start, instead of rehashing on fly. It can improve query time. https://github.com/ClickHouse/ClickHouse/pull/33439

GROUP BY in external memory

Slow!

Use hash function for GROUP BY keys

GROUP BY cityHash64(‘xxxx’)

Can lead to incorrect results as hash functions is not 1 to 1 mapping.

Performance bugs

https://github.com/ClickHouse/ClickHouse/issues/15005

https://github.com/ClickHouse/ClickHouse/issues/29131

https://github.com/ClickHouse/ClickHouse/issues/31120

https://github.com/ClickHouse/ClickHouse/issues/35096 Fixed in 22.7

2.2 - Adjustable table partitioning

An approach that allows you to redefine partitioning without table creation

In that example, partitioning is being calculated via MATERIALIZED column expression toDate(toStartOfInterval(ts, toIntervalT(...))), but partition id also can be generated on application side and inserted to ClickHouse® as is.

CREATE TABLE tbl
(
    `ts` DateTime,
    `key` UInt32,
    `partition_key` Date MATERIALIZED toDate(toStartOfInterval(ts, toIntervalYear(1)))
)
ENGINE = MergeTree
PARTITION BY (partition_key, ignore(ts))
ORDER BY key;

SET send_logs_level = 'trace';

INSERT INTO tbl SELECT toDateTime(toDate('2020-01-01') + number) as ts, number as key FROM numbers(300);

Renaming temporary part tmp_insert_20200101-0_1_1_0 to 20200101-0_1_1_0

INSERT INTO tbl SELECT toDateTime(toDate('2021-01-01') + number) as ts, number as key FROM numbers(300);

Renaming temporary part tmp_insert_20210101-0_2_2_0 to 20210101-0_2_2_0

ALTER TABLE tbl
    MODIFY COLUMN `partition_key` Date MATERIALIZED toDate(toStartOfInterval(ts, toIntervalMonth(1)));

INSERT INTO tbl SELECT toDateTime(toDate('2022-01-01') + number) as ts, number as key FROM numbers(300);

Renaming temporary part tmp_insert_20220101-0_3_3_0 to 20220101-0_3_3_0
Renaming temporary part tmp_insert_20220201-0_4_4_0 to 20220201-0_4_4_0
Renaming temporary part tmp_insert_20220301-0_5_5_0 to 20220301-0_5_5_0
Renaming temporary part tmp_insert_20220401-0_6_6_0 to 20220401-0_6_6_0
Renaming temporary part tmp_insert_20220501-0_7_7_0 to 20220501-0_7_7_0
Renaming temporary part tmp_insert_20220601-0_8_8_0 to 20220601-0_8_8_0
Renaming temporary part tmp_insert_20220701-0_9_9_0 to 20220701-0_9_9_0
Renaming temporary part tmp_insert_20220801-0_10_10_0 to 20220801-0_10_10_0
Renaming temporary part tmp_insert_20220901-0_11_11_0 to 20220901-0_11_11_0
Renaming temporary part tmp_insert_20221001-0_12_12_0 to 20221001-0_12_12_0


ALTER TABLE tbl
    MODIFY COLUMN `partition_key` Date MATERIALIZED toDate(toStartOfInterval(ts, toIntervalDay(1)));

INSERT INTO tbl SELECT toDateTime(toDate('2023-01-01') + number) as ts, number as key FROM numbers(5);

Renaming temporary part tmp_insert_20230101-0_13_13_0 to 20230101-0_13_13_0
Renaming temporary part tmp_insert_20230102-0_14_14_0 to 20230102-0_14_14_0
Renaming temporary part tmp_insert_20230103-0_15_15_0 to 20230103-0_15_15_0
Renaming temporary part tmp_insert_20230104-0_16_16_0 to 20230104-0_16_16_0
Renaming temporary part tmp_insert_20230105-0_17_17_0 to 20230105-0_17_17_0


SELECT _partition_id, min(ts), max(ts), count() FROM tbl GROUP BY _partition_id ORDER BY _partition_id;

┌─_partition_id─┬─────────────min(ts)─┬─────────────max(ts)─┬─count()─┐
│ 20200101-0    │ 2020-01-01 00:00:00 │ 2020-10-26 00:00:00 │     300 │
│ 20210101-0    │ 2021-01-01 00:00:00 │ 2021-10-27 00:00:00 │     300 │
│ 20220101-0    │ 2022-01-01 00:00:00 │ 2022-01-31 00:00:00 │      31 │
│ 20220201-0    │ 2022-02-01 00:00:00 │ 2022-02-28 00:00:00 │      28 │
│ 20220301-0    │ 2022-03-01 00:00:00 │ 2022-03-31 00:00:00 │      31 │
│ 20220401-0    │ 2022-04-01 00:00:00 │ 2022-04-30 00:00:00 │      30 │
│ 20220501-0    │ 2022-05-01 00:00:00 │ 2022-05-31 00:00:00 │      31 │
│ 20220601-0    │ 2022-06-01 00:00:00 │ 2022-06-30 00:00:00 │      30 │
│ 20220701-0    │ 2022-07-01 00:00:00 │ 2022-07-31 00:00:00 │      31 │
│ 20220801-0    │ 2022-08-01 00:00:00 │ 2022-08-31 00:00:00 │      31 │
│ 20220901-0    │ 2022-09-01 00:00:00 │ 2022-09-30 00:00:00 │      30 │
│ 20221001-0    │ 2022-10-01 00:00:00 │ 2022-10-27 00:00:00 │      27 │
│ 20230101-0    │ 2023-01-01 00:00:00 │ 2023-01-01 00:00:00 │       1 │
│ 20230102-0    │ 2023-01-02 00:00:00 │ 2023-01-02 00:00:00 │       1 │
│ 20230103-0    │ 2023-01-03 00:00:00 │ 2023-01-03 00:00:00 │       1 │
│ 20230104-0    │ 2023-01-04 00:00:00 │ 2023-01-04 00:00:00 │       1 │
│ 20230105-0    │ 2023-01-05 00:00:00 │ 2023-01-05 00:00:00 │       1 │
└───────────────┴─────────────────────┴─────────────────────┴─────────┘


SELECT count() FROM tbl WHERE ts > '2023-01-04';

Key condition: unknown
MinMax index condition: (column 0 in [1672758001, +Inf))
Selected 1/17 parts by partition key, 1 parts by primary key, 1/1 marks by primary key, 1 marks to read from 1 ranges
Spreading mark ranges among streams (default reading)
Reading 1 ranges in order from part 20230105-0_17_17_0, approx. 1 rows starting from 0

2.3 - DateTime64

Subtract fractional seconds

WITH toDateTime64('2021-09-07 13:41:50.926', 3) AS time
SELECT
    time - 1,
    time - 0.1 AS no_affect,
    time - toDecimal64(0.1, 3) AS uncorrect_result,
    time - toIntervalMillisecond(100) AS correct_result -- from 22.4

Query id: 696722bd-3c22-4270-babe-c6b124fee97f

┌──────────minus(time, 1)─┬───────────────no_affect─┬────────uncorrect_result─┬──────────correct_result─┐
│ 2021-09-07 13:41:49.926 │ 2021-09-07 13:41:50.926 │ 1970-01-01 00:00:00.000 │ 2021-09-07 13:41:50.826 │
└─────────────────────────┴─────────────────────────┴─────────────────────────┴─────────────────────────┘


WITH
    toDateTime64('2021-03-03 09:30:00.100', 3) AS time,
    fromUnixTimestamp64Milli(toInt64(toUnixTimestamp64Milli(time) + (1.25 * 1000))) AS first,
    toDateTime64(toDecimal64(time, 3) + toDecimal64('1.25', 3), 3) AS second,
    reinterpret(reinterpret(time, 'Decimal64(3)') + toDecimal64('1.25', 3), 'DateTime64(3)') AS third,
    time + toIntervalMillisecond(1250) AS fourth, -- from 22.4
    addMilliseconds(time, 1250) AS fifth          -- from 22.4
SELECT
    first,
    second,
    third,
    fourth,
    fifth

Query id: 176cd2e7-68bf-4e26-a492-63e0b5a87cc5

┌───────────────────first─┬──────────────────second─┬───────────────────third─┬──────────────────fourth─┬───────────────────fifth─┐
│ 2021-03-03 09:30:01.350 │ 2021-03-03 09:30:01.350 │ 2021-03-03 09:30:01.350 │ 2021-03-03 09:30:01.350 │ 2021-03-03 09:30:01.350 │
└─────────────────────────┴─────────────────────────┴─────────────────────────┴─────────────────────────┴─────────────────────────┘

SET max_threads=1;

Starting from 22.4

WITH
    materialize(toDateTime64('2021-03-03 09:30:00.100', 3)) AS time,
    time + toIntervalMillisecond(1250) AS fourth
SELECT count()
FROM numbers(100000000)
WHERE NOT ignore(fourth)

1 rows in set. Elapsed: 0.215 sec. Processed 100.03 million rows, 800.21 MB (464.27 million rows/s., 3.71 GB/s.)

WITH
    materialize(toDateTime64('2021-03-03 09:30:00.100', 3)) AS time,
    addMilliseconds(time, 1250) AS fifth
SELECT count()
FROM numbers(100000000)
WHERE NOT ignore(fifth)

1 rows in set. Elapsed: 0.208 sec. Processed 100.03 million rows, 800.21 MB (481.04 million rows/s., 3.85 GB/s.)

###########

WITH
    materialize(toDateTime64('2021-03-03 09:30:00.100', 3)) AS time,
    fromUnixTimestamp64Milli(reinterpretAsInt64(toUnixTimestamp64Milli(time) + (1.25 * 1000))) AS first
SELECT count()
FROM numbers(100000000)
WHERE NOT ignore(first)

1 rows in set. Elapsed: 0.370 sec. Processed 100.03 million rows, 800.21 MB (270.31 million rows/s., 2.16 GB/s.)

WITH
    materialize(toDateTime64('2021-03-03 09:30:00.100', 3)) AS time,
    fromUnixTimestamp64Milli(toUnixTimestamp64Milli(time) + toInt16(1.25 * 1000)) AS first
SELECT count()
FROM numbers(100000000)
WHERE NOT ignore(first)

1 rows in set. Elapsed: 0.256 sec. Processed 100.03 million rows, 800.21 MB (391.06 million rows/s., 3.13 GB/s.)


WITH
    materialize(toDateTime64('2021-03-03 09:30:00.100', 3)) AS time,
    toDateTime64(toDecimal64(time, 3) + toDecimal64('1.25', 3), 3) AS second
SELECT count()
FROM numbers(100000000)
WHERE NOT ignore(second)

1 rows in set. Elapsed: 2.240 sec. Processed 100.03 million rows, 800.21 MB (44.65 million rows/s., 357.17 MB/s.)

SET decimal_check_overflow=0;

WITH
    materialize(toDateTime64('2021-03-03 09:30:00.100', 3)) AS time,
    toDateTime64(toDecimal64(time, 3) + toDecimal64('1.25', 3), 3) AS second
SELECT count()
FROM numbers(100000000)
WHERE NOT ignore(second)

1 rows in set. Elapsed: 1.991 sec. Processed 100.03 million rows, 800.21 MB (50.23 million rows/s., 401.81 MB/s.)


WITH
    materialize(toDateTime64('2021-03-03 09:30:00.100', 3)) AS time,
    reinterpret(reinterpret(time, 'Decimal64(3)') + toDecimal64('1.25', 3), 'DateTime64(3)') AS third
SELECT count()
FROM numbers(100000000)
WHERE NOT ignore(third)

1 rows in set. Elapsed: 0.515 sec. Processed 100.03 million rows, 800.21 MB (194.39 million rows/s., 1.56 GB/s.)

SET decimal_check_overflow=0;

WITH
    materialize(toDateTime64('2021-03-03 09:30:00.100', 3)) AS time,
    reinterpret(reinterpret(time, 'Decimal64(3)') + toDecimal64('1.25', 3), 'DateTime64(3)') AS third
SELECT count()
FROM numbers(100000000)
WHERE NOT ignore(third)

1 rows in set. Elapsed: 0.281 sec. Processed 100.03 million rows, 800.21 MB (356.21 million rows/s., 2.85 GB/s.)

2.4 - DISTINCT & GROUP BY & LIMIT 1 BY what the difference

DISTINCT


SELECT DISTINCT number
FROM numbers_mt(100000000)
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 4.00 GiB.

0 rows in set. Elapsed: 18.720 sec. Processed 100.03 million rows, 800.21 MB (5.34 million rows/s., 42.75 MB/s.)

SELECT DISTINCT number
FROM numbers_mt(100000000)
SETTINGS max_threads = 1
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 4.00 GiB.

0 rows in set. Elapsed: 18.349 sec. Processed 100.03 million rows, 800.21 MB (5.45 million rows/s., 43.61 MB/s.)

SELECT DISTINCT number
FROM numbers_mt(100000000)
LIMIT 1000
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 21.56 MiB.

0 rows in set. Elapsed: 0.014 sec. Processed 589.54 thousand rows, 4.72 MB (43.08 million rows/s., 344.61 MB/s.)



SELECT DISTINCT number % 1000
FROM numbers_mt(1000000000)
LIMIT 1000
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 1.80 MiB.

0 rows in set. Elapsed: 0.005 sec. Processed 589.54 thousand rows, 4.72 MB (127.23 million rows/s., 1.02 GB/s.)

SELECT DISTINCT number % 1000
FROM numbers(1000000000)
LIMIT 1001
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 847.05 KiB.

0 rows in set. Elapsed: 0.448 sec. Processed 1.00 billion rows, 8.00 GB (2.23 billion rows/s., 17.88 GB/s.)

Final distinct step is single threaded
Stream resultset

GROUP BY


SELECT number
FROM numbers_mt(100000000)
GROUP BY number
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 4.04 GiB.

0 rows in set. Elapsed: 8.212 sec. Processed 100.00 million rows, 800.00 MB (12.18 million rows/s., 97.42 MB/s.)

SELECT number
FROM numbers_mt(100000000)
GROUP BY number
SETTINGS max_threads = 1
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 6.00 GiB.

0 rows in set. Elapsed: 19.206 sec. Processed 100.03 million rows, 800.21 MB (5.21 million rows/s., 41.66 MB/s.)

SELECT number
FROM numbers_mt(100000000)
GROUP BY number
LIMIT 1000
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 4.05 GiB.

0 rows in set. Elapsed: 4.852 sec. Processed 100.00 million rows, 800.00 MB (20.61 million rows/s., 164.88 MB/s.)

This query faster than first, because ClickHouse® doesn't need to merge states for all keys, only for first 1000 (based on LIMIT)


SELECT number % 1000 AS key
FROM numbers_mt(1000000000)
GROUP BY key
LIMIT 1000
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 3.15 MiB.

0 rows in set. Elapsed: 0.770 sec. Processed 1.00 billion rows, 8.00 GB (1.30 billion rows/s., 10.40 GB/s.)

SELECT number % 1000 AS key
FROM numbers_mt(1000000000)
GROUP BY key
LIMIT 1001
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 3.77 MiB.

0 rows in set. Elapsed: 0.770 sec. Processed 1.00 billion rows, 8.00 GB (1.30 billion rows/s., 10.40 GB/s.)

Multi threaded
Will return result only after completion of aggregation

LIMIT BY

SELECT number
FROM numbers_mt(100000000)
LIMIT 1 BY number
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 6.00 GiB.

0 rows in set. Elapsed: 39.541 sec. Processed 100.00 million rows, 800.00 MB (2.53 million rows/s., 20.23 MB/s.)

SELECT number
FROM numbers_mt(100000000)
LIMIT 1 BY number
SETTINGS max_threads = 1
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 6.01 GiB.

0 rows in set. Elapsed: 36.773 sec. Processed 100.03 million rows, 800.21 MB (2.72 million rows/s., 21.76 MB/s.)

SELECT number
FROM numbers_mt(100000000)
LIMIT 1 BY number
LIMIT 1000
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 10.56 MiB.

0 rows in set. Elapsed: 0.019 sec. Processed 589.54 thousand rows, 4.72 MB (30.52 million rows/s., 244.20 MB/s.)



SELECT number % 1000 AS key
FROM numbers_mt(1000000000)
LIMIT 1 BY key
LIMIT 1000
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 5.14 MiB.

0 rows in set. Elapsed: 0.008 sec. Processed 589.54 thousand rows, 4.72 MB (71.27 million rows/s., 570.16 MB/s.)

SELECT number % 1000 AS key
FROM numbers_mt(1000000000)
LIMIT 1 BY key
LIMIT 1001
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 3.23 MiB.

0 rows in set. Elapsed: 36.027 sec. Processed 1.00 billion rows, 8.00 GB (27.76 million rows/s., 222.06 MB/s.)

Single threaded
Stream resultset
Can return arbitrary amount of rows per each key

2.5 - Imprecise parsing of literal Decimal or Float64

Imprecise parsing of literal Decimal or Float64

Decimal

SELECT
    9.2::Decimal64(2) AS postgresql_cast,
    toDecimal64(9.2, 2) AS to_function,
    CAST(9.2, 'Decimal64(2)') AS cast_float_literal,
    CAST('9.2', 'Decimal64(2)') AS cast_string_literal

┌─postgresql_cast─┬─to_function─┬─cast_float_literal─┬─cast_string_literal─┐
│             9.2 │        9.19 │               9.19 │                 9.2 │
└─────────────────┴─────────────┴────────────────────┴─────────────────────┘

When we try to type cast 64.32 to Decimal128(2) the resulted value is 64.31.

When it sees a number with a decimal separator it interprets as Float64 literal (where 64.32 have no accurate representation, and actually you get something like 64.319999999999999999) and later that Float is casted to Decimal by removing the extra precision.

Workaround is very simple - wrap the number in quotes (and it will be considered as a string literal by query parser, and will be transformed to Decimal directly), or use postgres-alike casting syntax:

select cast(64.32,'Decimal128(2)') a, cast('64.32','Decimal128(2)') b, 64.32::Decimal128(2) c;

┌─────a─┬─────b─┬─────c─┐
│ 64.31 │ 64.32 │ 64.32 │
└───────┴───────┴───────┘

Float64

SELECT
    toFloat64(15008753.) AS to_func,
    toFloat64('1.5008753E7') AS to_func_scientific,
    CAST('1.5008753E7', 'Float64') AS cast_scientific

┌──to_func─┬─to_func_scientific─┬────cast_scientific─┐
│ 15008753 │ 15008753.000000002 │ 15008753.000000002 │
└──────────┴────────────────────┴────────────────────┘

2.6 - Multiple aligned date columns in PARTITION BY expression

How to put multiple correlated date-like columns in partition key without generating a lot of partitions in case not exact match between them.

Alternative to doing that by minmax skip index .

CREATE TABLE part_key_multiple_dates
(
    `key` UInt32,
    `date` Date,
    `time` DateTime,
    `created_at` DateTime,
    `inserted_at` DateTime
)
ENGINE = MergeTree
PARTITION BY (toYYYYMM(date), ignore(created_at, inserted_at))
ORDER BY (key, time);


INSERT INTO part_key_multiple_dates SELECT
    number,
    toDate(x),
    now() + intDiv(number, 10) AS x,
    x - (rand() % 100),
    x + (rand() % 100)
FROM numbers(100000000);

SELECT count()
FROM part_key_multiple_dates
WHERE date > (now() + toIntervalDay(105));

┌─count()─┐
│ 8434210 │
└─────────┘

1 rows in set. Elapsed: 0.022 sec. Processed 11.03 million rows, 22.05 MB (501.94 million rows/s., 1.00 GB/s.)

SELECT count()
FROM part_key_multiple_dates
WHERE inserted_at > (now() + toIntervalDay(105));

┌─count()─┐
│ 9279818 │
└─────────┘

1 rows in set. Elapsed: 0.046 sec. Processed 11.03 million rows, 44.10 MB (237.64 million rows/s., 950.57 MB/s.)

SELECT count()
FROM part_key_multiple_dates
WHERE created_at > (now() + toIntervalDay(105));

┌─count()─┐
│ 9279139 │
└─────────┘

1 rows in set. Elapsed: 0.043 sec. Processed 11.03 million rows, 44.10 MB (258.22 million rows/s., 1.03 GB/s.)

2.7 - Row policies overhead (hiding 'removed' tenants)

One more approach to hide (delete) rows in ClickHouse®

No row policy

CREATE TABLE test_delete
(
    tenant Int64,
    key Int64,
    ts DateTime,
    value_a String
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(ts)
ORDER BY (tenant, key, ts);

INSERT INTO test_delete 
SELECT
    number%5,
    number,
    toDateTime('2020-01-01')+number/10,
    concat('some_looong_string', toString(number)), 
FROM numbers(1e8);

INSERT INTO test_delete  -- multiple small tenants
SELECT
    number%5000,
    number,
    toDateTime('2020-01-01')+number/10,
    concat('some_looong_string', toString(number)), 
FROM numbers(1e8);

Q1) SELECT tenant, count() FROM test_delete GROUP BY tenant ORDER BY tenant LIMIT 6;
┌─tenant─┬──count()─┐
│      0 │ 20020000 │
│      1 │ 20020000 │
│      2 │ 20020000 │
│      3 │ 20020000 │
│      4 │ 20020000 │
│      5 │    20000 │
└────────┴──────────┘
6 rows in set. Elapsed: 0.285 sec. Processed 200.00 million rows, 1.60 GB (702.60 million rows/s., 5.62 GB/s.)

Q2) SELECT uniq(value_a) FROM test_delete where tenant = 4;
┌─uniq(value_a)─┐
│      20016427 │
└───────────────┘
1 row in set. Elapsed: 0.265 sec. Processed 20.23 million rows, 863.93 MB (76.33 million rows/s., 3.26 GB/s.)

Q3) SELECT max(ts) FROM test_delete where tenant = 4;
┌─────────────max(ts)─┐
│ 2020-04-25 17:46:39 │
└─────────────────────┘
1 row in set. Elapsed: 0.062 sec. Processed 20.23 million rows, 242.31 MB (324.83 million rows/s., 3.89 GB/s.)

Q4) SELECT max(ts) FROM test_delete where tenant = 4 and key = 444;
┌─────────────max(ts)─┐
│ 2020-01-01 00:00:44 │
└─────────────────────┘
1 row in set. Elapsed: 0.009 sec. Processed 212.99 thousand rows, 1.80 MB (24.39 million rows/s., 206.36 MB/s.)

row policy using expression

CREATE ROW POLICY pol1 ON test_delete USING tenant not in (1,2,3) TO all;

Q1) SELECT tenant, count() FROM test_delete GROUP BY tenant ORDER BY tenant LIMIT 6;
┌─tenant─┬──count()─┐
│      0 │ 20020000 │
│      4 │ 20020000 │
│      5 │    20000 │
│      6 │    20000 │
│      7 │    20000 │
│      8 │    20000 │
└────────┴──────────┘
6 rows in set. Elapsed: 0.333 sec. Processed 140.08 million rows, 1.12 GB (420.59 million rows/s., 3.36 GB/s.)

Q2) SELECT uniq(value_a) FROM test_delete where tenant = 4;
┌─uniq(value_a)─┐
│      20016427 │
└───────────────┘
1 row in set. Elapsed: 0.287 sec. Processed 20.23 million rows, 863.93 MB (70.48 million rows/s., 3.01 GB/s.)

Q3) SELECT max(ts) FROM test_delete where tenant = 4;
┌─────────────max(ts)─┐
│ 2020-04-25 17:46:39 │
└─────────────────────┘
1 row in set. Elapsed: 0.080 sec. Processed 20.23 million rows, 242.31 MB (254.20 million rows/s., 3.05 GB/s.)

Q4) SELECT max(ts) FROM test_delete where tenant = 4 and key = 444;
┌─────────────max(ts)─┐
│ 2020-01-01 00:00:44 │
└─────────────────────┘
1 row in set. Elapsed: 0.011 sec. Processed 212.99 thousand rows, 3.44 MB (19.53 million rows/s., 315.46 MB/s.)

Q5) SELECT uniq(value_a) FROM test_delete where tenant = 1;
┌─uniq(value_a)─┐
│             0 │
└───────────────┘
1 row in set. Elapsed: 0.008 sec. Processed 180.22 thousand rows, 1.44 MB (23.69 million rows/s., 189.54 MB/s.)

DROP ROW POLICY pol1 ON test_delete;

row policy using table subquery

create table deleted_tenants(tenant Int64) ENGINE=MergeTree order by tenant;

CREATE ROW POLICY pol1 ON test_delete USING tenant not in deleted_tenants TO all;

SELECT tenant, count() FROM test_delete GROUP BY tenant ORDER BY tenant LIMIT 6;
┌─tenant─┬──count()─┐
│      0 │ 20020000 │
│      1 │ 20020000 │
│      2 │ 20020000 │
│      3 │ 20020000 │
│      4 │ 20020000 │
│      5 │    20000 │
└────────┴──────────┘
6 rows in set. Elapsed: 0.455 sec. Processed 200.00 million rows, 1.60 GB (439.11 million rows/s., 3.51 GB/s.)

insert into deleted_tenants values(1),(2),(3);

Q1) SELECT tenant, count() FROM test_delete GROUP BY tenant ORDER BY tenant LIMIT 6;
┌─tenant─┬──count()─┐
│      0 │ 20020000 │
│      4 │ 20020000 │
│      5 │    20000 │
│      6 │    20000 │
│      7 │    20000 │
│      8 │    20000 │
└────────┴──────────┘
6 rows in set. Elapsed: 0.329 sec. Processed 140.08 million rows, 1.12 GB (426.34 million rows/s., 3.41 GB/s.)

Q2) SELECT uniq(value_a) FROM test_delete where tenant = 4;
┌─uniq(value_a)─┐
│      20016427 │
└───────────────┘
1 row in set. Elapsed: 0.287 sec. Processed 20.23 million rows, 863.93 MB (70.56 million rows/s., 3.01 GB/s.)

Q3) SELECT max(ts) FROM test_delete where tenant = 4;
┌─────────────max(ts)─┐
│ 2020-04-25 17:46:39 │
└─────────────────────┘
1 row in set. Elapsed: 0.080 sec. Processed 20.23 million rows, 242.31 MB (251.39 million rows/s., 3.01 GB/s.)

Q4) SELECT max(ts) FROM test_delete where tenant = 4 and key = 444;
┌─────────────max(ts)─┐
│ 2020-01-01 00:00:44 │
└─────────────────────┘
1 row in set. Elapsed: 0.010 sec. Processed 213.00 thousand rows, 3.44 MB (20.33 million rows/s., 328.44 MB/s.)

Q5) SELECT uniq(value_a) FROM test_delete where tenant = 1;
┌─uniq(value_a)─┐
│             0 │
└───────────────┘
1 row in set. Elapsed: 0.008 sec. Processed 180.23 thousand rows, 1.44 MB (22.11 million rows/s., 176.90 MB/s.)

DROP ROW POLICY pol1 ON test_delete;
DROP TABLE deleted_tenants;

row policy using external dictionary (NOT dictHas)

create table deleted_tenants(tenant Int64, deleted UInt8 default 1) ENGINE=MergeTree order by tenant;

insert into deleted_tenants(tenant) values(1),(2),(3);

CREATE DICTIONARY deleted_tenants_dict (tenant UInt64, deleted UInt8) 
PRIMARY KEY tenant SOURCE(CLICKHOUSE(TABLE deleted_tenants)) 
LIFETIME(600) LAYOUT(FLAT());

CREATE ROW POLICY pol1 ON test_delete USING NOT dictHas('deleted_tenants_dict', tenant) TO all;

Q1) SELECT tenant, count() FROM test_delete GROUP BY tenant ORDER BY tenant LIMIT 6;
┌─tenant─┬──count()─┐
│      0 │ 20020000 │
│      4 │ 20020000 │
│      5 │    20000 │
│      6 │    20000 │
│      7 │    20000 │
│      8 │    20000 │
└────────┴──────────┘
6 rows in set. Elapsed: 0.388 sec. Processed 200.00 million rows, 1.60 GB (515.79 million rows/s., 4.13 GB/s.)

Q2) SELECT uniq(value_a) FROM test_delete where tenant = 4;
┌─uniq(value_a)─┐
│      20016427 │
└───────────────┘
1 row in set. Elapsed: 0.291 sec. Processed 20.23 million rows, 863.93 MB (69.47 million rows/s., 2.97 GB/s.)

Q3) SELECT max(ts) FROM test_delete where tenant = 4;
┌─────────────max(ts)─┐
│ 2020-04-25 17:46:39 │
└─────────────────────┘
1 row in set. Elapsed: 0.084 sec. Processed 20.23 million rows, 242.31 MB (240.07 million rows/s., 2.88 GB/s.)

Q4) SELECT max(ts) FROM test_delete where tenant = 4 and key = 444;
┌─────────────max(ts)─┐
│ 2020-01-01 00:00:44 │
└─────────────────────┘
1 row in set. Elapsed: 0.010 sec. Processed 212.99 thousand rows, 3.44 MB (21.45 million rows/s., 346.56 MB/s.)

Q5) SELECT uniq(value_a) FROM test_delete where tenant = 1;
┌─uniq(value_a)─┐
│             0 │
└───────────────┘
1 row in set. Elapsed: 0.046 sec. Processed 20.22 million rows, 161.74 MB (440.26 million rows/s., 3.52 GB/s.)

DROP ROW POLICY pol1 ON test_delete;
DROP DICTIONARY deleted_tenants_dict;
DROP TABLE deleted_tenants;

row policy using external dictionary (dictHas)

create table deleted_tenants(tenant Int64, deleted UInt8 default 1) ENGINE=MergeTree order by tenant;

insert into deleted_tenants(tenant) select distinct tenant from test_delete where tenant not in (1,2,3);

CREATE DICTIONARY deleted_tenants_dict (tenant UInt64, deleted UInt8) 
PRIMARY KEY tenant SOURCE(CLICKHOUSE(TABLE deleted_tenants)) 
LIFETIME(600) LAYOUT(FLAT());

CREATE ROW POLICY pol1 ON test_delete USING dictHas('deleted_tenants_dict', tenant) TO all;

Q1) SELECT tenant, count() FROM test_delete GROUP BY tenant ORDER BY tenant LIMIT 6;
┌─tenant─┬──count()─┐
│      0 │ 20020000 │
│      4 │ 20020000 │
│      5 │    20000 │
│      6 │    20000 │
│      7 │    20000 │
│      8 │    20000 │
└────────┴──────────┘
6 rows in set. Elapsed: 0.399 sec. Processed 200.00 million rows, 1.60 GB (501.18 million rows/s., 4.01 GB/s.)

Q2) SELECT uniq(value_a) FROM test_delete where tenant = 4;
┌─uniq(value_a)─┐
│      20016427 │
└───────────────┘
1 row in set. Elapsed: 0.284 sec. Processed 20.23 million rows, 863.93 MB (71.30 million rows/s., 3.05 GB/s.)

Q3) SELECT max(ts) FROM test_delete where tenant = 4;
┌─────────────max(ts)─┐
│ 2020-04-25 17:46:39 │
└─────────────────────┘
1 row in set. Elapsed: 0.080 sec. Processed 20.23 million rows, 242.31 MB (251.88 million rows/s., 3.02 GB/s.)

Q4) SELECT max(ts) FROM test_delete where tenant = 4 and key = 444;
┌─────────────max(ts)─┐
│ 2020-01-01 00:00:44 │
└─────────────────────┘
1 row in set. Elapsed: 0.010 sec. Processed 212.99 thousand rows, 3.44 MB (22.01 million rows/s., 355.50 MB/s.)

Q5) SELECT uniq(value_a) FROM test_delete where tenant = 1;
┌─uniq(value_a)─┐
│             0 │
└───────────────┘
1 row in set. Elapsed: 0.034 sec. Processed 20.22 million rows, 161.74 MB (589.90 million rows/s., 4.72 GB/s.)

DROP ROW POLICY pol1 ON test_delete;
DROP DICTIONARY deleted_tenants_dict;
DROP TABLE deleted_tenants;

row policy using engine=Set

create table deleted_tenants(tenant Int64) ENGINE=Set;

insert into deleted_tenants(tenant) values(1),(2),(3);

CREATE ROW POLICY pol1 ON test_delete USING tenant not in deleted_tenants TO all;

Q1) SELECT tenant, count() FROM test_delete GROUP BY tenant ORDER BY tenant LIMIT 6;
┌─tenant─┬──count()─┐
│      0 │ 20020000 │
│      4 │ 20020000 │
│      5 │    20000 │
│      6 │    20000 │
│      7 │    20000 │
│      8 │    20000 │
└────────┴──────────┘
6 rows in set. Elapsed: 0.322 sec. Processed 200.00 million rows, 1.60 GB (621.38 million rows/s., 4.97 GB/s.)

Q2) SELECT uniq(value_a) FROM test_delete where tenant = 4;
┌─uniq(value_a)─┐
│      20016427 │
└───────────────┘
1 row in set. Elapsed: 0.275 sec. Processed 20.23 million rows, 863.93 MB (73.56 million rows/s., 3.14 GB/s.)

Q3) SELECT max(ts) FROM test_delete where tenant = 4;
┌─────────────max(ts)─┐
│ 2020-04-25 17:46:39 │
└─────────────────────┘
1 row in set. Elapsed: 0.084 sec. Processed 20.23 million rows, 242.31 MB (240.07 million rows/s., 2.88 GB/s.)

Q4) SELECT max(ts) FROM test_delete where tenant = 4 and key = 444;
┌─────────────max(ts)─┐
│ 2020-01-01 00:00:44 │
└─────────────────────┘
1 row in set. Elapsed: 0.010 sec. Processed 212.99 thousand rows, 3.44 MB (20.69 million rows/s., 334.18 MB/s.)

Q5) SELECT uniq(value_a) FROM test_delete where tenant = 1;
┌─uniq(value_a)─┐
│             0 │
└───────────────┘
1 row in set. Elapsed: 0.030 sec. Processed 20.22 million rows, 161.74 MB (667.06 million rows/s., 5.34 GB/s.)

DROP ROW POLICY pol1 ON test_delete;
DROP TABLE deleted_tenants;

results

expression: CREATE ROW POLICY pol1 ON test_delete USING tenant not in (1,2,3) TO all;

table subq: CREATE ROW POLICY pol1 ON test_delete USING tenant not in deleted_tenants TO all;

ext. dict. NOT dictHas : CREATE ROW POLICY pol1 ON test_delete USING NOT dictHas('deleted_tenants_dict', tenant) TO all;

ext. dict. dictHas :

Q	no policy	expression	table subq	ext. dict. NOT	ext. dict.	engine=Set
Q1	0.285 / 200.00m	0.333 / 140.08m	0.329 / 140.08m	0.388 / 200.00m	0.399 / 200.00m	0.322 / 200.00m
Q2	0.265 / 20.23m	0.287 / 20.23m	0.287 / 20.23m	0.291 / 20.23m	0.284 / 20.23m	0.275 / 20.23m
Q3	0.062 / 20.23m	0.080 / 20.23m	0.080 / 20.23m	0.084 / 20.23m	0.080 / 20.23m	0.084 / 20.23m
Q4	0.009 / 212.99t	0.011 / 212.99t	0.010 / 213.00t	0.010 / 212.99t	0.010 / 212.99t	0.010 / 212.99t
Q5		0.008 / 180.22t	0.008 / 180.23t	0.046 / 20.22m	0.034 / 20.22m	0.030 / 20.22m

Expression in row policy seems to be fastest way (Q1, Q5).

2.8 - Why is simple `SELECT count()` Slow in ClickHouse®?

ClickHouse is a columnar database that provides excellent performance for analytical queries. However, in some cases, a simple count query can be slow. In this article, we’ll explore the reasons why this can happen and how to optimize the query.

Three Strategies for Counting Rows in ClickHouse

There are three ways to count rows in a table in ClickHouse:

optimize_trivial_count_query: This strategy extracts the number of rows from the table metadata. It’s the fastest and most efficient way to count rows, but it only works for simple count queries.
allow_experimental_projection_optimization: This strategy uses a virtual projection called _minmax_count_projection to count rows. It’s faster than scanning the table but slower than the trivial count query.
Scanning the smallest column in the table and reading rows from that. This is the slowest strategy and is only used when the other two strategies can’t be used.

Why Does ClickHouse Sometimes Choose the Slowest Counting Strategy?

In some cases, ClickHouse may choose the slowest counting strategy even when there are faster options available. Here are some possible reasons why this can happen:

Row policies are used on the table: If row policies are used, ClickHouse needs to filter rows to give the proper count. You can check if row policies are used by selecting from system.row_policies.
Experimental light-weight delete feature was used on the table: If the experimental light-weight delete feature was used, ClickHouse may use the slowest counting strategy. You can check this by looking into parts_columns for the column named _row_exists. To do this, run the following query:

SELECT DISTINCT database, table FROM system.parts_columns WHERE column = '_row_exists';

You can also refer to this issue on GitHub for more information: https://github.com/ClickHouse/ClickHouse/issues/47930 .

SELECT FINAL or final=1 setting is used.
max_parallel_replicas > 1 is used.
Sampling is used.
Some other features like allow_experimental_query_deduplication or empty_result_for_aggregation_by_empty_set is used.

2.9 - Collecting query execution flamegraphs using system.trace_log

ClickHouse® has embedded functionality to analyze the details of query performance.

It’s system.trace_log table.

By default it collects information only about queries when runs longer than 1 sec (and collects stacktraces every second).

You can adjust that per query using settings query_profiler_real_time_period_ns & query_profiler_cpu_time_period_ns.

Both works very similar (with desired interval dump the stacktraces of all the threads which execute the query). real timer - allows to ‘see’ the situations when cpu was not working much, but time was spend for example on IO. cpu timer - allows to see the ‘hot’ points in calculations more accurately (skip the io time).

Trying to collect stacktraces with a frequency higher than few KHz is usually not possible.

To check where most of the RAM is used you can collect stacktraces during memory allocations / deallocation, by using the setting memory_profiler_sample_probability.

clickhouse-speedscope

# install 
wget https://github.com/laplab/clickhouse-speedscope/archive/refs/heads/master.tar.gz -O clickhouse-speedscope.tar.gz
tar -xvzf clickhouse-speedscope.tar.gz
cd clickhouse-speedscope-master/
pip3 install -r requirements.txt

For debugging particular query:

clickhouse-client 

SET query_profiler_cpu_time_period_ns=1000000; -- 1000 times per 'cpu' sec
-- or SET query_profiler_real_time_period_ns=2000000; -- 500 times per 'real' sec.
-- or SET memory_profiler_sample_probability=0.1; -- to debug the memory allocations

SELECT ... <your select>

SYSTEM FLUSH LOGS;

-- get the query_id from the clickhouse-client output or from system.query_log (also pay attention on query_id vs initial_query_id for distributed queries).

Now let’s process that:

python3 main.py &  # start the proxy in background
python3 main.py --query-id 908952ee-71a8-48a4-84d5-f4db92d45a5d # process the stacktraces
fg # get the proxy from background 
Ctrl + C  # stop it.

To access ClickHouse with other username / password etc. - see the sources of https://github.com/laplab/clickhouse-speedscope/blob/master/main.py

clickhouse-flamegraph

Installation & usage instructions: https://github.com/Slach/clickhouse-flamegraph

pure flamegraph.pl examples

git clone https://github.com/brendangregg/FlameGraph /opt/flamegraph

clickhouse-client -q "SELECT  arrayStringConcat(arrayReverse(arrayMap(x -> concat( addressToLine(x), '#', demangle(addressToSymbol(x)) ), trace)), ';') AS stack, count() AS samples FROM system.trace_log WHERE event_time >= subtractMinutes(now(),10) GROUP BY trace FORMAT TabSeparated" | /opt/flamegraph/flamegraph.pl > flamegraph.svg

clickhouse-client -q "SELECT  arrayStringConcat((arrayMap(x -> concat(splitByChar('/', addressToLine(x))[-1], '#', demangle(addressToSymbol(x)) ), trace)), ';') AS stack, sum(abs(size)) AS samples FROM system.trace_log where trace_type = 'Memory' and event_date = today() group by trace order by samples desc FORMAT TabSeparated" | /opt/flamegraph/flamegraph.pl > allocs.svg
clickhouse-client -q "SELECT  arrayStringConcat(arrayReverse(arrayMap(x -> concat(splitByChar('/', addressToLine(x))[-1], '#', demangle(addressToSymbol(x)) ), trace)), ';') AS stack, count() AS samples FROM system.trace_log where trace_type = 'Memory' group by trace FORMAT TabSeparated SETTINGS allow_introspection_functions=1" | /opt/flamegraph/flamegraph.pl > ~/mem1.svg

similar using perf

apt-get update -y 
apt-get install -y linux-tools-common linux-tools-generic linux-tools-`uname -r`git
apt-get install -y clickhouse-common-static-dbg clickhouse-common-dbg
mkdir -p /opt/flamegraph
git clone https://github.com/brendangregg/FlameGraph /opt/flamegraph

perf record -F 99 -p $(pidof clickhouse) -G
perf script > /tmp/out.perf
/opt/flamegraph/stackcollapse-perf.pl /tmp/out.perf | /opt/flamegraph/flamegraph.pl > /tmp/flamegraph.svg

also

https://kb.altinity.com/altinity-kb-queries-and-syntax/troubleshooting/#flamegraph

https://github.com/samber/grafana-flamegraph-panel/pull/2

2.10 - Using array functions to mimic window-functions alike behavior

There are cases where you may need to mimic window functions using arrays in ClickHouse. This could be for optimization purposes, to better manage memory, or to enable on-disk spilling, especially if you’re working with an older version of ClickHouse that doesn’t natively support window functions.

Here’s an example demonstrating how to mimic a window function like runningDifference() using arrays:

Step 1: Create Sample Data

We’ll start by creating a test table with some sample data:

DROP TABLE IS EXISTS test_running_difference

CREATE TABLE test_running_difference
ENGINE = Log AS
SELECT 
    number % 20 AS id, 
    toDateTime('2010-01-01 00:00:00') + (intDiv(number, 20) * 15) AS ts, 
    (number * round(xxHash32(number % 20) / 1000000)) - round(rand() / 1000000) AS val
FROM numbers(100)


SELECT * FROM test_running_difference;

┌─id─┬──────────────────ts─┬────val─┐
│  0 │ 2010-01-01 00:00:00 │  -1209 │
│  1 │ 2010-01-01 00:00:00 │     43 │
│  2 │ 2010-01-01 00:00:00 │   4322 │
│  3 │ 2010-01-01 00:00:00 │    -25 │
│  4 │ 2010-01-01 00:00:00 │  13720 │
│  5 │ 2010-01-01 00:00:00 │    903 │
│  6 │ 2010-01-01 00:00:00 │  18062 │
│  7 │ 2010-01-01 00:00:00 │  -2873 │
│  8 │ 2010-01-01 00:00:00 │   6286 │
│  9 │ 2010-01-01 00:00:00 │  13399 │
│ 10 │ 2010-01-01 00:00:00 │  18320 │
│ 11 │ 2010-01-01 00:00:00 │  11731 │
│ 12 │ 2010-01-01 00:00:00 │    857 │
│ 13 │ 2010-01-01 00:00:00 │   8752 │
│ 14 │ 2010-01-01 00:00:00 │  23060 │
│ 15 │ 2010-01-01 00:00:00 │  41902 │
│ 16 │ 2010-01-01 00:00:00 │  39406 │
│ 17 │ 2010-01-01 00:00:00 │  50010 │
│ 18 │ 2010-01-01 00:00:00 │  57673 │
│ 19 │ 2010-01-01 00:00:00 │  51389 │
│  0 │ 2010-01-01 00:00:15 │  66839 │
│  1 │ 2010-01-01 00:00:15 │  19440 │
│  2 │ 2010-01-01 00:00:15 │  74513 │
│  3 │ 2010-01-01 00:00:15 │  10542 │
│  4 │ 2010-01-01 00:00:15 │  94245 │
│  5 │ 2010-01-01 00:00:15 │   8230 │
│  6 │ 2010-01-01 00:00:15 │  87823 │
│  7 │ 2010-01-01 00:00:15 │   -128 │
│  8 │ 2010-01-01 00:00:15 │  30101 │
│  9 │ 2010-01-01 00:00:15 │  54321 │
│ 10 │ 2010-01-01 00:00:15 │  64078 │
│ 11 │ 2010-01-01 00:00:15 │  31886 │
│ 12 │ 2010-01-01 00:00:15 │   8749 │
│ 13 │ 2010-01-01 00:00:15 │  28982 │
│ 14 │ 2010-01-01 00:00:15 │  61299 │
│ 15 │ 2010-01-01 00:00:15 │  95867 │
│ 16 │ 2010-01-01 00:00:15 │  93667 │
│ 17 │ 2010-01-01 00:00:15 │ 114072 │
│ 18 │ 2010-01-01 00:00:15 │ 124279 │
│ 19 │ 2010-01-01 00:00:15 │ 109605 │
│  0 │ 2010-01-01 00:00:30 │ 135082 │
│  1 │ 2010-01-01 00:00:30 │  37345 │
│  2 │ 2010-01-01 00:00:30 │ 148744 │
│  3 │ 2010-01-01 00:00:30 │  21607 │
│  4 │ 2010-01-01 00:00:30 │ 171744 │
│  5 │ 2010-01-01 00:00:30 │  14736 │
│  6 │ 2010-01-01 00:00:30 │ 155349 │
│  7 │ 2010-01-01 00:00:30 │  -3901 │
│  8 │ 2010-01-01 00:00:30 │  54303 │
│  9 │ 2010-01-01 00:00:30 │  89629 │
│ 10 │ 2010-01-01 00:00:30 │ 106595 │
│ 11 │ 2010-01-01 00:00:30 │  54545 │
│ 12 │ 2010-01-01 00:00:30 │  18903 │
│ 13 │ 2010-01-01 00:00:30 │  48023 │
│ 14 │ 2010-01-01 00:00:30 │  97930 │
│ 15 │ 2010-01-01 00:00:30 │ 152165 │
│ 16 │ 2010-01-01 00:00:30 │ 146130 │
│ 17 │ 2010-01-01 00:00:30 │ 174854 │
│ 18 │ 2010-01-01 00:00:30 │ 189194 │
│ 19 │ 2010-01-01 00:00:30 │ 170134 │
│  0 │ 2010-01-01 00:00:45 │ 207471 │
│  1 │ 2010-01-01 00:00:45 │  54323 │
│  2 │ 2010-01-01 00:00:45 │ 217984 │
│  3 │ 2010-01-01 00:00:45 │  31835 │
│  4 │ 2010-01-01 00:00:45 │ 252709 │
│  5 │ 2010-01-01 00:00:45 │  21493 │
│  6 │ 2010-01-01 00:00:45 │ 221271 │
│  7 │ 2010-01-01 00:00:45 │   -488 │
│  8 │ 2010-01-01 00:00:45 │  76827 │
│  9 │ 2010-01-01 00:00:45 │ 131066 │
│ 10 │ 2010-01-01 00:00:45 │ 149087 │
│ 11 │ 2010-01-01 00:00:45 │  71934 │
│ 12 │ 2010-01-01 00:00:45 │  25125 │
│ 13 │ 2010-01-01 00:00:45 │  65274 │
│ 14 │ 2010-01-01 00:00:45 │ 135980 │
│ 15 │ 2010-01-01 00:00:45 │ 210910 │
│ 16 │ 2010-01-01 00:00:45 │ 200007 │
│ 17 │ 2010-01-01 00:00:45 │ 235872 │
│ 18 │ 2010-01-01 00:00:45 │ 256112 │
│ 19 │ 2010-01-01 00:00:45 │ 229371 │
│  0 │ 2010-01-01 00:01:00 │ 275331 │
│  1 │ 2010-01-01 00:01:00 │  72668 │
│  2 │ 2010-01-01 00:01:00 │ 290366 │
│  3 │ 2010-01-01 00:01:00 │  46074 │
│  4 │ 2010-01-01 00:01:00 │ 329207 │
│  5 │ 2010-01-01 00:01:00 │  26770 │
│  6 │ 2010-01-01 00:01:00 │ 287619 │
│  7 │ 2010-01-01 00:01:00 │  -2207 │
│  8 │ 2010-01-01 00:01:00 │ 100456 │
│  9 │ 2010-01-01 00:01:00 │ 165688 │
│ 10 │ 2010-01-01 00:01:00 │ 194136 │
│ 11 │ 2010-01-01 00:01:00 │  94113 │
│ 12 │ 2010-01-01 00:01:00 │  35810 │
│ 13 │ 2010-01-01 00:01:00 │  85081 │
│ 14 │ 2010-01-01 00:01:00 │ 170256 │
│ 15 │ 2010-01-01 00:01:00 │ 265445 │
│ 16 │ 2010-01-01 00:01:00 │ 254828 │
│ 17 │ 2010-01-01 00:01:00 │ 297238 │
│ 18 │ 2010-01-01 00:01:00 │ 323494 │
│ 19 │ 2010-01-01 00:01:00 │ 286252 │
└────┴─────────────────────┴────────┘

100 rows in set. Elapsed: 0.003 sec.

This table contains IDs, timestamps (ts), and values (val), where each id appears multiple times with different timestamps.

Step 2: Running Difference Example

If you try using runningDifference directly, it works block by block, which can be problematic when the data needs to be ordered or when group changes occur.

select id, val, runningDifference(val) from (select * from test_running_difference order by id, ts);

┌─id─┬────val─┬─runningDifference(val)─┐
│  0 │  -1209 │                      0 │
│  0 │  66839 │                  68048 │
│  0 │ 135082 │                  68243 │
│  0 │ 207471 │                  72389 │
│  0 │ 275331 │                  67860 │
│  1 │     43 │                -275288 │
│  1 │  19440 │                  19397 │
│  1 │  37345 │                  17905 │
│  1 │  54323 │                  16978 │
│  1 │  72668 │                  18345 │
│  2 │   4322 │                 -68346 │
│  2 │  74513 │                  70191 │
│  2 │ 148744 │                  74231 │
│  2 │ 217984 │                  69240 │
│  2 │ 290366 │                  72382 │
│  3 │    -25 │                -290391 │
│  3 │  10542 │                  10567 │
│  3 │  21607 │                  11065 │
│  3 │  31835 │                  10228 │
│  3 │  46074 │                  14239 │
│  4 │  13720 │                 -32354 │
│  4 │  94245 │                  80525 │
│  4 │ 171744 │                  77499 │
│  4 │ 252709 │                  80965 │
│  4 │ 329207 │                  76498 │
│  5 │    903 │                -328304 │
│  5 │   8230 │                   7327 │
│  5 │  14736 │                   6506 │
│  5 │  21493 │                   6757 │
│  5 │  26770 │                   5277 │
│  6 │  18062 │                  -8708 │
│  6 │  87823 │                  69761 │
│  6 │ 155349 │                  67526 │
│  6 │ 221271 │                  65922 │
│  6 │ 287619 │                  66348 │
│  7 │  -2873 │                -290492 │
│  7 │   -128 │                   2745 │
│  7 │  -3901 │                  -3773 │
│  7 │   -488 │                   3413 │
│  7 │  -2207 │                  -1719 │
│  8 │   6286 │                   8493 │
│  8 │  30101 │                  23815 │
│  8 │  54303 │                  24202 │
│  8 │  76827 │                  22524 │
│  8 │ 100456 │                  23629 │
│  9 │  13399 │                 -87057 │
│  9 │  54321 │                  40922 │
│  9 │  89629 │                  35308 │
│  9 │ 131066 │                  41437 │
│  9 │ 165688 │                  34622 │
│ 10 │  18320 │                -147368 │
│ 10 │  64078 │                  45758 │
│ 10 │ 106595 │                  42517 │
│ 10 │ 149087 │                  42492 │
│ 10 │ 194136 │                  45049 │
│ 11 │  11731 │                -182405 │
│ 11 │  31886 │                  20155 │
│ 11 │  54545 │                  22659 │
│ 11 │  71934 │                  17389 │
│ 11 │  94113 │                  22179 │
│ 12 │    857 │                 -93256 │
│ 12 │   8749 │                   7892 │
│ 12 │  18903 │                  10154 │
│ 12 │  25125 │                   6222 │
│ 12 │  35810 │                  10685 │
│ 13 │   8752 │                 -27058 │
│ 13 │  28982 │                  20230 │
│ 13 │  48023 │                  19041 │
│ 13 │  65274 │                  17251 │
│ 13 │  85081 │                  19807 │
│ 14 │  23060 │                 -62021 │
│ 14 │  61299 │                  38239 │
│ 14 │  97930 │                  36631 │
│ 14 │ 135980 │                  38050 │
│ 14 │ 170256 │                  34276 │
│ 15 │  41902 │                -128354 │
│ 15 │  95867 │                  53965 │
│ 15 │ 152165 │                  56298 │
│ 15 │ 210910 │                  58745 │
│ 15 │ 265445 │                  54535 │
│ 16 │  39406 │                -226039 │
│ 16 │  93667 │                  54261 │
│ 16 │ 146130 │                  52463 │
│ 16 │ 200007 │                  53877 │
│ 16 │ 254828 │                  54821 │
│ 17 │  50010 │                -204818 │
│ 17 │ 114072 │                  64062 │
│ 17 │ 174854 │                  60782 │
│ 17 │ 235872 │                  61018 │
│ 17 │ 297238 │                  61366 │
│ 18 │  57673 │                -239565 │
│ 18 │ 124279 │                  66606 │
│ 18 │ 189194 │                  64915 │
│ 18 │ 256112 │                  66918 │
│ 18 │ 323494 │                  67382 │
│ 19 │  51389 │                -272105 │
│ 19 │ 109605 │                  58216 │
│ 19 │ 170134 │                  60529 │
│ 19 │ 229371 │                  59237 │
│ 19 │ 286252 │                  56881 │
└────┴────────┴────────────────────────┘

100 rows in set. Elapsed: 0.005 sec.

The output may look inconsistent because runningDifference requires ordered data within blocks.

Step 3: Using Arrays for Grouping and Calculation

Instead of using runningDifference, we can utilize arrays to group data, sort it, and apply similar logic more efficiently.

Grouping Data into Arrays - You can group multiple columns into arrays by using the groupArray function. For example, to collect several columns as arrays of tuples, you can use the following query:

SELECT 
    id, 
    groupArray(tuple(ts, val))
FROM test_running_difference
GROUP BY id

┌─id─┬─groupArray(tuple(ts, val))──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│  0 │ [('2010-01-01 00:00:00',-1209),('2010-01-01 00:00:15',66839),('2010-01-01 00:00:30',135082),('2010-01-01 00:00:45',207471),('2010-01-01 00:01:00',275331)]  │
│  1 │ [('2010-01-01 00:00:00',43),('2010-01-01 00:00:15',19440),('2010-01-01 00:00:30',37345),('2010-01-01 00:00:45',54323),('2010-01-01 00:01:00',72668)]        │
│  2 │ [('2010-01-01 00:00:00',4322),('2010-01-01 00:00:15',74513),('2010-01-01 00:00:30',148744),('2010-01-01 00:00:45',217984),('2010-01-01 00:01:00',290366)]   │
│  3 │ [('2010-01-01 00:00:00',-25),('2010-01-01 00:00:15',10542),('2010-01-01 00:00:30',21607),('2010-01-01 00:00:45',31835),('2010-01-01 00:01:00',46074)]       │
│  4 │ [('2010-01-01 00:00:00',13720),('2010-01-01 00:00:15',94245),('2010-01-01 00:00:30',171744),('2010-01-01 00:00:45',252709),('2010-01-01 00:01:00',329207)]  │
│  5 │ [('2010-01-01 00:00:00',903),('2010-01-01 00:00:15',8230),('2010-01-01 00:00:30',14736),('2010-01-01 00:00:45',21493),('2010-01-01 00:01:00',26770)]        │
│  6 │ [('2010-01-01 00:00:00',18062),('2010-01-01 00:00:15',87823),('2010-01-01 00:00:30',155349),('2010-01-01 00:00:45',221271),('2010-01-01 00:01:00',287619)]  │
│  7 │ [('2010-01-01 00:00:00',-2873),('2010-01-01 00:00:15',-128),('2010-01-01 00:00:30',-3901),('2010-01-01 00:00:45',-488),('2010-01-01 00:01:00',-2207)]       │
│  8 │ [('2010-01-01 00:00:00',6286),('2010-01-01 00:00:15',30101),('2010-01-01 00:00:30',54303),('2010-01-01 00:00:45',76827),('2010-01-01 00:01:00',100456)]     │
│  9 │ [('2010-01-01 00:00:00',13399),('2010-01-01 00:00:15',54321),('2010-01-01 00:00:30',89629),('2010-01-01 00:00:45',131066),('2010-01-01 00:01:00',165688)]   │
│ 10 │ [('2010-01-01 00:00:00',18320),('2010-01-01 00:00:15',64078),('2010-01-01 00:00:30',106595),('2010-01-01 00:00:45',149087),('2010-01-01 00:01:00',194136)]  │
│ 11 │ [('2010-01-01 00:00:00',11731),('2010-01-01 00:00:15',31886),('2010-01-01 00:00:30',54545),('2010-01-01 00:00:45',71934),('2010-01-01 00:01:00',94113)]     │
│ 12 │ [('2010-01-01 00:00:00',857),('2010-01-01 00:00:15',8749),('2010-01-01 00:00:30',18903),('2010-01-01 00:00:45',25125),('2010-01-01 00:01:00',35810)]        │
│ 13 │ [('2010-01-01 00:00:00',8752),('2010-01-01 00:00:15',28982),('2010-01-01 00:00:30',48023),('2010-01-01 00:00:45',65274),('2010-01-01 00:01:00',85081)]      │
│ 14 │ [('2010-01-01 00:00:00',23060),('2010-01-01 00:00:15',61299),('2010-01-01 00:00:30',97930),('2010-01-01 00:00:45',135980),('2010-01-01 00:01:00',170256)]   │
│ 15 │ [('2010-01-01 00:00:00',41902),('2010-01-01 00:00:15',95867),('2010-01-01 00:00:30',152165),('2010-01-01 00:00:45',210910),('2010-01-01 00:01:00',265445)]  │
│ 16 │ [('2010-01-01 00:00:00',39406),('2010-01-01 00:00:15',93667),('2010-01-01 00:00:30',146130),('2010-01-01 00:00:45',200007),('2010-01-01 00:01:00',254828)]  │
│ 17 │ [('2010-01-01 00:00:00',50010),('2010-01-01 00:00:15',114072),('2010-01-01 00:00:30',174854),('2010-01-01 00:00:45',235872),('2010-01-01 00:01:00',297238)] │
│ 18 │ [('2010-01-01 00:00:00',57673),('2010-01-01 00:00:15',124279),('2010-01-01 00:00:30',189194),('2010-01-01 00:00:45',256112),('2010-01-01 00:01:00',323494)] │
│ 19 │ [('2010-01-01 00:00:00',51389),('2010-01-01 00:00:15',109605),('2010-01-01 00:00:30',170134),('2010-01-01 00:00:45',229371),('2010-01-01 00:01:00',286252)] │
└────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Sorting Arrays - To sort the arrays by a specific element, for example, by the second element of the tuple, you can use the arraySort function:

SELECT 
    id, 
    arraySort(x -> (x.2), groupArray((ts, val)))
FROM test_running_difference
GROUP BY id

┌─id─┬─arraySort(lambda(tuple(x), tupleElement(x, 2)), groupArray(tuple(ts, val)))─────────────────────────────────────────────────────────────────────────────────┐
│  0 │ [('2010-01-01 00:00:00',-1209),('2010-01-01 00:00:15',66839),('2010-01-01 00:00:30',135082),('2010-01-01 00:00:45',207471),('2010-01-01 00:01:00',275331)]  │
│  1 │ [('2010-01-01 00:00:00',43),('2010-01-01 00:00:15',19440),('2010-01-01 00:00:30',37345),('2010-01-01 00:00:45',54323),('2010-01-01 00:01:00',72668)]        │
│  2 │ [('2010-01-01 00:00:00',4322),('2010-01-01 00:00:15',74513),('2010-01-01 00:00:30',148744),('2010-01-01 00:00:45',217984),('2010-01-01 00:01:00',290366)]   │
│  3 │ [('2010-01-01 00:00:00',-25),('2010-01-01 00:00:15',10542),('2010-01-01 00:00:30',21607),('2010-01-01 00:00:45',31835),('2010-01-01 00:01:00',46074)]       │
│  4 │ [('2010-01-01 00:00:00',13720),('2010-01-01 00:00:15',94245),('2010-01-01 00:00:30',171744),('2010-01-01 00:00:45',252709),('2010-01-01 00:01:00',329207)]  │
│  5 │ [('2010-01-01 00:00:00',903),('2010-01-01 00:00:15',8230),('2010-01-01 00:00:30',14736),('2010-01-01 00:00:45',21493),('2010-01-01 00:01:00',26770)]        │
│  6 │ [('2010-01-01 00:00:00',18062),('2010-01-01 00:00:15',87823),('2010-01-01 00:00:30',155349),('2010-01-01 00:00:45',221271),('2010-01-01 00:01:00',287619)]  │
│  7 │ [('2010-01-01 00:00:30',-3901),('2010-01-01 00:00:00',-2873),('2010-01-01 00:01:00',-2207),('2010-01-01 00:00:45',-488),('2010-01-01 00:00:15',-128)]       │
│  8 │ [('2010-01-01 00:00:00',6286),('2010-01-01 00:00:15',30101),('2010-01-01 00:00:30',54303),('2010-01-01 00:00:45',76827),('2010-01-01 00:01:00',100456)]     │
│  9 │ [('2010-01-01 00:00:00',13399),('2010-01-01 00:00:15',54321),('2010-01-01 00:00:30',89629),('2010-01-01 00:00:45',131066),('2010-01-01 00:01:00',165688)]   │
│ 10 │ [('2010-01-01 00:00:00',18320),('2010-01-01 00:00:15',64078),('2010-01-01 00:00:30',106595),('2010-01-01 00:00:45',149087),('2010-01-01 00:01:00',194136)]  │
│ 11 │ [('2010-01-01 00:00:00',11731),('2010-01-01 00:00:15',31886),('2010-01-01 00:00:30',54545),('2010-01-01 00:00:45',71934),('2010-01-01 00:01:00',94113)]     │
│ 12 │ [('2010-01-01 00:00:00',857),('2010-01-01 00:00:15',8749),('2010-01-01 00:00:30',18903),('2010-01-01 00:00:45',25125),('2010-01-01 00:01:00',35810)]        │
│ 13 │ [('2010-01-01 00:00:00',8752),('2010-01-01 00:00:15',28982),('2010-01-01 00:00:30',48023),('2010-01-01 00:00:45',65274),('2010-01-01 00:01:00',85081)]      │
│ 14 │ [('2010-01-01 00:00:00',23060),('2010-01-01 00:00:15',61299),('2010-01-01 00:00:30',97930),('2010-01-01 00:00:45',135980),('2010-01-01 00:01:00',170256)]   │
│ 15 │ [('2010-01-01 00:00:00',41902),('2010-01-01 00:00:15',95867),('2010-01-01 00:00:30',152165),('2010-01-01 00:00:45',210910),('2010-01-01 00:01:00',265445)]  │
│ 16 │ [('2010-01-01 00:00:00',39406),('2010-01-01 00:00:15',93667),('2010-01-01 00:00:30',146130),('2010-01-01 00:00:45',200007),('2010-01-01 00:01:00',254828)]  │
│ 17 │ [('2010-01-01 00:00:00',50010),('2010-01-01 00:00:15',114072),('2010-01-01 00:00:30',174854),('2010-01-01 00:00:45',235872),('2010-01-01 00:01:00',297238)] │
│ 18 │ [('2010-01-01 00:00:00',57673),('2010-01-01 00:00:15',124279),('2010-01-01 00:00:30',189194),('2010-01-01 00:00:45',256112),('2010-01-01 00:01:00',323494)] │
│ 19 │ [('2010-01-01 00:00:00',51389),('2010-01-01 00:00:15',109605),('2010-01-01 00:00:30',170134),('2010-01-01 00:00:45',229371),('2010-01-01 00:01:00',286252)] │
└────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

20 rows in set. Elapsed: 0.004 sec.

This sorts each array by the val (second element of the tuple) for each id.

Simplified Sorting Example - We can rewrite the query in a more concise way using WITH clauses for better readability:

WITH 
    groupArray(tuple(ts, val)) as window_rows,
    arraySort(x -> x.1, window_rows) as sorted_window_rows
SELECT 
    id, 
    sorted_window_rows
FROM test_running_difference
GROUP BY id

Applying Calculations with Arrays - Once the data is sorted, you can apply array functions like arrayMap and arrayDifference to calculate differences between values in the arrays:

WITH 
    groupArray(tuple(ts, val)) as window_rows,
    arraySort(x -> x.1, window_rows) as sorted_window_rows,
    arrayMap(x -> x.2, sorted_window_rows) as sorted_window_rows_val_column,
	arrayDifference(sorted_window_rows_val_column) as sorted_window_rows_val_column_diff
SELECT 
    id, 
    sorted_window_rows_val_column_diff
FROM test_running_difference
GROUP BY id

┌─id─┬─sorted_window_rows_val_column_diff─┐
│  0 │ [0,68048,68243,72389,67860]        │
│  1 │ [0,19397,17905,16978,18345]        │
│  2 │ [0,70191,74231,69240,72382]        │
│  3 │ [0,10567,11065,10228,14239]        │
│  4 │ [0,80525,77499,80965,76498]        │
│  5 │ [0,7327,6506,6757,5277]            │
│  6 │ [0,69761,67526,65922,66348]        │
│  7 │ [0,2745,-3773,3413,-1719]          │
│  8 │ [0,23815,24202,22524,23629]        │
│  9 │ [0,40922,35308,41437,34622]        │
│ 10 │ [0,45758,42517,42492,45049]        │
│ 11 │ [0,20155,22659,17389,22179]        │
│ 12 │ [0,7892,10154,6222,10685]          │
│ 13 │ [0,20230,19041,17251,19807]        │
│ 14 │ [0,38239,36631,38050,34276]        │
│ 15 │ [0,53965,56298,58745,54535]        │
│ 16 │ [0,54261,52463,53877,54821]        │
│ 17 │ [0,64062,60782,61018,61366]        │
│ 18 │ [0,66606,64915,66918,67382]        │
│ 19 │ [0,58216,60529,59237,56881]        │
└────┴────────────────────────────────────┘

20 rows in set. Elapsed: 0.005 sec.

You can do also a lot of magic with arrayEnumerate and accessing different values by their ids.

Reverting Arrays Back to Rows - You can convert the arrays back into rows using arrayJoin:

WITH 
    groupArray(tuple(ts, val)) as window_rows,
    arraySort(x -> x.1, window_rows) as sorted_window_rows,
    arrayMap(x -> x.2, sorted_window_rows) as sorted_window_rows_val_column,
	arrayDifference(sorted_window_rows_val_column) as sorted_window_rows_val_column_diff,
	arrayJoin(sorted_window_rows_val_column_diff) as diff
SELECT 
    id, 
    diff
FROM test_running_difference
GROUP BY id

Or use ARRAY JOIN to join the arrays back to the original structure:

SELECT 
  id,
  diff,
  ts
FROM 
(
WITH 
    groupArray(tuple(ts, val)) as window_rows,
    arraySort(x -> x.1, window_rows) as sorted_window_rows,
    arrayMap(x -> x.2, sorted_window_rows) as sorted_window_rows_val_column
SELECT 
    id, 
	arrayDifference(sorted_window_rows_val_column) as sorted_window_rows_val_column_diff,
    arrayMap(x -> x.1, sorted_window_rows) as sorted_window_rows_ts_column
FROM test_running_difference
GROUP BY id
) as t1
ARRAY JOIN sorted_window_rows_val_column_diff as diff, sorted_window_rows_ts_column as ts

This allows you to manipulate and analyze data within arrays effectively, using powerful functions such as arrayMap, arrayDifference, and arrayEnumerate.

2.11 - -State & -Merge combinators

-State & -Merge combinators

The -State combinator in ClickHouse® does not store additional information about the -If combinator, which means that aggregate functions with and without -If have the same serialized data structure. This can be verified through various examples, as demonstrated below.

Example 1: maxIfState and maxState In this example, we use the maxIfState and maxState functions on a dataset of numbers, serialize the result, and merge it using the maxMerge function.

$ clickhouse-local --query "SELECT maxIfState(number,number % 2) as x, maxState(number) as y FROM numbers(10) FORMAT RowBinary" | clickhouse-local --input-format RowBinary --structure="x AggregateFunction(max,UInt64), y AggregateFunction(max,UInt64)" --query "SELECT maxMerge(x), maxMerge(y) FROM table"
9       9
$ clickhouse-local --query "SELECT maxIfState(number,number % 2) as x, maxState(number) as y FROM numbers(11) FORMAT RowBinary" | clickhouse-local --input-format RowBinary --structure="x AggregateFunction(max,UInt64), y AggregateFunction(max,UInt64)" --query "SELECT maxMerge(x), maxMerge(y) FROM table"
9       10

In both cases, the -State combinator results in identical serialized data footprints, regardless of the conditions in the -If variant. The maxMerge function merges the state without concern for the original -If condition.

Example 2: quantilesTDigestIfState Here, we use the quantilesTDigestIfState function to demonstrate that functions like quantile-based and sequence matching functions follow the same principle regarding serialized data consistency.

$ clickhouse-local --query "SELECT quantilesTDigestIfState(0.1,0.9)(number,number % 2) FROM  numbers(1000000) FORMAT RowBinary" | clickhouse-local --input-format RowBinary --structure="x AggregateFunction(quantileTDigestWeighted(0.5),UInt64,UInt8)" --query "SELECT quantileTDigestWeightedMerge(0.4)(x) FROM table"
400000

$ clickhouse-local --query "SELECT quantilesTDigestIfState(0.1,0.9)(number,number % 2) FROM  numbers(1000000) FORMAT RowBinary" | clickhouse-local --input-format RowBinary --structure="x AggregateFunction(quantilesTDigestWeighted(0.5),UInt64,UInt8)" --query "SELECT quantilesTDigestWeightedMerge(0.4,0.8)(x) FROM table"
[400000,800000]

Example 3: Quantile Functions with -Merge This example shows how the quantileState and quantileMerge functions work together to calculate a specific quantile.

SELECT quantileMerge(0.9)(x)
FROM
(
    SELECT quantileState(0.1)(number) AS x
    FROM numbers(1000)
)

┌─quantileMerge(0.9)(x)─┐
│                 899.1 │
└───────────────────────┘

Example 4: sequenceMatch and sequenceCount Functions with -Merge Finally, we demonstrate the behavior of sequenceMatchState and sequenceMatchMerge, as well as sequenceCountState and sequenceCountMerge, in ClickHouse.

SELECT
    sequenceMatchMerge('(?2)(?3)')(x) AS `2_3`,
    sequenceMatchMerge('(?1)(?4)')(x) AS `1_4`,
    sequenceMatchMerge('(?1)(?2)(?3)')(x) AS `1_2_3`
FROM
(
    SELECT sequenceMatchState('(?1)(?2)(?3)')(number, number = 8, number = 5, number = 6, number = 9) AS x
    FROM numbers(10)
)

┌─2_3─┬─1_4─┬─1_2_3─┐
│   1 │   1 │     0 │
└─────┴─────┴───────┘

Similarly, sequenceCountState and sequenceCountMerge functions behave consistently when merging states:


SELECT
    sequenceCountMerge('(?1)(?2)')(x) AS `2_3`,
    sequenceCountMerge('(?1)(?4)')(x) AS `1_4`,
    sequenceCountMerge('(?1)(?2)(?3)')(x) AS `1_2_3`
FROM
(
    WITH number % 4 AS cond
    SELECT sequenceCountState('(?1)(?2)(?3)')(number, cond = 1, cond = 2, cond = 3, cond = 5) AS x
    FROM numbers(11)
)

┌─2_3─┬─1_4─┬─1_2_3─┐
│   3 │   0 │     2 │
└─────┴─────┴───────┘

ClickHouse’s -State combinator stores serialized data in a consistent manner, irrespective of conditions used with -If. The same applies to a wide range of functions, including quantile and sequence-based functions. This behavior ensures that functions like maxMerge, quantileMerge, sequenceMatchMerge, and sequenceCountMerge work seamlessly, even across varied inputs.

2.12 - ALTER MODIFY COLUMN is stuck, the column is inaccessible.

ALTER MODIFY COLUMN is stuck, the column is inaccessible.

Problem

You’ve created a table in ClickHouse with the following structure:

CREATE TABLE modify_column(column_n String) ENGINE=MergeTree() ORDER BY tuple();

You populated the table with some data:

INSERT INTO modify_column VALUES ('key_a');
INSERT INTO modify_column VALUES ('key_b');
INSERT INTO modify_column VALUES ('key_c');

Next, you attempted to change the column type using this query:

ALTER TABLE modify_column MODIFY COLUMN column_n Enum8('key_a'=1, 'key_b'=2);

However, the operation failed, and you encountered an error when inspecting the system.mutations table:

SELECT *
FROM system.mutations
WHERE (table = 'modify_column') AND (is_done = 0)
FORMAT Vertical

Row 1:
──────
database:                   default
table:                      modify_column
mutation_id:                mutation_4.txt
command:                    MODIFY COLUMN `column_n` Enum8('key_a' = 1, 'key_b' = 2)
create_time:                2021-03-03 18:38:09
block_numbers.partition_id: ['']
block_numbers.number:       [4]
parts_to_do_names:          ['all_3_3_0']
parts_to_do:                1
is_done:                    0
latest_failed_part:         all_3_3_0
latest_fail_time:           2021-03-03 18:38:59
latest_fail_reason:         Code: 36, e.displayText() = DB::Exception: Unknown element 'key_c' for type Enum8('key_a' = 1, 'key_b' = 2): while executing 'FUNCTION CAST(column_n :: 0, 'Enum8(\'key_a\' = 1, \'key_b\' = 2)' :: 1) -> cast(column_n, 'Enum8(\'key_a\' = 1, \'key_b\' = 2)') Enum8('key_a' = 1, 'key_b' = 2) : 2': (while reading from part /var/lib/clickhouse/data/default/modify_column/all_3_3_0/): While executing MergeTree (version 21.3.1.6041)

The mutation result showed an error indicating that the value ‘key_c’ was not recognized in the Enum8 definition:

Unknown element 'key_c' for type Enum8('key_a' = 1, 'key_b' = 2)

Now, when trying to query the column, ClickHouse returns an exception and the column becomes inaccessible:

SELECT column_n
FROM modify_column

┌─column_n─┐
│ key_a    │
└──────────┘
┌─column_n─┐
│ key_b    │
└──────────┘
↓ Progress: 2.00 rows, 2.00 B (19.48 rows/s., 19.48 B/s.)
2 rows in set. Elapsed: 0.104 sec.

Received exception from server (version 21.3.1):
Code: 36. DB::Exception: Received from localhost:9000. DB::Exception: Unknown element 'key_c' for type Enum8('key_a' = 1, 'key_b' = 2): while executing 'FUNCTION CAST(column_n :: 0, 'Enum8(\'key_a\' = 1, \'key_b\' = 2)' :: 1) -> cast(column_n, 'Enum8(\'key_a\' = 1, \'key_b\' = 2)') Enum8('key_a' = 1, 'key_b' = 2) : 2': (while reading from part /var/lib/clickhouse/data/default/modify_column/all_3_3_0/): While executing MergeTreeThread.

This query results in:

Code: 36. DB::Exception: Unknown element 'key_c' for type Enum8('key_a' = 1, 'key_b' = 2)

Root Cause

The failure occurred because the Enum8 type only allows for predefined values. Since ‘key_c’ wasn’t included in the definition, the mutation failed and left the table in an inconsistent state.

Solution

Identify and Terminate the Stuck Mutation First, you need to locate the mutation that’s stuck in an incomplete state.

SELECT * FROM system.mutations WHERE table = 'modify_column' AND is_done=0 FORMAT Vertical;

Once you’ve identified the mutation, terminate it using:

KILL MUTATION WHERE table = 'modify_column' AND mutation_id = 'id_of_stuck_mutation';

This will stop the operation and allow you to revert the changes.

Revert the Column Type Next, revert the column back to its original type, which was String, to restore the table’s accessibility:

ALTER TABLE modify_column MODIFY COLUMN column_n String;

Verify the Column is Accessible Again To ensure the column is functioning normally, run a simple query to verify its data:

SELECT column_n, count() FROM modify_column GROUP BY column_n;

Apply the Correct Column Modification Now that the column is accessible, you can safely reapply the ALTER query, but this time include all the required enum values:

ALTER TABLE modify_column MODIFY COLUMN column_n Enum8('key_a'=1, 'key_b'=2, 'key_c'=3);

Monitor Progress You can monitor the progress of the column modification using the system.mutations or system.parts_columns tables to ensure everything proceeds as expected:

To track mutation progress:

SELECT
    command,
    parts_to_do,
    is_done
FROM system.mutations
WHERE table = 'modify_column';

To review the column’s active parts:

SELECT
    column,
    type,
    count() AS parts,
    sum(rows) AS rows,
    sum(bytes_on_disk) AS bytes
FROM system.parts_columns
WHERE (table = 'modify_column') AND (column = 'column_n') AND active
GROUP BY
    column,
    type;

2.13 - ANSI SQL mode

ANSI SQL mode

To make ClickHouse® more compatible with ANSI SQL standards (at the expense of some performance), you can adjust several settings. These configurations will bring ClickHouse closer to ANSI SQL behavior but may introduce a slowdown in query performance:

join_use_nulls=1

Introduced in: early versions Ensures that JOIN operations return NULL for non-matching rows, aligning with standard SQL behavior.

cast_keep_nullable=1

Introduced in: v20.5 Preserves the NULL flag when casting between data types, which is typical in ANSI SQL.

union_default_mode='DISTINCT'

Introduced in: v21.1 Makes the UNION operation default to UNION DISTINCT, which removes duplicate rows, following ANSI SQL behavior.

allow_experimental_window_functions=1

Introduced in: v21.3 Enables support for window functions, which are a standard feature in ANSI SQL.

prefer_column_name_to_alias=1

Introduced in: v21.4 This setting resolves ambiguities by preferring column names over aliases, following ANSI SQL conventions.

group_by_use_nulls=1

Introduced in: v22.7 Allows NULL values to appear in the GROUP BY clause, consistent with ANSI SQL behavior.

By enabling these settings, ClickHouse becomes more ANSI SQL-compliant, although this may come with a trade-off in terms of performance. Each of these options can be enabled as needed, based on the specific SQL compatibility requirements of your application.

2.14 - Async INSERTs

Comprehensive guide to ClickHouse Async INSERTs - configuration, best practices, and monitoring

Overview

Async INSERTs is a ClickHouse® feature that enables automatic server-side batching of data. While we generally recommend batching at the application/ingestor level for better control and decoupling, async inserts are valuable when you have hundreds or thousands of clients performing small inserts and client-side batching is not feasible.

Key Documentation: Official Async Inserts Documentation

How Async Inserts Work

When async_insert=1 is enabled, ClickHouse buffers incoming inserts and flushes them to disk when one of these conditions is met:

Buffer reaches specified size (async_insert_max_data_size)
Time threshold elapses (async_insert_busy_timeout_ms)
Maximum number of queries accumulate (async_insert_max_query_number)

Critical Configuration Settings

Core Settings

-- Enable async inserts (0=disabled, 1=enabled)
SET async_insert = 1;

-- Wait behavior (STRONGLY RECOMMENDED: use 1)
-- 0 = fire-and-forget mode (risky - no error feedback)
-- 1 = wait for data to be written to storage
SET wait_for_async_insert = 1;

-- Buffer flush conditions
SET async_insert_max_data_size = 1000000;  -- 1MB default
SET async_insert_busy_timeout_ms = 1000;    -- 1 second
SET async_insert_max_query_number = 100;    -- max queries before flush

Adaptive Timeout (Since 24.3)

-- Adaptive timeout automatically adjusts flush timing based on server load
-- Default: 1 (enabled) - OVERRIDES manual timeout settings
-- Set to 0 for deterministic behavior with manual settings
SET async_insert_use_adaptive_busy_timeout = 0;

Important Behavioral Notes

What Works and What Doesn’t

✅ Works with Async Inserts:

Direct INSERT with VALUES
INSERT with FORMAT (JSONEachRow, CSV, etc.)
Native protocol inserts (since 22.x)

❌ Does NOT Work:

INSERT .. SELECT statements - Other strategies are needed for managing performance and load. Do not use async_insert.

Data Safety Considerations

ALWAYS use wait_for_async_insert = 1 in production!

Risks with wait_for_async_insert = 0:

Silent data loss on errors (read-only table, disk full, too many parts)
Data loss on sudden restart (no fsync by default)
Data not immediately queryable after acknowledgment
No error feedback to client

Deduplication Behavior

Sync inserts: Automatic deduplication enabled by default
Async inserts: Deduplication disabled by default
Enable with async_insert_deduplicate = 1 (since 22.x)
Warning: Don’t use with deduplicate_blocks_in_dependent_materialized_views = 1

features / improvements

Async insert dedup: Support block deduplication for asynchronous inserts. Before this change, async inserts did not support deduplication, because multiple small inserts coexisted in one inserted batch:
- #38075
- #43304
Added system table asynchronous_insert_log. It contains information about asynchronous inserts (including results of queries in fire-and-forget mode. (with wait_for_async_insert=0)) for better introspection #42040
Support async inserts in clickhouse-client for queries with inlined data (Native protocol):
Async insert backpressure #4762
Limit the deduplication overhead when using async_insert_deduplicate #46549
SYSTEM FLUSH ASYNC INSERTS #49160
Adjustable asynchronous insert timeouts #58486

bugfixes

Fixed bug which could lead to deadlock while using asynchronous inserts #43233 .
Fix crash when async inserts with deduplication are used for ReplicatedMergeTree tables using a nondefault merging algorithm #51676
Async inserts not working with log_comment setting 48430
Fix misbehaviour with async inserts with deduplication #50663
Reject Insert if async_insert=1 and deduplicate_blocks_in_dependent_materialized_views=1#60888
Disable async_insert_use_adaptive_busy_timeout correctly with compatibility settings #61486

observability / introspection

In 22.x versions, it is not possible to relate part_log/query_id column with asynchronous_insert_log/query_id column. We need to use query_log/query_id:

asynchronous_insert_log shows up the query_id and flush_query_id of each async insert. The query_id from asynchronous_insert_log shows up in the system.query_log as type = 'QueryStart' but the same query_id does not show up in the query_id column of the system.part_log. Because the query_id column in the part_log is the identifier of the INSERT query that created a data part, and it seems it is for sync INSERTS but not for async inserts.

So in asynchronous_inserts table you can check the current batch that still has not been flushed. In the asynchronous_insert_log you can find a log of all the flushed async inserts.

This has been improved in ClickHouse 23.7 Flush queries for async inserts (the queries that do the final push of data) are now logged in the system.query_log where they appear as query_kind = 'AsyncInsertFlush' #51160

Versions

23.8 is a good version to start using async inserts because of the improvements and bugfixes.
24.3 the new adaptive timeout mechanism has been added so ClickHouse will throttle the inserts based on the server load.#58486 This new feature is enabled by default and will OVERRRIDE current async insert settings, so better to disable it if your async insert settings are working. Here’s how to do it in a clickhouse-client session: SET async_insert_use_adaptive_busy_timeout = 0; You can also add it as a setting on the INSERT or as a profile setting.

Metrics

SELECT name
FROM system.columns
WHERE (`table` = 'metric_log') AND ((name ILIKE '%asyncinsert%') OR (name ILIKE '%asynchronousinsert%'))

┌─name─────────────────────────────────────────────┐
│ ProfileEvent_AsyncInsertQuery                    │
│ ProfileEvent_AsyncInsertBytes                    │
│ ProfileEvent_AsyncInsertRows                     │
│ ProfileEvent_AsyncInsertCacheHits                │
│ ProfileEvent_FailedAsyncInsertQuery              │
│ ProfileEvent_DistributedAsyncInsertionFailures   │
│ CurrentMetric_AsynchronousInsertThreads          │
│ CurrentMetric_AsynchronousInsertThreadsActive    │
│ CurrentMetric_AsynchronousInsertThreadsScheduled │
│ CurrentMetric_AsynchronousInsertQueueSize        │
│ CurrentMetric_AsynchronousInsertQueueBytes       │
│ CurrentMetric_PendingAsyncInsert                 │
│ CurrentMetric_AsyncInsertCacheSize               │
└──────────────────────────────────────────────────┘

SELECT *
FROM system.metrics
WHERE (metric ILIKE '%asyncinsert%') OR (metric ILIKE '%asynchronousinsert%')

┌─metric─────────────────────────────┬─value─┬─description─────────────────────────────────────────────────────────────┐
│ AsynchronousInsertThreads          │     1 │ Number of threads in the AsynchronousInsert thread pool.                │
│ AsynchronousInsertThreadsActive    │     0 │ Number of threads in the AsynchronousInsert thread pool running a task. │
│ AsynchronousInsertThreadsScheduled │     0 │ Number of queued or active jobs in the AsynchronousInsert thread pool.  │
│ AsynchronousInsertQueueSize        │     1 │ Number of pending tasks in the AsynchronousInsert queue.                │
│ AsynchronousInsertQueueBytes       │   680 │ Number of pending bytes in the AsynchronousInsert queue.                │
│ PendingAsyncInsert                 │     7 │ Number of asynchronous inserts that are waiting for flush.              │
│ AsyncInsertCacheSize               │     0 │ Number of async insert hash id in cache                                 │
└────────────────────────────────────┴───────┴─────────────────────────────────────────────────────────────────────────┘

2.15 - Atomic insert

Atomic insert

An insert is atomic if it creates only one part.

An insert will create one part if:

Data is inserted directly into a MergeTree table
Data is inserted into a single partition.
Smaller blocks are properly squashed up to the configured block size (min_insert_block_size_rows and min_insert_block_size_bytes)
For INSERT FORMAT:
- Number of rows is less than max_insert_block_size (default is 1048545)
- Parallel formatting is disabled (For TSV, TSKV, CSV, and JSONEachRow formats setting input_format_parallel_parsing=0 is set).
For INSERT SELECT (including all variants with table functions), data for insert should be created fully deterministically.
- non-deterministic functions there like rand() not used in SELECT
- Number of rows/bytes is less than min_insert_block_size_rows and min_insert_block_size_bytes
- And one of:
  - setting max_threads to 1
  - adding ORDER BY to the table’s DDL (not ordering by tuple)
  - There is some ORDER BY inside SELECT
- See example
The MergeTree table doesn’t have Materialized Views (there is no atomicity Table <> MV)

https://github.com/ClickHouse/ClickHouse/issues/9195#issuecomment-587500824 https://github.com/ClickHouse/ClickHouse/issues/5148#issuecomment-487757235

Example how to make a large insert atomically

Generate test data in Native and TSV format ( 100 millions rows )

Text formats and Native format require different set of settings, here I want to find / demonstrate mandatory minimum of settings for any case.

clickhouse-client -q \
     'SELECT toInt64(number) A, toString(number) S FROM numbers(100000000) FORMAT Native' > t.native
clickhouse-client -q \
     'SELECT toInt64(number) A, toString(number) S FROM numbers(100000000) FORMAT TSV' > t.tsv

Insert with default settings (not atomic)

DROP TABLE IF EXISTS trg;
CREATE TABLE trg(A Int64, S String) Engine=MergeTree ORDER BY A;

-- Load data in Native format
clickhouse-client  -q 'INSERT INTO trg FORMAT Native' <t.native

-- Check how many parts is created
SELECT 
    count(),
    min(rows),
    max(rows),
    sum(rows)
FROM system.parts
WHERE (level = 0) AND (table = 'trg');
┌─count()─┬─min(rows)─┬─max(rows)─┬─sum(rows)─┐
│      90 │    890935 │   1113585 │ 100000000 │
└─────────┴───────────┴───────────┴───────────┘

--- 90 parts! was created - not atomic



DROP TABLE IF EXISTS trg;
CREATE TABLE trg(A Int64, S String) Engine=MergeTree ORDER BY A;

-- Load data in TSV format
clickhouse-client  -q 'INSERT INTO trg FORMAT TSV' <t.tsv

-- Check how many parts is created
SELECT 
    count(),
    min(rows),
    max(rows),
    sum(rows)
FROM system.parts
WHERE (level = 0) AND (table = 'trg');
┌─count()─┬─min(rows)─┬─max(rows)─┬─sum(rows)─┐
│      85 │    898207 │   1449610 │ 100000000 │
└─────────┴───────────┴───────────┴───────────┘

--- 85 parts! was created - not atomic

Insert with adjusted settings (atomic)

Atomic insert use more memory because it needs 100 millions rows in memory.

DROP TABLE IF EXISTS trg;
CREATE TABLE trg(A Int64, S String) Engine=MergeTree ORDER BY A;

clickhouse-client --input_format_parallel_parsing=0 \
                  --min_insert_block_size_bytes=0 \
                  --min_insert_block_size_rows=1000000000 \
                  -q 'INSERT INTO trg FORMAT Native' <t.native

-- Check that only one part is created
SELECT
    count(),
    min(rows),
    max(rows),
    sum(rows)
FROM system.parts
WHERE (level = 0) AND (table = 'trg');
┌─count()─┬─min(rows)─┬─max(rows)─┬─sum(rows)─┐
│       1 │ 100000000 │ 100000000 │ 100000000 │
└─────────┴───────────┴───────────┴───────────┘

-- 1 part, success.



DROP TABLE IF EXISTS trg;
CREATE TABLE trg(A Int64, S String) Engine=MergeTree ORDER BY A;

-- Load data in TSV format
clickhouse-client --input_format_parallel_parsing=0 \
                  --min_insert_block_size_bytes=0 \
                  --min_insert_block_size_rows=1000000000 \
                  -q 'INSERT INTO trg FORMAT TSV' <t.tsv

-- Check that only one part is created
SELECT 
    count(),
    min(rows),
    max(rows),
    sum(rows)
FROM system.parts
WHERE (level = 0) AND (table = 'trg');
┌─count()─┬─min(rows)─┬─max(rows)─┬─sum(rows)─┐
│       1 │ 100000000 │ 100000000 │ 100000000 │
└─────────┴───────────┴───────────┴───────────┘

-- 1 part, success.

2.16 - ClickHouse® Projections

Using this ClickHouse feature to optimize queries

Projections in ClickHouse act as inner tables within a main table, functioning as a mechanism to optimize queries by using these inner tables when only specific columns are needed. Essentially, a projection is similar to a Materialized View with an AggregatingMergeTree engine , designed to be automatically populated with relevant data.

However, too many projections can lead to excess storage, much like overusing Materialized Views. Projections share the same lifecycle as the main table, meaning they are automatically backfilled and don’t require query rewrites, which is particularly advantageous when integrating with BI tools.

Projection parts are stored within the main table parts, and their merges occur simultaneously as the main table merges, ensuring data consistency without additional maintenance.

compared to a separate table+MV setup:

A separate table gives you more freedom (like partitioning, granularity, etc), but projections - more consistency (parts managed as a whole)
Projections do not support many features (like indexes and FINAL). That becomes better with recent versions, but still a drawback

The design approach for projections is the same as for indexes. Create a table and give it to users. If you encounter a slower query, add a projection for that particular query (or set of similar queries). You can create 10+ projections per table, materialize, drop, etc - the very same as indexes. You exchange query speed for disk space/IO and CPU needed to build and rebuild projections on merges.

Why is a ClickHouse projection not used?

A query analyzer should have a reason for using a projection and should not have any limitation to do so.

the query should use ONLY the columns defined in the projection.
There should be a lot of data to read from the main table (gigabytes)
for ORDER BY projection WHERE statement referring to a column should be in the query
FINAL queries do not work with projections.
tables with DELETEd rows do not work with projections. This is because rows in a projection may be affected by a DELETE operation. But there is a MergeTree setting lightweight_mutation_projection_mode to change the behavior (Since 24.7)
Projection is used only if it is cheaper to read from it than from the table (expected amount of rows and GBs read is smaller)
Projection should be materialized. Verify that all parts have the needed projection by comparing system.parts and system.projection_parts (see query below)
a bug in a Clickhouse version. Look at changelog and search for projection.
If there are many projections per table, the analyzer can select any of them. If you think that it is better, use settings preferred_optimize_projection_name or force_optimize_projection_name
If expressions are used instead of plain column names, the query should use the exact expression as defined in the projection with the same functions and modifiers. Use column aliases to make the query the very same as in the projection definition:

CREATE TABLE test
(
    a Int64,
    ts DateTime,
    week alias toStartOfWeek(ts),
    PROJECTION weekly_projection
    (
        SELECT week, sum(a) group by week
    )
)
ENGINE = MergeTree ORDER BY a;

insert into test
select number, now()-number*100
from numbers(1e7);

--explain indexes=1
select week, sum(a) from test group by week
settings force_optimize_projection=1;

https://fiddle.clickhouse.com/7f331eb2-9408-4813-9c67-caef4cdd227d

Explain result: ReadFromMergeTree (weekly_projection)

Expression ((Project names + Projection))
  Aggregating
    Expression
      ReadFromMergeTree (weekly_projection)
      Indexes:
        PrimaryKey
          Condition: true
          Parts: 9/9
          Granules: 9/1223

check parts

has the projection materialized
does not have lightweight deletes

SELECT
    p.database AS base_database,
    p.table AS base_table,
    p.name AS base_part_name,         -- Name of the part in the base table
    p.has_lightweight_delete,
    pp.active
FROM system.parts AS p  -- Alias for the base table's parts
LEFT JOIN system.projection_parts AS pp -- Alias for the projection's parts
ON    p.database = pp.database AND p.table = pp.table
  AND p.name = pp.parent_name
  AND pp.name = 'projection'
WHERE
    p.database = 'database'
    AND p.table = 'table'
    AND p.active  -- Consider only active parts of the base table
  -- and not pp.active          -- see only missed in the list
ORDER BY p.database, p.table, p.name;

Recalculate on Merge

What happens in the case of non-trivial background merges in ReplacingMergeTree, AggregatingMergeTree and similar, and OPTIMIZE table DEDUPLICATE queries?

Before version 24.8, projections became out of sync with the main data.
Since version 24.8, it is controlled by a new table-level setting:
deduplicate_merge_projection_mode = throw/drop/rebuild
Somewhere later (before 25.3) ignore option was introduced. It can be helpful for cases when SummingMergeTree is used with Projections and no DELETE operation in any flavor (Replacing/Collapsing/DELETE/ALTER DELETE) is executed over the table.

However, projection usage is still disabled for FINAL queries. So, you have to use OPTIMIZE FINAL or SELECT …GROUP BY instead of FINAL for fighting duplicates between parts

CREATE TABLE users (uid Int16, name String, version Int16,
  projection xx (
     select name,uid,version order by name
  )
) ENGINE=ReplacingMergeTree order by uid
settings deduplicate_merge_projection_mode='rebuild'
  ;

INSERT INTO users
SELECT 
    number AS uid,
    concat('User_', toString(uid)) AS name,
    1 AS version  
FROM numbers(100000);

INSERT INTO users
SELECT 
    number AS uid,
    concat('User_', toString(uid)) AS name,
    2 AS version  
FROM numbers(100000);

SELECT 'duplicate',name,uid,version FROM users 
where name ='User_98304' 
settings force_optimize_projection=1 ;

SELECT 'dedup by group by/limit 1 by',name,uid,version FROM users 
where name ='User_98304' 
order by version DESC
limit 1 by uid
settings force_optimize_projection=1
;

optimize table users final ;

SELECT 'dedup after optimize',name,uid,version FROM users 
where name ='User_98304' 
settings force_optimize_projection=1 ;

https://fiddle.clickhouse.com/e1977a66-09ce-43c4-aabc-508c957d44d7

System tables

system.projections
system.projection_parts
system.projection_parts_columns

SELECT
    database,
    table,
    name,
    formatReadableSize(sum(data_compressed_bytes) AS size) AS compressed,
    formatReadableSize(sum(data_uncompressed_bytes) AS usize) AS uncompressed,
    round(usize / size, 2) AS compr_rate,
    sum(rows) AS rows,
    count() AS part_count
FROM system.projection_parts
WHERE active
GROUP BY
    database,
    table,
    name
ORDER BY size DESC;

How to receive a list of tables with projections?

select database, table from system.tables
where create_table_query ilike '%projection%'
  and database <> 'system'

Examples

Aggregating ClickHouse projections

create table z(Browser String, Country UInt8, F Float64)
Engine=MergeTree
order by Browser;

insert into z
     select toString(number%9999),
     number%33, 1
from numbers(100000000);

--Q1)
select sum(F), Browser
from z
group by Browser format Null;
Elapsed: 0.205 sec. Processed 100.00 million rows

--Q2)
select sum(F), Browser, Country
from z
group by Browser,Country format Null;
Elapsed: 0.381 sec. Processed 100.00 million rows

--Q3)
select sum(F),count(), Browser, Country
from z
group by Browser,Country format Null;
Elapsed: 0.398 sec. Processed 100.00 million rows

alter table z add projection pp
   (select Browser,Country, count(), sum(F)
    group by Browser,Country);
alter table z materialize projection pp;

---- 0 = don't use proj, 1 = use projection
set allow_experimental_projection_optimization=1;

--Q1)
select sum(F), Browser
from z
group by Browser format Null;
Elapsed: 0.003 sec. Processed 22.43 thousand rows

--Q2)
select sum(F), Browser, Country
from z
group by Browser,Country format Null;
Elapsed: 0.004 sec. Processed 22.43 thousand rows

--Q3)
select sum(F),count(), Browser, Country
from z
group by Browser,Country format Null;
Elapsed: 0.005 sec. Processed 22.43 thousand rows

Emulation of an inverted index using orderby projection

You can create an orderby projection and include all columns of a table, but if a table is very wide it will double the amount of stored data. This example demonstrate a trick, we create an orderby projection and include primary key columns and the target column and sort by the target column. This allows using subquery to find primary key values and after that to query the table using the primary key.

CREATE TABLE test_a
(
    `src` String,
    `dst` String,
    `other_cols` String,
    PROJECTION p1
    (
        SELECT
            src,
            dst
        ORDER BY dst
    )
)
ENGINE = MergeTree
ORDER BY src;

insert into test_a select number, -number, 'other_col '||toString(number) from numbers(1e8);

select * from test_a where src='42';
┌─src─┬─dst─┬─other_cols───┐
│ 42  │ -42 │ other_col 42 │
└─────┴─────┴──────────────┘
1 row in set. Elapsed: 0.005 sec. Processed 16.38 thousand rows, 988.49 KB (3.14 million rows/s., 189.43 MB/s.)


select * from test_a where dst='-42';
┌─src─┬─dst─┬─other_cols───┐
│ 42  │ -42 │ other_col 42 │
└─────┴─────┴──────────────┘
1 row in set. Elapsed: 0.625 sec. Processed 100.00 million rows, 1.79 GB (160.05 million rows/s., 2.86 GB/s.)

-- optimization using projection
select * from test_a where src in (select src from test_a where dst='-42') and dst='-42';
┌─src─┬─dst─┬─other_cols───┐
│ 42  │ -42 │ other_col 42 │
└─────┴─────┴──────────────┘
1 row in set. Elapsed: 0.013 sec. Processed 32.77 thousand rows, 660.75 KB (2.54 million rows/s., 51.26 MB/s.)

Elapsed: 0.625 sec. Processed 100.00 million rows – not optimized

Elapsed: 0.013 sec. Processed 32.77 thousand rows – optimized

2.17 - Cumulative Anything

Cumulative Anything

Sample data

CREATE TABLE events
(
    `ts` DateTime,
    `user_id` UInt32
)
ENGINE = Memory;

INSERT INTO events SELECT
    toDateTime('2021-04-29 10:10:10') + toIntervalHour(7 * number) AS ts,
    toDayOfWeek(ts) + (number % 2) AS user_id
FROM numbers(15);

Using window functions (starting from ClickHouse® 21.3)

SELECT
    toStartOfDay(ts) AS ts,
    uniqExactMerge(uniqExactState(user_id)) OVER (ORDER BY ts ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS uniq
FROM events
GROUP BY ts
ORDER BY ts ASC

┌──────────────────ts─┬─uniq─┐
│ 2021-04-29 00:00:00 │    2 │
│ 2021-04-30 00:00:00 │    3 │
│ 2021-05-01 00:00:00 │    4 │
│ 2021-05-02 00:00:00 │    5 │
│ 2021-05-03 00:00:00 │    7 │
└─────────────────────┴──────┘

SELECT
    ts,
    uniqExactMerge(state) OVER (ORDER BY ts ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS uniq
FROM
(
    SELECT
        toStartOfDay(ts) AS ts,
        uniqExactState(user_id) AS state
    FROM events
    GROUP BY ts
)
ORDER BY ts ASC

┌──────────────────ts─┬─uniq─┐
│ 2021-04-29 00:00:00 │    2 │
│ 2021-04-30 00:00:00 │    3 │
│ 2021-05-01 00:00:00 │    4 │
│ 2021-05-02 00:00:00 │    5 │
│ 2021-05-03 00:00:00 │    7 │
└─────────────────────┴──────┘

Using arrays

WITH
    groupArray(_ts) AS ts_arr,
    groupArray(state) AS state_arr
SELECT
    arrayJoin(ts_arr) AS ts,
    arrayReduce('uniqExactMerge', arrayFilter((x, y) -> (y <= ts), state_arr, ts_arr)) AS uniq
FROM
(
    SELECT
        toStartOfDay(ts) AS _ts,
        uniqExactState(user_id) AS state
    FROM events
    GROUP BY _ts
)
ORDER BY ts ASC

┌──────────────────ts─┬─uniq─┐
│ 2021-04-29 00:00:00 │    2 │
│ 2021-04-30 00:00:00 │    3 │
│ 2021-05-01 00:00:00 │    4 │
│ 2021-05-02 00:00:00 │    5 │
│ 2021-05-03 00:00:00 │    7 │
└─────────────────────┴──────┘

WITH arrayJoin(range(toUInt32(_ts) AS int, least(int + toUInt32((3600 * 24) * 5), toUInt32(toDateTime('2021-05-04 00:00:00'))), 3600 * 24)) AS ts_expanded
SELECT
    toDateTime(ts_expanded) AS ts,
    uniqExactMerge(state) AS uniq
FROM
(
    SELECT
        toStartOfDay(ts) AS _ts,
        uniqExactState(user_id) AS state
    FROM events
    GROUP BY _ts
)
GROUP BY ts
ORDER BY ts ASC

┌──────────────────ts─┬─uniq─┐
│ 2021-04-29 00:00:00 │    2 │
│ 2021-04-30 00:00:00 │    3 │
│ 2021-05-01 00:00:00 │    4 │
│ 2021-05-02 00:00:00 │    5 │
│ 2021-05-03 00:00:00 │    7 │
└─────────────────────┴──────┘

Using runningAccumulate (incorrect result over blocks)

SELECT
    ts,
    runningAccumulate(state) AS uniq
FROM
(
    SELECT
        toStartOfDay(ts) AS ts,
        uniqExactState(user_id) AS state
    FROM events
    GROUP BY ts
    ORDER BY ts ASC
)
ORDER BY ts ASC

┌──────────────────ts─┬─uniq─┐
│ 2021-04-29 00:00:00 │    2 │
│ 2021-04-30 00:00:00 │    3 │
│ 2021-05-01 00:00:00 │    4 │
│ 2021-05-02 00:00:00 │    5 │
│ 2021-05-03 00:00:00 │    7 │
└─────────────────────┴──────┘

2.18 - Data types on disk and in RAM

Data types on disk and in RAM

DataType	RAM size (=byteSize)	Disk Size
String	string byte length + 9 string length: 64 bit integer zero-byte terminator: 1 byte.	string length prefix (varint) + string itself: string shorter than 128 - string byte length + 1 string shorter than 16384 - string byte length + 2 string shorter than 2097152 - string byte length + 2 string shorter than 268435456 - string byte length + 4
AggregateFunction(count, ...)		varint

DataType

RAM size (=byteSize)

Disk Size

String

string byte length + 9

string length: 64 bit integer

zero-byte terminator: 1 byte.

string length prefix (varint) + string itself:

string shorter than 128 - string byte length + 1
string shorter than 16384 - string byte length + 2
string shorter than 2097152 - string byte length + 2
string shorter than 268435456 - string byte length + 4

AggregateFunction(count, ...)

varint

See also the presentation Data processing into ClickHouse® , especially slides 17-22.

2.19 - DELETE via tombstone column

DELETE via tombstone column

This article provides an overview of the different methods to handle row deletion in ClickHouse, using tombstone columns and ALTER UPDATE or DELETE. The goal is to highlight the performance impacts of different techniques and storage settings, including a scenario using S3 for remote storage.

Creating a Test Table We will start by creating a simple MergeTree table with a tombstone column (is_active) to track active rows:

CREATE TABLE test_delete
(
    `key` UInt32,
    `ts` UInt32,
    `value_a` String,
    `value_b` String,
    `value_c` String,
    `is_active` UInt8 DEFAULT 1
)
ENGINE = MergeTree
ORDER BY key;

Inserting Data Insert sample data into the table:

INSERT INTO test_delete (key, ts, value_a, value_b, value_c) SELECT
    number,
    1,
    concat('some_looong_string', toString(number)),
    concat('another_long_str', toString(number)),
    concat('string', toString(number))
FROM numbers(10000000);


INSERT INTO test_delete (key, ts, value_a, value_b, value_c) VALUES (400000, 2, 'totally different string', 'another totally different string', 'last string');

Querying the Data To verify the inserted data:

SELECT *
FROM test_delete
WHERE key = 400000;

┌────key─┬─ts─┬─value_a──────────────────┬─value_b──────────────────────────┬─value_c─────┬─is_active─┐
│ 400000 │  2 │ totally different string │ another totally different string │ last string │         1 │
└────────┴────┴──────────────────────────┴──────────────────────────────────┴─────────────┴───────────┘
┌────key─┬─ts─┬─value_a──────────────────┬─value_b────────────────┬─value_c──────┬─is_active─┐
│ 400000 │  1 │ some_looong_string400000 │ another_long_str400000 │ string400000 │         1 │
└────────┴────┴──────────────────────────┴────────────────────────┴──────────────┴───────────┘

This should return two rows with different ts values.

Soft Deletion Using ALTER UPDATE Instead of deleting a row, you can mark it as inactive by setting is_active to 0:


SET mutations_sync = 2;

ALTER TABLE test_delete
    UPDATE is_active = 0 WHERE (key = 400000) AND (ts = 1);
Ok.

0 rows in set. Elapsed: 0.058 sec.

After updating, you can filter out inactive rows:

SELECT *
FROM test_delete
WHERE (key = 400000) AND is_active=0;

┌────key─┬─ts─┬─value_a──────────────────┬─value_b────────────────┬─value_c──────┬─is_active─┐
│ 400000 │  1 │ some_looong_string400000 │ another_long_str400000 │ string400000 │         0 │
└────────┴────┴──────────────────────────┴────────────────────────┴──────────────┴───────────┘

Hard Deletion Using ALTER DELETE If you need to completely remove a row from the table, you can use ALTER DELETE:

ALTER TABLE test_delete
    DELETE WHERE (key = 400000) AND (ts = 1);

Ok.

0 rows in set. Elapsed: 1.101 sec. -- 20 times slower!!!

However, this operation is significantly slower compared to the ALTER UPDATE approach. For example:

ALTER DELETE: Takes around 1.1 seconds ALTER UPDATE: Only 0.05 seconds

The reason for this difference is that DELETE modifies the physical data structure, while UPDATE merely changes a column value.

SELECT *
FROM test_delete
WHERE key = 400000;

┌────key─┬─ts─┬─value_a──────────────────┬─value_b──────────────────────────┬─value_c─────┬─is_active─┐
│ 400000 │  2 │ totally different string │ another totally different string │ last string │         1 │
└────────┴────┴──────────────────────────┴──────────────────────────────────┴─────────────┴───────────┘

-- For ReplacingMergeTree -> https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/replacingmergetree

OPTIMIZE TABLE test_delete FINAL;

Ok.

0 rows in set. Elapsed: 2.230 sec. -- 40 times slower!!!

SELECT *
FROM test_delete
WHERE key = 400000

┌────key─┬─ts─┬─value_a──────────────────┬─value_b──────────────────────────┬─value_c─────┬─is_active─┐
│ 400000 │  2 │ totally different string │ another totally different string │ last string │         1 │
└────────┴────┴──────────────────────────┴──────────────────────────────────┴─────────────┴───────────┘

Soft Deletion (via ALTER UPDATE): A quicker approach that does not involve physical data deletion but rather updates the tombstone column. Hard Deletion (via ALTER DELETE): Can take significantly longer, especially with large datasets stored in remote storage like S3.

Optimizing for Faster Deletion with S3 Storage If using S3 for storage, the DELETE operation becomes even slower due to the overhead of handling remote data. Here’s an example with a table using S3-backed storage:

CREATE TABLE test_delete
(
    `key` UInt32,
    `value_a` String,
    `value_b` String,
    `value_c` String,
    `is_deleted` UInt8 DEFAULT 0
)
ENGINE = MergeTree
ORDER BY key
SETTINGS storage_policy = 's3tiered';

INSERT INTO test_delete (key, value_a, value_b, value_c) SELECT
    number,
    concat('some_looong_string', toString(number)),
    concat('another_long_str', toString(number)),
    concat('really long string', toString(arrayMap(i -> cityHash64(i*number), range(50))))
FROM numbers(10000000);

OPTIMIZE TABLE test_delete FINAL;

ALTER TABLE test_delete MOVE PARTITION tuple() TO DISK 's3disk';

SELECT count() FROM test_delete;
┌──count()─┐
│ 10000000 │
└──────────┘
1 row in set. Elapsed: 0.002 sec.

DELETE Using ALTER UPDATE and Row Policy You can also control visibility at the query level using row policies. For example, to only show rows where is_active = 1:

To delete a row using ALTER UPDATE:

CREATE ROW POLICY pol1 ON test_delete USING is_active=1 TO all;

SELECT count() FROM test_delete;  -- select count() became much slower, it reads data now, not metadata
┌──count()─┐
│ 10000000 │
└──────────┘
1 row in set. Elapsed: 0.314 sec. Processed 10.00 million rows, 10.00 MB (31.84 million rows/s., 31.84 MB/s.)

ALTER TABLE test_delete UPDATE is_active = 0 WHERE (key = 400000) settings mutations_sync = 2;
0 rows in set. Elapsed: 1.256 sec.

SELECT count() FROM test_delete;
┌─count()─┐
│ 9999999 │
└─────────┘

This impacts the performance of queries like SELECT count(), as ClickHouse now needs to scan data instead of reading metadata.

DELETE Using ALTER DELETE - https://clickhouse.com/docs/en/sql-reference/statements/alter/delete To delete a row using ALTER DELETE:

ALTER TABLE test_delete DELETE WHERE (key = 400001) settings mutations_sync = 2;
0 rows in set. Elapsed: 955.672 sec.

SELECT count() FROM test_delete;
┌─count()─┐
│ 9999998 │
└─────────┘

This operation may take significantly longer compared to soft deletions (around 955 seconds in this example for large datasets):

DELETE Using DELETE Statement - https://clickhouse.com/docs/en/sql-reference/statements/delete The DELETE statement can also be used to remove data from a table:

DELETE FROM test_delete WHERE (key = 400002);
0 rows in set. Elapsed: 1.281 sec.

SELECT count() FROM test_delete;
┌─count()─┐
│ 9999997 │
└─────────┘

This operation is faster, with an elapsed time of around 1.28 seconds in this case:

The choice between ALTER UPDATE and ALTER DELETE depends on your use case. For soft deletes, updating a tombstone column is significantly faster and easier to manage. However, if you need to physically remove rows, be mindful of the performance costs, especially with remote storage like S3.

2.20 - EXPLAIN query

EXPLAIN query

EXPLAIN types

EXPLAIN AST
        SYNTAX
        PLAN indexes = 0,
             header = 0,
             description = 1,
             actions = 0,
             optimize = 1
             json = 0
        PIPELINE header = 0,
                 graph = 0,
                 compact = 1
        ESTIMATE
SELECT ...

AST - abstract syntax tree
SYNTAX - query text after AST-level optimizations
PLAN - query execution plan
PIPELINE - query execution pipeline
ESTIMATE - See Estimates for select query , available since ClickHouse® 21.9
indexes=1 supported starting from 21.6 (https://github.com/ClickHouse/ClickHouse/pull/22352 )
json=1 supported starting from 21.6 (https://github.com/ClickHouse/ClickHouse/pull/23082 )

References

https://clickhouse.com/docs/en/sql-reference/statements/explain/
Nikolai Kochetov from Yandeх. EXPLAIN query in ClickHouse. slides , video
https://github.com/ClickHouse/clickhouse-presentations/blob/master/meetup39/query-profiling.pdf
https://github.com/ClickHouse/ClickHouse/issues/28847

2.21 - Fill missing values at query time

Fill missing values at query time

CREATE TABLE event_table
(
    `key` UInt32,
    `created_at` DateTime,
    `value_a` UInt32,
    `value_b` String
)
ENGINE = MergeTree
ORDER BY (key, created_at)

INSERT INTO event_table SELECT
    1 AS key,
    toDateTime('2020-10-11 10:10:10') + number AS created_at,
    if((number = 0) OR ((number % 5) = 1), number + 1, 0) AS value_a,
    if((number = 0) OR ((number % 3) = 1), toString(number), '') AS value_b
FROM numbers(10)

SELECT
    main.key,
    main.created_at,
    a.value_a,
    b.value_b
FROM event_table AS main
ASOF INNER JOIN
(
    SELECT
        key,
        created_at,
        value_a
    FROM event_table
    WHERE value_a != 0
) AS a ON (main.key = a.key) AND (main.created_at >= a.created_at)
ASOF INNER JOIN
(
    SELECT
        key,
        created_at,
        value_b
    FROM event_table
    WHERE value_b != ''
) AS b ON (main.key = b.key) AND (main.created_at >= b.created_at)

┌─main.key─┬─────main.created_at─┬─a.value_a─┬─b.value_b─┐
│        1 │ 2020-10-11 10:10:10 │         1 │ 0         │
│        1 │ 2020-10-11 10:10:11 │         2 │ 1         │
│        1 │ 2020-10-11 10:10:12 │         2 │ 1         │
│        1 │ 2020-10-11 10:10:13 │         2 │ 1         │
│        1 │ 2020-10-11 10:10:14 │         2 │ 4         │
│        1 │ 2020-10-11 10:10:15 │         2 │ 4         │
│        1 │ 2020-10-11 10:10:16 │         7 │ 4         │
│        1 │ 2020-10-11 10:10:17 │         7 │ 7         │
│        1 │ 2020-10-11 10:10:18 │         7 │ 7         │
│        1 │ 2020-10-11 10:10:19 │         7 │ 7         │
└──────────┴─────────────────────┴───────────┴───────────┘

SELECT
    key,
    created_at,
    value_a,
    value_b
FROM
(
    SELECT
        key,
        groupArray(created_at) AS created_arr,
        arrayFill(x -> (x != 0), groupArray(value_a)) AS a_arr,
        arrayFill(x -> (x != ''), groupArray(value_b)) AS b_arr
    FROM
    (
        SELECT *
        FROM event_table
        ORDER BY
            key ASC,
            created_at ASC
    )
    GROUP BY key
)
ARRAY JOIN
    created_arr AS created_at,
    a_arr AS value_a,
    b_arr AS value_b

┌─key─┬──────────created_at─┬─value_a─┬─value_b─┐
│   1 │ 2020-10-11 10:10:10 │       1 │ 0       │
│   1 │ 2020-10-11 10:10:11 │       2 │ 1       │
│   1 │ 2020-10-11 10:10:12 │       2 │ 1       │
│   1 │ 2020-10-11 10:10:13 │       2 │ 1       │
│   1 │ 2020-10-11 10:10:14 │       2 │ 4       │
│   1 │ 2020-10-11 10:10:15 │       2 │ 4       │
│   1 │ 2020-10-11 10:10:16 │       7 │ 4       │
│   1 │ 2020-10-11 10:10:17 │       7 │ 7       │
│   1 │ 2020-10-11 10:10:18 │       7 │ 7       │
│   1 │ 2020-10-11 10:10:19 │       7 │ 7       │
└─────┴─────────────────────┴─────────┴─────────┘

2.22 - FINAL clause speed

FINAL clause speed

SELECT * FROM table FINAL

History

Before ClickHouse® 20.5 - always executed in a single thread and slow.
Since 20.5 - final can be parallel, see https://github.com/ClickHouse/ClickHouse/pull/10463
Since 20.10 - you can use do_not_merge_across_partitions_select_final setting. See https://github.com/ClickHouse/ClickHouse/pull/15938 and https://github.com/ClickHouse/ClickHouse/issues/11722
Since 22.6 - final even more parallel, see https://github.com/ClickHouse/ClickHouse/pull/36396
Since 22.8 - final doesn’t read excessive data, see https://github.com/ClickHouse/ClickHouse/pull/47801
Since 23.5 - final use less memory, see https://github.com/ClickHouse/ClickHouse/pull/50429
Since 23.9 - final doesn’t read PK columns if unneeded ie only one part in partition, see https://github.com/ClickHouse/ClickHouse/pull/53919
Since 23.12 - final applied only for intersecting ranges of parts, see https://github.com/ClickHouse/ClickHouse/pull/58120
Since 24.1 - final doesn’t compare rows from the same part with level > 0, see https://github.com/ClickHouse/ClickHouse/pull/58142
Since 24.1 - final use vertical algorithm (more cache friendly), see https://github.com/ClickHouse/ClickHouse/pull/54366
Since 25.6 - final supports skip indexes (use_skip_indexes_if_final=1 by default)
Since 25.12 - apply_prewhere_after_final and apply_row_policy_after_final settings for correct PREWHERE/row policy handling with FINAL
Since 26.2 - enable_automatic_decision_for_merging_across_partitions_for_final=1 by default (auto-enables cross-partition optimization when safe)

Partitioning

Proper partition design could speed up FINAL processing.

For example, if you have a table with Daily partitioning, you can:

After day end + some time interval during which you can get some updates run OPTIMIZE TABLE xxx PARTITION 'prev_day' FINAL
or add table SETTINGS min_age_to_force_merge_seconds=86400,min_age_to_force_merge_on_partition_only=1

In that case, using FINAL with do_not_merge_across_partitions_select_final will be cheap or even zero.

Example:

DROP TABLE IF EXISTS repl_tbl;

CREATE TABLE repl_tbl
(
    `key` UInt32,
    `val_1` UInt32,
    `val_2` String,
    `val_3` String,
    `val_4` String,
    `val_5` UUID,
    `ts` DateTime
)
ENGINE = ReplacingMergeTree(ts)
PARTITION BY toDate(ts)
ORDER BY key;


INSERT INTO repl_tbl SELECT number as key, rand() as val_1, randomStringUTF8(10) as val_2, randomStringUTF8(5) as val_3, randomStringUTF8(4) as val_4, generateUUIDv4() as val_5, '2020-01-01 00:00:00' as ts FROM numbers(10000000);
OPTIMIZE TABLE repl_tbl PARTITION ID '20200101' FINAL;
INSERT INTO repl_tbl SELECT number as key, rand() as val_1, randomStringUTF8(10) as val_2, randomStringUTF8(5) as val_3, randomStringUTF8(4) as val_4, generateUUIDv4() as val_5, '2020-01-02 00:00:00' as ts FROM numbers(10000000);
OPTIMIZE TABLE repl_tbl PARTITION ID '20200102' FINAL;
INSERT INTO repl_tbl SELECT number as key, rand() as val_1, randomStringUTF8(10) as val_2, randomStringUTF8(5) as val_3, randomStringUTF8(4) as val_4, generateUUIDv4() as val_5, '2020-01-03 00:00:00' as ts FROM numbers(10000000);
OPTIMIZE TABLE repl_tbl PARTITION ID '20200103' FINAL;
INSERT INTO repl_tbl SELECT number as key, rand() as val_1, randomStringUTF8(10) as val_2, randomStringUTF8(5) as val_3, randomStringUTF8(4) as val_4, generateUUIDv4() as val_5, '2020-01-04 00:00:00' as ts FROM numbers(10000000);
OPTIMIZE TABLE repl_tbl PARTITION ID '20200104' FINAL;

SYSTEM STOP MERGES repl_tbl;
INSERT INTO repl_tbl SELECT number as key, rand() as val_1, randomStringUTF8(10) as val_2, randomStringUTF8(5) as val_3, randomStringUTF8(4) as val_4, generateUUIDv4() as val_5, '2020-01-05 00:00:00' as ts FROM numbers(10000000);


SELECT count() FROM repl_tbl WHERE NOT ignore(*)

┌──count()─┐
│ 50000000 │
└──────────┘

1 rows in set. Elapsed: 1.504 sec. Processed 50.00 million rows, 6.40 GB (33.24 million rows/s., 4.26 GB/s.)

SELECT count() FROM repl_tbl FINAL WHERE NOT ignore(*)

┌──count()─┐
│ 10000000 │
└──────────┘

1 rows in set. Elapsed: 3.314 sec. Processed 50.00 million rows, 6.40 GB (15.09 million rows/s., 1.93 GB/s.)

/* more that 2 time slower, and will get worse once you will have more data */

set do_not_merge_across_partitions_select_final=1;

SELECT count() FROM repl_tbl FINAL WHERE NOT ignore(*)

┌──count()─┐
│ 50000000 │
└──────────┘

1 rows in set. Elapsed: 1.850 sec. Processed 50.00 million rows, 6.40 GB (27.03 million rows/s., 3.46 GB/s.)

/* only 0.35 sec slower, and while partitions have about the same size that extra cost will be about constant */

Since 26.2, enable_automatic_decision_for_merging_across_partitions_for_final=1 (default) auto-enables this when partition key columns are included in PRIMARY KEY

Light ORDER BY

All columns specified in ORDER BY will be read during FINAL processing, creating additional disk load. Use fewer columns and lighter column types to create faster queries.

Example: UUID vs UInt64

CREATE TABLE uuid_table (id UUID, value UInt64)    ENGINE = ReplacingMergeTree() ORDER BY id;
CREATE TABLE uint64_table (id UInt64,value UInt64) ENGINE = ReplacingMergeTree() ORDER BY id;

INSERT INTO uuid_table SELECT generateUUIDv4(), number FROM numbers(5E7);
INSERT INTO uint64_table SELECT number, number         FROM numbers(5E7);

SELECT sum(value) FROM uuid_table   FINAL format JSON;
SELECT sum(value) FROM uint64_table FINAL format JSON;

Results :

		"elapsed": 0.58738197,
		"rows_read": 50172032,
		"bytes_read": 1204128768

		"elapsed": 0.189792142,
		"rows_read": 50057344,
		"bytes_read": 480675040

Vertical FINAL Algorithm (24.1+)

When enable_vertical_final=1 (default since 24.1), ClickHouse uses a different deduplication strategy:

Marks duplicate rows as deleted instead of merging them immediately
Filters deleted rows in a later processing step
Reads different columns from different parts in parallel

This improves performance for queries that read only a subset of columns, as non-ORDER BY columns can be read independently from different parts.

PREWHERE and Row Policies with FINAL (25.12+)

By default, PREWHERE and row policies are applied before FINAL deduplication. This can cause incorrect results when:

PREWHERE references columns that differ across duplicate rows
Row policies should filter based on the “winning” row values after deduplication

Use these settings when needed:

apply_prewhere_after_final=1 - Apply PREWHERE after deduplication
apply_row_policy_after_final=1 - Apply row policies after deduplication

Example problem: if you have ReplacingMergeTree with a deleted column and PREWHERE filters on it, without apply_prewhere_after_final=1 you may get wrong results because PREWHERE sees rows before FINAL picks the winner.

FINAL with skip indexes:

Both use_skip_indexes_if_final and use_skip_indexes_if_final_exact_mode are enabled by default since 25.6
Skip indexes on PRIMARY KEY columns have lower overhead (no extra rescan needed since 26.1), see https://github.com/ClickHouse/ClickHouse/pull/78350

Settings reference

Setting	Default	Since	Description
`do_not_merge_across_partitions_select_final`	0	20.10	Skip cross-partition merging when partitions are pre-optimized
`max_final_threads`	0 (auto)	20.5	Thread limit for FINAL processing
`enable_vertical_final`	1	24.1	Read columns in parallel from different parts
`use_skip_indexes_if_final`	1	25.6	Allow skip indexes with FINAL
`use_skip_indexes_if_final_exact_mode`	1	25.6	Rescan newer parts to ensure correctness with skip indexes
`apply_prewhere_after_final`	0	25.12	Apply PREWHERE after deduplication (needed when PREWHERE references non-PK columns)
`enable_automatic_decision_for_merging_across_partitions_for_final`	1	26.2	Auto-enable `do_not_merge_across_partitions_select_final` when partition key is in PK

2.23 - Join with Calendar using Arrays

Join with Calendar using Arrays

Sample data

CREATE TABLE test_metrics (counter_id Int64, timestamp DateTime, metric UInt64)
Engine=Log;

INSERT INTO test_metrics SELECT number % 3,
    toDateTime('2021-01-01 00:00:00'), 1
FROM numbers(20);

INSERT INTO test_metrics SELECT number % 3,
    toDateTime('2021-01-03 00:00:00'), 1
FROM numbers(20);

SELECT counter_id, toDate(timestamp) dt, sum(metric)
FROM test_metrics
GROUP BY counter_id, dt
ORDER BY counter_id, dt;

┌─counter_id─┬─────────dt─┬─sum(metric)─┐
│          0 │ 2021-01-01 │           7 │
│          0 │ 2021-01-03 │           7 │
│          1 │ 2021-01-01 │           7 │
│          1 │ 2021-01-03 │           7 │
│          2 │ 2021-01-01 │           6 │
│          2 │ 2021-01-03 │           6 │
└────────────┴────────────┴─────────────┘

Calendar

WITH arrayMap(i -> (toDate('2021-01-01') + i), range(4)) AS Calendar
SELECT arrayJoin(Calendar);

┌─arrayJoin(Calendar)─┐
│          2021-01-01 │
│          2021-01-02 │
│          2021-01-03 │
│          2021-01-04 │
└─────────────────────┘

Join with Calendar using arrayJoin

SELECT counter_id, tuple.2 dt, sum(tuple.1) sum FROM
  (
  WITH arrayMap(i -> (0, toDate('2021-01-01') + i), range(4)) AS Calendar
   SELECT counter_id, arrayJoin(arrayConcat(Calendar, [(sum, dt)])) tuple
   FROM
             (SELECT counter_id, toDate(timestamp) dt, sum(metric) sum
              FROM test_metrics
              GROUP BY counter_id, dt)
  ) GROUP BY counter_id, dt
    ORDER BY counter_id, dt;

┌─counter_id─┬─────────dt─┬─sum─┐
│          0 │ 2021-01-01 │   7 │
│          0 │ 2021-01-02 │   0 │
│          0 │ 2021-01-03 │   7 │
│          0 │ 2021-01-04 │   0 │
│          1 │ 2021-01-01 │   7 │
│          1 │ 2021-01-02 │   0 │
│          1 │ 2021-01-03 │   7 │
│          1 │ 2021-01-04 │   0 │
│          2 │ 2021-01-01 │   6 │
│          2 │ 2021-01-02 │   0 │
│          2 │ 2021-01-03 │   6 │
│          2 │ 2021-01-04 │   0 │
└────────────┴────────────┴─────┘

With fill

SELECT
    counter_id,
    toDate(timestamp) AS dt,
    sum(metric) AS sum
FROM test_metrics
GROUP BY
    counter_id,
    dt
ORDER BY
    counter_id ASC WITH FILL,
    dt ASC WITH FILL FROM toDate('2021-01-01') TO toDate('2021-01-05');

┌─counter_id─┬─────────dt─┬─sum─┐
│          0 │ 2021-01-01 │   7 │
│          0 │ 2021-01-02 │   0 │
│          0 │ 2021-01-03 │   7 │
│          0 │ 2021-01-04 │   0 │
│          1 │ 2021-01-01 │   7 │
│          1 │ 2021-01-02 │   0 │
│          1 │ 2021-01-03 │   7 │
│          1 │ 2021-01-04 │   0 │
│          2 │ 2021-01-01 │   6 │
│          2 │ 2021-01-02 │   0 │
│          2 │ 2021-01-03 │   6 │
│          2 │ 2021-01-04 │   0 │
└────────────┴────────────┴─────┘

2.24 - JOINs

JOINs

Resources:

Overview of JOINs (Russian) - Presentation from Meetup 38 in 2019
Notes on JOIN options

Join Table Engine

The main purpose of JOIN table engine is to avoid building the right table for joining on each query execution. So it’s usually used when you have a high amount of fast queries which share the same right table for joining.

Updates

It’s possible to update rows with setting join_any_take_last_row enabled.

CREATE TABLE id_val_join
(
    `id` UInt32,
    `val` UInt8
)
ENGINE = Join(ANY, LEFT, id)
SETTINGS join_any_take_last_row = 1

Ok.

INSERT INTO id_val_join VALUES (1,21)(1,22)(3,23);

Ok.

SELECT *
FROM
(
    SELECT toUInt32(number) AS id
    FROM numbers(4)
) AS n
ANY LEFT JOIN id_val_join USING (id)

┌─id─┬─val─┐
│  0 │   0 │
│  1 │  22 │
│  2 │   0 │
│  3 │  23 │
└────┴─────┘

INSERT INTO id_val_join VALUES (1,40)(2,24);

Ok.

SELECT *
FROM
(
    SELECT toUInt32(number) AS id
    FROM numbers(4)
) AS n
ANY LEFT JOIN id_val_join USING (id)

┌─id─┬─val─┐
│  0 │   0 │
│  1 │  40 │
│  2 │  24 │
│  3 │  23 │
└────┴─────┘

Join table engine documentation

2.24.1 - JOIN optimization tricks

All tests below were done with default hash join. ClickHouse joins are evolving rapidly and behavior varies with other join types.

Data

For our exercise, we will use two tables from a well known TPS-DS benchmark: store_sales and customer. Table sizes are the following:

store_sales = 2 billion rows customer = 12 millions rows

So there are 200 rows in store_sales table per each customer on average. Also 90% of customers made 1-10 purchases.

Schema example:

CREATE TABLE store_sales
(
	`ss_sold_time_sk` DateTime,
	`ss_sold_date_sk` Date,
	`ss_ship_date_sk` Date,
	`ss_item_sk` UInt32,
	`ss_customer_sk` UInt32,
	`ss_cdemo_sk` UInt32,
	`ss_hdemo_sk` UInt32,
	`ss_addr_sk` UInt32,
	`ss_store_sk` UInt32,
	`ss_promo_sk` UInt32,
	`ss_ticket_number` UInt32,
	`ss_quantity` UInt32,
	`ss_wholesale_cost` Float64,
	`ss_list_price` Float64,
	`ss_sales_price` Float64,
	`ss_ext_discount_amt` Float64,
	`ss_ext_sales_price` Float64,
	`ss_ext_wholesale_cost` Float64,
	`ss_ext_list_price` Float64,
	`ss_ext_tax` Float64,
	`ss_coupon_amt` Float64,
	`ss_net_paid` Float64,
	`ss_net_paid_inc_tax` Float64,
	`ss_net_profit` Float64
)
ENGINE = MergeTree
ORDER BY ss_ticket_number

CREATE TABLE customer
(
	`c_customer_sk` UInt32,
	`c_current_addr_sk` UInt32,
	`c_first_shipto_date_sk` Date,
	`c_first_sales_date_sk` Date,
	`c_salutation` String,
	`c_c_first_name` String,
	`c_last_name` String,
	`c_preferred_cust_flag` String,
	`c_birth_date` Date,
	`c_birth_country` String,
	`c_login` String,
	`c_email_address` String,
	`c_last_review_date` Date
)
ENGINE = MergeTree
ORDER BY c_customer_id

Target query

SELECT
	sumIf(ss_sales_price, customer.c_first_name = 'James') AS sum_James,
	sumIf(ss_sales_price, customer.c_first_name = 'Lisa') AS sum_Lisa,
	sum(ss_sales_price) AS sum_total
FROM store_sales
INNER JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk

Baseline performance

SELECT
	sumIf(ss_sales_price, customer.c_first_name = 'James') AS sum_James,
	sumIf(ss_sales_price, customer.c_first_name = 'Lisa') AS sum_Lisa,
	sum(ss_sales_price) AS sum_total
FROM store_sales
INNER JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk

0 rows in set. Elapsed: 188.384 sec. Processed 2.89 billion rows, 40.60 GB (15.37 million rows/s., 216.92 MB/s.)

Manual pushdown of conditions

If we look at our query, we only care if sale belongs to customer named James or Lisa and dont care for rest of cases. We can use that.

Usually, ClickHouse is able to pushdown conditions, but not in that case, when conditions itself part of function expression, so you can manually help in those cases.

SELECT  
      sumIf(ss_sales_price, customer.c_first_name = 'James') as sum_James,
    	sumIf(ss_sales_price, customer.c_first_name = 'Lisa') as sum_Lisa,
    	sum(ss_sales_price) as sum_total
FROM store_sales LEFT JOIN (SELECT * FROM customer WHERE c_first_name = 'James' OR c_first_name = 'Lisa') as customer ON store_sales.ss_customer_sk = customer.c_customer_sk

1 row in set. Elapsed: 35.370 sec. Processed 2.89 billion rows, 40.60 GB (81.76 million rows/s., 1.15 GB/s.)

Reduce right table row size

Reduce attribute columns (push expression before JOIN step)

Our row from the right table consists of 2 fields: customer_sk and c_first_name. First one is needed to JOIN by it, so it’s not much we can do here, but we can transform a bit of the second column.

Again, let’s look in how we use this column in main query:

customer.c_first_name = ‘James’ customer.c_first_name = ‘Lisa’

We calculate 2 simple conditions(which don’t have any dependency on data from the left table) and nothing more. It does mean that we can move this calculation to the right table, it will make 3 improvements!

Right table will be smaller -> smaller RAM usage -> better cache hits
We will calculate our conditions over a smaller data set. In the right table we have only 10 million rows and after joining because of the left table we have 2 billion rows -> 200 times improvement!
Our resulting table after JOIN will not have an expensive String column, only 1 byte UInt8 instead -> less copy of data in memory.

Let’s do it:

There are several ways to rewrite that query, let’s not bother with simple once and go straight to most optimized:

Put our 2 conditions in hand-made bitmask:

In order to do that we will take our conditions and multiply them by

(c_first_name = 'James') + (2 * (c_first_name = 'Lisa')

C_first_name	| (c_first_name = 'James') + (2 * (c_first_name = 'Lisa')
   James        |         				00000001
   Lisa        	|         				00000010

As you can see, if you do it in that way, your conditions will not interfere with each other! But we need to be careful with the wideness of the resulting numeric type. Let’s write our calculations in type notation: UInt8 + UInt8*2 -> UInt8 + UInt16 -> UInt32

But we actually do not use more than first 2 bits, so we need to cast this expression back to UInt8

Last thing to do is use the bitTest function in order to get the result of our condition by its position.

And resulting query is:

SELECT
	sumIf(ss_sales_price, bitTest(customer.cond, 0)) AS sum_James,
	sumIf(ss_sales_price, bitTest(customer.cond, 1)) AS sum_Lisa,
	sum(ss_sales_price) AS sum_total
FROM store_sales
LEFT JOIN
(
	SELECT
    	c_customer_sk,
    	((c_first_name = 'James') + (2 * (c_first_name = 'Lisa')))::UInt8 AS cond 	FROM customer
	WHERE (c_first_name = 'James') OR (c_first_name = 'Lisa')
) AS customer ON store_sales.ss_customer_sk = customer.c_customer_sk

1 row in set. Elapsed: 31.699 sec. Processed 2.89 billion rows, 40.60 GB (91.23 million rows/s., 1.28 GB/s.)

Reduce key column size

But can we make something with our JOIN key column?

It’s type is Nullable(UInt64)

Let’s check if we really need to have a 0…18446744073709551615 range for our customer id, it sure looks like that we have much less people on earth than this number. The same about Nullable trait, we don’t care about Nulls in customer_id

SELECT max(c_customer_sk) FROM customer

For sure, we don’t need that wide type. Lets remove Nullable trait and cast column to UInt32, twice smaller in byte size compared to UInt64.

SELECT
	sumIf(ss_sales_price, bitTest(customer.cond, 0)) AS sum_James,
	sumIf(ss_sales_price, bitTest(customer.cond, 1)) AS sum_Lisa,
	sum(ss_sales_price) AS sum_total
FROM store_sales
LEFT JOIN
(
	SELECT
    	CAST(c_customer_sk, 'UInt32') AS c_customer_sk,
    	(c_first_name = 'James') + (2 * (c_first_name = 'Lisa')) AS cond
	FROM customer
	WHERE (c_first_name = 'James') OR (c_first_name = 'Lisa')
) AS customer ON store_sales.ss_customer_sk_nn = customer.c_customer_sk

1 row in set. Elapsed: 27.093 sec. Processed 2.89 billion rows, 26.20 GB (106.74 million rows/s., 967.16 MB/s.)

Another 10% perf improvement from using UInt32 key instead of Nullable(Int64) Looks pretty neat, we almost got 10 times improvement over our initial query. Can we do better?

Probably, but it does mean that we need to get rid of JOIN.

Use IN clause instead of JOIN

Despite that all DBMS support ~ similar feature set, feature performance on different database are different:

Small example, for PostgreSQL, is recommended to replace big IN clauses with JOINs, because IN clauses have bad performance. But for ClickHouse it’s the opposite!, IN works faster than JOIN, because it only checks key existence in HashSet and doesn’t need to extract any data from the right table in IN.

Let’s test that:

SELECT
	sumIf(ss_sales_price, ss_customer_sk IN (
    	SELECT c_customer_sk
    	FROM customer
    	WHERE c_first_name = 'James'
	)) AS sum_James,
	sumIf(ss_sales_price, ss_customer_sk IN (
    	SELECT c_customer_sk
    	FROM customer
    	WHERE c_first_name = 'Lisa'
	)) AS sum_Lisa,
	sum(ss_sales_price) AS sum_total
FROM store_sales

1 row in set. Elapsed: 16.546 sec. Processed 2.90 billion rows, 40.89 GB (175.52 million rows/s., 2.47 GB/s.)

Almost 2 times faster than our previous record with JOIN, what if we will improve the same hint with c_customer_sk key like in JOIN?

SELECT
	sumIf(ss_sales_price, ss_customer_sk_nn IN (
    	SELECT c_customer_sk::UInt32
    	FROM customer
    	WHERE c_first_name = 'James'
	)) AS sum_James,
	sumIf(ss_sales_price, ss_customer_sk_nn IN (
    	SELECT c_customer_sk::UInt32
    	FROM customer
    	WHERE c_first_name = 'Lisa'
	)) AS sum_Lisa,
	sum(ss_sales_price) AS sum_total
FROM store_sales

1 row in set. Elapsed: 12.355 sec. Processed 2.90 billion rows, 26.49 GB (235.06 million rows/s., 2.14 GB/s.)

Another 25% performance!

But, there is one big limitation with IN approach, what if we have more than just 2 conditions?

SELECT
	sumIf(ss_sales_price, ss_customer_sk_nn IN (
    	SELECT c_customer_sk::UInt32
    	FROM customer
    	WHERE c_first_name = 'James'
	)) AS sum_James,
	sumIf(ss_sales_price, ss_customer_sk_nn IN (
    	SELECT c_customer_sk::UInt32
    	FROM customer
    	WHERE c_first_name = 'Lisa'
	)) AS sum_Lisa,
	sumIf(ss_sales_price, ss_customer_sk_nn IN (
    	SELECT c_customer_sk::UInt32
    	FROM customer
    	WHERE c_last_name = 'Smith'
	)) AS sum_Smith,
	sumIf(ss_sales_price, ss_customer_sk_nn IN (
    	SELECT c_customer_sk::UInt32
    	FROM customer
    	WHERE c_last_name = 'Williams'
	)) AS sum_Williams,
	sum(ss_sales_price) AS sum_total
FROM store_sales

1 row in set. Elapsed: 23.690 sec. Processed 2.93 billion rows, 27.06 GB (123.60 million rows/s., 1.14 GB/s.)

Adhoc alternative to Dictionary with FLAT layout

But first is a short introduction. What the hell is a Dictionary with a FLAT layout?

Basically, it’s just a set of Array’s for each attribute where the value position in the attribute array is just a dictionary key For sure it put heavy limitation about what dictionary key could be, but it gives really good advantages:

['Alice','James', 'Robert','John', ...].length = 12mil, Memory usage ~ N*sum(sizeOf(String(N)) + 1)

It’s really small memory usage (good cache hit rate) & really fast key lookups (no complex hash calculation)

So, if it’s that great what are the caveats? First one is that your keys should be ideally autoincremental (with small number of gaps) And for second, lets look in that simple query and write down all calculations:

SELECT sumIf(ss_sales_price, dictGet(...) = 'James')

Dictionary call (2 billion times)
String equality check (2 billion times)

Although it’s really efficient in terms of dictGet call and memory usage by Dictionary, it still materializes the String column (memcpy) and we pay a penalty of execution condition on top of such a string column for each row.

But what if we could first calculate our required condition and create such a “Dictionary” ad hoc in query time?

And we can actually do that! But let’s repeat our analysis again:

SELECT sumIf(ss_sales_price, here_lives_unicorns(dictGet(...) = 'James'))

['Alice','James', 'Lisa','James', ...].map(x -> multiIf(x = 'James', 1, x = 'Lisa', 2, 0)) => [0,1,2,1,...].length = 12mil, Memory usage ~ N*sizeOf(UInt8) <- It’s event smaller than FLAT dictionary

And actions:

String equality check (12 million times)
Create Array (12 million elements)
Array call (2 billion times)
UInt8 equality check (2 billion times)

But what is here_lives_unicorns function, does it exist in ClickHouse?

No, but we can hack it with some array manipulation:

SELECT sumIf(ss_sales_price, arr[customer_id] = 2)

WITH (
    	SELECT groupArray(assumeNotNull((c_first_name = 'James') + (2 * (c_first_name = 'Lisa')))::UInt8)
    	FROM
    	(
        	SELECT *
        	FROM customer
        	ORDER BY c_customer_sk ASC
    	)
	) AS cond
SELECT
	sumIf(ss_sales_price, bitTest(cond[ss_customer_sk], 0)) AS sum_James,
	sumIf(ss_sales_price, bitTest(cond[ss_customer_sk], 1)) AS sum_Lisa,
	sum(ss_sales_price) AS sum_total
FROM store_sales

1 row in set. Elapsed: 13.006 sec. Processed 2.89 billion rows, 40.60 GB (222.36 million rows/s., 3.12 GB/s.)

WITH (
    	SELECT groupArray(assumeNotNull((c_first_name = 'James') + (2 * (c_first_name = 'Lisa')))::UInt8)
    	FROM
    	(
        	SELECT *
        	FROM customer
        	ORDER BY c_customer_sk ASC
    	)
	) AS cond,
	bitTest(cond[ss_customer_sk_nn], 0) AS cond_james,
	bitTest(cond[ss_customer_sk_nn], 1) AS cond_lisa
SELECT
	sumIf(ss_sales_price, cond_james) AS sum_James,
	sumIf(ss_sales_price, cond_lisa) AS sum_Lisa,
	sum(ss_sales_price) AS sum_total
FROM store_sales


1 row in set. Elapsed: 10.054 sec. Processed 2.89 billion rows, 26.20 GB (287.64 million rows/s., 2.61 GB/s.)

20% faster than the IN approach, what if we will have not 2 but 4 such conditions:

WITH (
    	SELECT groupArray(assumeNotNull((((c_first_name = 'James') + (2 * (c_first_name = 'Lisa'))) + (4 * (c_last_name = 'Smith'))) + (8 * (c_last_name = 'Williams')))::UInt8)
    	FROM
    	(
        	SELECT *
        	FROM customer
        	ORDER BY c_customer_sk ASC
    	)
	) AS cond
SELECT
	sumIf(ss_sales_price, bitTest(cond[ss_customer_sk_nn], 0)) AS sum_James,
	sumIf(ss_sales_price, bitTest(cond[ss_customer_sk_nn], 1)) AS sum_Lisa,
	sumIf(ss_sales_price, bitTest(cond[ss_customer_sk_nn], 2)) AS sum_Smith,
	sumIf(ss_sales_price, bitTest(cond[ss_customer_sk_nn], 3)) AS sum_Williams,
	sum(ss_sales_price) AS sum_total
FROM store_sales

1 row in set. Elapsed: 11.454 sec. Processed 2.89 billion rows, 26.39 GB (252.49 million rows/s., 2.30 GB/s.)

As we can see, that Array approach doesn’t even notice that we increased the amount of conditions by 2 times.

2.25 - JSONExtract to parse many attributes at a time

JSONExtract to parse many attributes at a time

Don’t use several JSONExtract for parsing big JSON. It’s very ineffective, slow, and consumes CPU. Try to use one JSONExtract to parse String to Tupes and next get the needed elements:

WITH JSONExtract(json, 'Tuple(name String, id String, resources Nested(description String, format String, tracking_summary Tuple(total UInt32, recent UInt32)), extras Nested(key String, value String))') AS parsed_json
SELECT
    tupleElement(parsed_json, 'name') AS name,
    tupleElement(parsed_json, 'id') AS id,
    tupleElement(tupleElement(parsed_json, 'resources'), 'description') AS `resources.description`,
    tupleElement(tupleElement(parsed_json, 'resources'), 'format') AS `resources.format`,
    tupleElement(tupleElement(tupleElement(parsed_json, 'resources'), 'tracking_summary'), 'total') AS `resources.tracking_summary.total`,
    tupleElement(tupleElement(tupleElement(parsed_json, 'resources'), 'tracking_summary'), 'recent') AS `resources.tracking_summary.recent`
FROM url('https://raw.githubusercontent.com/jsonlines/guide/master/datagov100.json', 'JSONAsString', 'json String')

However, such parsing requires static schema - all keys should be presented in every row, or you will get an empty structure. More dynamic parsing requires several JSONExtract invocations, but still - try not to scan the same data several times:

WITH
    '{"timestamp":"2024-06-12T14:30:00.001Z","functionality":"DOCUMENT","flowId":"210abdee-6de5-474a-83da-748def0facc1","step":"BEGIN","env":"dev","successful":true,"data":{"action":"initiate_view","stats":{"total":1,"success":1,"failed":0},"client_ip":"192.168.1.100","client_port":"8080"}}' AS json,
    JSONExtractKeysAndValues(json, 'String') AS m,
    mapFromArrays(m.1, m.2) AS p
SELECT
    extractKeyValuePairs(p['data'])['action'] AS data,
    (p['successful']) = 'true' AS successful
FORMAT Vertical

/*
Row 1:
──────
data:       initiate_view
successful: 1
*/

A good approach to get a proper schema from a json message is to let clickhouse-local schema inference do the job:

$ ls example_message.json         
example_message.json

$ clickhouse-local --query="DESCRIBE file('example_message.json', 'JSONEachRow')" --format="Vertical";

Row 1:
──────
name:               resourceLogs
type:               Array(Tuple(
    resource Nullable(String),
    scopeLogs Array(Tuple(
        logRecords Array(Tuple(
            attributes Array(Tuple(
                key Nullable(String),
                value Tuple(
                    stringValue Nullable(String)))),
            body Tuple(
                stringValue Nullable(String)),
            observedTimeUnixNano Nullable(String),
            spanId Nullable(String),
            traceId Nullable(String))),
        scope Nullable(String)))))

For very subnested dynamic JSON files, if you don’t need all the keys, you could parse sublevels specifically. Still this will require several JSONExtract calls but each call will have less data to parse so complexity will be reduced for each pass: O(log n)

CREATE TABLE better_parsing (json String) ENGINE = Memory;
INSERT INTO better_parsing FORMAT JSONAsString {"timestamp":"2024-06-12T14:30:00.001Z","functionality":"DOCUMENT","flowId":"210abdee-6de5-474a-83da-748def0facc1","step":"BEGIN","env":"dev","successful":true,"data":{"action":"initiate_view","stats":{"total":1,"success":1,"failed":0},"client_ip":"192.168.1.100","client_port":"8080"}}

WITH parsed_content AS
    (
      SELECT 
        JSONExtractKeysAndValues(json, 'String') AS 1st_level_arr,
        mapFromArrays(1st_level_arr.1, 1st_level_arr.2) AS 1st_level_map,
        JSONExtractKeysAndValues(1st_level_map['data'], 'String') AS 2nd_level_arr,
        mapFromArrays(2nd_level_arr.1, 2nd_level_arr.2) AS 2nd_level_map,
        JSONExtractKeysAndValues(2nd_level_map['stats'], 'String') AS 3rd_level_arr,
        mapFromArrays(3rd_level_arr.1, 3rd_level_arr.2) AS 3rd_level_map
      FROM json_tests.better_parsing
    ) 
SELECT 
  1st_level_map['timestamp'] AS timestamp,
  2nd_level_map['action'] AS action,
  3rd_level_map['total'] AS total
  3rd_level_map['nokey'] AS no_key_empty
FROM parsed_content

/*
   ┌─timestamp────────────────┬─action────────┬─total─┬─no_key_empty─┐
1. │ 2024-06-12T14:30:00.001Z │ initiate_view │ 1     │              │
   └──────────────────────────┴───────────────┴───────┴──────────────┘

1 row in set. Elapsed: 0.003 sec.
*/

2.26 - KILL QUERY

KILL QUERY

Unfortunately not all queries can be killed. KILL QUERY only sets a flag that must be checked by the query. A query pipeline is checking this flag before a switching to next block. If the pipeline has stuck somewhere in the middle it cannot be killed. If a query does not stop, the only way to get rid of it is to restart ClickHouse®.

See also:

How to replace a running query

Q. We are trying to abort running queries when they are being replaced with a new one. We are setting the same query id for this. In some cases this error happens:
Query with id = e213cc8c-3077-4a6c-bc78-e8463adad35d is already running and can’t be stopped
The query is still being killed but the new one is not being executed. Do you know anything about this and if there is a fix or workaround for it?

I guess you use replace_running_query + replace_running_query_max_wait_ms.

Unfortunately it’s not always possible to kill the query at random moment of time.

Kill don’t send any signals, it just set a flag. Which gets (synchronously) checked at certain moments of query execution, mostly after finishing processing one block and starting another.

On certain stages (executing scalar sub-query) the query can not be killed at all. This is a known issue and requires an architectural change to fix it.

I see. Is there a workaround?
This is our use case:
A user requests an analytics report which has a query that takes several settings, the user makes changes to the report (e.g. to filters, metrics, dimensions…). Since the user changed what he is looking for the query results from the initial query are never used and we would like to cancel it when starting the new query (edited)

You can just use 2 commands:

KILL QUERY WHERE query_id = ' ... ' ASYNC

SELECT ... new query ....

in that case you don’t need to care when the original query will be stopped.

2.27 - Lag / Lead

Lag / Lead

Sample data

CREATE TABLE llexample (
    g Int32,
    a Date )
ENGINE = Memory;

INSERT INTO llexample SELECT
    number % 3,
    toDate('2020-01-01') + number
FROM numbers(10);

SELECT * FROM llexample ORDER BY g,a;

┌─g─┬──────────a─┐
│ 0 │ 2020-01-01 │
│ 0 │ 2020-01-04 │
│ 0 │ 2020-01-07 │
│ 0 │ 2020-01-10 │
│ 1 │ 2020-01-02 │
│ 1 │ 2020-01-05 │
│ 1 │ 2020-01-08 │
│ 2 │ 2020-01-03 │
│ 2 │ 2020-01-06 │
│ 2 │ 2020-01-09 │
└───┴────────────┘

Using arrays

select g, (arrayJoin(tuple_ll) as ll).1 a, ll.2 prev, ll.3 next
from (
select g, arrayZip( arraySort(groupArray(a)) as aa,
                    arrayPopBack(arrayPushFront(aa, toDate(0))),
                    arrayPopFront(arrayPushBack(aa, toDate(0))) ) tuple_ll
from llexample
group by g)
order by g, a;

┌─g─┬──────────a─┬───────prev─┬───────next─┐
│ 0 │ 2020-01-01 │ 1970-01-01 │ 2020-01-04 │
│ 0 │ 2020-01-04 │ 2020-01-01 │ 2020-01-07 │
│ 0 │ 2020-01-07 │ 2020-01-04 │ 2020-01-10 │
│ 0 │ 2020-01-10 │ 2020-01-07 │ 1970-01-01 │
│ 1 │ 2020-01-02 │ 1970-01-01 │ 2020-01-05 │
│ 1 │ 2020-01-05 │ 2020-01-02 │ 2020-01-08 │
│ 1 │ 2020-01-08 │ 2020-01-05 │ 1970-01-01 │
│ 2 │ 2020-01-03 │ 1970-01-01 │ 2020-01-06 │
│ 2 │ 2020-01-06 │ 2020-01-03 │ 2020-01-09 │
│ 2 │ 2020-01-09 │ 2020-01-06 │ 1970-01-01 │
└───┴────────────┴────────────┴────────────┘

Using window functions (starting from ClickHouse® 21.3)

SET allow_experimental_window_functions = 1;

SELECT
    g,
    a,
    any(a) OVER (PARTITION BY g ORDER BY a ASC ROWS
                 BETWEEN 1 PRECEDING AND 1 PRECEDING) AS prev,
    any(a) OVER (PARTITION BY g ORDER BY a ASC ROWS
                 BETWEEN 1 FOLLOWING AND 1 FOLLOWING) AS next
FROM llexample
ORDER BY
    g ASC,
    a ASC;

┌─g─┬──────────a─┬───────prev─┬───────next─┐
│ 0 │ 2020-01-01 │ 1970-01-01 │ 2020-01-04 │
│ 0 │ 2020-01-04 │ 2020-01-01 │ 2020-01-07 │
│ 0 │ 2020-01-07 │ 2020-01-04 │ 2020-01-10 │
│ 0 │ 2020-01-10 │ 2020-01-07 │ 1970-01-01 │
│ 1 │ 2020-01-02 │ 1970-01-01 │ 2020-01-05 │
│ 1 │ 2020-01-05 │ 2020-01-02 │ 2020-01-08 │
│ 1 │ 2020-01-08 │ 2020-01-05 │ 1970-01-01 │
│ 2 │ 2020-01-03 │ 1970-01-01 │ 2020-01-06 │
│ 2 │ 2020-01-06 │ 2020-01-03 │ 2020-01-09 │
│ 2 │ 2020-01-09 │ 2020-01-06 │ 1970-01-01 │
└───┴────────────┴────────────┴────────────┘

Using lagInFrame/leadInFrame (starting from ClickHouse 21.4)

SELECT
    g,
    a,
    lagInFrame(a) OVER (PARTITION BY g ORDER BY a ASC ROWS
                 BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS prev,
    leadInFrame(a) OVER (PARTITION BY g ORDER BY a ASC ROWS
                 BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS next
FROM llexample
ORDER BY
    g ASC,
    a ASC;

┌─g─┬──────────a─┬───────prev─┬───────next─┐
│ 0 │ 2020-01-01 │ 1970-01-01 │ 2020-01-04 │
│ 0 │ 2020-01-04 │ 2020-01-01 │ 2020-01-07 │
│ 0 │ 2020-01-07 │ 2020-01-04 │ 2020-01-10 │
│ 0 │ 2020-01-10 │ 2020-01-07 │ 1970-01-01 │
│ 1 │ 2020-01-02 │ 1970-01-01 │ 2020-01-05 │
│ 1 │ 2020-01-05 │ 2020-01-02 │ 2020-01-08 │
│ 1 │ 2020-01-08 │ 2020-01-05 │ 1970-01-01 │
│ 2 │ 2020-01-03 │ 1970-01-01 │ 2020-01-06 │
│ 2 │ 2020-01-06 │ 2020-01-03 │ 2020-01-09 │
│ 2 │ 2020-01-09 │ 2020-01-06 │ 1970-01-01 │
└───┴────────────┴────────────┴────────────┘

Using neighbor (no grouping, incorrect result over blocks)

SELECT
    g,
    a,
    neighbor(a, -1) AS prev,
    neighbor(a, 1) AS next
FROM
(
    SELECT *
    FROM llexample
    ORDER BY
        g ASC,
        a ASC
);

┌─g─┬──────────a─┬───────prev─┬───────next─┐
│ 0 │ 2020-01-01 │ 1970-01-01 │ 2020-01-04 │
│ 0 │ 2020-01-04 │ 2020-01-01 │ 2020-01-07 │
│ 0 │ 2020-01-07 │ 2020-01-04 │ 2020-01-10 │
│ 0 │ 2020-01-10 │ 2020-01-07 │ 2020-01-02 │
│ 1 │ 2020-01-02 │ 2020-01-10 │ 2020-01-05 │
│ 1 │ 2020-01-05 │ 2020-01-02 │ 2020-01-08 │
│ 1 │ 2020-01-08 │ 2020-01-05 │ 2020-01-03 │
│ 2 │ 2020-01-03 │ 2020-01-08 │ 2020-01-06 │
│ 2 │ 2020-01-06 │ 2020-01-03 │ 2020-01-09 │
│ 2 │ 2020-01-09 │ 2020-01-06 │ 1970-01-01 │
└───┴────────────┴────────────┴────────────┘

2.28 - Machine learning in ClickHouse

Machine learning in ClickHouse

Resources

Machine Learning in ClickHouse - Presentation from 2019 (Meetup 31)
ML discussion: CatBoost / MindsDB / Fast.ai - Brief article from 2021
Machine Learning Forecase (Russian) - Presentation from 2019 (Meetup 38)

2.29 - Mutations

ALTER UPDATE / DELETE

How to know if `ALTER TABLE … DELETE/UPDATE mutation ON CLUSTER` was finished successfully on all the nodes?

A. mutation status in system.mutations is local to each replica, so use

SELECT hostname(), * FROM clusterAllReplicas('your_cluster_name', system.mutations);
-- you can also add WHERE conditions to that query if needed.

Look on is_done and latest_fail_reason columns

Are mutations being run in parallel or they are sequential in ClickHouse® (in scope of one table)

Mutations

ClickHouse runs mutations sequentially, but it can combine several mutations in a single and apply all of them in one merge. Sometimes, it can lead to problems, when a combined expression which ClickHouse needs to execute becomes really big. (If ClickHouse combined thousands of mutations in one)

Because ClickHouse stores data in independent parts, ClickHouse is able to run mutation(s) merges for each part independently and in parallel. It also can lead to high resource utilization, especially memory usage if you use x IN (SELECT ... FROM big_table) statements in mutation, because each merge will run and keep in memory its own HashSet. You can avoid this problem, if you will use Dictionary approach for such mutations.

Parallelism of mutations controlled by settings:

SELECT *
FROM system.merge_tree_settings
WHERE name LIKE '%mutation%'

┌─name───────────────────────────────────────────────┬─value─┬─changed─┬─description──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─type───┐
│ max_replicated_mutations_in_queue                  │ 8     │       0 │ How many tasks of mutating parts are allowed simultaneously in ReplicatedMergeTree queue.                                                                                    │ UInt64 │
│ number_of_free_entries_in_pool_to_execute_mutation │ 20    │       0 │ When there is less than specified number of free entries in pool, do not execute part mutations. This is to leave free threads for regular merges and avoid "Too many parts" │ UInt64 │
└────────────────────────────────────────────────────┴───────┴─────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────┘

2.30 - OPTIMIZE vs OPTIMIZE FINAL

OPTIMIZE vs OPTIMIZE FINAL

OPTIMIZE TABLE xyz – this initiates an unscheduled merge.

Example

You have 40 parts in 3 partitions. This unscheduled merge selects some partition (i.e. February) and selects 3 small parts to merge, then merge them into a single part. You get 38 parts in the result.

OPTIMIZE TABLE xyz FINAL – initiates a cycle of unscheduled merges.

ClickHouse® merges parts in this table until will remains 1 part in each partition (if a system has enough free disk space). As a result, you get 3 parts, 1 part per partition. In this case, ClickHouse rewrites parts even if they are already merged into a single part. It creates a huge CPU / Disk load if the table (XYZ) is huge. ClickHouse reads / uncompress / merge / compress / writes all data in the table.

If this table has size 1TB it could take around 3 hours to complete.

So we don’t recommend running OPTIMIZE TABLE xyz FINAL against tables with more than 10million rows.

2.31 - Parameterized views

Parameterized views

ClickHouse® versions 23.1+ (23.1.6.42, 23.2.5.46, 23.3.1.2823) have inbuilt support for parametrized views :

CREATE VIEW my_new_view AS
SELECT *
FROM deals
WHERE category_id IN (
    SELECT category_id
    FROM deal_categories
    WHERE category = {category:String}
)

SELECT * FROM my_new_view(category = 'hot deals');

One more example

CREATE OR REPLACE VIEW v AS SELECT 1::UInt32 x WHERE x IN ({xx:Array(UInt32)});

select * from v(xx=[1,2,3]);
┌─x─┐
│ 1 │
└───┘

ClickHouse versions pre 23.1

Custom settings allows to emulate parameterized views.

You need to enable custom settings and define any prefixes for settings.

$ cat /etc/clickhouse-server/config.d/custom_settings_prefixes.xml
<?xml version="1.0" ?>
<yandex>
    <custom_settings_prefixes>my,my2</custom_settings_prefixes>
</yandex>

You can also set the default value for user settings in the default section of the user configuration.

cat /etc/clickhouse-server/users.d/custom_settings_default.xml
<?xml version="1.0"?>
<yandex>
    <profiles>
        <default>
            <my2_category>'hot deals'</my2_category>
        </default>
    </profiles>
</yandex>

A server restart is required for the default value to be applied

$ systemctl restart clickhouse-server

Now you can set settings as any other settings, and query them using getSetting() function.

SET my2_category='hot deals';

SELECT getSetting('my2_category');
┌─getSetting('my2_category')─┐
│ hot deals                  │
└────────────────────────────┘

-- you can query ClickHouse settings as well
SELECT getSetting('max_threads')
┌─getSetting('max_threads')─┐
│                         8 │
└───────────────────────────┘

Now we can create a view

CREATE VIEW my_new_view AS
SELECT *
FROM deals
WHERE category_id IN
(
    SELECT category_id
    FROM deal_categories
    WHERE category = getSetting('my2_category')
);

And query it

SELECT *
FROM my_new_view
SETTINGS my2_category = 'hot deals';

If the custom setting is not set when the view is being created, you need to explicitly define the list of columns for the view:

CREATE VIEW my_new_view (c1 Int, c2 String, ...)
AS
SELECT *
FROM deals
WHERE category_id IN
(
    SELECT category_id
    FROM deal_categories
    WHERE category = getSetting('my2_category')
);

2.32 - Use both projection and raw data in single query

How to write queries, which will use both data from projection and raw table.

CREATE TABLE default.metric
(
    `key_a` UInt8,
    `key_b` UInt32,
    `date` Date,
    `value` UInt32,
    PROJECTION monthly
    (
        SELECT
            key_a,
            key_b,
            min(date),
            sum(value)
        GROUP BY
            key_a,
            key_b
    )
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(date)
ORDER BY (key_a, key_b, date)
SETTINGS index_granularity = 8192;


INSERT INTO metric SELECT
    key_a,
    key_b,
    date,
    rand() % 100000 AS value
FROM
(
    SELECT
        arrayJoin(range(8)) AS key_a,
        number % 500000 AS key_b,
        today() - intDiv(number, 500000) AS date
    FROM numbers_mt(1080000000)
);

OPTIMIZE TABLE metric FINAL;

SET max_threads = 8;

WITH
    toDate('2015-02-27') AS start_date,
    toDate('2022-02-15') AS end_date,
    key_a IN (1, 3, 5, 7) AS key_a_cond
SELECT
    key_b,
    sum(value) AS sum
FROM metric
WHERE (date > start_date) AND (date < end_date) AND key_a_cond
GROUP BY key_b
ORDER BY sum DESC
LIMIT 25

25 rows in set. Elapsed: 6.561 sec. Processed 4.32 billion rows, 47.54 GB (658.70 million rows/s., 7.25 GB/s.)

WITH
    toDate('2015-02-27') AS start_date,
    toDate('2022-02-15') AS end_date,
    key_a IN (1, 3, 5, 7) AS key_a_cond
SELECT
    key_b,
    sum(value) AS sum
FROM
(
    SELECT
        key_b,
        value
    FROM metric
    WHERE indexHint(_partition_id IN CAST([toYYYYMM(start_date), toYYYYMM(end_date)], 'Array(String)')) AND (date > start_date) AND (date < end_date) AND key_a_cond
    UNION ALL
    SELECT
        key_b,
        sum(value) AS value
    FROM metric
    WHERE indexHint(_partition_id IN CAST(range(toYYYYMM(start_date) + 1, toYYYYMM(end_date)), 'Array(String)')) AND key_a_cond
    GROUP BY key_b
)
GROUP BY key_b
ORDER BY sum DESC
LIMIT 25

25 rows in set. Elapsed: 1.038 sec. Processed 181.86 million rows, 4.56 GB (175.18 million rows/s., 4.40 GB/s.)


WITH
    (toDate('2016-02-27'), toDate('2017-02-15')) AS period_1,
    (toDate('2018-05-27'), toDate('2022-08-15')) AS period_2,
    (date > (period_1.1)) AND (date < (period_1.2)) AS period_1_cond,
    (date > (period_2.1)) AND (date < (period_2.2)) AS period_2_cond,
    key_a IN (1, 3, 5, 7) AS key_a_cond
SELECT
    key_b,
    sumIf(value, period_1_cond) AS sum_per_1,
    sumIf(value, period_2_cond) AS sum_per_2
FROM metric
WHERE (period_1_cond OR period_2_cond) AND key_a_cond
GROUP BY key_b
ORDER BY sum_per_2 / sum_per_1 DESC
LIMIT 25

25 rows in set. Elapsed: 5.717 sec. Processed 3.47 billion rows, 38.17 GB (606.93 million rows/s., 6.68 GB/s.)

WITH
    (toDate('2016-02-27'), toDate('2017-02-15')) AS period_1,
    (toDate('2018-05-27'), toDate('2022-08-15')) AS period_2,
    (date > (period_1.1)) AND (date < (period_1.2)) AS period_1_cond,
    (date > (period_2.1)) AND (date < (period_2.2)) AS period_2_cond,
    CAST([toYYYYMM(period_1.1), toYYYYMM(period_1.2), toYYYYMM(period_2.1), toYYYYMM(period_2.2)], 'Array(String)') AS daily_parts,
    key_a IN (1, 3, 5, 7) AS key_a_cond
SELECT
    key_b,
    sumIf(value, period_1_cond) AS sum_per_1,
    sumIf(value, period_2_cond) AS sum_per_2
FROM
(
    SELECT
        key_b,
        date,
        value
    FROM metric
    WHERE indexHint(_partition_id IN (daily_parts)) AND (period_1_cond OR period_2_cond) AND key_a_cond
    UNION ALL
    SELECT
        key_b,
        min(date) AS date,
        sum(value) AS value
    FROM metric
    WHERE indexHint(_partition_id IN CAST(arrayConcat(range(toYYYYMM(period_1.1) + 1, toYYYYMM(period_1.2)), range(toYYYYMM(period_2.1) + 1, toYYYYMM(period_2.1))), 'Array(String)')) AND indexHint(_partition_id NOT IN (daily_parts)) AND key_a_cond
    GROUP BY
        key_b
)
GROUP BY key_b
ORDER BY sum_per_2 / sum_per_1 DESC
LIMIT 25


25 rows in set. Elapsed: 0.444 sec. Processed 140.34 million rows, 2.11 GB (316.23 million rows/s., 4.77 GB/s.)


WITH
    toDate('2022-01-03') AS start_date,
    toDate('2022-02-15') AS end_date,
    key_a IN (1, 3, 5, 7) AS key_a_cond
SELECT
    key_b,
    sum(value) AS sum
FROM metric
WHERE (date > start_date) AND (date < end_date) AND key_a_cond
GROUP BY key_b
ORDER BY sum DESC
LIMIT 25

25 rows in set. Elapsed: 0.208 sec. Processed 100.06 million rows, 1.10 GB (481.06 million rows/s., 5.29 GB/s.)


WITH
    toDate('2022-01-03') AS start_date,
    toDate('2022-02-15') AS end_date,
    key_a IN (1, 3, 5, 7) AS key_a_cond
SELECT
    key_b,
    sum(value) AS sum
FROM
(
    SELECT
        key_b,
        value
    FROM metric
    WHERE indexHint(_partition_id IN CAST([toYYYYMM(start_date), toYYYYMM(end_date)], 'Array(String)')) AND (date > start_date) AND (date < end_date) AND key_a_cond
    UNION ALL
    SELECT
        key_b,
        sum(value) AS value
    FROM metric
    WHERE indexHint(_partition_id IN CAST(range(toYYYYMM(start_date) + 1, toYYYYMM(end_date)), 'Array(String)')) AND key_a_cond
    GROUP BY key_b
)
GROUP BY key_b
ORDER BY sum DESC
LIMIT 25

25 rows in set. Elapsed: 0.216 sec. Processed 100.06 million rows, 1.10 GB (462.68 million rows/s., 5.09 GB/s.)


WITH
    toDate('2021-12-03') AS start_date,
    toDate('2022-02-15') AS end_date,
    key_a IN (1, 3, 5, 7) AS key_a_cond
SELECT
    key_b,
    sum(value) AS sum
FROM metric
WHERE (date > start_date) AND (date < end_date) AND key_a_cond
GROUP BY key_b
ORDER BY sum DESC
LIMIT 25

25 rows in set. Elapsed: 0.308 sec. Processed 162.09 million rows, 1.78 GB (526.89 million rows/s., 5.80 GB/s.)

WITH
    toDate('2021-12-03') AS start_date,
    toDate('2022-02-15') AS end_date,
    key_a IN (1, 3, 5, 7) AS key_a_cond
SELECT
    key_b,
    sum(value) AS sum
FROM
(
    SELECT
        key_b,
        value
    FROM metric
    WHERE indexHint(_partition_id IN CAST([toYYYYMM(start_date), toYYYYMM(end_date)], 'Array(String)')) AND (date > start_date) AND (date < end_date) AND key_a_cond
    UNION ALL
    SELECT
        key_b,
        sum(value) AS value
    FROM metric
    WHERE indexHint(_partition_id IN CAST(range(toYYYYMM(start_date) + 1, toYYYYMM(end_date)), 'Array(String)')) AND key_a_cond
    GROUP BY key_b
)
GROUP BY key_b
ORDER BY sum DESC
LIMIT 25

25 rows in set. Elapsed: 0.268 sec. Processed 102.08 million rows, 1.16 GB (381.46 million rows/s., 4.33 GB/s.)

2.33 - PIVOT / UNPIVOT

PIVOT / UNPIVOT

PIVOT

CREATE TABLE sales(suppkey UInt8, category String, quantity UInt32) ENGINE=Memory(); 

INSERT INTO sales VALUES (2, 'AA' ,7500),(1, 'AB' , 4000),(1, 'AA' , 6900),(1, 'AB', 8900), (1, 'AC', 8300), (1, 'AA', 7000), (1, 'AC', 9000), (2,'AA', 9800), (2,'AB', 9600), (1,'AC', 8900),(1, 'AD', 400), (2,'AD', 900), (2,'AD', 1200), (1,'AD', 2600), (2, 'AC', 9600),(1, 'AC', 6200);

Using Map data type (starting from ClickHouse® 21.1)

WITH CAST(sumMap([category], [quantity]), 'Map(String, UInt32)') AS map
SELECT
    suppkey,
    map['AA'] AS AA,
    map['AB'] AS AB,
    map['AC'] AS AC,
    map['AD'] AS AD
FROM sales
GROUP BY suppkey
ORDER BY suppkey ASC

┌─suppkey─┬────AA─┬────AB─┬────AC─┬───AD─┐
│       1 │ 13900 │ 12900 │ 32400 │ 3000 │
│       2 │ 17300 │  9600 │  9600 │ 2100 │
└─────────┴───────┴───────┴───────┴──────┘

WITH CAST(sumMap(map(category, quantity)), 'Map(LowCardinality(String), UInt32)') AS map
SELECT
    suppkey,
    map['AA'] AS AA,
    map['AB'] AS AB,
    map['AC'] AS AC,
    map['AD'] AS AD
FROM sales
GROUP BY suppkey
ORDER BY suppkey ASC

┌─suppkey─┬────AA─┬────AB─┬────AC─┬───AD─┐
│       1 │ 13900 │ 12900 │ 32400 │ 3000 │
│       2 │ 17300 │  9600 │  9600 │ 2100 │
└─────────┴───────┴───────┴───────┴──────┘

Using -If combinator

SELECT
    suppkey,
    sumIf(quantity, category = 'AA') AS AA,
    sumIf(quantity, category = 'AB') AS AB,
    sumIf(quantity, category = 'AC') AS AC,
    sumIf(quantity, category = 'AD') AS AD
FROM sales
GROUP BY suppkey
ORDER BY suppkey ASC

┌─suppkey─┬────AA─┬────AB─┬────AC─┬───AD─┐
│       1 │ 13900 │ 12900 │ 32400 │ 3000 │
│       2 │ 17300 │  9600 │  9600 │ 2100 │
└─────────┴───────┴───────┴───────┴──────┘

Using -Resample combinator

WITH sumResample(0, 4, 1)(quantity, transform(category, ['AA', 'AB', 'AC', 'AD'], [0, 1, 2, 3], 4)) AS sum
SELECT
    suppkey,
    sum[1] AS AA,
    sum[2] AS AB,
    sum[3] AS AC,
    sum[4] AS AD
FROM sales
GROUP BY suppkey
ORDER BY suppkey ASC

┌─suppkey─┬────AA─┬────AB─┬────AC─┬───AD─┐
│       1 │ 13900 │ 12900 │ 32400 │ 3000 │
│       2 │ 17300 │  9600 │  9600 │ 2100 │
└─────────┴───────┴───────┴───────┴──────┘

UNPIVOT

CREATE TABLE sales_w(suppkey UInt8, brand String, AA UInt32, AB UInt32, AC UInt32,
AD UInt32) ENGINE=Memory();

 INSERT INTO sales_w VALUES (1, 'BRAND_A', 1500, 4200, 1600, 9800), (2, 'BRAND_B', 6200, 1300, 5800, 3100), (3, 'BRAND_C', 5000, 8900, 6900, 3400);

SELECT
    suppkey,
    brand,
    category,
    quantity
FROM sales_w
ARRAY JOIN
    [AA, AB, AC, AD] AS quantity,
    splitByString(', ', 'AA, AB, AC, AD') AS category
ORDER BY suppkey ASC

┌─suppkey─┬─brand───┬─category─┬─quantity─┐
│       1 │ BRAND_A │ AA       │     1500 │
│       1 │ BRAND_A │ AB       │     4200 │
│       1 │ BRAND_A │ AC       │     1600 │
│       1 │ BRAND_A │ AD       │     9800 │
│       2 │ BRAND_B │ AA       │     6200 │
│       2 │ BRAND_B │ AB       │     1300 │
│       2 │ BRAND_B │ AC       │     5800 │
│       2 │ BRAND_B │ AD       │     3100 │
│       3 │ BRAND_C │ AA       │     5000 │
│       3 │ BRAND_C │ AB       │     8900 │
│       3 │ BRAND_C │ AC       │     6900 │
│       3 │ BRAND_C │ AD       │     3400 │
└─────────┴─────────┴──────────┴──────────┘

SELECT
    suppkey,
    brand,
    tpl.1 AS category,
    tpl.2 AS quantity
FROM sales_w
ARRAY JOIN tupleToNameValuePairs(CAST((AA, AB, AC, AD), 'Tuple(AA UInt32, AB UInt32, AC UInt32, AD UInt32)')) AS tpl
ORDER BY suppkey ASC

┌─suppkey─┬─brand───┬─category─┬─quantity─┐
│       1 │ BRAND_A │ AA       │     1500 │
│       1 │ BRAND_A │ AB       │     4200 │
│       1 │ BRAND_A │ AC       │     1600 │
│       1 │ BRAND_A │ AD       │     9800 │
│       2 │ BRAND_B │ AA       │     6200 │
│       2 │ BRAND_B │ AB       │     1300 │
│       2 │ BRAND_B │ AC       │     5800 │
│       2 │ BRAND_B │ AD       │     3100 │
│       3 │ BRAND_C │ AA       │     5000 │
│       3 │ BRAND_C │ AB       │     8900 │
│       3 │ BRAND_C │ AC       │     6900 │
│       3 │ BRAND_C │ AD       │     3400 │
└─────────┴─────────┴──────────┴──────────┘

2.34 - Possible deadlock avoided. Client should retry

Possible deadlock avoided. Client should retry

In ClickHouse® version 19.14 a serious issue was found: a race condition that can lead to server deadlock. The reason for that was quite fundamental, and a temporary workaround for that was added (“possible deadlock avoided”).

Those locks are one of the fundamental things that the core team was actively working on in 2020.

In 20.3 some of the locks leading to that situation were removed as a part of huge refactoring.

In 20.4 more locks were removed, the check was made configurable (see lock_acquire_timeout ) so you can say how long to wait before returning that exception

In 20.5 heuristics of that check (“possible deadlock avoided”) was improved.

In 20.6 all table-level locks which were possible to remove were removed, so alters are totally lock-free.

20.10 enables database=Atomic by default which allows running even DROP commands without locks.

Typically issue was happening when doing some concurrent select on system.parts / system.columns / system.table with simultaneous table manipulations (doing some kind of ALTERS / TRUNCATES / DROP)I

If that exception happens often in your use-case:

use recent clickhouse versions
ensure you use Atomic engine for the database (not Ordinary) (can be checked in system.databases)

Sometime you can try to workaround issue by finding the queries which uses that table concurently (especially to system.tables / system.parts and other system tables) and try killing them (or avoiding them).

2.35 - Roaring bitmaps for calculating retention

CREATE TABLE test_roaring_bitmap
ENGINE = MergeTree
ORDER BY h AS
SELECT
    intDiv(number, 5) AS h,
    groupArray(toUInt16(number - (2 * intDiv(number, 5)))) AS vals,
    groupBitmapState(toUInt16(number - (2 * intDiv(number, 5)))) AS vals_bitmap
FROM numbers(40)
GROUP BY h

SELECT
    h,
    vals,
    hex(vals_bitmap)
FROM test_roaring_bitmap

┌─h─┬─vals─────────────┬─hex(vals_bitmap)─────────┐
│ 0 │ [0,1,2,3,4]      │ 000500000100020003000400 │
│ 1 │ [3,4,5,6,7]      │ 000503000400050006000700 │
│ 2 │ [6,7,8,9,10]     │ 000506000700080009000A00 │
│ 3 │ [9,10,11,12,13]  │ 000509000A000B000C000D00 │
│ 4 │ [12,13,14,15,16] │ 00050C000D000E000F001000 │
│ 5 │ [15,16,17,18,19] │ 00050F001000110012001300 │
│ 6 │ [18,19,20,21,22] │ 000512001300140015001600 │
│ 7 │ [21,22,23,24,25] │ 000515001600170018001900 │
└───┴──────────────────┴──────────────────────────┘

SELECT
    groupBitmapAnd(vals_bitmap) AS uniq,
    bitmapToArray(groupBitmapAndState(vals_bitmap)) AS vals
FROM test_roaring_bitmap
WHERE h IN (0, 1)

┌─uniq─┬─vals──┐
│    2 │ [3,4] │
└──────┴───────┘

2.36 - SAMPLE by

SAMPLE by

The execution pipeline is embedded in the partition reading code.

So that works this way:

ClickHouse® does partition pruning based on WHERE conditions.
For every partition, it picks a columns ranges (aka ‘marks’ / ‘granulas’) based on primary key conditions.
Here the sampling logic is applied: a) in case of SAMPLE k (k in 0..1 range) it adds conditions WHERE sample_key < k * max_int_of_sample_key_type b) in case of SAMPLE k OFFSET m it adds conditions WHERE sample_key BETWEEN m * max_int_of_sample_key_type AND (m + k) * max_int_of_sample_key_typec) in case of SAMPLE N (N>1) if first estimates how many rows are inside the range we need to read and based on that convert it to 3a case (calculate k based on number of rows in ranges and desired number of rows)
on the data returned by those other conditions are applied (so here the number of rows can be decreased here)

Source Code

SAMPLE by

SAMPLE key Must be:

Included in the primary key.
Uniformly distributed in the domain of its data type:
- Bad: Timestamp;
- Good: intHash32(UserID);
Cheap to calculate:
- Bad: cityHash64(URL);
- Good: intHash32(UserID);
Not after high granular fields in primary key:
- Bad: ORDER BY (Timestamp, sample_key);
- Good: ORDER BY (CounterID, Date, sample_key).

Sampling is:

Deterministic
Works in a consistent way for different tables.
Allows reading less amount of data from disk.
- SAMPLE key, bonus
- SAMPLE 1/10
- Select data for 1/10 of all possible sample keys; SAMPLE 1000000
Select from about (not less than) 1 000 000 rows on each shard;
- You can use _sample_factor virtual column to determine the relative sample factor; SAMPLE 1/10 OFFSET 1/10
Select second 1/10 of all possible sample keys; SET max_parallel_replicas = 3
Select from multiple replicas of each shard in parallel;

SAMPLE emulation via WHERE condition

Sometimes, it’s easier to emulate sampling via conditions in WHERE clause instead of using SAMPLE key.

SELECT count() FROM table WHERE ... AND cityHash64(some_high_card_key) % 10 = 0; -- Deterministic
SELECT count() FROM table WHERE ... AND rand() % 10 = 0; -- Non-deterministic

ClickHouse will read more data from disk compared to an example with a good SAMPLE key, but it’s more universal and can be used if you can’t change table ORDER BY key. (To learn more about ClickHouse internals, Administrator Training for ClickHouse is available.)

2.37 - Sampling Example

The most important idea about sampling that the primary index must have LowCardinality. (For more information, see the Altinity Knowledge Base article on LowCardinality or a ClickHouse® user's lessons learned from LowCardinality ).

The following example demonstrates how sampling can be setup correctly, and an example if it being set up incorrectly as a comparison.

Sampling requires sample by expression . This ensures a range of sampled column types fit within a specified range, which ensures the requirement of low cardinality. In this example, I cannot use transaction_id because I can not ensure that the min value of transaction_id = 0 and max value = MAX_UINT64. Instead, I used cityHash64(transaction_id)to expand the range within the minimum and maximum values.

For example if all values of transaction_id are from 0 to 10000 sampling will be inefficient. But cityHash64(transaction_id) expands the range from 0 to 18446744073709551615:

SELECT cityHash64(10000)
┌────cityHash64(10000)─┐
│ 14845905981091347439 │
└──────────────────────┘

If I used transaction_id without knowing that they matched the allowable ranges, the results of sampled queries would be skewed. For example, when using sample 0.5, ClickHouse requests where sample_col >= 0 and sample_col <= MAX_UINT64/2.

Also you can include multiple columns into a hash function of the sampling expression to improve randomness of the distribution cityHash64(transaction_id, banner_id).

Sampling Friendly Table

CREATE TABLE table_one
( timestamp UInt64,
  transaction_id UInt64,
  banner_id UInt16,
  value UInt32
)
ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(toDateTime(timestamp))
ORDER BY (banner_id,
          toStartOfHour(toDateTime(timestamp)),
          cityHash64(transaction_id))
SAMPLE BY cityHash64(transaction_id)
SETTINGS index_granularity = 8192

insert into table_one
select 1602809234+intDiv(number,100000),
       number,
       number%991,
       toUInt32(rand())
from numbers(10000000000);

I reduced the granularity of the timestamp column to one hour with toStartOfHour(toDateTime(timestamp)) , otherwise sampling will not work.

Verifying Sampling Works

The following shows that sampling works with the table and parameters described above. Notice the Elapsed time when invoking sampling:

-- Q1. No where filters.
-- The query is 10 times faster with SAMPLE 0.01
select banner_id, sum(value), count(value), max(value)
from table_one
group by banner_id format Null;

0 rows in set. Elapsed: 11.490 sec.
     Processed 10.00 billion rows, 60.00 GB (870.30 million rows/s., 5.22 GB/s.)

select banner_id, sum(value), count(value), max(value)
from table_one SAMPLE 0.01
group by banner_id format Null;

0 rows in set. Elapsed: 1.316 sec.
     Processed 452.67 million rows, 6.34 GB (343.85 million rows/s., 4.81 GB/s.)

-- Q2. Filter by the first column in index (banner_id = 42)
-- The query is 20 times faster with SAMPLE 0.01
-- reads 20 times less rows: 10.30 million rows VS Processed 696.32 thousand rows
select banner_id, sum(value), count(value), max(value)
from table_one
WHERE banner_id = 42
group by banner_id format Null;

0 rows in set. Elapsed: 0.020 sec.
     Processed 10.30 million rows, 61.78 MB (514.37 million rows/s., 3.09 GB/s.)

select banner_id, sum(value), count(value), max(value)
from table_one SAMPLE 0.01
WHERE banner_id = 42
group by banner_id format Null;

0 rows in set. Elapsed: 0.008 sec.
     Processed 696.32 thousand rows, 9.75 MB (92.49 million rows/s., 1.29 GB/s.)

-- Q3. No filters
-- The query is 10 times faster with SAMPLE 0.01
-- reads 20 times less rows.
select banner_id,
       toStartOfHour(toDateTime(timestamp)) hr,
       sum(value), count(value), max(value)
from table_one
group by banner_id, hr format Null;
0 rows in set. Elapsed: 36.660 sec.
     Processed 10.00 billion rows, 140.00 GB (272.77 million rows/s., 3.82 GB/s.)

select banner_id,
       toStartOfHour(toDateTime(timestamp)) hr,
       sum(value), count(value), max(value)
from table_one SAMPLE 0.01
group by banner_id, hr format Null;
0 rows in set. Elapsed: 3.741 sec.
     Processed 452.67 million rows, 9.96 GB (121.00 million rows/s., 2.66 GB/s.)

-- Q4. Filter by not indexed column
-- The query is 6 times faster with SAMPLE 0.01
-- reads 20 times less rows.
select count()
from table_one
where value = 666 format Null;
1 rows in set. Elapsed: 6.056 sec.
     Processed 10.00 billion rows, 40.00 GB (1.65 billion rows/s., 6.61 GB/s.)

select count()
from table_one  SAMPLE 0.01
where value = 666 format Null;
1 rows in set. Elapsed: 1.214 sec.
     Processed 452.67 million rows, 5.43 GB (372.88 million rows/s., 4.47 GB/s.)

Non-Sampling Friendly Table

CREATE TABLE table_one
( timestamp UInt64,
  transaction_id UInt64,
  banner_id UInt16,
  value UInt32
)
ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(toDateTime(timestamp))
ORDER BY (banner_id,
          timestamp,
          cityHash64(transaction_id))
SAMPLE BY cityHash64(transaction_id)
SETTINGS index_granularity = 8192

insert into table_one
select 1602809234+intDiv(number,100000),
       number,
       number%991,
       toUInt32(rand())
from numbers(10000000000);

This is the same as our other table, BUT granularity of timestamp column is not reduced.

Verifying Sampling Does Not Work

The following tests shows that sampling is not working because of the lack of timestamp granularity. The Elapsed time is longer when sampling is used.

-- Q1. No where filters.
-- The query is 2 times SLOWER!!! with SAMPLE 0.01
-- Because it needs to read excessive column with sampling data!
select banner_id, sum(value), count(value), max(value)
from table_one
group by banner_id format Null;
0 rows in set. Elapsed: 11.196 sec.
     Processed 10.00 billion rows, 60.00 GB (893.15 million rows/s., 5.36 GB/s.)

select banner_id, sum(value), count(value), max(value)
from table_one SAMPLE 0.01
group by banner_id format Null;
0 rows in set. Elapsed: 24.378 sec.
     Processed 10.00 billion rows, 140.00 GB (410.21 million rows/s., 5.74 GB/s.)

-- Q2. Filter by the first column in index (banner_id = 42)
-- The query is SLOWER with SAMPLE 0.01
select banner_id, sum(value), count(value), max(value)
from table_one
WHERE banner_id = 42
group by banner_id format Null;
0 rows in set. Elapsed: 0.022 sec.
     Processed 10.27 million rows, 61.64 MB (459.28 million rows/s., 2.76 GB/s.)

select banner_id, sum(value), count(value), max(value)
from table_one SAMPLE 0.01
WHERE banner_id = 42
group by banner_id format Null;
0 rows in set. Elapsed: 0.037 sec.
     Processed 10.27 million rows, 143.82 MB (275.16 million rows/s., 3.85 GB/s.)

-- Q3. No filters
-- The query is SLOWER with SAMPLE 0.01
select banner_id,
       toStartOfHour(toDateTime(timestamp)) hr,
       sum(value), count(value), max(value)
from table_one
group by banner_id, hr format Null;
0 rows in set. Elapsed: 21.663 sec.
     Processed 10.00 billion rows, 140.00 GB (461.62 million rows/s., 6.46 GB/s.)

select banner_id,
       toStartOfHour(toDateTime(timestamp)) hr, sum(value),
       count(value), max(value)
from table_one SAMPLE 0.01
group by banner_id, hr format Null;
0 rows in set. Elapsed: 26.697 sec.
     Processed 10.00 billion rows, 220.00 GB (374.57 million rows/s., 8.24 GB/s.)

-- Q4. Filter by not indexed column
-- The query is SLOWER with SAMPLE 0.01
select count()
from table_one
where value = 666 format Null;
0 rows in set. Elapsed: 7.679 sec.
     Processed 10.00 billion rows, 40.00 GB (1.30 billion rows/s., 5.21 GB/s.)

select count()
from table_one  SAMPLE 0.01
where value = 666 format Null;
0 rows in set. Elapsed: 21.668 sec.
     Processed 10.00 billion rows, 120.00 GB (461.51 million rows/s., 5.54 GB/s.)

2.38 - Simple aggregate functions & combinators

Simple aggregate functions & combinators

Q. What is SimpleAggregateFunction? Are there advantages to use it instead of AggregateFunction in AggregatingMergeTree?

The ClickHouse® SimpleAggregateFunction can be used for those aggregations when the function state is exactly the same as the resulting function value. Typical example is max function: it only requires storing the single value which is already maximum, and no extra steps needed to get the final value. In contrast avg need to store two numbers - sum & count, which should be divided to get the final value of aggregation (done by the -Merge step at the very end).

	SimpleAggregateFunction	AggregateFunction
inserting	accepts the value of underlying type OR a value of corresponding SimpleAggregateFunction type `CREATE TABLE saf_test ( x SimpleAggregateFunction(max, UInt64) ) ENGINE=AggregatingMergeTree ORDER BY tuple(); INSERT INTO saf_test VALUES (1); INSERT INTO saf_test SELECT max(number) FROM numbers(10); INSERT INTO saf_test SELECT maxSimpleState(number) FROM numbers(20);`	ONLY accepts the state of same aggregate function calculated using -State combinator
storing	Internally store just a value of underlying type	function-specific state
storage usage	typically is much better due to better compression/codecs	in very rare cases it can be more optimal than raw values adaptive granularity doesn't work for large states
reading raw value per row	you can access it directly	you need to use `finalizeAggregation` function
using aggregated value	just `select max(x) from test;`	you need to use `-Merge` combinator `select maxMerge(x) from test;`
memory usage	typically less memory needed (in some corner cases even 10 times)	typically uses more memory, as every state can be quite complex
performance	typically better, due to lower overhead	worse

See also:

Q. How maxSimpleState combinator result differs from plain max?

They produce the same result, but types differ (the first have SimpleAggregateFunction datatype). Both can be pushed to SimpleAggregateFunction or to the underlying type. So they are interchangeable.

Info

-SimpleState is useful for implicit Materialized View creation, like

CREATE MATERIALIZED VIEW mv ENGINE = AggregatingMergeTree ORDER BY date AS SELECT date, sumSimpleState(1) AS cnt, sumSimpleState(revenue) AS rev FROM table GROUP BY date

Warning

-SimpleState supported since 21.1. See https://github.com/ClickHouse/ClickHouse/pull/16853/

Q. Can I use -If combinator with SimpleAggregateFunction?

Something like SimpleAggregateFunction(maxIf, UInt64, UInt8) is NOT possible. But is 100% ok to push maxIf (or maxSimpleStateIf) into SimpleAggregateFunction(max, UInt64)

There is one problem with that approach: -SimpleStateIf Would produce 0 as result in case of no-match, and it can mess up some aggregate functions state. It wouldn’t affect functions like max/argMax/sum, but could affect functions like min/argMin/any/anyLast

SELECT
    minIfMerge(state_1),
    min(state_2)
FROM
(
    SELECT
        minIfState(number, number > 5) AS state_1,
        minSimpleStateIf(number, number > 5) AS state_2
    FROM numbers(5)
    UNION ALL
    SELECT
        minIfState(toUInt64(2), 2),
        minIf(2, 2)
)

┌─minIfMerge(state_1)─┬─min(state_2)─┐
│                   2 │            0 │
└─────────────────────┴──────────────┘

You can easily workaround that:

Using Nullable datatype.
Set result to some big number in case of no-match, which would be bigger than any possible value, so it would be safe to use. But it would work only for min/argMin

SELECT
    min(state_1),
    min(state_2)
FROM
(
    SELECT
        minSimpleState(if(number > 5, number, 1000)) AS state_1,
        minSimpleStateIf(toNullable(number), number > 5) AS state_2
    FROM numbers(5)
    UNION ALL
    SELECT
        minIf(2, 2),
        minIf(2, 2)
)

┌─min(state_1)─┬─min(state_2)─┐
│            2 │            2 │
└──────────────┴──────────────┘

Extra example

WITH
    minIfState(number, number > 5) AS state_1,
    minSimpleStateIf(number, number > 5) AS state_2
SELECT
    byteSize(state_1),
    toTypeName(state_1),
    byteSize(state_2),
    toTypeName(state_2)
FROM numbers(10)
FORMAT Vertical

-- For UInt64
Row 1:
──────
byteSize(state_1):   24
toTypeName(state_1): AggregateFunction(minIf, UInt64, UInt8)
byteSize(state_2):   8
toTypeName(state_2): SimpleAggregateFunction(min, UInt64)

-- For UInt32
──────
byteSize(state_1):   16
byteSize(state_2):   4

-- For UInt16
──────
byteSize(state_1):   12
byteSize(state_2):   2

-- For UInt8
──────
byteSize(state_1):   10
byteSize(state_2):   1

2.39 - Skip indexes

Skip indexes

ClickHouse® provides a type of index that in specific circumstances can significantly improve query speed. These structures are labeled “skip” indexes because they enable ClickHouse to skip reading significant chunks of data that are guaranteed to have no matching values.

2.39.1 - Example: minmax

Example: minmax

Use cases

Strong correlation between column from table ORDER BY / PARTITION BY key and other column which is regularly being used in WHERE condition

Good example is incremental ID which increasing with time.

CREATE TABLE skip_idx_corr
(
    `key` UInt32,
    `id` UInt32,
    `ts` DateTime
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(ts)
ORDER BY (key, id);

INSERT INTO skip_idx_corr SELECT
    rand(),
    number,
    now() + intDiv(number, 10)
FROM numbers(100000000);

SELECT count()
FROM skip_idx_corr
WHERE id = 6000000

1 rows in set. Elapsed: 0.167 sec. Processed 100.00 million rows, 400.00 MB
(599.96 million rows/s., 2.40 GB/s.)


ALTER TABLE skip_idx_corr ADD INDEX id_idx id TYPE minmax GRANULARITY 10;
ALTER TABLE skip_idx_corr MATERIALIZE INDEX id_idx;


SELECT count()
FROM skip_idx_corr
WHERE id = 6000000

1 rows in set. Elapsed: 0.017 sec. Processed 6.29 million rows, 25.17 MB
(359.78 million rows/s., 1.44 GB/s.)

Multiple Date/DateTime columns can be used in WHERE conditions

Usually it could happen if you have separate Date and DateTime columns and different column being used in PARTITION BY expression and in WHERE condition. Another possible scenario when you have multiple DateTime columns which have pretty the same date or even time.

CREATE TABLE skip_idx_multiple
(
    `key` UInt32,
    `date` Date,
    `time` DateTime,
    `created_at` DateTime,
    `inserted_at` DateTime
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(date)
ORDER BY (key, time);

INSERT INTO skip_idx_multiple SELECT
    number,
    toDate(x),
    now() + intDiv(number, 10) AS x,
    x - (rand() % 100),
    x + (rand() % 100)
FROM numbers(100000000);


SELECT count()
FROM skip_idx_multiple
WHERE date > (now() + toIntervalDay(105));

1 rows in set. Elapsed: 0.048 sec. Processed 14.02 million rows, 28.04 MB
(290.96 million rows/s., 581.92 MB/s.)

SELECT count()
FROM skip_idx_multiple
WHERE time > (now() + toIntervalDay(105));

1 rows in set. Elapsed: 0.188 sec. Processed 100.00 million rows, 400.00 MB
(530.58 million rows/s., 2.12 GB/s.)

SELECT count()
FROM skip_idx_multiple
WHERE created_at > (now() + toIntervalDay(105));

1 rows in set. Elapsed: 0.400 sec. Processed 100.00 million rows, 400.00 MB
(250.28 million rows/s., 1.00 GB/s.)


ALTER TABLE skip_idx_multiple ADD INDEX time_idx time TYPE minmax GRANULARITY 1000;
ALTER TABLE skip_idx_multiple MATERIALIZE INDEX time_idx;

SELECT count()
FROM skip_idx_multiple
WHERE time > (now() + toIntervalDay(105));

1 rows in set. Elapsed: 0.036 sec. Processed 14.02 million rows, 56.08 MB
(391.99 million rows/s., 1.57 GB/s.)


ALTER TABLE skip_idx_multiple ADD INDEX created_at_idx created_at TYPE minmax GRANULARITY 1000;
ALTER TABLE skip_idx_multiple MATERIALIZE INDEX created_at_idx;

SELECT count()
FROM skip_idx_multiple
WHERE created_at > (now() + toIntervalDay(105));

1 rows in set. Elapsed: 0.076 sec. Processed 14.02 million rows, 56.08 MB
(184.90 million rows/s., 739.62 MB/s.)

Condition in query trying to filter outlier value

CREATE TABLE skip_idx_outlier
(
    `key` UInt32,
    `ts` DateTime,
    `value` UInt32
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(ts)
ORDER BY (key, ts);

INSERT INTO skip_idx_outlier SELECT
    number,
    now(),
    rand() % 10
FROM numbers(10000000);

INSERT INTO skip_idx_outlier SELECT
    number,
    now(),
    20
FROM numbers(10);

SELECT count()
FROM skip_idx_outlier
WHERE value > 15;

1 rows in set. Elapsed: 0.059 sec. Processed 10.00 million rows, 40.00 MB
(170.64 million rows/s., 682.57 MB/s.)

ALTER TABLE skip_idx_outlier ADD INDEX value_idx value TYPE minmax GRANULARITY 10;
ALTER TABLE skip_idx_outlier MATERIALIZE INDEX value_idx;

SELECT count()
FROM skip_idx_outlier
WHERE value > 15;

1 rows in set. Elapsed: 0.004 sec.

2.39.2 - Skip index bloom_filter Example

tested with ClickHouse® 20.8.17.25

https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#table_engine-mergetree-data_skipping-indexes

Let’s create test data

create table bftest (k Int64, x Array(Int64))
Engine=MergeTree order by k;

insert into bftest select number,
    arrayMap(i->rand64()%565656, range(10)) from numbers(10000000);
insert into bftest select number,
    arrayMap(i->rand64()%565656, range(10)) from numbers(100000000);

Base point (no index)

select count() from bftest where has(x, 42);
┌─count()─┐
│     186 │
└─────────┘
1 rows in set. Elapsed: 0.495 sec.
    Processed 110.00 million rows, 9.68 GB (222.03 million rows/s., 19.54 GB/s.)

select count() from bftest where has(x, -42);
┌─count()─┐
│       0 │
└─────────┘
1 rows in set. Elapsed: 0.505 sec.
    Processed 110.00 million rows, 9.68 GB (217.69 million rows/s., 19.16 GB/s.)

As you can see ClickHouse read 110.00 million rows and the query elapsed Elapsed: 0.505 sec.

Let’s add an index

alter table bftest add index ix1(x) TYPE bloom_filter GRANULARITY 3;

-- GRANULARITY 3 means how many table granules will be in the one index granule
-- In our case 1 granule of skip index allows to check and skip 3*8192 rows.
-- Every dataset is unique sometimes GRANULARITY 1 is better, sometimes
-- GRANULARITY 10.
-- Need to test on the real data.

optimize table bftest final;
-- I need to optimize my table because an index is created for only
-- new parts (inserted or merged)
-- optimize table final re-writes all parts, but with an index.
-- probably in your production you don't need to optimize
-- because your data is rotated frequently.
-- optimize is a heavy operation, better never run optimize table final in a
-- production.

test bloom_filter GRANULARITY 3

select count() from bftest where has(x, 42);
┌─count()─┐
│     186 │
└─────────┘
1 rows in set. Elapsed: 0.063 sec.
    Processed 5.41 million rows, 475.79 MB (86.42 million rows/s., 7.60 GB/s.)

select count() from bftest where has(x, -42);
┌─count()─┐
│       0 │
└─────────┘
1 rows in set. Elapsed: 0.042 sec.
   Processed 1.13 million rows, 99.48 MB (26.79 million rows/s., 2.36 GB/s.)

As you can see I got 10 times boost.

Let’s try to reduce GRANULARITY to drop by 1 table granule

alter  table bftest drop index ix1;
alter table bftest add index ix1(x) TYPE bloom_filter GRANULARITY 1;
optimize table bftest final;

select count() from bftest where has(x, 42);
┌─count()─┐
│     186 │
└─────────┘
1 rows in set. Elapsed: 0.051 sec.
    Processed 3.64 million rows, 320.08 MB (71.63 million rows/s., 6.30 GB/s.)

select count() from bftest where has(x, -42);
┌─count()─┐
│       0 │
└─────────┘
1 rows in set. Elapsed: 0.050 sec.
    Processed 2.06 million rows, 181.67 MB (41.53 million rows/s., 3.65 GB/s.)

No improvement :(

Let’s try to change the false/true probability of the bloom_filter bloom_filter(0.05)

alter  table bftest drop index ix1;
alter table bftest add index ix1(x) TYPE bloom_filter(0.05) GRANULARITY 3;
optimize table bftest final;

select count() from bftest where has(x, 42);
┌─count()─┐
│     186 │
└─────────┘
1 rows in set. Elapsed: 0.079 sec.
    Processed 8.95 million rows, 787.22 MB (112.80 million rows/s., 9.93 GB/s.)

select count() from bftest where has(x, -42);
┌─count()─┐
│       0 │
└─────────┘
1 rows in set. Elapsed: 0.058 sec.
    Processed 3.86 million rows, 339.54 MB (66.83 million rows/s., 5.88 GB/s.)

No improvement.

bloom_filter(0.01)

alter  table bftest drop index ix1;
alter table bftest add index ix1(x) TYPE bloom_filter(0.01) GRANULARITY 3;
optimize table bftest final;

select count() from bftest where has(x, 42);
┌─count()─┐
│     186 │
└─────────┘
1 rows in set. Elapsed: 0.069 sec.
    Processed 5.26 million rows, 462.82 MB (76.32 million rows/s., 6.72 GB/s.)

select count() from bftest where has(x, -42);
┌─count()─┐
│       0 │
└─────────┘
1 rows in set. Elapsed: 0.047 sec.
    Processed 737.28 thousand rows, 64.88 MB (15.72 million rows/s., 1.38 GB/s.)

Also no improvement :(

Outcome: I would use TYPE bloom_filter GRANULARITY 3.

2.39.3 - Skip indexes examples

bloom_filter

create table bftest (k Int64, x Int64) Engine=MergeTree order by k;

insert into bftest select number, rand64()%565656 from numbers(10000000);
insert into bftest select number, rand64()%565656 from numbers(100000000);

select count() from bftest where x = 42;
┌─count()─┐
│     201 │
└─────────┘
1 rows in set. Elapsed: 0.243 sec. Processed 110.00 million rows


alter table bftest add index ix1(x) TYPE bloom_filter GRANULARITY 1;

alter table bftest materialize index ix1;


select count() from bftest where x = 42;
┌─count()─┐
│     201 │
└─────────┘
1 rows in set. Elapsed: 0.056 sec. Processed 3.68 million rows

minmax

create table bftest (k Int64, x Int64) Engine=MergeTree order by k;

-- data is in x column is correlated with the primary key
insert into bftest select number, number * 2 from numbers(100000000);

alter table bftest add index ix1(x) TYPE minmax GRANULARITY 1;
alter table bftest materialize index ix1;

select count() from bftest where x = 42;
1 rows in set. Elapsed: 0.004 sec. Processed 8.19 thousand rows

projection

create table bftest (k Int64, x Int64, S String) Engine=MergeTree order by k;
insert into bftest select number, rand64()%565656, '' from numbers(10000000);
insert into bftest select number, rand64()%565656, '' from numbers(100000000);
alter table bftest add projection p1 (select k,x order by x);
alter table bftest materialize projection p1 settings mutations_sync=1;
set allow_experimental_projection_optimization=1 ;

-- projection
select count() from bftest where x = 42;
1 rows in set. Elapsed: 0.002 sec. Processed 24.58 thousand rows

-- no projection
select * from bftest where x = 42 format Null;
0 rows in set. Elapsed: 0.432 sec. Processed 110.00 million rows

-- projection
select * from bftest where k in (select k from bftest where x = 42) format Null;
0 rows in set. Elapsed: 0.316 sec. Processed 1.50 million rows

2.40 - Time zones

Time zones

Important things to know:

DateTime inside ClickHouse® is actually UNIX timestamp always, i.e. number of seconds since 1970-01-01 00:00:00 GMT.
Conversion from that UNIX timestamp to a human-readable form and reverse can happen on the client (for native clients) and on the server (for HTTP clients, and for some type of queries, like toString(ts))
Depending on the place where that conversion happened rules of different timezones may be applied.
You can check server timezone using SELECT timezone()
clickhouse-client also by default tries to use server timezone (see also --use_client_time_zone flag)
If you want you can store the timezone name inside the data type, in that case, timestamp <-> human-readable time rules of that timezone will be applied.

SELECT
    timezone(),
    toDateTime(now()) AS t,
    toTypeName(t),
    toDateTime(now(), 'UTC') AS t_utc,
    toTypeName(t_utc),
    toUnixTimestamp(t),
    toUnixTimestamp(t_utc)

Row 1:
──────
timezone():                                Europe/Warsaw
t:                                         2021-07-16 12:50:28
toTypeName(toDateTime(now())):             DateTime
t_utc:                                     2021-07-16 10:50:28
toTypeName(toDateTime(now(), 'UTC')):      DateTime('UTC')
toUnixTimestamp(toDateTime(now())):        1626432628
toUnixTimestamp(toDateTime(now(), 'UTC')): 1626432628

Since version 20.4 ClickHouse uses embedded tzdata (see https://github.com/ClickHouse/ClickHouse/pull/10425 )

You get used tzdata version

SELECT *
FROM system.build_options
WHERE name = 'TZDATA_VERSION'

Query id: 0a9883f0-dadf-4fb1-8b42-8fe93f561430

┌─name───────────┬─value─┐
│ TZDATA_VERSION │ 2020e │
└────────────────┴───────┘

and list of available time zones

SELECT *
FROM system.time_zones
WHERE time_zone LIKE '%Anta%'

Query id: 855453d7-eccd-44cb-9631-f63bb02a273c

┌─time_zone─────────────────┐
│ Antarctica/Casey          │
│ Antarctica/Davis          │
│ Antarctica/DumontDUrville │
│ Antarctica/Macquarie      │
│ Antarctica/Mawson         │
│ Antarctica/McMurdo        │
│ Antarctica/Palmer         │
│ Antarctica/Rothera        │
│ Antarctica/South_Pole     │
│ Antarctica/Syowa          │
│ Antarctica/Troll          │
│ Antarctica/Vostok         │
│ Indian/Antananarivo       │
└───────────────────────────┘

13 rows in set. Elapsed: 0.002 sec.

ClickHouse uses system timezone info from tzdata package if it exists, and uses own builtin tzdata if it is missing in the system.

cd /usr/share/zoneinfo/Canada
ln -s ../America/Halifax A

TZ=Canada/A clickhouse-local -q 'select timezone()'
Canada/A

When the conversion using different rules happen

SELECT timezone()

┌─timezone()─┐
│ UTC        │
└────────────┘

create table t_with_dt_utc ( ts DateTime64(3,'Europe/Moscow') ) engine=Log;

create table x (ts String) engine=Null;

create materialized view x_mv to t_with_dt_utc as select parseDateTime64BestEffort(ts) as ts from x;

$ echo '2021-07-15T05:04:23.733' | clickhouse-client -q 'insert into t_with_dt_utc format CSV'
-- here client checks the type of the columns, see that it's 'Europe/Moscow' and use conversion according to moscow rules

$ echo '2021-07-15T05:04:23.733' | clickhouse-client -q 'insert into x format CSV'
-- here client check tha type of the columns (it is string), and pass string value to the server.
-- parseDateTime64BestEffort(ts) uses server default timezone (UTC in my case), and convert the value using UTC rules.
-- and the result is 2 different timestamps (when i selecting from that is shows both in 'desired' timezone, forced by column type, i.e. Moscow):

SELECT * FROM t_with_dt_utc
┌──────────────────────ts─┐
│ 2021-07-15 05:04:23.733 │
│ 2021-07-15 08:04:23.733 │
└─────────────────────────┘

Best practice here: use UTC timezone everywhere, OR use the same default timezone for ClickHouse server as used by your data

2.41 - Time-series alignment with interpolation

Time-series alignment with interpolation

This article demonstrates how to perform time-series data alignment with interpolation using window functions in ClickHouse. The goal is to align two different time-series (A and B) on the same timestamp axis and fill the missing values using linear interpolation.

Step-by-Step Implementation We begin by creating a table with test data that simulates two time-series (A and B) with randomly distributed timestamps and values. Then, we apply interpolation to fill missing values for each time-series based on the surrounding data points.

1. Drop Existing Table (if it exists)

DROP TABLE test_ts_interpolation;

This ensures that any previous versions of the table are removed.

2. Generate Test Data

In this step, we generate random time-series data with timestamps and values for series A and B. The values are calculated differently for each series:

CREATE TABLE test_ts_interpolation
ENGINE = Log AS
SELECT
    ((number * 100) + 50) - (rand() % 100) AS timestamp, -- random timestamp generation
    transform(rand() % 2, [0, 1], ['A', 'B'], '') AS ts, -- randomly assign series 'A' or 'B'
    if(ts = 'A', timestamp * 10, timestamp * 100) AS value -- different value generation for each series
FROM numbers(1000000);

Here, the timestamp is generated randomly and assigned to either series A or B using the transform() function. The value is calculated based on the series type (A or B), with different multipliers for each.

3. Preview the Generated Data

After generating the data, you can inspect it by running a simple SELECT query:

SELECT * FROM test_ts_interpolation;

This will show the randomly generated timestamps, series (A or B), and their corresponding values.

4. Perform Interpolation with Window Functions

To align the time-series and interpolate missing values, we use window functions in the following query:

SELECT 
    timestamp,
    if(
        ts = 'A',
        toFloat64(value), -- If the current series is 'A', keep the original value
        prev_a.2 + (timestamp - prev_a.1 ) * (next_a.2 - prev_a.2) / ( next_a.1 - prev_a.1) -- Interpolate for 'A'
    ) as a_value,
    if(
        ts = 'B',
        toFloat64(value), -- If the current series is 'B', keep the original value
        prev_b.2 + (timestamp - prev_b.1 ) * (next_b.2 - prev_b.2) / ( next_b.1 - prev_b.1) -- Interpolate for 'B'
    ) as b_value
FROM 
(
    SELECT 
        timestamp,
        ts,
        value,
        -- Find the previous and next values for series 'A'
        anyLastIf((timestamp,value), ts='A') OVER (ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS prev_a,
        anyLastIf((timestamp,value), ts='A') OVER (ORDER BY timestamp DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS next_a,
        -- Find the previous and next values for series 'B'
        anyLastIf((timestamp,value), ts='B') OVER (ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS prev_b,
        anyLastIf((timestamp,value), ts='B') OVER (ORDER BY timestamp DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS next_b
    FROM 
    test_ts_interpolation
)

Explanation:

Timestamp Alignment: We align the timestamps of both series (A and B) and handle missing data points.

Interpolation Logic: For each A-series timestamp, if the current series is not A, we calculate the interpolated value using the linear interpolation formula:

interpolated_value = prev_a.2 + ((timestamp - prev_a.1) / (next_a.1 - prev_a.1)) * (next_a.2 - prev_a.2)

Similarly, for the B series, interpolation is calculated between the previous (prev_b) and next (next_b) known values.

Window Functions: anyLastIf() is used to fetch the previous or next values for series A and B based on the timestamps. We use window functions to efficiently calculate these values over the ordered sequence of timestamps.

By using window functions and interpolation, we can align time-series data with irregular timestamps and fill in missing values based on nearby data points. This technique is useful in scenarios where data is recorded at different times or irregular intervals across multiple series.

2.42 - Top N & Remain

Top N & Remain

When working with large datasets, you may often need to compute the sum of values for the top N groups and aggregate the remainder separately. This article demonstrates several methods to achieve that in ClickHouse.

Dataset Setup We’ll start by creating a table top_with_rest and inserting data for demonstration purposes:

CREATE TABLE top_with_rest
(
    `k` String,
    `number` UInt64
)
ENGINE = Memory;

INSERT INTO top_with_rest SELECT
    toString(intDiv(number, 10)),
    number
FROM numbers_mt(10000);

This creates a table with 10,000 numbers, grouped by dividing the numbers into tens.

Method 1: Using UNION ALL

This approach retrieves the top 10 groups by sum and aggregates the remaining groups as a separate row.

SELECT *
FROM
(
    SELECT
        k,
        sum(number) AS res
    FROM top_with_rest
    GROUP BY k
    ORDER BY res DESC
    LIMIT 10
    UNION ALL
    SELECT
        NULL,
        sum(number) AS res
    FROM top_with_rest
    WHERE k NOT IN (
        SELECT k
        FROM top_with_rest
        GROUP BY k
        ORDER BY sum(number) DESC
        LIMIT 10
    )
)
ORDER BY res ASC

┌─k───┬───res─┐
│ 990 │ 99045 │
│ 991 │ 99145 │
│ 992 │ 99245 │
│ 993 │ 99345 │
│ 994 │ 99445 │
│ 995 │ 99545 │
│ 996 │ 99645 │
│ 997 │ 99745 │
│ 998 │ 99845 │
│ 999 │ 99945 │
└─────┴───────┘
┌─k────┬──────res─┐
│ null │ 49000050 │
└──────┴──────────┘

Method 2: Using Arrays

In this method, we push the top 10 groups into an array and add a special row for the remainder

WITH toUInt64(sumIf(sum, isNull(k)) - sumIf(sum, isNotNull(k))) AS total
SELECT
    (arrayJoin(arrayPushBack(groupArrayIf(10)((k, sum), isNotNull(k)), (NULL, total))) AS tpl).1 AS key,
    tpl.2 AS res
FROM
(
    SELECT
        toNullable(k) AS k,
        sum(number) AS sum
    FROM top_with_rest
    GROUP BY k
        WITH CUBE
    ORDER BY sum DESC
    LIMIT 11
)
ORDER BY res ASC

┌─key──┬──────res─┐
│ 990  │    99045 │
│ 991  │    99145 │
│ 992  │    99245 │
│ 993  │    99345 │
│ 994  │    99445 │
│ 995  │    99545 │
│ 996  │    99645 │
│ 997  │    99745 │
│ 998  │    99845 │
│ 999  │    99945 │
│ null │ 49000050 │
└──────┴──────────┘

Method 3: Using Window Functions

Window functions, available from ClickHouse version 21.1, provide an efficient way to calculate the sum for the top N rows and the remainder.

SET allow_experimental_window_functions = 1;

SELECT
    k AS key,
    If(isNotNull(key), sum, toUInt64(sum - wind)) AS res
FROM
(
    SELECT
        *,
        sumIf(sum, isNotNull(k)) OVER () AS wind
    FROM
    (
        SELECT
            toNullable(k) AS k,
            sum(number) AS sum
        FROM top_with_rest
        GROUP BY k
            WITH CUBE
        ORDER BY sum DESC
        LIMIT 11
    )
)
ORDER BY res ASC

┌─key──┬──────res─┐
│ 990  │    99045 │
│ 991  │    99145 │
│ 992  │    99245 │
│ 993  │    99345 │
│ 994  │    99445 │
│ 995  │    99545 │
│ 996  │    99645 │
│ 997  │    99745 │
│ 998  │    99845 │
│ 999  │    99945 │
│ null │ 49000050 │
└──────┴──────────┘

Window functions allow efficient summation of the total and top groups in one query.

Method 4: Using Row Number and Grouping

This approach calculates the row number (rn) for each group and replaces the remaining groups with NULL.

SELECT
    k,
    sum(sum) AS res
FROM
(
    SELECT
        if(rn > 10, NULL, k) AS k,
        sum
    FROM
    (
        SELECT
            k,
            sum,
            row_number() OVER () AS rn
        FROM
        (
            SELECT
                k,
                sum(number) AS sum
            FROM top_with_rest
            GROUP BY k
            ORDER BY sum DESC
        )
    )
)
GROUP BY k
ORDER BY res

┌─k────┬──────res─┐
│ 990  │    99045 │
│ 991  │    99145 │
│ 992  │    99245 │
│ 993  │    99345 │
│ 994  │    99445 │
│ 995  │    99545 │
│ 996  │    99645 │
│ 997  │    99745 │
│ 998  │    99845 │
│ 999  │    99945 │
│ null │ 49000050 │
└──────┴──────────┘

This method uses ROW_NUMBER() to segregate the top N from the rest.

Method 5: Using WITH TOTALS

This method includes totals for all groups, and you calculate the remainder on the application side.

SELECT
    k,
    sum(number) AS res
FROM top_with_rest
GROUP BY k
    WITH TOTALS
ORDER BY res DESC
LIMIT 10

┌─k───┬───res─┐
│ 999 │ 99945 │
│ 998 │ 99845 │
│ 997 │ 99745 │
│ 996 │ 99645 │
│ 995 │ 99545 │
│ 994 │ 99445 │
│ 993 │ 99345 │
│ 992 │ 99245 │
│ 991 │ 99145 │
│ 990 │ 99045 │
└─────┴───────┘

Totals:
┌─k─┬──────res─┐
│   │ 49995000 │
└───┴──────────┘

You would subtract the sum of the top rows from the totals in your application.

These methods offer different approaches for handling the Top N rows and aggregating the remainder in ClickHouse. Depending on your requirements—whether you prefer using UNION ALL, arrays, window functions, or totals—each method provides flexibility for efficient querying.

2.43 - Troubleshooting

Tips for ClickHouse® troubleshooting

Query Execution Logging

When troubleshooting query execution in ClickHouse®, one of the most useful tools is logging the query execution details. This can be controlled using the session-level setting send_logs_level. Here are the different log levels you can use: Possible values: 'trace', 'debug', 'information', 'warning', 'error', 'fatal', 'none'

This can be used with clickhouse-client in both interactive and non-interactive mode.

The logs provide detailed information about query execution, making it easier to identify issues or bottlenecks. You can use the following command to run a query with logging enabled:

$ clickhouse-client -mn --send_logs_level='trace' --query "SELECT sum(number) FROM numbers(1000)"

-- output -- 
[LAPTOP] 2021.04.29 00:05:31.425842 [ 25316 ] {14b0646d-8a6e-4b2f-9b13-52a218cf43ba} <Debug> executeQuery: (from 127.0.0.1:42590, using production parser) SELECT sum(number) FROM numbers(1000)
[LAPTOP] 2021.04.29 00:05:31.426281 [ 25316 ] {14b0646d-8a6e-4b2f-9b13-52a218cf43ba} <Trace> ContextAccess (default): Access granted: CREATE TEMPORARY TABLE ON *.*
[LAPTOP] 2021.04.29 00:05:31.426648 [ 25316 ] {14b0646d-8a6e-4b2f-9b13-52a218cf43ba} <Trace> InterpreterSelectQuery: FetchColumns -> Complete
[LAPTOP] 2021.04.29 00:05:31.427132 [ 25448 ] {14b0646d-8a6e-4b2f-9b13-52a218cf43ba} <Trace> AggregatingTransform: Aggregating
[LAPTOP] 2021.04.29 00:05:31.427187 [ 25448 ] {14b0646d-8a6e-4b2f-9b13-52a218cf43ba} <Trace> Aggregator: Aggregation method: without_key
[LAPTOP] 2021.04.29 00:05:31.427220 [ 25448 ] {14b0646d-8a6e-4b2f-9b13-52a218cf43ba} <Debug> AggregatingTransform: Aggregated. 1000 to 1 rows (from 7.81 KiB) in 0.0004469 sec. (2237637.0552696353 rows/sec., 17.07 MiB/sec.)
[LAPTOP] 2021.04.29 00:05:31.427233 [ 25448 ] {14b0646d-8a6e-4b2f-9b13-52a218cf43ba} <Trace> Aggregator: Merging aggregated data
[LAPTOP] 2021.04.29 00:05:31.427875 [ 25316 ] {14b0646d-8a6e-4b2f-9b13-52a218cf43ba} <Information> executeQuery: Read 1000 rows, 7.81 KiB in 0.0019463 sec., 513795 rows/sec., 3.92 MiB/sec.
[LAPTOP] 2021.04.29 00:05:31.427898 [ 25316 ] {14b0646d-8a6e-4b2f-9b13-52a218cf43ba} <Debug> MemoryTracker: Peak memory usage (for query): 0.00 B.
499500

You can also redirect the logs to a file for further analysis:

$ clickhouse-client -mn --send_logs_level='trace' --query "SELECT sum(number) FROM numbers(1000)" 2> ./query.log

Analyzing Logs in System Tables

If you need to analyze the logs after executing a query, you can query the system tables to retrieve the execution details.

Query Log: You can fetch query logs from the system.query_log table:

LAPTOP.localdomain :) SET send_logs_level='trace';

SET send_logs_level = 'trace'

Query id: cbbffc02-283e-48ef-93e2-8b3baced6689

Ok.

0 rows in set. Elapsed: 0.003 sec.

LAPTOP.localdomain :) SELECT sum(number) FROM numbers(1000);

SELECT sum(number)
FROM numbers(1000)

Query id: d3db767b-34e9-4252-9f90-348cf958f822

[LAPTOP] 2021.04.29 00:06:51.673836 [ 25316 ] {d3db767b-34e9-4252-9f90-348cf958f822} <Debug> executeQuery: (from 127.0.0.1:43116, using production parser) SELECT sum(number) FROM numbers(1000);
[LAPTOP] 2021.04.29 00:06:51.674167 [ 25316 ] {d3db767b-34e9-4252-9f90-348cf958f822} <Trace> ContextAccess (default): Access granted: CREATE TEMPORARY TABLE ON *.*
[LAPTOP] 2021.04.29 00:06:51.674419 [ 25316 ] {d3db767b-34e9-4252-9f90-348cf958f822} <Trace> InterpreterSelectQuery: FetchColumns -> Complete
[LAPTOP] 2021.04.29 00:06:51.674748 [ 25449 ] {d3db767b-34e9-4252-9f90-348cf958f822} <Trace> AggregatingTransform: Aggregating
[LAPTOP] 2021.04.29 00:06:51.674781 [ 25449 ] {d3db767b-34e9-4252-9f90-348cf958f822} <Trace> Aggregator: Aggregation method: without_key
[LAPTOP] 2021.04.29 00:06:51.674855 [ 25449 ] {d3db767b-34e9-4252-9f90-348cf958f822} <Debug> AggregatingTransform: Aggregated. 1000 to 1 rows (from 7.81 KiB) in 0.0003299 sec. (3031221.582297666 rows/sec., 23.13 MiB/sec.)
[LAPTOP] 2021.04.29 00:06:51.674883 [ 25449 ] {d3db767b-34e9-4252-9f90-348cf958f822} <Trace> Aggregator: Merging aggregated data
┌─sum(number)─┐
│      499500 │
└─────────────┘
[LAPTOP] 2021.04.29 00:06:51.675481 [ 25316 ] {d3db767b-34e9-4252-9f90-348cf958f822} <Information> executeQuery: Read 1000 rows, 7.81 KiB in 0.0015799 sec., 632951 rows/sec., 4.83 MiB/sec.
[LAPTOP] 2021.04.29 00:06:51.675508 [ 25316 ] {d3db767b-34e9-4252-9f90-348cf958f822} <Debug> MemoryTracker: Peak memory usage (for query): 0.00 B.

1 rows in set. Elapsed: 0.007 sec. Processed 1.00 thousand rows, 8.00 KB (136.43 thousand rows/s., 1.09 MB/s.)

Analyzing Logs in System Tables

# Query Log: You can fetch query logs from the system.query_log table:

SELECT sum(number)
FROM numbers(1000);

Query id: 34c61093-3303-47d0-860b-0d644fa7264b

┌─sum(number)─┐
│      499500 │
└─────────────┘

1 row in set. Elapsed: 0.002 sec. Processed 1.00 thousand rows, 8.00 KB (461.45 thousand rows/s., 3.69 MB/s.)

SELECT *
FROM system.query_log
WHERE (event_date = today()) AND (query_id = '34c61093-3303-47d0-860b-0d644fa7264b');

# Query Thread Log: If thread-level logging is enabled (log_query_threads = 1), retrieve logs using:
# To capture detailed thread-level logs, enable log_query_threads: (SET log_query_threads = 1;)

SELECT *
FROM system.query_thread_log
WHERE (event_date = today()) AND (query_id = '34c61093-3303-47d0-860b-0d644fa7264b');

# OpenTelemetry Span Log: For detailed tracing with OpenTelemetry, if enabled (opentelemetry_start_trace_probability = 1), use:
# To enable OpenTelemetry tracing for queries, set: (SET opentelemetry_start_trace_probability = 1, opentelemetry_trace_processors = 1) 

SELECT *
FROM system.opentelemetry_span_log
WHERE (trace_id, finish_date) IN (
    SELECT
        trace_id,
        finish_date
    FROM system.opentelemetry_span_log
    WHERE ((attribute['clickhouse.query_id']) = '34c61093-3303-47d0-860b-0d644fa7264b') AND (finish_date = today())
);

Visualizing Query Performance with Flamegraphs

ClickHouse supports exporting query performance data in a format compatible with speedscope.app. This can help you visualize performance bottlenecks within queries. Example query to generate a flamegraph: https://www.speedscope.app/

WITH
    '95578e1c-1e93-463c-916c-a1a8cdd08198' AS query,
    min(min) AS start_value,
    max(max) AS end_value,
    groupUniqArrayArrayArray(trace_arr) AS uniq_frames,
    arrayMap((x, a, b) -> ('sampled', b, 'none', start_value, end_value, arrayMap(s -> reverse(arrayMap(y -> toUInt32(indexOf(uniq_frames, y) - 1), s)), x), a), groupArray(trace_arr), groupArray(weights), groupArray(trace_type)) AS samples
SELECT
    concat('clickhouse-server@', version()) AS exporter,
    'https://www.speedscope.app/file-format-schema.json' AS `$schema`,
    concat('ClickHouse query id: ', query) AS name,
    CAST(samples, 'Array(Tuple(type String, name String, unit String, startValue UInt64, endValue UInt64, samples Array(Array(UInt32)), weights Array(UInt32)))') AS profiles,
    CAST(tuple(arrayMap(x -> (demangle(addressToSymbol(x)), addressToLine(x)), uniq_frames)), 'Tuple(frames Array(Tuple(name String, line String)))') AS shared
FROM
(
    SELECT
        min(min_ns) AS min,
        trace_type,
        max(max_ns) AS max,
        groupArray(trace) AS trace_arr,
        groupArray(cnt) AS weights
    FROM
    (
        SELECT
            min(timestamp_ns) AS min_ns,
            max(timestamp_ns) AS max_ns,
            trace,
            trace_type,
            count() AS cnt
        FROM system.trace_log
        WHERE query_id = query
        GROUP BY
            trace_type,
            trace
    )
    GROUP BY trace_type
)
SETTINGS allow_introspection_functions = 1, output_format_json_named_tuples_as_objects = 1
FORMAT JSONEachRow

And query to generate traces per thread

WITH
    '8e7e0616-cfaf-43af-a139-d938ced7655a' AS query,
    min(min) AS start_value,
    max(max) AS end_value,
    groupUniqArrayArrayArray(trace_arr) AS uniq_frames,
    arrayMap((x, a, b, c, d) -> ('sampled', concat(b, ' - thread ', c.1, ' - traces ', c.2), 'nanoseconds', d.1 - start_value, d.2 - start_value, arrayMap(s -> reverse(arrayMap(y -> toUInt32(indexOf(uniq_frames, y) - 1), s)), x), a), groupArray(trace_arr), groupArray(weights), groupArray(trace_type), groupArray((thread_id, total)), groupArray((min, max))) AS samples
SELECT
    concat('clickhouse-server@', version()) AS exporter,
    'https://www.speedscope.app/file-format-schema.json' AS `$schema`,
    concat('ClickHouse query id: ', query) AS name,
    CAST(samples, 'Array(Tuple(type String, name String, unit String, startValue UInt64, endValue UInt64, samples Array(Array(UInt32)), weights Array(UInt32)))') AS profiles,
    CAST(tuple(arrayMap(x -> (demangle(addressToSymbol(x)), addressToLine(x)), uniq_frames)), 'Tuple(frames Array(Tuple(name String, line String)))') AS shared
FROM
(
    SELECT
        min(min_ns) AS min,
        trace_type,
        thread_id,
        max(max_ns) AS max,
        groupArray(trace) AS trace_arr,
        groupArray(cnt) AS weights,
        sum(cnt) as total
    FROM
    (
        SELECT
            min(timestamp_ns) AS min_ns,
            max(timestamp_ns) AS max_ns,
            trace,
            trace_type,
            thread_id,
            sum(if(trace_type IN ('Memory', 'MemoryPeak', 'MemorySample'), size, 1)) AS cnt
        FROM system.trace_log
        WHERE query_id = query
        GROUP BY
            trace_type,
            trace,
            thread_id
    )
    GROUP BY
        trace_type,
        thread_id
    ORDER BY
        trace_type ASC,
        total DESC
)
SETTINGS allow_introspection_functions = 1, output_format_json_named_tuples_as_objects = 1, output_format_json_quote_64bit_integers=1
FORMAT JSONEachRow

By enabling detailed logging and tracing, you can effectively diagnose issues and optimize query performance in ClickHouse.

2.44 - TTL

TTL

2.44.1 - MODIFY (ADD) TTL in ClickHouse®

What happens during a MODIFY or ADD TTL query

For a general overview of TTL, see the article Putting Things Where They Belong Using New TTL Moves .

ALTER TABLE tbl MODIFY (ADD) TTL:

It’s 2 step process:

ALTER TABLE tbl MODIFY (ADD) TTL ...

Update table metadata: schema .sql & metadata in ZK. It’s usually cheap and fast command. And any new INSERT after schema change will calculate TTL according to new rule.

ALTER TABLE tbl MATERIALIZE TTL

Recalculate TTL for already exist parts. It can be heavy operation, because ClickHouse® will read column data & recalculate TTL & apply TTL expression. You can disable this step completely by using materialize_ttl_after_modify user session setting (by default it’s 1, so materialization is enabled).

SET materialize_ttl_after_modify=0;
ALTER TABLE tbl MODIFY TTL

If you will disable materialization of TTL, it does mean that all old parts will be transformed according OLD TTL rules. MATERIALIZE TTL:

Recalculate TTL (Kinda cheap, it read only column participate in TTL)
Apply TTL (Rewrite of table data for all columns)

You also can only disable apply TTL substep via materialize_ttl_recalculate_only merge_tree setting (by default it’s 0, so clickhouse will apply TTL expression)

ALTER TABLE tbl MODIFY SETTING materialize_ttl_recalculate_only=1;

It does mean, that TTL rule will not be applied during ALTER TABLE tbl MODIFY (ADD) TTL ... query and data is now going to be rewritten.

After this you can apply TTL (MATERIALIZE) per partition manually (which will apply the TTL and rewrite data)

ALTER TABLE tbl MATERIALIZE TTL [IN PARTITION partition | IN PARTITION ID 'partition_id'];

The idea of materialize_ttl_after_modify = 0 and materialize_ttl_recalculate_only = 1 is to use ALTER TABLE tbl MATERIALIZE TTL IN PARTITION xxx; ALTER TABLE tbl MATERIALIZE TTL IN PARTITION yyy; and materialize TTL gently or drop/move partitions manually until the old data without/old TTL is processed.

MATERIALIZE TTL done via Mutation:

ClickHouse create new parts via hardlinks and write new ttl.txt file
ClickHouse remove old(inactive) parts after remove time (default is 8 minutes)

To stop materialization of TTL:

SELECT * FROM system.mutations WHERE is_done=0 AND table = 'tbl';
KILL MUTATION WHERE command LIKE '%MATERIALIZE TTL%' AND table = 'tbl'

MODIFY TTL MOVE

today: 2022-06-02

Table tbl

Daily partitioning by toYYYYMMDD(timestamp) -> 20220602

Increase of TTL

TTL timestamp + INTERVAL 30 DAY MOVE TO DISK s3 -> TTL timestamp + INTERVAL 60 DAY MOVE TO DISK s3

Idea: ClickHouse need to move data from s3 to local disk BACK
Actual: There is no rule that data earlier than 60 DAY should be on local disk

Table parts:

20220401    ttl: 20220501       disk: s3
20220416    ttl: 20220516       disk: s3
20220501    ttl: 20220531       disk: s3
20220502    ttl: 20220601       disk: local
20220516    ttl: 20220616       disk: local
20220601    ttl: 20220631       disk: local

ALTER TABLE tbl MODIFY TTL timestamp + INTERVAL 60 DAY MOVE TO DISK s3;

Table parts:

20220401    ttl: 20220601       disk: s3
20220416    ttl: 20220616       disk: s3
20220501    ttl: 20220631       disk: s3        (ClickHouse will not move this part to local disk, because there is no TTL rule for that)
20220502    ttl: 20220701       disk: local
20220516    ttl: 20220716       disk: local
20220601    ttl: 20220731       disk: local

Decrease of TTL

TTL timestamp + INTERVAL 30 DAY MOVE TO DISK s3 -> TTL timestamp + INTERVAL 14 DAY MOVE TO DISK s3

Table parts:

20220401    ttl: 20220401       disk: s3
20220416    ttl: 20220516       disk: s3
20220501    ttl: 20220531       disk: s3        
20220502    ttl: 20220601       disk: local     
20220516    ttl: 20220616       disk: local
20220601    ttl: 20220631       disk: local

ALTER TABLE tbl MODIFY TTL timestamp + INTERVAL 14 DAY MOVE TO DISK s3;

Table parts:

20220401    ttl: 20220415       disk: s3
20220416    ttl: 20220501       disk: s3
20220501    ttl: 20220515       disk: s3
20220502    ttl: 20220517       disk: local     (ClickHouse will move this part to disk s3 in background according to TTL rule)
20220516    ttl: 20220601       disk: local     (ClickHouse will move this part to disk s3 in background according to TTL rule)
20220601    ttl: 20220616       disk: local

Possible TTL Rules

TTL:

DELETE          (With enabled `ttl_only_drop_parts`, it's cheap operation, ClickHouse will drop the whole part)
MOVE
GROUP BY
WHERE
RECOMPRESS

Related settings:

Server settings:

background_move_processing_pool_thread_sleep_seconds                        |   10      |
background_move_processing_pool_thread_sleep_seconds_random_part            |   1.0     |
background_move_processing_pool_thread_sleep_seconds_if_nothing_to_do       |   0.1     |
background_move_processing_pool_task_sleep_seconds_when_no_work_min         |   10      |
background_move_processing_pool_task_sleep_seconds_when_no_work_max         |   600     |
background_move_processing_pool_task_sleep_seconds_when_no_work_multiplier  |   1.1     |
background_move_processing_pool_task_sleep_seconds_when_no_work_random_part |   1.0     |

MergeTree settings:

merge_with_ttl_timeout                      │   14400   │       0 │ Minimal time in seconds, when merge with delete TTL can be repeated.
merge_with_recompression_ttl_timeout        │   14400   │       0 │ Minimal time in seconds, when merge with recompression TTL can be repeated.
max_replicated_merges_with_ttl_in_queue     │   1       │       0 │ How many tasks of merging parts with TTL are allowed simultaneously in ReplicatedMergeTree queue.
max_number_of_merges_with_ttl_in_pool       │   2       │       0 │ When there is more than specified number of merges with TTL entries in pool, do not assign new merge with TTL. This is to leave free threads for regular merges and avoid "Too many parts"
ttl_only_drop_parts                         │   0       │       0 │ Only drop altogether the expired parts and not partially prune them.

Session settings:

materialize_ttl_after_modify                │   1       │       0 │ Apply TTL for old data, after ALTER MODIFY TTL query

2.44.2 - What are my TTL settings?

What are my TTL settings?

Using `SHOW CREATE TABLE`

If you just want to see the current TTL settings on a table, you can look at the schema definition.

SHOW CREATE TABLE events2_local
FORMAT Vertical

Query id: eba671e5-6b8c-4a81-a4d8-3e21e39fb76b

Row 1:
──────
statement: CREATE TABLE default.events2_local
(
    `EventDate` DateTime,
    `EventID` UInt32,
    `Value` String
)
ENGINE = ReplicatedMergeTree('/clickhouse/{cluster}/tables/{shard}/default/events2_local', '{replica}')
PARTITION BY toYYYYMM(EventDate)
ORDER BY (EventID, EventDate)
TTL EventDate + toIntervalMonth(1)
SETTINGS index_granularity = 8192

This works even when there’s no data in the table. It does not tell you when the TTLs expire or anything specific to data in one or more of the table parts.

Using system.parts

If you want to see the actually TTL values for specific data, run a query on system.parts. There are columns listing all currently applicable TTL limits for each part. (It does not work if the table is empty because there aren’t any parts yet.)

SELECT *
FROM system.parts
WHERE (database = 'default') AND (table = 'events2_local')
FORMAT Vertical

Query id: 59106476-210f-4397-b843-9920745b6200

Row 1:
──────
partition:                             202203
name:                                  202203_0_0_0
...
database:                              default
table:                                 events2_local
...
delete_ttl_info_min:                   2022-04-27 21:26:30
delete_ttl_info_max:                   2022-04-27 21:26:30
move_ttl_info.expression:              []
move_ttl_info.min:                     []
move_ttl_info.max:                     []
default_compression_codec:             LZ4
recompression_ttl_info.expression:     []
recompression_ttl_info.min:            []
recompression_ttl_info.max:            []
group_by_ttl_info.expression:          []
group_by_ttl_info.min:                 []
group_by_ttl_info.max:                 []
rows_where_ttl_info.expression:        []
rows_where_ttl_info.min:               []
rows_where_ttl_info.max:               []

2.44.3 - TTL GROUP BY Examples

TTL GROUP BY Examples

Example with MergeTree table

CREATE TABLE test_ttl_group_by
(
    `key` UInt32,
    `ts` DateTime,
    `value` UInt32,
    `min_value` UInt32 DEFAULT value,
    `max_value` UInt32 DEFAULT value
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(ts)
ORDER BY (key, toStartOfDay(ts))
TTL ts + interval 30 day 
    GROUP BY key, toStartOfDay(ts) 
    SET value = sum(value), 
    min_value = min(min_value), 
    max_value = max(max_value), 
    ts = min(toStartOfDay(ts));

During TTL merges ClickHouse® re-calculates values of columns in the SET section.

GROUP BY section should be a prefix of a table’s PRIMARY KEY (the same as ORDER BY, if no separate PRIMARY KEY defined).

-- stop merges to demonstrate data before / after 
-- a rolling up
SYSTEM STOP TTL MERGES test_ttl_group_by;
SYSTEM STOP MERGES test_ttl_group_by;

INSERT INTO test_ttl_group_by (key, ts, value)
SELECT
    number % 5,
    now() + number,
    1
FROM numbers(100);

INSERT INTO test_ttl_group_by (key, ts, value)
SELECT
    number % 5,
    now() - interval 60 day + number,
    2
FROM numbers(100);

SELECT
    toYYYYMM(ts) AS m,
    count(),
    sum(value),
    min(min_value),
    max(max_value)
FROM test_ttl_group_by
GROUP BY m;
┌──────m─┬─count()─┬─sum(value)─┬─min(min_value)─┬─max(max_value)─┐
│ 202102 │     100 │        200 │              2 │              2 │
│ 202104 │     100 │        100 │              1 │              1 │
└────────┴─────────┴────────────┴────────────────┴────────────────┘

SYSTEM START TTL MERGES test_ttl_group_by;
SYSTEM START MERGES test_ttl_group_by;
OPTIMIZE TABLE test_ttl_group_by FINAL;

SELECT
    toYYYYMM(ts) AS m,
    count(),
    sum(value),
    min(min_value),
    max(max_value)
FROM test_ttl_group_by
GROUP BY m;
┌──────m─┬─count()─┬─sum(value)─┬─min(min_value)─┬─max(max_value)─┐
│ 202102 │       5 │        200 │              2 │              2 │
│ 202104 │     100 │        100 │              1 │              1 │
└────────┴─────────┴────────────┴────────────────┴────────────────┘

As you can see 100 rows were rolled up into 5 rows (key has 5 values) for rows older than 30 days.

Example with SummingMergeTree table

CREATE TABLE test_ttl_group_by
(
    `key1` UInt32,
    `key2` UInt32,
    `ts` DateTime,
    `value` UInt32,
    `min_value` SimpleAggregateFunction(min, UInt32) 
                                       DEFAULT value,
    `max_value` SimpleAggregateFunction(max, UInt32) 
                                       DEFAULT value
)
ENGINE = SummingMergeTree
PARTITION BY toYYYYMM(ts)
PRIMARY KEY (key1, key2, toStartOfDay(ts))
ORDER BY (key1, key2, toStartOfDay(ts), ts)
TTL ts + interval 30 day 
    GROUP BY key1, key2, toStartOfDay(ts) 
    SET value = sum(value), 
    min_value = min(min_value), 
    max_value = max(max_value), 
    ts = min(toStartOfDay(ts));

-- stop merges to demonstrate data before / after 
-- a rolling up
SYSTEM STOP TTL MERGES test_ttl_group_by;
SYSTEM STOP MERGES test_ttl_group_by;

INSERT INTO test_ttl_group_by (key1, key2, ts, value)
SELECT
    1,
    1,
    toStartOfMinute(now() + number*60),
    1
FROM numbers(100);

INSERT INTO test_ttl_group_by (key1, key2, ts, value)
SELECT
    1,
    1,
    toStartOfMinute(now() + number*60),
    1
FROM numbers(100);

INSERT INTO test_ttl_group_by (key1, key2, ts, value)
SELECT
    1,
    1,
    toStartOfMinute(now() + number*60 - toIntervalDay(60)),
    2
FROM numbers(100);

INSERT INTO test_ttl_group_by (key1, key2, ts, value)
SELECT
    1,
    1,
    toStartOfMinute(now() + number*60 - toIntervalDay(60)),
    2
FROM numbers(100);

SELECT
    toYYYYMM(ts) AS m,
    count(),
    sum(value),
    min(min_value),
    max(max_value)
FROM test_ttl_group_by
GROUP BY m;

┌──────m─┬─count()─┬─sum(value)─┬─min(min_value)─┬─max(max_value)─┐
│ 202102 │     200 │        400 │              2 │              2 │
│ 202104 │     200 │        200 │              1 │              1 │
└────────┴─────────┴────────────┴────────────────┴────────────────┘

SYSTEM START TTL MERGES test_ttl_group_by;
SYSTEM START MERGES test_ttl_group_by;
OPTIMIZE TABLE test_ttl_group_by FINAL;

SELECT
    toYYYYMM(ts) AS m,
    count(),
    sum(value),
    min(min_value),
    max(max_value)
FROM test_ttl_group_by
GROUP BY m;

┌──────m─┬─count()─┬─sum(value)─┬─min(min_value)─┬─max(max_value)─┐
│ 202102 │       1 │        400 │              2 │              2 │
│ 202104 │     100 │        200 │              1 │              1 │
└────────┴─────────┴────────────┴────────────────┴────────────────┘

During merges ClickHouse re-calculates ts columns as min(toStartOfDay(ts)). It’s possible only for the last column of SummingMergeTree ORDER BY section ORDER BY (key1, key2, toStartOfDay(ts), ts) otherwise it will break the order of rows in the table.

Example with AggregatingMergeTree table

CREATE TABLE test_ttl_group_by_agg
(
    `key1` UInt32,
    `key2` UInt32,
    `ts` DateTime,
    `counter` AggregateFunction(count, UInt32)
)
ENGINE = AggregatingMergeTree
PARTITION BY toYYYYMM(ts)
PRIMARY KEY (key1, key2, toStartOfDay(ts))
ORDER BY (key1, key2, toStartOfDay(ts), ts)
TTL ts + interval 30 day 
    GROUP BY key1, key2, toStartOfDay(ts) 
    SET counter = countMergeState(counter),
    ts = min(toStartOfDay(ts));

CREATE TABLE test_ttl_group_by_raw
(
    `key1` UInt32,
    `key2` UInt32,
    `ts` DateTime
) ENGINE = Null;

CREATE MATERIALIZED VIEW test_ttl_group_by_mv
    TO test_ttl_group_by_agg
AS
SELECT
    `key1`,
    `key2`,
    `ts`,
    countState() as counter
FROM test_ttl_group_by_raw
GROUP BY key1, key2, ts;

-- stop merges to demonstrate data before / after 
-- a rolling up
SYSTEM STOP TTL MERGES test_ttl_group_by_agg;
SYSTEM STOP MERGES test_ttl_group_by_agg;

INSERT INTO test_ttl_group_by_raw (key1, key2, ts)
SELECT
    1,
    1,
    toStartOfMinute(now() + number*60)
FROM numbers(100);

INSERT INTO test_ttl_group_by_raw (key1, key2, ts)
SELECT
    1,
    1,
    toStartOfMinute(now() + number*60)
FROM numbers(100);

INSERT INTO test_ttl_group_by_raw (key1, key2, ts)
SELECT
    1,
    1,
    toStartOfMinute(now() + number*60 - toIntervalDay(60))
FROM numbers(100);

INSERT INTO test_ttl_group_by_raw (key1, key2, ts)
SELECT
    1,
    1,
    toStartOfMinute(now() + number*60 - toIntervalDay(60))
FROM numbers(100);

SELECT
    toYYYYMM(ts) AS m,
    count(),
    countMerge(counter)
FROM test_ttl_group_by_agg
GROUP BY m;

┌──────m─┬─count()─┬─countMerge(counter)─┐
│ 202307 │     200 │                 200 │
│ 202309 │     200 │                 200 │
└────────┴─────────┴─────────────────────┘

SYSTEM START TTL MERGES test_ttl_group_by_agg;
SYSTEM START MERGES test_ttl_group_by_agg;
OPTIMIZE TABLE test_ttl_group_by_agg FINAL;

SELECT
    toYYYYMM(ts) AS m,
    count(),
    countMerge(counter)
FROM test_ttl_group_by_agg
GROUP BY m;

┌──────m─┬─count()─┬─countMerge(counter)─┐
│ 202307 │       1 │                 200 │
│ 202309 │     100 │                 200 │
└────────┴─────────┴─────────────────────┘

Multilevel TTL Group by

CREATE TABLE test_ttl_group_by
(
    `key` UInt32,
    `ts` DateTime,
    `value` UInt32,
    `min_value` UInt32 DEFAULT value,
    `max_value` UInt32 DEFAULT value
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(ts)
ORDER BY (key, toStartOfWeek(ts), toStartOfDay(ts), toStartOfHour(ts))
TTL 
ts + interval 1 hour 
GROUP BY key, toStartOfWeek(ts), toStartOfDay(ts), toStartOfHour(ts) 
    SET value = sum(value), 
    min_value = min(min_value), 
    max_value = max(max_value), 
    ts = min(toStartOfHour(ts)),
ts + interval 1 day 
GROUP BY key, toStartOfWeek(ts), toStartOfDay(ts) 
    SET value = sum(value), 
    min_value = min(min_value), 
    max_value = max(max_value), 
    ts = min(toStartOfDay(ts)),
ts + interval 30 day 
GROUP BY key, toStartOfWeek(ts) 
    SET value = sum(value), 
    min_value = min(min_value), 
    max_value = max(max_value), 
    ts = min(toStartOfWeek(ts));
    
SYSTEM STOP TTL MERGES test_ttl_group_by;
SYSTEM STOP MERGES test_ttl_group_by;

INSERT INTO test_ttl_group_by (key, ts, value)
SELECT
    number % 5,
    now() + number,
    1
FROM numbers(100);

INSERT INTO test_ttl_group_by (key, ts, value)
SELECT
    number % 5,
    now() - interval 2 hour + number,
    2
FROM numbers(100);    

INSERT INTO test_ttl_group_by (key, ts, value)
SELECT
    number % 5,
    now() - interval 2 day + number,
    3
FROM numbers(100);    

INSERT INTO test_ttl_group_by (key, ts, value)
SELECT
    number % 5,
    now() - interval 2 month + number,
    4
FROM numbers(100); 

SELECT
    toYYYYMMDD(ts) AS d,
    count(),
    sum(value),
    min(min_value),
    max(max_value)
FROM test_ttl_group_by
GROUP BY d
ORDER BY d;

┌────────d─┬─count()─┬─sum(value)─┬─min(min_value)─┬─max(max_value)─┐
│ 20210616 │     100 │        400 │              4 │              4 │
│ 20210814 │     100 │        300 │              3 │              3 │
│ 20210816 │     200 │        300 │              1 │              2 │
└──────────┴─────────┴────────────┴────────────────┴────────────────┘

SYSTEM START TTL MERGES test_ttl_group_by;
SYSTEM START MERGES test_ttl_group_by;
OPTIMIZE TABLE test_ttl_group_by FINAL;

SELECT
    toYYYYMMDD(ts) AS d,
    count(),
    sum(value),
    min(min_value),
    max(max_value)
FROM test_ttl_group_by
GROUP BY d
ORDER BY d;

┌────────d─┬─count()─┬─sum(value)─┬─min(min_value)─┬─max(max_value)─┐
│ 20210613 │       5 │        400 │              4 │              4 │
│ 20210814 │       5 │        300 │              3 │              3 │
│ 20210816 │     105 │        300 │              1 │              2 │
└──────────┴─────────┴────────────┴────────────────┴────────────────┘

TTL GROUP BY + DELETE

CREATE TABLE test_ttl_group_by
(
    `key` UInt32,
    `ts` DateTime,
    `value` UInt32,
    `min_value` UInt32 DEFAULT value,
    `max_value` UInt32 DEFAULT value
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(ts)
ORDER BY (key, toStartOfDay(ts))
TTL 
ts + interval 180 day,
ts + interval 30 day 
    GROUP BY key, toStartOfDay(ts) 
    SET value = sum(value), 
    min_value = min(min_value), 
    max_value = max(max_value), 
    ts = min(toStartOfDay(ts));

-- stop merges to demonstrate data before / after 
-- a rolling up
SYSTEM STOP TTL MERGES test_ttl_group_by;
SYSTEM STOP MERGES test_ttl_group_by;

INSERT INTO test_ttl_group_by (key, ts, value)
SELECT
    number % 5,
    now() + number,
    1
FROM numbers(100);

INSERT INTO test_ttl_group_by (key, ts, value)
SELECT
    number % 5,
    now() - interval 60 day + number,
    2
FROM numbers(100);    

INSERT INTO test_ttl_group_by (key, ts, value)
SELECT
    number % 5,
    now() - interval 200 day + number,
    3
FROM numbers(100);  

SELECT
    toYYYYMM(ts) AS m,
    count(),
    sum(value),
    min(min_value),
    max(max_value)
FROM test_ttl_group_by
GROUP BY m;

┌──────m─┬─count()─┬─sum(value)─┬─min(min_value)─┬─max(max_value)─┐
│ 202101 │     100 │        300 │              3 │              3 │
│ 202106 │     100 │        200 │              2 │              2 │
│ 202108 │     100 │        100 │              1 │              1 │
└────────┴─────────┴────────────┴────────────────┴────────────────┘

SYSTEM START TTL MERGES test_ttl_group_by;
SYSTEM START MERGES test_ttl_group_by;
OPTIMIZE TABLE test_ttl_group_by FINAL;

┌──────m─┬─count()─┬─sum(value)─┬─min(min_value)─┬─max(max_value)─┐
│ 202106 │       5 │        200 │              2 │              2 │
│ 202108 │     100 │        100 │              1 │              1 │
└────────┴─────────┴────────────┴────────────────┴────────────────┘

Also see the Altinity Knowledge Base pages on the MergeTree table engine family .

2.44.4 - TTL Recompress example

TTL Recompress example

Example how to create a table and define recompression rules

CREATE TABLE hits
(
    `banner_id` UInt64,
    `event_time` DateTime CODEC(Delta, Default),
    `c_name` String,
    `c_cost` Float64
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(event_time)
ORDER BY (banner_id, event_time)
TTL event_time + toIntervalMonth(1) RECOMPRESS CODEC(ZSTD(1)),
    event_time + toIntervalMonth(6) RECOMPRESS CODEC(ZSTD(6);

Default compression is LZ4. See the ClickHouse® documentation for more information.

These TTL rules recompress data after 1 and 6 months.

CODEC(Delta, Default) – Default means to use default compression (LZ4 -> ZSTD1 -> ZSTD6) in this case.

Example how to define recompression rules for an existing table

CREATE TABLE hits
(
    `banner_id` UInt64,
    `event_time` DateTime CODEC(Delta, LZ4),
    `c_name` String,
    `c_cost` Float64
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(event_time)
ORDER BY (banner_id, event_time);

ALTER TABLE hits 
  modify column event_time DateTime CODEC(Delta, Default),
  modify TTL event_time + toIntervalMonth(1) RECOMPRESS CODEC(ZSTD(1)),
       event_time + toIntervalMonth(6) RECOMPRESS CODEC(ZSTD(6));

All columns have implicit default compression from server config, except event_time, that’s why need to change to compression to Default for this column otherwise it won’t be recompressed.

2.45 - UPDATE via Dictionary

UPDATE via Dictionary

CREATE TABLE test_update
(
    `key` UInt32,
    `value` String
)
ENGINE = MergeTree
ORDER BY key;

INSERT INTO test_update SELECT
    number,
    concat('value ', toString(number))
FROM numbers(20);

SELECT *
FROM test_update;

┌─key─┬─value────┐
│   0 │ value 0  │
│   1 │ value 1  │
│   2 │ value 2  │
│   3 │ value 3  │
│   4 │ value 4  │
│   5 │ value 5  │
│   6 │ value 6  │
│   7 │ value 7  │
│   8 │ value 8  │
│   9 │ value 9  │
│  10 │ value 10 │
│  11 │ value 11 │
│  12 │ value 12 │
│  13 │ value 13 │
│  14 │ value 14 │
│  15 │ value 15 │
│  16 │ value 16 │
│  17 │ value 17 │
│  18 │ value 18 │
│  19 │ value 19 │
└─────┴──────────┘

CREATE TABLE test_update_source
(
    `key` UInt32,
    `value` String
)
ENGINE = MergeTree
ORDER BY key;

INSERT INTO test_update_source VALUES (1,'other value'), (10, 'new value');

CREATE DICTIONARY update_dict
(
    `key` UInt32,
    `value` String
)
PRIMARY KEY key
SOURCE(CLICKHOUSE(TABLE 'test_update_source'))
LIFETIME(MIN 0 MAX 10)
LAYOUT(FLAT);

SELECT dictGet('default.update_dict', 'value', toUInt64(1));

┌─dictGet('default.update_dict', 'value', toUInt64(1))─┐
│ other value                                          │
└──────────────────────────────────────────────────────┘

ALTER TABLE test_update
    UPDATE value = dictGet('default.update_dict', 'value', toUInt64(key)) WHERE dictHas('default.update_dict', toUInt64(key));

SELECT *
FROM test_update

┌─key─┬─value───────┐
│   0 │ value 0     │
│   1 │ other value │
│   2 │ value 2     │
│   3 │ value 3     │
│   4 │ value 4     │
│   5 │ value 5     │
│   6 │ value 6     │
│   7 │ value 7     │
│   8 │ value 8     │
│   9 │ value 9     │
│  10 │ new value   │
│  11 │ value 11    │
│  12 │ value 12    │
│  13 │ value 13    │
│  14 │ value 14    │
│  15 │ value 15    │
│  16 │ value 16    │
│  17 │ value 17    │
│  18 │ value 18    │
│  19 │ value 19    │
└─────┴─────────────┘

Info

In case of Replicated installation, Dictionary should be created on all nodes and source tables should use the ReplicatedMergeTree engine and be replicated across all nodes.

Info

Starting from 20.4, ClickHouse® forbid by default any potential non-deterministic mutations. This behavior controlled by setting allow_nondeterministic_mutations. You can append it to query like this ALTER TABLE xxx UPDATE ... WHERE ... SETTINGS allow_nondeterministic_mutations = 1; For ON CLUSTER queries, you would need to put this setting in default profile and restart ClickHouse servers.

2.46 - Values mapping

Values mapping

SELECT count()
FROM numbers_mt(1000000000)
WHERE NOT ignore(transform(number % 3, [0, 1, 2, 3], ['aa', 'ab', 'ad', 'af'], 'a0'))

1 rows in set. Elapsed: 4.668 sec. Processed 1.00 billion rows, 8.00 GB (214.21 million rows/s., 1.71 GB/s.)

SELECT count()
FROM numbers_mt(1000000000)
WHERE NOT ignore(multiIf((number % 3) = 0, 'aa', (number % 3) = 1, 'ab', (number % 3) = 2, 'ad', (number % 3) = 3, 'af', 'a0'))

1 rows in set. Elapsed: 7.333 sec. Processed 1.00 billion rows, 8.00 GB (136.37 million rows/s., 1.09 GB/s.)

SELECT count()
FROM numbers_mt(1000000000)
WHERE NOT ignore(CAST(number % 3 AS Enum('aa' = 0, 'ab' = 1, 'ad' = 2, 'af' = 3)'))

1 rows in set. Elapsed: 1.152 sec. Processed 1.00 billion rows, 8.00 GB (867.79 million rows/s., 6.94 GB/s.)

2.47 - Window functions

Window functions

Resources:

How Do I Simulate Window Functions Using Arrays on older versions of ClickHouse?

Group with groupArray.
Calculate the needed metrics.
Ungroup back using arrayJoin.

NTILE

SELECT intDiv((num - 1) - (cnt % 3), 3) AS ntile
FROM
(
    SELECT
        row_number() OVER (ORDER BY number ASC) AS num,
        count() OVER () AS cnt
    FROM numbers(11)
)

┌─ntile─┐
│     0 │
│     0 │
│     0 │
│     0 │
│     0 │
│     1 │
│     1 │
│     1 │
│     2 │
│     2 │
│     2 │
└───────┘

3 - Functions

Functions

3.1 - How to encode/decode quantileTDigest states from/to list of centroids

A way to export or import quantileTDigest states from/into ClickHouse®

quantileTDigestState

quantileTDigestState is stored in two parts: a count of centroids in LEB128 format + list of centroids without a delimiter. Each centroid is represented as two Float32 values: Mean & Count.

SELECT
    hex(quantileTDigestState(1)),
    hex(toFloat32(1))

┌─hex(quantileTDigestState(1))─┬─hex(toFloat32(1))─┐
│ 010000803F0000803F           │ 0000803F          │
└──────────────────────────────┴───────────────────┘
  01          0000803F      0000803F
  ^           ^             ^
  LEB128      Float32 Mean  Float32 Count

We need to make two helper UDF functions:

cat /etc/clickhouse-server/decodeTDigestState_function.xml
<yandex>
  <function>
    <type>executable</type>
    <execute_direct>0</execute_direct>
    <name>decodeTDigestState</name>
    <return_type>Array(Tuple(mean Float32, count Float32))</return_type>
    <argument>
      <type>AggregateFunction(quantileTDigest, UInt32)</type>
    </argument>
    <format>RowBinary</format>
    <command>cat</command>
    <send_chunk_header>0</send_chunk_header>
  </function>
</yandex>

cat /etc/clickhouse-server/encodeTDigestState_function.xml
<yandex>
  <function>
    <type>executable</type>
    <execute_direct>0</execute_direct>
    <name>encodeTDigestState</name>
    <return_type>AggregateFunction(quantileTDigest, UInt32)</return_type>
    <argument>
      <type>Array(Tuple(mean Float32, count Float32))</type>
    </argument>
    <format>RowBinary</format>
    <command>cat</command>
    <send_chunk_header>0</send_chunk_header>
  </function>
</yandex>

Those UDF – (encode/decode)TDigestState converts TDigestState to the Array(Tuple(Float32, Float32)) and back.

SELECT quantileTDigest(CAST(number, 'UInt32')) AS result
FROM numbers(10)

┌─result─┐
│      4 │
└────────┘

SELECT decodeTDigestState(quantileTDigestState(CAST(number, 'UInt32'))) AS state
FROM numbers(10)

┌─state─────────────────────────────────────────────────────────┐
│ [(0,1),(1,1),(2,1),(3,1),(4,1),(5,1),(6,1),(7,1),(8,1),(9,1)] │
└───────────────────────────────────────────────────────────────┘

SELECT finalizeAggregation(encodeTDigestState(CAST('[(0,1),(1,1),(2,1),(3,1),(4,1),(5,1),(6,1),(7,1),(8,1),(9,1)]', 'Array(Tuple(Float32, Float32))'))) AS result

┌─result─┐
│      4 │
└────────┘

3.2 - kurt & skew statistical functions in ClickHouse®

How to make them return the same result like python scipy

from scipy.stats import skew, kurtosis

# Creating a dataset

dataset = [10,17,71,6,55,38,27,61,48,46,21,38,2,67,35,77,29,31,27,67,81,82,75,81,31,38,68,95,37,34,65,59,81,28,82,80,35,3,97,42,66,28,85,98,45,15,41,61,24,53,97,86,5,65,84,18,9,32,46,52,69,44,78,98,61,64,26,11,3,19,0,90,28,72,47,8,0,74,38,63,88,43,81,61,34,24,37,53,79,72,5,77,58,3,61,56,1,3,5,61]

print(skew(dataset, axis=0, bias=True), skew(dataset))

# -0.05785361619432152 -0.05785361619432152

WITH arrayJoin([10,17,71,6,55,38,27,61,48,46,21,38,2,67,35,77,29,31,27,67,81,82,75,81,31,38,68,95,37,34,65,59,81,28,82,80,35,3,97,42,66,28,85,98,45,15,41,61,24,53,97,86,5,65,84,18,9,32,46,52,69,44,78,98,61,64,26,11,3,19,0,90,28,72,47,8,0,74,38,63,88,43,81,61,34,24,37,53,79,72,5,77,58,3,61,56,1,3,5,61]) AS value
SELECT skewPop(value) AS ex_1

┌──────────────────ex_1─┐
│ -0.057853616194321014 │
└───────────────────────┘

print(skew(dataset, bias=False))

# -0.05873838908626328

WITH arrayJoin([10, 17, 71, 6, 55, 38, 27, 61, 48, 46, 21, 38, 2, 67, 35, 77, 29, 31, 27, 67, 81, 82, 75, 81, 31, 38, 68, 95, 37, 34, 65, 59, 81, 28, 82, 80, 35, 3, 97, 42, 66, 28, 85, 98, 45, 15, 41, 61, 24, 53, 97, 86, 5, 65, 84, 18, 9, 32, 46, 52, 69, 44, 78, 98, 61, 64, 26, 11, 3, 19, 0, 90, 28, 72, 47, 8, 0, 74, 38, 63, 88, 43, 81, 61, 34, 24, 37, 53, 79, 72, 5, 77, 58, 3, 61, 56, 1, 3, 5, 61]) AS value
SELECT
    skewSamp(value) AS ex_1,
    (pow(count(), 2) * ex_1) / ((count() - 1) * (count() - 2)) AS G

┌─────────────────ex_1─┬────────────────────G─┐
│ -0.05698798509149213 │ -0.05873838908626276 │
└──────────────────────┴──────────────────────┘

print(kurtosis(dataset, bias=True, fisher=False), kurtosis(dataset, bias=True, fisher=True), kurtosis(dataset))

# 1.9020275610791184 -1.0979724389208816 -1.0979724389208816

WITH arrayJoin([10, 17, 71, 6, 55, 38, 27, 61, 48, 46, 21, 38, 2, 67, 35, 77, 29, 31, 27, 67, 81, 82, 75, 81, 31, 38, 68, 95, 37, 34, 65, 59, 81, 28, 82, 80, 35, 3, 97, 42, 66, 28, 85, 98, 45, 15, 41, 61, 24, 53, 97, 86, 5, 65, 84, 18, 9, 32, 46, 52, 69, 44, 78, 98, 61, 64, 26, 11, 3, 19, 0, 90, 28, 72, 47, 8, 0, 74, 38, 63, 88, 43, 81, 61, 34, 24, 37, 53, 79, 72, 5, 77, 58, 3, 61, 56, 1, 3, 5, 61]) AS value
SELECT
    kurtPop(value) AS pearson,
    pearson - 3 AS fisher

┌────────────pearson─┬──────────────fisher─┐
│ 1.9020275610791124 │ -1.0979724389208876 │
└────────────────────┴─────────────────────┘

print(kurtosis(dataset, bias=False))

# -1.0924286152713967

WITH arrayJoin([10, 17, 71, 6, 55, 38, 27, 61, 48, 46, 21, 38, 2, 67, 35, 77, 29, 31, 27, 67, 81, 82, 75, 81, 31, 38, 68, 95, 37, 34, 65, 59, 81, 28, 82, 80, 35, 3, 97, 42, 66, 28, 85, 98, 45, 15, 41, 61, 24, 53, 97, 86, 5, 65, 84, 18, 9, 32, 46, 52, 69, 44, 78, 98, 61, 64, 26, 11, 3, 19, 0, 90, 28, 72, 47, 8, 0, 74, 38, 63, 88, 43, 81, 61, 34, 24, 37, 53, 79, 72, 5, 77, 58, 3, 61, 56, 1, 3, 5, 61]) AS value
SELECT
    kurtSamp(value) AS ex_1,
    (((pow(count(), 2) * (count() + 1)) / (((count() - 1) * (count() - 2)) * (count() - 3))) * ex_1) - ((3 * pow(count() - 1, 2)) / ((count() - 2) * (count() - 3))) AS G

┌──────────────ex_1─┬───────────────────G─┐
│ 1.864177212613638 │ -1.0924286152714027 │
└───────────────────┴─────────────────────┘

Google Collab

3.3 - -Resample vs -If vs -Map vs Subquery

5 categories

SELECT sumResample(0, 5, 1)(number, number % 5) AS sum
FROM numbers_mt(1000000000)

┌─sum───────────────────────────────────────────────────────────────────────────────────────────┐
│ [99999999500000000,99999999700000000,99999999900000000,100000000100000000,100000000300000000] │
└───────────────────────────────────────────────────────────────────────────────────────────────┘

1 rows in set. Elapsed: 1.010 sec. Processed 1.00 billion rows, 8.00 GB (990.20 million rows/s., 7.92 GB/s.)


SELECT sumMap([number % 5], [number]) AS sum
FROM numbers_mt(1000000000)

┌─sum─────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ ([0,1,2,3,4],[99999999500000000,99999999700000000,99999999900000000,100000000100000000,100000000300000000]) │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

1 rows in set. Elapsed: 5.730 sec. Processed 1.00 billion rows, 8.00 GB (174.51 million rows/s., 1.40 GB/s.)

SELECT sumMap(map(number % 5, number)) AS sum
FROM numbers_mt(1000000000)

┌─sum─────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ {0:99999999500000000,1:99999999700000000,2:99999999900000000,3:100000000100000000,4:100000000300000000} │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────┘

1 rows in set. Elapsed: 4.169 sec. Processed 1.00 billion rows, 8.00 GB (239.89 million rows/s., 1.92 GB/s.)

SELECT
    sumIf(number, (number % 5) = 0) AS sum_0,
    sumIf(number, (number % 5) = 1) AS sum_1,
    sumIf(number, (number % 5) = 2) AS sum_2,
    sumIf(number, (number % 5) = 3) AS sum_3,
    sumIf(number, (number % 5) = 4) AS sum_4
FROM numbers_mt(1000000000)

┌─────────────sum_0─┬─────────────sum_1─┬─────────────sum_2─┬──────────────sum_3─┬──────────────sum_4─┐
│ 99999999500000000 │ 99999999700000000 │ 99999999900000000 │ 100000000100000000 │ 100000000300000000 │
└───────────────────┴───────────────────┴───────────────────┴────────────────────┴────────────────────┘

1 rows in set. Elapsed: 0.762 sec. Processed 1.00 billion rows, 8.00 GB (1.31 billion rows/s., 10.50 GB/s.)

SELECT sumMap([id], [sum]) AS sum
FROM
(
    SELECT
        number % 5 AS id,
        sum(number) AS sum
    FROM numbers_mt(1000000000)
    GROUP BY id
)

┌─sum─────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ ([0,1,2,3,4],[99999999500000000,99999999700000000,99999999900000000,100000000100000000,100000000300000000]) │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

1 rows in set. Elapsed: 0.331 sec. Processed 1.00 billion rows, 8.00 GB (3.02 billion rows/s., 24.15 GB/s.)

20 categories

SELECT sumResample(0, 20, 1)(number, number % 20) AS sum
FROM numbers_mt(1000000000)

1 rows in set. Elapsed: 1.056 sec. Processed 1.00 billion rows, 8.00 GB (947.28 million rows/s., 7.58 GB/s.)

SELECT sumMap([number % 20], [number]) AS sum
FROM numbers_mt(1000000000)

1 rows in set. Elapsed: 6.410 sec. Processed 1.00 billion rows, 8.00 GB (156.00 million rows/s., 1.25 GB/s.)

SELECT sumMap(map(number % 20, number)) AS sum
FROM numbers_mt(1000000000)

┌─sum────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ {0:24999999500000000,1:24999999550000000,2:24999999600000000,3:24999999650000000,4:24999999700000000,5:24999999750000000,6:24999999800000000,7:24999999850000000,8:24999999900000000,9:24999999950000000,10:25000000000000000,11:25000000050000000,12:25000000100000000,13:25000000150000000,14:25000000200000000,15:25000000250000000,16:25000000300000000,17:25000000350000000,18:25000000400000000,19:25000000450000000} │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

1 rows in set. Elapsed: 4.629 sec. Processed 1.00 billion rows, 8.00 GB (216.04 million rows/s., 1.73 GB/s.)

SELECT
    sumIf(number, (number % 5) = 0) AS sum_0,
    sumIf(number, (number % 5) = 1) AS sum_1,
    sumIf(number, (number % 5) = 2) AS sum_2,
    sumIf(number, (number % 5) = 3) AS sum_3,
    sumIf(number, (number % 5) = 4) AS sum_4,
    sumIf(number, (number % 5) = 5) AS sum_5,
    sumIf(number, (number % 5) = 6) AS sum_6,
    sumIf(number, (number % 5) = 7) AS sum_7,
    sumIf(number, (number % 5) = 8) AS sum_8,
    sumIf(number, (number % 5) = 9) AS sum_9,
    sumIf(number, (number % 5) = 10) AS sum_10,
    sumIf(number, (number % 5) = 11) AS sum_11,
    sumIf(number, (number % 5) = 12) AS sum_12,
    sumIf(number, (number % 5) = 13) AS sum_13,
    sumIf(number, (number % 5) = 14) AS sum_14,
    sumIf(number, (number % 5) = 15) AS sum_15,
    sumIf(number, (number % 5) = 16) AS sum_16,
    sumIf(number, (number % 5) = 17) AS sum_17,
    sumIf(number, (number % 5) = 18) AS sum_18,
    sumIf(number, (number % 5) = 19) AS sum_19
FROM numbers_mt(1000000000)

1 rows in set. Elapsed: 5.282 sec. Processed 1.00 billion rows, 8.00 GB (189.30 million rows/s., 1.51 GB/s.)

SELECT sumMap([id], [sum]) AS sum
FROM
(
    SELECT
        number % 20 AS id,
        sum(number) AS sum
    FROM numbers_mt(1000000000)
    GROUP BY id
)

1 rows in set. Elapsed: 0.362 sec. Processed 1.00 billion rows, 8.00 GB (2.76 billion rows/s., 22.10 GB/s.)

SELECT sumMap(map(id, sum)) AS sum
FROM
(
    SELECT
        number % 20 AS id,
        sum(number) AS sum
    FROM numbers_mt(1000000000)
    GROUP BY id
)

sumMapResample

It’s also possible to combine them.

SELECT
    day,
    category_id,
    sales
FROM
(
    SELECT sumMapResample(1, 31, 1)([category_id], [sales], day) AS res
    FROM
    (
        SELECT
            number % 31 AS day,
            100 * (number % 11) AS category_id,
            number AS sales
        FROM numbers(10000)
    )
)
ARRAY JOIN
    res.1 AS category_id,
    res.2 AS sales,
    arrayEnumerate(res.1) AS day

┌─day─┬─category_id──────────────────────────────────┬─sales──────────────────────────────────────────────────────────────────────────┐
│   1 │ [0,100,200,300,400,500,600,700,800,900,1000] │ [143869,148365,142970,147465,142071,146566,151155,145667,150225,144768,149295] │
│   2 │ [0,100,200,300,400,500,600,700,800,900,1000] │ [149325,143898,148395,142999,147494,142100,146595,151185,145696,150255,144797] │
│   3 │ [0,100,200,300,400,500,600,700,800,900,1000] │ [144826,149355,143927,148425,143028,147523,142129,146624,151215,145725,150285] │
│   4 │ [0,100,200,300,400,500,600,700,800,900,1000] │ [150315,144855,149385,143956,148455,143057,147552,142158,146653,151245,145754] │
│   5 │ [0,100,200,300,400,500,600,700,800,900,1000] │ [145783,150345,144884,149415,143985,148485,143086,147581,142187,146682,151275] │
│   6 │ [0,100,200,300,400,500,600,700,800,900,1000] │ [151305,145812,150375,144913,149445,144014,148515,143115,147610,142216,146711] │
│   7 │ [0,100,200,300,400,500,600,700,800,900,1000] │ [146740,151335,145841,150405,144942,149475,144043,148545,143144,147639,142245] │
│   8 │ [0,100,200,300,400,500,600,700,800,900,1000] │ [142274,146769,151365,145870,150435,144971,149505,144072,148575,143173,147668] │
│   9 │ [0,100,200,300,400,500,600,700,800,900,1000] │ [147697,142303,146798,151395,145899,150465,145000,149535,144101,148605,143202] │
│  10 │ [0,100,200,300,400,500,600,700,800,900,1000] │ [143231,147726,142332,146827,151425,145928,150495,145029,149565,144130,148635] │
│  11 │ [0,100,200,300,400,500,600,700,800,900,1000] │ [148665,143260,147755,142361,146856,151455,145957,150525,145058,149595,144159] │
│  12 │ [0,100,200,300,400,500,600,700,800,900,1000] │ [144188,148695,143289,147784,142390,146885,151485,145986,150555,145087,149625] │
│  13 │ [0,100,200,300,400,500,600,700,800,900,1000] │ [149655,144217,148725,143318,147813,142419,146914,151515,146015,150585,145116] │
│  14 │ [0,100,200,300,400,500,600,700,800,900,1000] │ [145145,149685,144246,148755,143347,147842,142448,146943,151545,146044,150615] │
│  15 │ [0,100,200,300,400,500,600,700,800,900,1000] │ [150645,145174,149715,144275,148785,143376,147871,142477,146972,151575,146073] │
│  16 │ [0,100,200,300,400,500,600,700,800,900,1000] │ [146102,150675,145203,149745,144304,148815,143405,147900,142506,147001,151605] │
│  17 │ [0,100,200,300,400,500,600,700,800,900,1000] │ [151635,146131,150705,145232,149775,144333,148845,143434,147929,142535,147030] │
│  18 │ [0,100,200,300,400,500,600,700,800,900,1000] │ [147059,141665,146160,150735,145261,149805,144362,148875,143463,147958,142564] │
│  19 │ [0,100,200,300,400,500,600,700,800,900,1000] │ [142593,147088,141694,146189,150765,145290,149835,144391,148905,143492,147987] │
│  20 │ [0,100,200,300,400,500,600,700,800,900,1000] │ [148016,142622,147117,141723,146218,150795,145319,149865,144420,148935,143521] │
│  21 │ [0,100,200,300,400,500,600,700,800,900,1000] │ [143550,148045,142651,147146,141752,146247,150825,145348,149895,144449,148965] │
│  22 │ [0,100,200,300,400,500,600,700,800,900,1000] │ [148995,143579,148074,142680,147175,141781,146276,150855,145377,149925,144478] │
│  23 │ [0,100,200,300,400,500,600,700,800,900,1000] │ [144507,149025,143608,148103,142709,147204,141810,146305,150885,145406,149955] │
│  24 │ [0,100,200,300,400,500,600,700,800,900,1000] │ [149985,144536,149055,143637,148132,142738,147233,141839,146334,150915,145435] │
│  25 │ [0,100,200,300,400,500,600,700,800,900,1000] │ [145464,150015,144565,149085,143666,148161,142767,147262,141868,146363,150945] │
│  26 │ [0,100,200,300,400,500,600,700,800,900,1000] │ [150975,145493,150045,144594,149115,143695,148190,142796,147291,141897,146392] │
│  27 │ [0,100,200,300,400,500,600,700,800,900,1000] │ [146421,151005,145522,150075,144623,149145,143724,148219,142825,147320,141926] │
│  28 │ [0,100,200,300,400,500,600,700,800,900,1000] │ [141955,146450,151035,145551,150105,144652,149175,143753,148248,142854,147349] │
│  29 │ [0,100,200,300,400,500,600,700,800,900,1000] │ [147378,141984,146479,151065,145580,150135,144681,149205,143782,148277,142883] │
│  30 │ [0,100,200,300,400,500,600,700,800,900,1000] │ [142912,147407,142013,146508,151095,145609,150165,144710,149235,143811,148306] │
└─────┴──────────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────┘

3.4 - arrayFold

EWMA example

WITH
    [40, 45, 43, 31, 20] AS data,
    0.3 AS alpha
SELECT arrayFold((acc, x) -> arrayPushBack(acc, (alpha * x) + ((1 - alpha) * (acc[-1]))), arrayPopFront(data), [CAST(data[1], 'Float64')]) as ewma

┌─ewma─────────────────────────────────────────────────────────────┐
│ [40,41.5,41.949999999999996,38.66499999999999,33.06549999999999] │
└──────────────────────────────────────────────────────────────────┘

3.5 - arrayMap, arrayJoin or ARRAY JOIN memory usage

Why do arrayMap, arrayFilter, and arrayJoin use so much memory?

arrayMap-like functions memory usage calculation.

In order to calculate arrayMap or similar array* functions ClickHouse® temporarily does arrayJoin-like operation, which in certain conditions can lead to huge memory usage for big arrays.

So for example, you have 2 columns:

SELECT *
FROM
(
    SELECT
        [1, 2, 3, 4, 5] AS array_1,
        [1, 2, 3, 4, 5] AS array_2
)

┌─array_1─────┬─array_2─────┐
│ [1,2,3,4,5] │ [1,2,3,4,5] │
└─────────────┴─────────────┘

Let’s say we want to multiply array elements at corresponding positions.

SELECT arrayMap(x -> ((array_1[x]) * (array_2[x])), arrayEnumerate(array_1)) AS multi
FROM
(
    SELECT
        [1, 2, 3, 4, 5] AS array_1,
        [1, 2, 3, 4, 5] AS array_2
)

┌─multi─────────┐
│ [1,4,9,16,25] │
└───────────────┘

ClickHouse create temporary structure in memory like this:

SELECT
    array_1,
	array_2,
    x
FROM
(
    SELECT
        [1, 2, 3, 4, 5] AS array_1,
        [1, 2, 3, 4, 5] AS array_2
)
ARRAY JOIN arrayEnumerate(array_1) AS x

┌─array_1─────┬─array_2─────┬─x─┐
│ [1,2,3,4,5] │ [1,2,3,4,5] │ 1 │
│ [1,2,3,4,5] │ [1,2,3,4,5] │ 2 │
│ [1,2,3,4,5] │ [1,2,3,4,5] │ 3 │
│ [1,2,3,4,5] │ [1,2,3,4,5] │ 4 │
│ [1,2,3,4,5] │ [1,2,3,4,5] │ 5 │
└─────────────┴─────────────┴───┘

We can roughly estimate memory usage by multiplying the size of columns participating in the lambda function by the size of the unnested array.

And total memory usage will be 55 values (5(array size)*2(array count)*5(row count) + 5(unnested array size)), which is 5.5 times more than initial array size.

SELECT groupArray((array_1[x]) * (array_2[x])) AS multi
FROM
(
    SELECT
        array_1,
        array_2,
        x
    FROM
    (
        SELECT
            [1, 2, 3, 4, 5] AS array_1,
            [1, 2, 3, 4, 5] AS array_2
    )
ARRAY JOIN arrayEnumerate(array_1) AS x
)

┌─multi─────────┐
│ [1,4,9,16,25] │
└───────────────┘

But what if we write this function in a more logical way, so we wouldn’t use any unnested arrays in lambda.

SELECT arrayMap((x, y) -> (x * y), array_1, array_2) AS multi
FROM
(
    SELECT
        [1, 2, 3, 4, 5] AS array_1,
        [1, 2, 3, 4, 5] AS array_2
)

┌─multi─────────┐
│ [1,4,9,16,25] │
└───────────────┘

ClickHouse create temporary structure in memory like this:

SELECT
    x,
    y
FROM
(
    SELECT
        [1, 2, 3, 4, 5] AS array_1,
        [1, 2, 3, 4, 5] AS array_2
)
ARRAY JOIN
    array_1 AS x,
    array_2 AS y

┌─x─┬─y─┐
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ 3 │
│ 4 │ 4 │
│ 5 │ 5 │
└───┴───┘

We have only 10 values, which is no more than what we have in initial arrays.

SELECT groupArray(x * y) AS multi
FROM
(
    SELECT
        x,
        y
    FROM
    (
        SELECT
            [1, 2, 3, 4, 5] AS array_1,
            [1, 2, 3, 4, 5] AS array_2
    )
ARRAY JOIN
        array_1 AS x,
        array_2 AS y
)

┌─multi─────────┐
│ [1,4,9,16,25] │
└───────────────┘

The same approach can be applied to other array* function with arrayMap-like capabilities to use lambda functions and ARRAY JOIN (arrayJoin).

Examples with bigger arrays:

SET max_threads=1;
SET send_logs_level='trace';

SELECT arrayMap(x -> ((array_1[x]) * (array_2[x])), arrayEnumerate(array_1)) AS multi
FROM
(
    WITH 100 AS size
    SELECT
        materialize(CAST(range(size), 'Array(UInt32)')) AS array_1,
        materialize(CAST(range(size), 'Array(UInt32)')) AS array_2
    FROM numbers(100000000)
)
FORMAT `Null`

<Debug> MemoryTracker: Current memory usage (for query): 8.13 GiB. 

size=100, (2*size)*size = 2*(size^2)

Elapsed: 24.879 sec. Processed 524.04 thousand rows, 4.19 MB (21.06 thousand rows/s., 168.51 KB/s.)

SELECT arrayMap(x -> ((array_1[x]) * (array_2[x])), arrayEnumerate(array_1)) AS multi
FROM
(
    WITH 100 AS size
    SELECT
        materialize(CAST(range(2*size), 'Array(UInt32)')) AS array_1,
        materialize(CAST(range(size), 'Array(UInt32)')) AS array_2
    FROM numbers(100000000)
)
FORMAT `Null`

<Debug> MemoryTracker: Current memory usage (for query): 24.28 GiB.

size=100, (3*size)*2*size = 6*(size^2)

Elapsed: 71.547 sec. Processed 524.04 thousand rows, 4.19 MB (7.32 thousand rows/s., 58.60 KB/s.)


SELECT arrayMap(x -> ((array_1[x]) * (array_2[x])), arrayEnumerate(array_1)) AS multi
FROM
(
    WITH 100 AS size
    SELECT
        materialize(CAST(range(size), 'Array(UInt32)')) AS array_1,
        materialize(CAST(range(2*size), 'Array(UInt32)')) AS array_2
    FROM numbers(100000000)
)
FORMAT `Null`


<Debug> MemoryTracker: Current memory usage (for query): 12.19 GiB.

size=100, (3*size)*size = 3*(size^2)

Elapsed: 36.777 sec. Processed 524.04 thousand rows, 4.19 MB (14.25 thousand rows/s., 113.99 KB/s.)

Which data types we have in those arrays?

WITH 100 AS size
SELECT
    toTypeName(materialize(CAST(range(size), 'Array(UInt32)'))) AS array_1,
    toTypeName(materialize(CAST(range(2 * size), 'Array(UInt32)'))) AS array_2,
    toTypeName(arrayEnumerate(materialize(CAST(range(size), 'Array(UInt32)')))) AS x

┌─array_1───────┬─array_2───────┬─x─────────────┐
│ Array(UInt32) │ Array(UInt32) │ Array(UInt32) │
└───────────────┴───────────────┴───────────────┘

So each value use 4 bytes.

By default ClickHouse execute query by blocks of 65515 rows (max_block_size setting value)

Lets estimate query total memory usage given previous calculations.

WITH
    100 AS size,
    4 AS value_size,
    65515 AS max_block_size
SELECT
    array_1_multiplier,
    array_2_multiplier,
    formatReadableSize(((value_size * max_block_size) * ((array_1_multiplier * size) + (array_2_multiplier * size))) * (array_1_multiplier * size) AS estimated_memory_usage_bytes) AS estimated_memory_usage,
    real_memory_usage,
    round(estimated_memory_usage_bytes / (real_memory_usage * 1073741824), 2) AS ratio
FROM
(
    WITH arrayJoin([(1, 1, 8.13), (2, 1, 24.28), (1, 2, 12.19)]) AS tpl
    SELECT
        tpl.1 AS array_1_multiplier,
        tpl.2 AS array_2_multiplier,
        tpl.3 AS real_memory_usage
)

┌─array_1_multiplier─┬─array_2_multiplier─┬─estimated_memory_usage─┬─real_memory_usage─┬─ratio─┐
│                  1 │                  1 │ 4.88 GiB               │              8.13 │   0.6 │
│                  2 │                  1 │ 14.64 GiB              │             24.28 │   0.6 │
│                  1 │                  2 │ 7.32 GiB               │             12.19 │   0.6 │
└────────────────────┴────────────────────┴────────────────────────┴───────────────────┴───────┘

Correlation is pretty clear.

What if we will reduce size of blocks used for query execution?

SET max_block_size = '16k';

SELECT arrayMap(x -> ((array_1[x]) * (array_2[x])), arrayEnumerate(array_1)) AS multi
FROM
(
    WITH 100 AS size
    SELECT
        materialize(CAST(range(size), 'Array(UInt32)')) AS array_1,
        materialize(CAST(range(2 * size), 'Array(UInt32)')) AS array_2
    FROM numbers(100000000)
)
FORMAT `Null`

<Debug> MemoryTracker: Current memory usage (for query): 3.05 GiB.

Elapsed: 35.935 sec. Processed 512.00 thousand rows, 4.10 MB (14.25 thousand rows/s., 113.98 KB/s.)

Memory usage down in 4 times, which has strong correlation with our change: 65k -> 16k ~ 4 times.

SELECT arrayMap((x, y) -> (x * y), array_1, array_2) AS multi
FROM
(
    WITH 100 AS size
    SELECT
        materialize(CAST(range(size), 'Array(UInt32)')) AS array_1,
        materialize(CAST(range(size), 'Array(UInt32)')) AS array_2
    FROM numbers(100000000)
)
FORMAT `Null`

<Debug> MemoryTracker: Peak memory usage (for query): 226.04 MiB.

Elapsed: 5.700 sec. Processed 11.53 million rows, 92.23 MB (2.02 million rows/s., 16.18 MB/s.)

Almost 100 times faster than first query!

3.6 - assumeNotNull and friends

assumeNotNull and friends

assumeNotNull result is implementation specific:

WITH CAST(NULL, 'Nullable(UInt8)') AS column
SELECT
    column,
    assumeNotNull(column + 999) AS x;

┌─column─┬─x─┐
│   null │ 0 │
└────────┴───┘

WITH CAST(NULL, 'Nullable(UInt8)') AS column
SELECT
    column,
    assumeNotNull(materialize(column) + 999) AS x;

┌─column─┬───x─┐
│   null │ 999 │
└────────┴─────┘

CREATE TABLE test_null
(
    `key` UInt32,
    `value` Nullable(String)
)
ENGINE = MergeTree
ORDER BY key;

INSERT INTO test_null SELECT
    number,
    concat('value ', toString(number))
FROM numbers(4);

SELECT *
FROM test_null;

┌─key─┬─value───┐
│   0 │ value 0 │
│   1 │ value 1 │
│   2 │ value 2 │
│   3 │ value 3 │
└─────┴─────────┘

ALTER TABLE test_null
    UPDATE value = NULL WHERE key = 3;

SELECT *
FROM test_null;

┌─key─┬─value───┐
│   0 │ value 0 │
│   1 │ value 1 │
│   2 │ value 2 │
│   3 │ null    │
└─────┴─────────┘

SELECT
    key,
    assumeNotNull(value)
FROM test_null;

┌─key─┬─assumeNotNull(value)─┐
│   0 │ value 0              │
│   1 │ value 1              │
│   2 │ value 2              │
│   3 │ value 3              │
└─────┴──────────────────────┘

WITH CAST(NULL, 'Nullable(Enum8(\'a\' = 1, \'b\' = 0))') AS test
SELECT assumeNotNull(test)

┌─assumeNotNull(test)─┐
│ b                   │
└─────────────────────┘

WITH CAST(NULL, 'Nullable(Enum8(\'a\' = 1))') AS test
SELECT assumeNotNull(test)

Error on processing query 'with CAST(null, 'Nullable(Enum8(\'a\' = 1))') as test
select assumeNotNull(test); ;':
Code: 36, e.displayText() = DB::Exception: Unexpected value 0 in enum, Stack trace (when copying this message, always include the lines below):

Info

Null values in ClickHouse® are stored in a separate dictionary: is this value Null. And for faster dispatch of functions there is no check on Null value while function execution, so functions like plus can modify internal column value (which has default value). In normal conditions it’s not a problem because on read attempt, ClickHouse first would check the Null dictionary and return value from column itself for non-Nulls only. And assumeNotNull function just ignores this Null dictionary. So it would return only column values, and in certain cases it’s possible to have unexpected results.

If it’s possible to have Null values, it’s better to use ifNull function instead.

SELECT count()
FROM numbers_mt(1000000000)
WHERE NOT ignore(ifNull(toNullable(number), 0))

┌────count()─┐
│ 1000000000 │
└────────────┘

1 rows in set. Elapsed: 0.705 sec. Processed 1.00 billion rows, 8.00 GB (1.42 billion rows/s., 11.35 GB/s.)

SELECT count()
FROM numbers_mt(1000000000)
WHERE NOT ignore(coalesce(toNullable(number), 0))

┌────count()─┐
│ 1000000000 │
└────────────┘

1 rows in set. Elapsed: 2.383 sec. Processed 1.00 billion rows, 8.00 GB (419.56 million rows/s., 3.36 GB/s.)

SELECT count()
FROM numbers_mt(1000000000)
WHERE NOT ignore(assumeNotNull(toNullable(number)))

┌────count()─┐
│ 1000000000 │
└────────────┘

1 rows in set. Elapsed: 0.051 sec. Processed 1.00 billion rows, 8.00 GB (19.62 billion rows/s., 156.98 GB/s.)

SELECT count()
FROM numbers_mt(1000000000)
WHERE NOT ignore(toNullable(number))

┌────count()─┐
│ 1000000000 │
└────────────┘

1 rows in set. Elapsed: 0.050 sec. Processed 1.00 billion rows, 8.00 GB (20.19 billion rows/s., 161.56 GB/s.)

Info

There is no overhead for assumeNotNull at all.

3.7 - Encrypt

WHERE over encrypted column

CREATE TABLE encrypt
(
    `key` UInt32,
    `value` FixedString(4)
)
ENGINE = MergeTree
ORDER BY key;

INSERT INTO encrypt SELECT
    number,
    encrypt('aes-256-ctr', reinterpretAsString(number + 0.3), 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx', 'xxxxxxxxxxxxxxxx')
FROM numbers(100000000);

SET max_threads = 1;

SELECT count()
FROM encrypt
WHERE value IN encrypt('aes-256-ctr', reinterpretAsString(toFloat32(1.3)), 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx', 'xxxxxxxxxxxxxxxx')

┌─count()─┐
│       1 │
└─────────┘

1 rows in set. Elapsed: 0.666 sec. Processed 100.00 million rows, 400.01 MB (150.23 million rows/s., 600.93 MB/s.)


SELECT count()
FROM encrypt
WHERE reinterpretAsFloat32(encrypt('aes-256-ctr', value, 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx', 'xxxxxxxxxxxxxxxx')) IN toFloat32(1.3)

┌─count()─┐
│       1 │
└─────────┘

1 rows in set. Elapsed: 8.395 sec. Processed 100.00 million rows, 400.01 MB (11.91 million rows/s., 47.65 MB/s.)

Info

Because encryption and decryption can be expensive due re-initialization of keys and iv, usually it make sense to use those functions over literal values instead of table column.

3.8 - sequenceMatch

sequenceMatch

Question

I expect the sequence here to only match once as a is only directly after a once - but it matches with gaps. Why is that?

SELECT sequenceCount('(?1)(?2)')(sequence, page ILIKE '%a%', page ILIKE '%a%') AS sequences
  FROM values('page String, sequence UInt16', ('a', 1), ('a', 2), ('b', 3), ('b', 4), ('a', 5), ('b', 6), ('a', 7))

2 # ??

Answer

sequenceMatch just ignores the events which don’t match the condition. Check that:

SELECT sequenceMatch('(?1)(?2)')(sequence,page='a',page='b') AS sequences　FROM values( 'page String, sequence UInt16' , ('a', 1), ('c',2), ('b', 3));
1 # ??

SELECT sequenceMatch('(?1).(?2)')(sequence,page='a',page='b') AS sequences　FROM values( 'page String, sequence UInt16' , ('a', 1), ('c',2), ('b', 3));
0 # ???

SELECT sequenceMatch('(?1)(?2)')(sequence,page='a',page='b', page NOT IN ('a','b')) AS sequences　from values( 'page String, sequence UInt16' , ('a', 1), ('c',2), ('b', 3));
0 # !

SELECT sequenceMatch('(?1).(?2)')(sequence,page='a',page='b', page NOT IN ('a','b')) AS sequences　from values( 'page String, sequence UInt16' , ('a', 1), ('c',2), ('b', 3));
1 #

So for your example - just introduce one more ’nothing matched’ condition:

SELECT sequenceCount('(?1)(?2)')(sequence, page ILIKE '%a%', page ILIKE '%a%', NOT (page ILIKE '%a%')) AS sequences
FROM values('page String, sequence UInt16', ('a', 1), ('a', 2), ('b', 3), ('b', 4), ('a', 5), ('b', 6), ('a', 7))

4 - Integrations

Learn how you can integrate cloud services, BI tools, kafka, MySQL, Spark, MindsDB, and more with ClickHouse®

4.1 - Altinity Cloud Access Management

Enabling access_management for Altinity.Cloud databases.

Organizations that want to enable administrative users in their Altinity.Cloud ClickHouse® servers can do so by enabling access_management manually. This allows for administrative users to be created on the specific ClickHouse Cluster.

WARNING

Modifying the ClickHouse cluster settings manually can lead to the cluster not loading or other issues. Change settings only with full consultation with an Altinity.Cloud support team member, and be ready to remove settings if they cause any disruption of service.

To add the access_management setting to an Altinity.Cloud ClickHouse Cluster:

Log into your Altinity.Cloud account.
For the cluster to modify, select Configure -> Settings.
Cluster setting configure
From the Settings page, select +ADD SETTING.
Add cluster setting
Set the following options:
Setting Type: Select users.d file.
Filename: access_management.xml

Contents: Enter the following to allow the clickhouse_operator that controls the cluster through the clickhouse-operator the ability to set administrative options:

<clickhouse>
    <users>
        <admin>
            <access_management>1</access_management>
        </admin>
        <clickhouse_operator>
            <access_management>1</access_management>
        </clickhouse_operator>
    </users>
</clickhouse>

access_management=1 means that users admin, clickhouse_operator are able to create users and grant them privileges using SQL.

Select OK. The cluster will restart, and users can now be created in the cluster that can be granted administrative access.
If you are running ClickHouse 21.9 and above you can enable storing access management in ZooKeeper. in this case it will be automatically propagated to the cluster. This requires yet another configuration file:
Setting Type: Select config.d file
Filename: user_directories.xml

Contents:

<clickhouse>
  <user_directories replace="replace">
    <users_xml>
      <path>/etc/clickhouse-server/users.xml</path>
    </users_xml>
    <replicated>
      <zookeeper_path>/clickhouse/access/</zookeeper_path>
    </replicated>
    <local_directory>
       <path>/var/lib/clickhouse/access/</path>
    </local_directory>
  </user_directories>
</clickhouse>

4.2 - ClickHouse® python drivers

Python main drivers/clients for ClickHouse®

There are two main python drivers that can be used with ClickHouse. They all have their different set of features and use cases:

ClickHouse driver AKA clickhouse-driver

The clickhouse-driver is a Python library used for interacting with ClickHouse. Here’s a summary of its features:

Connectivity: clickhouse-driver allows Python applications to connect to ClickHouse servers over TCP/IP Native Interface (9000/9440 ports) and also HTTP interface but it is experimental.
SQL Queries: It enables executing SQL queries against ClickHouse databases from Python scripts, including data manipulation (insertion, deletion, updating) and data retrieval (select queries).
Query Parameters: Supports parameterized queries, which helps in preventing SQL injection attacks and allows for more efficient execution of repeated queries with different parameter values.
Connection Pooling: Provides support for connection pooling, which helps manage connections efficiently, especially in high-concurrency applications, by reusing existing connections instead of creating new ones for each query.
Data Types: Handles conversion between Python data types and ClickHouse data types, ensuring compatibility and consistency when passing data between Python and ClickHouse.
Error Handling: Offers comprehensive error handling mechanisms, including exceptions and error codes, to facilitate graceful error recovery and handling in Python applications.
Asynchronous Support: Supports asynchronous execution of queries using asyncio, allowing for non-blocking query execution in asynchronous Python applications.
Customization: Provides options for customizing connection settings, query execution behavior, and other parameters to suit specific application requirements and performance considerations.
Compatibility: Works with various versions of ClickHouse, ensuring compatibility and support for different ClickHouse features and functionalities.
Documentation and Community: Offers comprehensive documentation and active community support, including examples, tutorials, and forums, to assist developers in effectively using the library and addressing any issues or questions they may have.
Supports multiple host on connection string https://clickhouse-driver.readthedocs.io/en/latest/features.html#multiple-hosts
Connection pooling (aiohttp)

Python ecosystem libs/modules:

Good Pandas/Numpy support: https://clickhouse-driver.readthedocs.io/en/latest/features.html#numpy-pandas-support
Good SQLALchemy support: https://pypi.org/project/clickhouse-sqlalchemy/

This was the first python driver for ClickHouse. It has a mature codebase. By default ClickHouse drivers uses synchronous code . There is a wrapper to convert code to asynchronous, https://github.com/long2ice/asynch

Here you can get a basic working example from Altinity repo for ingestion/selection using clickhouse-driver:

https://github.com/lesandie/clickhouse-tests/blob/main/scripts/test_ch_driver.py

ClickHouse-connect AKA clickhouse-connect

The ClickHouse Connect Python driver is the ClickHouse, Inc supported-official Python library. Here’s a summary of its key features:

Connectivity: allows Python applications to connect to ClickHouse servers over HTTP Interface (8123/8443 ports).
Compatibility: The driver is compatible with Python 3.x versions, ensuring that it can be used with modern Python applications without compatibility issues.
Performance: The driver is optimized for performance, allowing for efficient communication with ClickHouse databases to execute queries and retrieve results quickly, which is crucial for applications requiring low latency and high throughput.
Query Execution: Developers can use the driver to execute SQL queries against ClickHouse databases, including SELECT, INSERT, UPDATE, DELETE, and other SQL operations, enabling them to perform various data manipulation tasks from Python applications.
Parameterized Queries: The driver supports parameterized queries, allowing developers to safely pass parameters to SQL queries to prevent SQL injection attacks and improve query performance by reusing query execution plans.
Data Type Conversion: The driver automatically handles data type conversion between Python data types and ClickHouse data types, ensuring seamless integration between Python applications and ClickHouse databases without manual data type conversion.
Error Handling: The driver provides robust error handling mechanisms, including exceptions and error codes, to help developers handle errors gracefully and take appropriate actions based on the type of error encountered during query execution.
Limited Asynchronous Support: Some implementations of the driver offer asynchronous support, allowing developers to execute queries asynchronously to improve concurrency and scalability in asynchronous Python applications using asynchronous I/O frameworks like asyncio.
Configuration Options: The driver offers various configuration options, such as connection parameters, authentication methods, and connection pooling settings, allowing developers to customize the driver’s behavior to suit their specific requirements and environment.
Documentation and Community: Offers comprehensive documentation and active community support, including examples, tutorials, and forums, to assist developers in effectively using the library and addressing any issues or questions they may have. https://clickhouse.com/docs/en/integrations/language-clients/python/intro/
Multiple host on connection string not supported https://github.com/ClickHouse/clickhouse-connect/issues/74
Connection pooling (urllib3)

Python ecosystem libs/modules:

Good Pandas/Numpy support: https://clickhouse.com/docs/en/integrations/python#consuming-query-results-with-numpy-pandas-or-arrow
Decent SQLAlchemy 1.3 and 1.4 support (limited feature set)

It is the most recent driver with the latest feature set (query context and query streaming …. ), and in recent release asyncio wrapper

You can check multiple official examples here:

https://github.com/ClickHouse/clickhouse-connect/tree/457533df05fa685b2a1424359bea5654240ef971/examples

Also some Altinity examples from repo:

https://github.com/lesandie/clickhouse-tests/blob/main/scripts/test_ch_connect_asyncio_insert.py

You can clone the repo and use the helper files like DDL.sql to setup some tests.

Most common use cases:

Connection pooler:

Clickhouse-connect can use a connection pooler (based on urllib3) https://clickhouse.com/docs/en/integrations/python#customizing-the-http-connection-pool
Clickhouse-driver you can use aiohttp (https://docs.aiohttp.org/en/stable/client_advanced.html#limiting-connection-pool-size )

Managing ClickHouse `session_id`:

clickhouse-driver
- Because it is using the Native Interface session_id is managed internally by clickhouse, so it is very rare (unless using asyncio) to get:
Code: 373. DB::Exception: Session is locked by a concurrent client. (SESSION_IS_LOCKED) .
clickhouse-connect: How to use clickhouse-connect in a pythonic way and avoid getting SESSION_IS_LOCKED exceptions:
- https://clickhouse.com/docs/en/integrations/python#managing-clickhouse-session-ids
- If you want to specify a session_id per query you should be able to use the setting dictionary to pass a session_id for each query (note that ClickHouse will automatically generate a session_id if none is provided).
```
SETTINGS = {"session_id": "dagster-batch" + "-" + f"{time.time()}"}
client.query("INSERT INTO table ....", settings=SETTINGS)
```

Also in clickhouse documentation some explanation how to set session_id with another approach: https://clickhouse.com/docs/en/integrations/python#managing-clickhouse-session-ids

ClickHouse Connect Driver API | ClickHouse Docs

Best practices with flask · Issue #73 · ClickHouse/clickhouse-connect

Asyncio (asynchronous wrappers)

clickhouse-connect

New release with asyncio wrapper for clickhouse-connect

How the wrapper works: https://clickhouse.com/docs/en/integrations/python#asyncclient-wrapper

Wrapper and connection pooler example:

import clickhouse_connect
import asyncio
from clickhouse_connect.driver.httputil import get_pool_manager

async def main():
    client = await clickhouse_connect.get_async_client(host='localhost', port=8123, pool_mgr=get_pool_manager())
    for i in range(100):
        result = await client.query("SELECT name FROM system.databases")
        print(result.result_rows)

asyncio.run(main())

clickhouse-connect code is synchronous by default and running synchronous functions in an async application is a workaround and might not be as efficient as using a library/wrapper designed for asynchronous operations from the ground up.. So you can use the current wrapper or you can use another approach with asyncio and concurrent.futures and ThreadpoolExecutor or ProcessPoolExecutor. Python GIL has a mutex over Threads but not to Processes so if you need performance at the cost of using processes instead of threads (not much different for medium workloads) you can use ProcesspoolExecutor instead.

Some info about this from the tinybird guys https://www.tinybird.co/blog-posts/killing-the-processpoolexecutor

For clickhouse-connect :

import asyncio
from concurrent.futures import ProcessPoolExecutor
import clickhouse_connect

# Function to execute a query using clickhouse-connect synchronously
def execute_query_sync(query):
    client = clickhouse_connect.get_client()  # Adjust connection params as needed
    result = client.query(query)
    return result

# Asynchronous wrapper function to run the synchronous function in a process pool
async def execute_query_async(query):
    loop = asyncio.get_running_loop()
    # Use ProcessPoolExecutor to execute the synchronous function
    with ProcessPoolExecutor() as pool:
        result = await loop.run_in_executor(pool, execute_query_sync, query)
        return result

async def main():
    query = "SELECT * FROM your_table LIMIT 10"  # Example query
    result = await execute_query_async(query)
    print(result)

# Run the async main function
if __name__ == '__main__':
    asyncio.run(main())

Clickhouse-driver

clickhouse-driver code is also synchronous and suffers the same problem as clickhouse-connect https://clickhouse-driver.readthedocs.io/en/latest/quickstart.html#async-and-multithreading

So to use asynchronous approach it is recommended to use a connection pool and some asyncio wrapper that can hide the complexity of using the ThreadPoolExecutor/ProcessPoolExecutor

To begin testing such environment aiohttp is a good approach. Here an example: https://github.com/lesandie/clickhouse-tests/blob/main/scripts/test_aiohttp_inserts.py This will use simply requests module and aiohttp (you can tune the connection pooler https://docs.aiohttp.org/en/stable/client_advanced.html#limiting-connection-pool-size )
Also aiochclient is another good wrapper https://github.com/maximdanilchenko/aiochclient for the HTTP interface
For the native interface you can try https://github.com/long2ice/asynch , asynch is an asyncio ClickHouse Python Driver with native (TCP) interface support, which reuse most of clickhouse-driver and comply with PEP249 .

4.3 - MySQL

Replication using MaterializeMySQL.

It reads mysql binlog directly and transform queries into something which ClickHouse® can support. Supports updates and deletes (under the hood implemented via something like ReplacingMergeTree with enforced FINAL and ‘deleted’ flag). Status is ’experimental’, there are quite a lot of known limitations and issues, but some people use it. The original author of that went to another project, and the main team don’t have a lot of resource to improve that for now (more important thing in the backlog)

The replication happens on the mysql database level.

Replication using debezium + Kafka (+ Altinity Sink Connector for ClickHouse)

Debezium can read the binlog and transform it to Kafka messages.

You can later capture the stream of message on ClickHouse side and process it as you like. Please remember that currently Kafka engine supports only at-least-once delivery guarantees. It’s used by several companies, quite nice & flexible. But initial setup may require some efforts.

Altinity Sink Connector for ClickHouse

Can handle transformation of debezium messages (with support for DELETEs and UPDATEs) and exactly-once delivery for you.

Links:

Same as above but using https://maxwells-daemon.io/ instead of debezium.

Have no experience / feedback there, but should be very similar to debezium.

Replication using clickhouse-mysql

See https://altinity.com/blog/2018/6/30/realtime-mysql-clickhouse-replication-in-practice

That was done long time ago in altinity for one use-case, and it seem like it was never used outside of that. It’s a python application with lot of switches which can copy a schema or read binlog from mysql and put it to ClickHouse. Not supported currently. But it’s just a python, so maybe can be adjusted to different needs.

Accessing MySQL data via integration engines from inside ClickHouse.

MySQL table engine / table function , or MySQL database engine - ClickHouse just connects to mysql server as a client, and can do normal selects.

We had webinar about that a year ago: https://www.youtube.com/watch?v=44kO3UzIDLI

Using that you can easily create some ETL script which will copy the data from mysql to ClickHouse regularly, i.e. something like

INSERT INTO clickhouse_table SELECT * FROM mysql_table WHERE id > ...

Works great if you have append only table in MySQL.

In newer ClickHouse versions you can query this was also sharded / replicated MySQL cluster - see ExternalDistributed

MySQL dictionaries

There are also MySQL dictionaries, which can be very nice alternative for storing some dimensions information in star schema.

4.4 - ODBC Driver for ClickHouse®

ODBC Driver for ClickHouse®

ODBC interface for ClickHouse® RDBMS.

Licensed under the Apache 2.0 .

Installation and usage

Windows

Download the latest release . On 64bit system you usually need both 32 bit and 64 bit drivers.
Install (usually you will need ANSI driver, but better to install both versions, see below).
Configure ClickHouse DSN.

Note: that install driver linked against MDAC (which is default for Windows), some non-windows native applications (cygwin / msys64 based) may require driver linked against unixodbc. Build section below.

MacOS

Install homebrew .
Install driver

brew install https://raw.githubusercontent.com/proller/homebrew-core/chodbc/Formula/clickhouse-odbc.rb

Add ClickHouse DSN configuration into ~/.odbc.ini file. (sample )

Note: that install driver linked against iodbc (which is default for Mac), some homebrew applications (like python) may require unixodbc driver to work properly. In that case see Build section below.

Linux

DEB/RPM packaging is not provided yet, please build & install the driver from sources.
Add ClickHouse DSN configuration into ~/.odbc.ini file. (sample )

Configuration

On Linux / Max you configure DSN by adding new desctions in ~/.odbc.ini (See sample file: https://github.com/ClickHouse/clickhouse-odbc/blob/fd74398b50201ab13b535cdfab57bca86e588b37/packaging/odbc.ini.sample )

On Windows you can create/edit DSN using GUI tool through Control Panel.

The list of DSN parameters recognized by the driver is as follows:

Parameter	Default value	Description
`Url`	empty	URL that points to a running ClickHouse instance, may include username, password, port, database, etc.
`Proto`	deduced from `Url`, or from `Port` and `SSLMode`: `https` if `443` or `8443` or `SSLMode` is not empty, `http` otherwise	Protocol, one of: `http`, `https`
`Server` or `Host`	deduced from `Url`	IP or hostname of a server with a running ClickHouse instance on it
`Port`	deduced from `Url`, or from `Proto`: `8443` if `https`, `8123` otherwise	Port on which the ClickHouse instance is listening
`Path`	`/query`	Path portion of the URL
`UID` or `Username`	`default`	User name
`PWD` or `Password`	empty	Password
`Database`	`default`	Database name to connect to
`Timeout`	`30`	Connection timeout
`SSLMode`	empty	Certificate verification method (used by TLS/SSL connections, ignored in Windows), one of: `allow`, `prefer`, `require`, use `allow` to enable SSL_VERIFY_PEER TLS/SSL certificate verification mode, SSL_VERIFY_PEER \| SSL_VERIFY_FAIL_IF_NO_PEER_CERT is used otherwise
`PrivateKeyFile`	empty	Path to private key file (used by TLS/SSL connections), can be empty if no private key file is used
`CertificateFile`	empty	Path to certificate file (used by TLS/SSL connections, ignored in Windows), if the private key and the certificate are stored in the same file, this can be empty if `PrivateKeyFile` is specified
`CALocation`	empty	Path to the file or directory containing the CA/root certificates (used by TLS/SSL connections, ignored in Windows)
`DriverLog`	`on` if `CMAKE_BUILD_TYPE` is `Debug`, `off` otherwise	Enable or disable the extended driver logging
`DriverLogFile`	`\temp\clickhouse-odbc-driver.log` on Windows, `/tmp/clickhouse-odbc-driver.log` otherwise	Path to the extended driver log file (used when `DriverLog` is `on`)

Troubleshooting & bug reporting

If some software doesn’t work properly with that driver, but works good with other drivers - we will be appropriate if you will be able to collect debug info.

To debug issues with the driver, first things that need to be done are:

enabling driver manager tracing. Links may contain some irrelevant vendor-specific details.
- on Windows/MDAC: 1 , 2 , 3
- on Mac/iODBC: 1 , 2
- on Linux/unixODBC: 1 , 2
enabling driver logging, see DriverLog and DriverLogFile DSN parameters above
making sure that the application is allowed to create and write these driver log and driver manager trace files
follow the steps leading to the issue.

Collected log files will help to diagnose & solve the issue.

Driver Managers

Note, that since ODBC drivers are not used directly by a user, but rather accessed through applications, which in their turn access the driver through ODBC driver manager, user have to install the driver for the same architecture (32- or 64-bit) as the application that is going to access the driver. Moreover, both the driver and the application must be compiled for (and actually use during run-time) the same ODBC driver manager implementation (we call them “ODBC providers” here). There are three supported ODBC providers:

ODBC driver manager associated with MDAC (Microsoft Data Access Components, sometimes referenced as WDAC, Windows Data Access Components) - the standard ODBC provider of Windows
UnixODBC - the most common ODBC provider in Unix-like systems. Theoretically, could be used in Cygwin or MSYS/MinGW environments in Windows too.
iODBC - less common ODBC provider, mainly used in Unix-like systems, however, it is the standard ODBC provider in macOS. Theoretically, could be used in Cygwin or MSYS/MinGW environments in Windows too.

If you don’t see a package that matches your platforms, or the version of your system is significantly different than those of the available packages, or maybe you want to try a bleeding edge version of the code that hasn’t been released yet, you can always build the driver manually from sources.

Note, that it is always a good idea to install the driver from the corresponding native package (.msi, etc., which you can also easily create if you are building from sources), than use the binaries that were manually copied to some folder.

Building from sources

The general requirements for building the driver from sources are as follows:

CMake 3.12 and later
C++17 and C11 capable compiler toolchain:
- Clang 4 and later
- GCC 7 and later
- Xcode 10 and later
- Microsoft Visual Studio 2017 and later
ODBC Driver manager (MDAC / unixodbc / iODBC)
SSL library (openssl)

Generic build scenario:

git clone --recursive git@github.com:ClickHouse/clickhouse-odbc.git
cd clickhouse-odbc
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo ..
cmake --build . -C RelWithDebInfo

Additional requirements exist for each platform, which also depend on whether packaging and/or testing is performed.

Linux/macOS

Execute the following in the terminal to install needed dependencies:

# on Red Hat/CentOS (tested on CentOS 7)
sudo yum groupinstall "Development Tools"
sudo yum install centos-release-scl
sudo yum install devtoolset-8
sudo yum install git cmake openssl-devel unixODBC-devel # You may use libiodbc-devel INSTEAD of unixODBC-devel
scl enable devtoolset-8 -- bash # Enable Software collections for that terminal session, to use newer versions of complilers

# on Ubuntu (tested on Ubuntu 18.10, for older versions you may need to install newer c++ compiler and cmake versions)
sudo apt install build-essential git cmake libpoco-dev libssl-dev unixodbc-dev # You may use libiodbc-devel INSEAD of unixODBC-devel

# MacOS: 
# You will need Xcode 10 or later and Command Line Tools to be installed, as well as [Homebrew](https://brew.sh/).
brew install git cmake make poco openssl libiodbc # You may use unixodbc INSTEAD of libiodbc

Note: usually on Linux you use unixODBC driver manager, and on Mac - iODBC. In some (rare) cases you may need use other driver manager, please do it only if you clearly understand the differences. Driver should be used with the driver manager it was linked to.

Clone the repo with submodules:

git clone --recursive git@github.com:ClickHouse/clickhouse-odbc.git

Enter the cloned source tree, create a temporary build folder, and generate a Makefile for the project in it:

cd clickhouse-odbc
mkdir build
cd build

# Configuration options for the project can be specified in the next command in a form of '-Dopt=val'
# For MacOS: you may also add '-G Xcode' to the next command, in order to use Xcode as a build system or IDE, and generate the solution and project files instead of Makefile.
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo ..

Build the generated solution in-place:

cmake --build . -C RelWithDebInfo
cmake --build . -C RelWithDebInfo --target package

…and, optionally, run tests (note, that for non-unit tests, preconfigured driver and DSN entries must exist, that point to the binaries generated in this build folder):

cmake --build . -C RelWithDebInfo --target test

For MacOS: if you configured the project with ‘-G Xcode’ initially, open the IDE and build all, package, and test targets manually from there

cmake --open .

Windows

CMake bundled with the recent versions of Visual Studio can be used.

An SDK required for building the ODBC driver is included in Windows SDK, which in its turn is also bundled with Visual Studio.

You will need to install WiX toolset to be able to generate .msi packages. You can download and install it from WiX toolset home page .

All of the following commands have to be issued in Visual Studio Command Prompt:

use x86 Native Tools Command Prompt for VS 2019 or equivalent for 32-bit builds
use x64 Native Tools Command Prompt for VS 2019 or equivalent for 64-bit builds

Clone the repo with submodules:

git clone --recursive git@github.com:ClickHouse/clickhouse-odbc.git

Enter the cloned source tree, create a temporary build folder, and generate the solution and project files in it:

cd clickhouse-odbc
mkdir build
cd build

# Configuration options for the project can be specified in the next command in a form of '-Dopt=val'

# Use the following command for 32-bit build only.
cmake -A Win32 -DCMAKE_BUILD_TYPE=RelWithDebInfo ..

# Use the following command for 64-bit build only.
cmake -A x64 -DCMAKE_BUILD_TYPE=RelWithDebInfo ..

Build the generated solution in-place:

cmake --build . -C RelWithDebInfo
cmake --build . -C RelWithDebInfo --target package

…and, optionally, run tests (note, that for non-unit tests, preconfigured driver and DSN entries must exist, that point to the binaries generated in this build folder):

cmake --build . -C RelWithDebInfo --target test

…or open the IDE and build all, package, and test targets manually from there:

cmake --open .

cmake options

The list of configuration options recognized during the CMake generation step is as follows:

Option	Default value	Description
`CMAKE_BUILD_TYPE`	`RelWithDebInfo`	Build type, one of: `Debug`, `Release`, `RelWithDebInfo`
`CH_ODBC_ENABLE_SSL`	`ON`	Enable TLS/SSL (required for utilizing `https://` interface, etc.)
`CH_ODBC_ENABLE_INSTALL`	`ON`	Enable install targets (required for packaging)
`CH_ODBC_ENABLE_TESTING`	inherits value of `BUILD_TESTING`	Enable test targets
`CH_ODBC_PREFER_BUNDLED_THIRD_PARTIES`	`ON`	Prefer bundled over system variants of third party libraries
`CH_ODBC_PREFER_BUNDLED_POCO`	inherits value of `CH_ODBC_PREFER_BUNDLED_THIRD_PARTIES`	Prefer bundled over system variants of Poco library
`CH_ODBC_PREFER_BUNDLED_SSL`	inherits value of `CH_ODBC_PREFER_BUNDLED_POCO`	Prefer bundled over system variants of TLS/SSL library
`CH_ODBC_PREFER_BUNDLED_GOOGLETEST`	inherits value of `CH_ODBC_PREFER_BUNDLED_THIRD_PARTIES`	Prefer bundled over system variants of Google Test library
`CH_ODBC_PREFER_BUNDLED_NANODBC`	inherits value of `CH_ODBC_PREFER_BUNDLED_THIRD_PARTIES`	Prefer bundled over system variants of nanodbc library
`CH_ODBC_RUNTIME_LINK_STATIC`	`OFF`	Link with compiler and language runtime statically
`CH_ODBC_THIRD_PARTY_LINK_STATIC`	`ON`	Link with third party libraries statically
`CH_ODBC_DEFAULT_DSN_ANSI`	`ClickHouse DSN (ANSI)`	Default ANSI DSN name
`CH_ODBC_DEFAULT_DSN_UNICODE`	`ClickHouse DSN (Unicode)`	Default Unicode DSN name
`TEST_DSN`	inherits value of `CH_ODBC_DEFAULT_DSN_ANSI`	ANSI DSN name to use in tests
`TEST_DSN_W`	inherits value of `CH_ODBC_DEFAULT_DSN_UNICODE`	Unicode DSN name to use in tests

Packaging / redistributing the driver

You can just copy the library to another computer, in that case you need to

install run-time dependencies on target computer
- Windows:
  - MDAC driver manager (preinstalled on all modern Windows systems)
  - C++ Redistributable for Visual Studio 2017 or same for 2019, etc.
- Linux

# CentOS / RedHat
sudo yum install openssl unixODBC

# Debian/Ubuntu
sudo apt install openssl unixodbc

MacOS (assuming you have Homebrew installed):

brew install poco openssl libiodbc

All this involves modifying a dedicated registry keys in case of MDAC, or editing odbcinst.ini (for driver registration) and odbc.ini (for DSN definition) files for UnixODBC or iODBC, directly or indirectly.

This will be done automatically using some default values if you are installing the driver using native installers.

Otherwise, if you are configuring manually, or need to modify the default configuration created by the installer, please see the exact locations of files (or registry keys) that need to be modified.

4.5 - ClickHouse® + Spark

jdbc

The trivial & natural way to talk to ClickHouse from Spark is using jdbc. There are 2 jdbc drivers:

ClickHouse-Native-JDBC has some hints about integration with Spark even in the main README file.

‘Official’ driver does support some conversion of complex data types (Roaring bitmaps) for Spark-ClickHouse integration: https://github.com/ClickHouse/clickhouse-jdbc/pull/596

But proper partitioning of the data (to spark partitions) may be tricky with jdbc.

Some example snippets:

Connectors

https://github.com/DmitryBe/spark-clickhouse (looks dead)
https://github.com/VaBezruchko/spark-clickhouse-connector (is not actively maintained).
https://github.com/housepower/spark-clickhouse-connector (actively developing connector from housepower - same guys as authors of ClickHouse-Native-JDBC)

via Kafka

ClickHouse can produce / consume data from/to Kafka to exchange data with Spark.

via hdfs

You can load data into hadoop/hdfs using sequence of statements like INSERT INTO FUNCTION hdfs(...) SELECT ... FROM clickhouse_table later process the data from hdfs by spark and do the same in reverse direction.

via s3

Similar to above but using s3.

via shell calls

You can call other commands from Spark. Those commands can be clickhouse-client and/or clickhouse-local.

do you really need Spark? :)

In many cases you can do everything inside ClickHouse without Spark help :) Arrays, Higher-order functions, machine learning, integration with lot of different things including the possibility to run some external code using executable dictionaries or UDF.

More info + some unordered links (mostly in Chinese / Russian)

Spark + ClickHouse: not a fight, but a symbiosis (Russian) https://github.com/ClickHouse/clickhouse-presentations/blob/master/meetup28/spark_and_clickhouse.pdf (russian)
Using a bunch of ClickHouse and Spark in MFI Soft (Russian) https://www.youtube.com/watch?v=ID8eTnmag0s (russian)
Spark read and write ClickHouse (Chinese: Spark读写ClickHouse) https://yerias.github.io/2020/12/08/clickhouse/9/#Jdbc%E6%93%8D%E4%BD%9Cclickhouse
Spark JDBC write ClickHouse operation summary (Chinese: Spark JDBC 写 ClickHouse 操作总结) https://www.jianshu.com/p/43f78c8a025b?hmsr=toutiao.io&utm_campaign=toutiao.io&utm_medium=toutiao.io&utm_source=toutiao.io
Spark-sql is based on ClickHouse’s DataSourceV2 data source extension (Chinese: spark-sql基于ClickHouse的DataSourceV2数据源扩展) https://www.cnblogs.com/mengyao/p/4689866.html
Alibaba integration instructions (English) https://www.alibabacloud.com/help/doc-detail/191192.htm
Tencent integration instructions (English) https://intl.cloud.tencent.com/document/product/1026/35884
Yandex DataProc demo: loading files from S3 to ClickHouse with Spark (Russian) https://www.youtube.com/watch?v=N3bZW0_rRzI
ClickHouse official documentation_Spark JDBC writes some pits of ClickHouse (Chinese: ClickHouse官方文档_Spark JDBC写ClickHouse的一些坑) https://blog.csdn.net/weixin_39615984/article/details/111206050
ClickHouse data import: Flink, Spark, Kafka, MySQL, Hive (Chinese: 篇五|ClickHouse数据导入 Flink、Spark、Kafka、MySQL、Hive) https://zhuanlan.zhihu.com/p/299094269
SPARK-CLICKHOUSE-ES REAL-TIME PROJECT EIGHTH DAY-PRECISE ONE-TIME CONSUMPTION SAVE OFFSET. (Chinese: SPARK-CLICKHOUSE-ES实时项目第八天-精确一次性消费保存偏移量) https://www.freesion.com/article/71421322524/
HDFS+ClickHouse+Spark: A lightweight big data analysis system from 0 to 1. (Chinese: HDFS+ClickHouse+Spark：从0到1实现一款轻量级大数据分析系统) https://juejin.cn/post/6850418114962653198
ClickHouse Clustering for Spark Developer (English) http://blog.madhukaraphatak.com/clickouse-clustering-spark-developer/
«Иногда приходится заглядывать в код Spark»: Александр Морозов (SEMrush) об использовании Scala, Spark и ClickHouse. (Russian) https://habr.com/ru/company/jugru/blog/341288/

4.6 - BI Tools

Business Intelligence Tools

Superset: https://superset.apache.org/docs/databases/clickhouse
Metabase: https://github.com/enqueue/metabase-clickhouse-driver
Querybook: https://www.querybook.org/docs/setup_guide/connect_to_query_engines/#all-query-engines
Tableau: Altinity Tableau Connector for ClickHouse® support both JDBC & ODBC drivers
Looker: https://docs.looker.com/setup-and-management/database-config/clickhouse
Apache Zeppelin
SeekTable
ReDash
Mondrian: https://altinity.com/blog/accessing-clickhouse-from-excel-using-mondrian-rolap-engine
Grafana: Integrating Grafana with ClickHouse
Cumul.io
Tablum: https://tablum.io

4.7 - CatBoost / MindsDB / Fast.ai

CatBoost / MindsDB / Fast.ai

Info

Article is based on feedback provided by one of Altinity clients.

CatBoost:

It uses gradient boosting - a hard to use technique which can outperform neural networks. Gradient boosting is powerful but it’s easy to shoot yourself in the foot using it.
The documentation on how to use it is quite lacking. The only good source of information on how to properly configure a model to yield good results is this video: https://www.youtube.com/watch?v=usdEWSDisS0 . We had to dig around GitHub issues to find out how to make it work with ClickHouse®.
CatBoost is fast. Other libraries will take ~5X to ~10X as long to do what CatBoost does.
CatBoost will do preprocessing out of the box (fills nulls, apply standard scaling, encodes strings as numbers).
CatBoost has all functions you’d need (metrics, plotters, feature importance)

It makes sense to split what CatBoost does into 2 parts:

preprocessing (fills nulls, apply standard scaling, encodes strings as numbers)
number crunching (convert preprocessed numbers to another number - ex: revenue of impression)

Compared to Fast.ai , CatBoost pre-processing is as simple to use and produces results that can be as good as Fast.ai .

The number crunching part of Fast.ai is no-config. For CatBoost you need to configure it, a lot.

CatBoost won’t simplify or hide any complexity of the process. So you need to know data science terms and what it does (ex: if your model is underfitting you can use a smaller l2_reg parameter in the model constructor).

In the end both Fast.ai and CatBoost can yield comparable results.

Regarding deploying models, CatBoost is really good. The model runs fast, it has a simple binary format which can be loaded in ClickHouse, C, or Python and it will encapsulate pre-processing with the binary file. Deploying Fast.ai models at scale/speed is impossible out of the box (we have our custom solution to do it which is not simple).

TLDR: CatBoost is fast, produces awesome models, is super easy to deploy and it’s easy to use/train (after becoming familiar with it despite the bad documentation & if you know data science terms).

Regarding MindsDB

The project seems to be a good idea but it’s too young. I was using the GUI version and I’ve encountered some bugs, and none of those bugs have a good error message.

It won’t show data in preview.
The “download” button won’t work.
It’s trying to create and drop tables in ClickHouse without me asking it to.
Other than bugs:
- It will only use 1 core to do everything (training, analysis, download).
- Analysis will only run with a very small subset of data, if I use something like 1M rows it never finishes.
Training a model on 100k rows took 25 minutes - (CatBoost takes 90s to train with 1M rows)
The model trained on MindsDB is way worse. It had r-squared of 0.46 (CatBoost=0.58)
To me it seems that they are a plugin which connects ClickHouse to MySQL to run the model in Pytorch.
It’s too complex and hard to debug and understand. The resulting model is not good enough.
TLDR: Easy to use (if bugs are ignored), too slow to train & produces a bad model.

4.8 - Google S3 (GCS)

GCS with the table function - seems to work correctly for simple scenarios.

Essentially you can follow the steps from the Migrating from Amazon S3 to Cloud Storage .

Set up a GCS bucket.
This bucket must be set as part of the default project for the account. This configuration can be found in settings -> interoperability.
Generate a HMAC key for the account, can be done in settings -> interoperability, in the section for user account access keys.
In ClickHouse®, replace the S3 bucket endpoint with the GCS bucket endpoint This must be done with the path-style GCS endpoint: https://storage.googleapis.com/BUCKET_NAME/OBJECT_NAME.
Replace the aws access key id and aws secret access key with the corresponding parts of the HMAC key.

4.9 - Kafka engine

Kafka engine

librdkafka changelog

This changelog tracks the librdkafka version bundled with ClickHouse and notable related fixes.

git log -- contrib/librdkafka | git name-rev --stdin

ClickHouse® version	librdkafka version
25.3+ (#63697 )	2.8.0 + few fixes
21.10+ (#27883 )	1.6.1 + snappy fixes + boring ssl + illumos_build fixes + edenhill#3279 fix
21.6+ (#23874 )	1.6.1 + snappy fixes + boring ssl + illumos_build fixes
21.1+ (#18671 )	1.6.0-RC3 + snappy fixes + boring ssl
20.13+ (#18053 )	1.5.0 + msan fixes + snappy fixes + boring ssl
20.7+ (#12991 )	1.5.0 + msan fixes
20.5+ (#11256 )	1.4.2
20.2+ (#9000 )	1.3.0
19.11+ (#5872 )	1.1.0
19.5+ (#4799 )	1.0.0
19.1+ (#4025 )	1.0.0-RC5
v1.1.54382+ (#2276 )	0.11.4

4.9.1 - Fundamentals

Core Kafka engine behavior and query semantics in ClickHouse.

4.9.1.1 - Config by provider

Kafka engine configuration examples grouped by managed Kafka provider.

Sometimes the consumer group needs to be explicitly allowed in the broker UI config.

Read Adjusting librdkafka settings first, then apply the provider-specific settings below.

Amazon MSK | SASL/SCRAM

<yandex>
  <kafka>
    <security_protocol>sasl_ssl</security_protocol>
    <!-- Depending on your broker config you may need to uncomment below sasl_mechanism -->
    <!-- <sasl_mechanism>SCRAM-SHA-512</sasl_mechanism> -->
    <sasl_username>root</sasl_username>
    <sasl_password>toor</sasl_password>
  </kafka>
</yandex>

Broker ports detail
Read here more (Russian language)

on-prem / self-hosted Kafka broker

<yandex>
  <kafka>
    <security_protocol>sasl_ssl</security_protocol>
    <sasl_mechanism>SCRAM-SHA-512</sasl_mechanism>
    <sasl_username>root</sasl_username>
    <sasl_password>toor</sasl_password>
    <!-- fullchain cert here -->
    <ssl_ca_location>/path/to/cert/fullchain.pem</ssl_ca_location>
  </kafka>
</yandex>

Inline Kafka certs

To connect to some Kafka cloud services you may need to use certificates.

If needed they can be converted to pem format and inlined into ClickHouse® config.xml Example:

<kafka>
<ssl_key_pem><![CDATA[
  RSA Private-Key: (3072 bit, 2 primes)
    ....
-----BEGIN RSA PRIVATE KEY-----
...
-----END RSA PRIVATE KEY-----
]]></ssl_key_pem>
<ssl_certificate_pem><![CDATA[
-----BEGIN CERTIFICATE-----
...
-----END CERTIFICATE-----
]]></ssl_certificate_pem>
</kafka>

See

Azure Event Hub

See https://github.com/ClickHouse/ClickHouse/issues/12609

Confluent Cloud / Google Cloud

<yandex>
  <kafka>
    <auto_offset_reset>smallest</auto_offset_reset>
    <security_protocol>SASL_SSL</security_protocol>
    <!-- older broker versions may need this below, for newer versions ignore -->
    <!-- <ssl_endpoint_identification_algorithm>https</ssl_endpoint_identification_algorithm> -->
    <sasl_mechanism>PLAIN</sasl_mechanism>
    <sasl_username>username</sasl_username>
    <sasl_password>password</sasl_password>
    <!-- Same as above here ignore if newer broker version -->
    <!-- <ssl_ca_location>probe</ssl_ca_location> -->
  </kafka>
</yandex>

4.9.1.2 - Kafka engine Virtual columns

Kafka virtual columns

Kafka engine virtual columns (built-in)

From the Kafka engine docs , the supported virtual columns are:

_topic — Kafka topic (LowCardinality(String))
_key — message key (String)
_offset — message offset (UInt64)
_timestamp — message timestamp (Nullable(DateTime))
_timestamp_ms — timestamp with millisecond precision (Nullable(DateTime64(3)))
_partition — partition (UInt64)
_headers.name — header keys (Array(String))
_headers.value — header values (Array(String))

Extra virtual columns when you enable parse-error streaming:

If you set kafka_handle_error_mode='stream', ClickHouse adds:

_raw_message — the raw message that failed to parse (String)
_error — the exception message from parsing failure (String)

Note: _raw_message and _error are populated only when parsing fails; otherwise they’re empty.

We can use these columns in a materialized view like this for example:

4.9.1.3 - Adjusting librdkafka settings

Adjusting librdkafka settings

To set rdkafka options - add to <kafka> section in config.xml or preferably use a separate file in config.d/:
- https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md

Some random example using SSL certificates to authenticate:

<yandex>
    <kafka>
        <max_poll_interval_ms>60000</max_poll_interval_ms>
        <session_timeout_ms>60000</session_timeout_ms>
        <heartbeat_interval_ms>10000</heartbeat_interval_ms>
        <reconnect_backoff_ms>5000</reconnect_backoff_ms>
        <reconnect_backoff_max_ms>60000</reconnect_backoff_max_ms>
        <request_timeout_ms>20000</request_timeout_ms>
        <retry_backoff_ms>500</retry_backoff_ms>
        <message_max_bytes>20971520</message_max_bytes>
        <debug>all</debug><!-- only to get the errors -->
        <security_protocol>SSL</security_protocol>
        <ssl_ca_location>/etc/clickhouse-server/ssl/kafka-ca-qa.crt</ssl_ca_location>
        <ssl_certificate_location>/etc/clickhouse-server/ssl/client_clickhouse_client.pem</ssl_certificate_location>
        <ssl_key_location>/etc/clickhouse-server/ssl/client_clickhouse_client.key</ssl_key_location>
        <ssl_key_password>pass</ssl_key_password>
    </kafka>
</yandex>

Authentication / connectivity

Sometimes the consumer group needs to be explicitly allowed in the broker UI config.

Use general Kafka/librdkafka settings from this page first, then apply provider-specific options from Config by provider .

Kerberos

  <!-- Kerberos-aware Kafka -->
  <kafka>
    <security_protocol>SASL_PLAINTEXT</security_protocol>
    <sasl_kerberos_keytab>/home/kafkauser/kafkauser.keytab</sasl_kerberos_keytab>
    <sasl_kerberos_principal>kafkauser/kafkahost@EXAMPLE.COM</sasl_kerberos_principal>
  </kafka>

How to test connection settings

Use kafkacat utility - it internally uses same library to access Kafla as ClickHouse itself and allows easily to test different settings.

kafkacat -b my_broker:9092 -C -o -10 -t my_topic \ (Google cloud and on-prem use 9092 port)
   -X security.protocol=SASL_SSL  \
   -X sasl.mechanisms=PLAIN \
   -X sasl.username=uerName \
   -X sasl.password=Password

Different configurations for different tables?

Is there some more documentation how to use this multiconfiguration for Kafka ?

The whole logic is here: https://github.com/ClickHouse/ClickHouse/blob/da4856a2be035260708fe2ba3ffb9e437d9b7fef/src/Storages/Kafka/StorageKafka.cpp#L466-L475

So it load the main config first, after that it load (with overwrites) the configs for all topics, listed in kafka_topic_list of the table.

Also since v21.12 it’s possible to use more straightforward way using named_collections: https://github.com/ClickHouse/ClickHouse/pull/31691

So you can write a config file something like this:

<clickhouse>
 <named_collections>
  <kafka_preset1>
   <kafka_broker_list>kafka1:19092</kafka_broker_list>
   <kafka_topic_list>conf</kafka_topic_list>
   <kafka_group_name>conf</kafka_group_name>
  </kafka_preset1>
 </named_collections>
</clickhouse>


<clickhouse>
    <named_collections>
        <kafka_preset2>
            <kafka_broker_list>...</kafka_broker_list>
            <kafka_topic_list>foo.bar</kafka_topic_list>
            <kafka_group_name>foo.bar.group</kafka_group_name>
            <kafka>
                <security_protocol>...</security_protocol>
                <sasl_mechanism>...</sasl_mechanism>
                <sasl_username>...</sasl_username>
                <sasl_password>...</sasl_password>
                <auto_offset_reset>smallest</auto_offset_reset>
                <ssl_endpoint_identification_algorithm>https</ssl_endpoint_identification_algorithm>
                <ssl_ca_location>probe</ssl_ca_location>
            </kafka>
        </kafka_preset2>
    </named_collections>
</clickhouse>

And after execute:

CREATE TABLE test.kafka (key UInt64, value UInt64) ENGINE = Kafka(kafka_preset1, kafka_format='CSV');

The same named collections can be created with SQL from v24.2+:

CREATE NAMED COLLECTION kafka_preset1 AS
    kafka_broker_list = 'kafka1:19092',
    kafka_topic_list = 'conf',
    kafka_group_name = 'conf';

CREATE NAMED COLLECTION kafka_preset2 AS
    kafka_broker_list = '...',
    kafka_topic_list = 'foo.bar',
    kafka_group_name = 'foo.bar.group',
    kafka.security_protocol = 'SASL_SSL',
    kafka.sasl_mechanism = 'PLAIN',
    kafka.sasl_username = '...',
    kafka.sasl_password = '...',
    kafka.auto_offset_reset = 'smallest',
    kafka.ssl_endpoint_identification_algorithm = 'https',
    kafka.ssl_ca_location = 'probe';

You can verify SQL-created named collections via:

SELECT
    name,
    source,
    create_query
FROM system.named_collections
WHERE name IN ('kafka_preset1', 'kafka_preset2');

and remove them with:

DROP NAMED COLLECTION kafka_preset1;
DROP NAMED COLLECTION kafka_preset2;

The same fragment of code in newer versions:

https://github.com/ClickHouse/ClickHouse/blob/d19e24f530c30f002488bc136da78f5fb55aedab/src/Storages/Kafka/StorageKafka.cpp#L474-L496

4.9.1.4 - Kafka main parsing loop

Kafka main parsing loop

One of the threads from scheduled_pool (pre ClickHouse® 20.9) / background_message_broker_schedule_pool (after 20.9) do that in infinite loop:

Batch poll (time limit: kafka_poll_timeout_ms 500ms, messages limit: kafka_poll_max_batch_size 65536)
Parse messages.
If we don’t have enough data (rows limit: kafka_max_block_size 1048576) or time limit reached (kafka_flush_interval_ms 7500ms) - continue polling (goto p.1)
Write a collected block of data to MV
Do commit (commit after write = at-least-once).

On any error, during that process, Kafka client is restarted (leading to rebalancing - leave the group and get back in few seconds).

Kafka batching

Important settings

These usually should not be adjusted:

kafka_poll_max_batch_size = max_block_size (65536)
kafka_poll_timeout_ms = stream_poll_timeout_ms (500ms)

You may want to adjust those depending on your scenario:

kafka_flush_interval_ms = stream_flush_interval_ms (7500ms)
kafka_max_block_size = max_insert_block_size / kafka_num_consumers (for the single consumer: 1048576)

Disable at-least-once delivery

kafka_commit_every_batch = 1 will change the loop logic mentioned above. Consumed batch committed to the Kafka and the block of rows send to Materialized Views only after that. It could be resembled as at-most-once delivery mode as prevent duplicate creation but allow loss of data in case of failures.

4.9.1.5 - SELECTs from engine=Kafka

SELECTs from engine=Kafka

Question

What will happen, if we would run SELECT query from working Kafka table with MV attached? Would data showed in SELECT query appear later in MV destination table?

Answer

Most likely SELECT query would show nothing.
If you lucky enough and something would show up, those rows wouldn’t appear in MV destination table.

So it’s not recommended to run SELECT queries on working Kafka tables.

In case of debug it’s possible to use another Kafka table with different consumer_group, so it wouldn’t affect your main pipeline.

4.9.2 - Consumption Patterns

Message consumption models, replay patterns, and delivery semantics.

4.9.2.1 - Exactly once semantics

Exactly once semantics

EOS consumer (isolation.level=read_committed) is enabled by default since librdkafka 1.2.0, so for ClickHouse® - since 20.2

See:

BUT: while EOS semantics will guarantee you that no duplicates will happen on the Kafka side (i.e. even if you produce the same messages few times it will be consumed once), but ClickHouse as a Kafka client can currently guarantee only at-least-once. And in some corner cases (connection lost etc) you can get duplicates.

We need to have something like transactions on ClickHouse side to be able to avoid that. Adding something like simple transactions is in plans for Y2022.

block-aggregator by eBay

Block Aggregator is a data loader that subscribes to Kafka topics, aggregates the Kafka messages into blocks that follow the ClickHouse’s table schemas, and then inserts the blocks into ClickHouse. Block Aggregator provides exactly-once delivery guarantee to load data from Kafka to ClickHouse. Block Aggregator utilizes Kafka’s metadata to keep track of blocks that are intended to send to ClickHouse, and later uses this metadata information to deterministically re-produce ClickHouse blocks for re-tries in case of failures. The identical blocks are guaranteed to be deduplicated by ClickHouse.

eBay/block-aggregator

4.9.2.2 - Kafka parallel consuming

Kafka parallel consuming

For very large topics when you need more parallelism (especially on the insert side) you may use several tables with the same pipeline (pre ClickHouse® 20.9) or enable kafka_thread_per_consumer (after 20.9).

kafka_num_consumers = N,
kafka_thread_per_consumer=1

Notes:

the inserts will happen in parallel (without that setting inserts happen linearly)
enough partitions are needed.
kafka_num_consumers is limited by number of physical cores (half of vCPUs). kafka_disable_num_consumers_limit can be used to override the limit.
background_message_broker_schedule_pool_size is 16 by default, you may need to increase if using more than 16 consumers

Before increasing kafka_num_consumers with keeping kafka_thread_per_consumer=0 may improve consumption & parsing speed, but flushing & committing still happens by a single thread there (so inserts are linear).

4.9.2.3 - Multiple MVs attached to Kafka table

How Multiple MVs attached to Kafka table consume and how they are affected by kafka_num_consumers/kafka_thread_per_consumer

Kafka Consumer is a thread inside the Kafka Engine table that is visible by Kafka monitoring tools like kafka-consumer-groups and in Clickhouse in system.kafka_consumers table.

Having multiple consumers increases ingesting parallelism and can significantly speed up event processing. However, it comes with a trade-off: it’s a CPU-intensive task, especially under high event load and/or complicated parsing of incoming data. Therefore, it’s crucial to create as many consumers as you really need and ensure you have enough CPU cores to handle them. We don’t recommend creating too many Kafka Engines per server because it could lead to uncontrolled CPU usage in situations like bulk data upload or catching up a huge kafka lag due to excessive parallelism of the ingesting process.

kafka_thread_per_consumer meaning

Consider a basic pipeline depicted as a Kafka table with 2 MVs attached. The Kafka broker has 2 topics and 4 partitions.

kafka_thread_per_consumer = 0

Kafka engine table will act as 2 consumers, but only 1 insert thread for both of them. It is important to note that the topic needs to have as many partitions as consumers. For this scenario, we use these settings:

kafka_num_consumers = 2
kafka_thread_per_consumer = 0

The same Kafka engine will create 2 streams, 1 for each consumer, and will join them in a union stream. And it will use 1 thread for inserting [ 2385 ] This is how we can see it in the logs:

2022.11.09 17:49:34.282077 [ 2385 ] {} <Debug> StorageKafka (kafka_table): Started streaming to 2 attached views

How ClickHouse® calculates the number of threads depending on the thread_per_consumer setting:

  auto stream_count = thread_per_consumer ? 1 : num_created_consumers;
      sources.reserve(stream_count);
      pipes.reserve(stream_count);
      for (size_t i = 0; i < stream_count; ++i)
      {
         ......
      }

Details:

https://github.com/ClickHouse/ClickHouse/blob/1b49463bd297ade7472abffbc931c4bb9bf213d0/src/Storages/Kafka/StorageKafka.cpp#L834

Also, a detailed graph of the pipeline:

thread_per_consumer0

With this approach, even if the number of consumers increased, the Kafka engine will still use only 1 thread to flush. The consuming/processing rate will probably increase a bit, but not linearly. For example, 5 consumers will not consume 5 times faster. Also, a good property of this approach is the linearization of INSERTS, which means that the order of the inserts is preserved and sequential. This option is good for small/medium Kafka topics.

kafka_thread_per_consumer = 1

Kafka engine table will act as 2 consumers and 1 thread per consumer. For this scenario, we use these settings:

kafka_num_consumers = 2
kafka_thread_per_consumer = 1

Here, the pipeline works like this:

thread_per_consumer1

With this approach, the number of consumers remains the same, but each consumer will use their own insert/flush thread, and the consuming/processing rate should increase.

Background Pool

In Clickhouse there is a special thread pool for background processes, such as streaming engines. Its size is controlled by the background_message_broker_schedule_pool_size setting and is 16 by default. If you exceed this limit across all tables on the server, you’ll likely encounter continuous Kafka rebalances, which will slow down processing considerably. For a server with a lot of CPU cores, you can increase that limit to a higher value, like 20 or even 40. background_message_broker_schedule_pool_size = 20 allows you to create 5 Kafka Engine tables with 4 consumers each of them has its own insert thread. This option is good for large Kafka topics with millions of messages per second.

Multiple Materialized Views

Attaching multiple Materialized Views (MVs) to a Kafka Engine table can be used when you need to apply different transformations to the same topic and store the resulting data in different tables.

(This approach also applies to the other streaming engines - RabbitMQ, s3queue, etc).

All streaming engines begin processing data (reading from the source and producing insert blocks) only after at least one Materialized View is attached to the engine. Multiple Materialized Views can be connected to distribute data across various tables with different transformations. But how does it work when the server starts?

Once the first Materialized View (MV) is loaded, started, and attached to the Kafka/s3queue table, data consumption begins immediately—data is read from the source, pushed to the destination, and the pointers advance to the next position. However, any other MVs that haven’t started yet will miss the data consumed by the first MV, leading to some data loss.

This issue worsens with asynchronous table loading. Tables are only loaded upon first access, and the loading process takes time. When multiple MVs direct the data stream to different tables, some tables might be ready sooner than others. As soon as the first table becomes ready, data consumption starts, and any tables still loading will miss the data consumed during that interval, resulting in further data loss for those tables.

That means when you make a design with Multiple MVs async_load_databases should be switched off:

<async_load_databases>false</async_load_databases>

Also, you have to prevent starting to consume until all MVs are loaded and started. For that, you can add an additional Null table to the MV pipeline, so the Kafka table will pass the block to a single Null table first, and only then many MVs start their own transformations to many dest tables:

KafkaTable → dummy_MV -> NullTable -> [MV1, MV2, ….] → [Table1, Table2, …]

create table NullTable Engine=Null as KafkaTable;
create materialized view dummy_MV to NullTable
select * from KafkaTable
--WHERE NOT ignore(throwIf(if((uptime() < 120), 1 , 0)))
WHERE NOT ignore(throwIf(if((uptime() < 120), 1 + sleep(3), 0)))

120 seconds should be enough for loading all MVs.

Using an intermediate Null table is also preferable because it’s easier to make any changes with MVs:

drop the dummy_MV to stop consuming
make any changes to transforming MVs by drop/recreate
create dummy_MV again to resume consuming

The fix for correctly starting multiple MVs will be available from 25.5 version - https://github.com/ClickHouse/ClickHouse/pull/72123

4.9.2.4 - Rewind / fast-forward / replay

Rewind / fast-forward / replay

Step 1: Detach Kafka tables in ClickHouse®

DETACH TABLE db.kafka_table_name ON CLUSTER '{cluster}';

Step 2: kafka-consumer-groups.sh --bootstrap-server kafka:9092 --topic topic:0,1,2 --group id1 --reset-offsets --to-latest --execute
- More samples: https://gist.github.com/filimonov/1646259d18b911d7a1e8745d6411c0cc

Step 3: Attach Kafka tables back

ATTACH TABLE db.kafka_table_name ON CLUSTER '{cluster}';

About Offset Consuming

When a consumer joins the consumer group, the broker will check if it has a committed offset. If that is the case, then it will start from the latest offset. Both ClickHouse and librdKafka documentation state that the default value for auto_offset_reset is largest (or latest in new Kafka versions) but it is not, if the consumer is new:

https://github.com/ClickHouse/ClickHouse/blob/f171ad93bcb903e636c9f38812b6aaf0ab045b04/src/Storages/Kafka/StorageKafka.cpp#L506

conf.set("auto.offset.reset", "earliest"); // If no offset stored for this group, read all messages from the start

If there is no offset stored or it is out of range, for that particular consumer group, the consumer will start consuming from the beginning (earliest), and if there is some offset stored then it should use the latest. The log retention policy influences which offset values correspond to the earliest and latest configurations. Consider a scenario where a topic has a retention policy set to 1 hour. Initially, you produce 5 messages, and then, after an hour, you publish 5 more messages. In this case, the latest offset will remain unchanged from the previous example. However, due to Kafka removing the earlier messages, the earliest available offset will not be 0; instead, it will be 5.

4.9.3 - Schema and Formats

Schema inference and format-specific integration details.

4.9.3.1 - Inferring Schema from AvroConfluent Messages in Kafka for ClickHouse®

Learn how to define Kafka table structures in ClickHouse® by using Avro’s schema registry & sample message.

To consume messages from Kafka within ClickHouse®, you need to define the ENGINE=Kafka table structure with all the column names and types. This task can be particularly challenging when dealing with complex Avro messages, as manually determining the exact schema for ClickHouse is both tricky and time-consuming. This complexity is particularly frustrating in the case of Avro formats, where the column names and their types are already clearly defined in the schema registry.

Although ClickHouse supports schema inference for files, it does not natively support this for Kafka streams.

Here’s a workaround to infer the schema using AvroConfluent messages:

Step 1: Capture and Store a Raw Kafka Message

First, create a table in ClickHouse to consume a raw message from Kafka and store it as a file:

CREATE TABLE test_kafka (raw String) ENGINE = Kafka 
SETTINGS kafka_broker_list = 'localhost:29092', 
         kafka_topic_list = 'movies-raw', 
         kafka_format = 'RawBLOB', -- Don't try to parse the message, return it 'as is'
         kafka_group_name = 'tmp_test'; -- Using some dummy consumer group here.

INSERT INTO FUNCTION file('./avro_raw_sample.avro', 'RawBLOB') 
SELECT * FROM test_kafka LIMIT 1 
SETTINGS max_block_size=1, stream_like_engine_allow_direct_select=1;

DROP TABLE test_kafka;

Step 2: Infer Schema Using the Stored File

Using the stored raw message, let ClickHouse infer the schema based on the AvroConfluent format and a specified schema registry URL:

CREATE TEMPORARY TABLE test AS 
SELECT * FROM file('./avro_raw_sample.avro', 'AvroConfluent') 
SETTINGS format_avro_schema_registry_url='http://localhost:8085';

SHOW CREATE TEMPORARY TABLE test\G;

The output from the SHOW CREATE command will display the inferred schema, for example:

Row 1:
──────
statement: CREATE TEMPORARY TABLE test
(
    `movie_id` Int64,
    `title` String,
    `release_year` Int64
)
ENGINE = Memory

Step 3: Create the Kafka Table with the Inferred Schema

Now, use the inferred schema to create the Kafka table:

CREATE TABLE movies_kafka
(
    `movie_id` Int64,
    `title` String,
    `release_year` Int64
)
ENGINE = Kafka
SETTINGS kafka_broker_list = 'localhost:29092',
         kafka_topic_list = 'movies-raw',
         kafka_format = 'AvroConfluent',
         kafka_group_name = 'movies',
         kafka_schema_registry_url = 'http://localhost:8085';

This approach reduces manual schema definition efforts and enhances data integration workflows by utilizing the schema inference capabilities of ClickHouse for AvroConfluent messages.

Appendix

Avro is a binary serialization format used within Apache Kafka for efficiently serializing data with a compact binary format. It relies on schemas, which define the structure of the serialized data, to ensure robust data compatibility and type safety.

Schema Registry is a service that provides a centralized repository for Avro schemas. It helps manage and enforce schemas across applications, ensuring that the data exchanged between producers and consumers adheres to a predefined format, and facilitates schema evolution in a safe manner.

In ClickHouse, the Avro format is used for data that contains the schema embedded directly within the file or message. This means the structure of the data is defined and included with the data itself, allowing for self-describing messages. However, embedding the schema within every message is not optimal for streaming large volumes of data, as it increases the workload and network overhead. Repeatedly passing the same schema with each message can be inefficient, particularly in high-throughput environments.

On the other hand, the AvroConfluent format in ClickHouse is specifically designed to work with the Confluent Schema Registry. This format expects the schema to be managed externally in a schema registry rather than being embedded within each message. It retrieves schema information from the Schema Registry, which allows for centralized schema management and versioning, facilitating easier schema evolution and enforcement across different applications using Kafka.

4.9.4 - Operations and Troubleshooting

Runtime tuning, resource settings, and error diagnostics.

4.9.4.1 - Setting the background message broker schedule pool size

Guide to managing the background_message_broker_schedule_pool_size setting for Kafka, RabbitMQ, and NATS table engines in your database.

Overview

When using Kafka, RabbitMQ, or NATS table engines in ClickHouse®, you may encounter issues related to a saturated background thread pool. One common symptom is a warning similar to the following:

2025.03.14 08:44:26.725868 [ 344 ] {} <Warning> StorageKafka (events_kafka): [rdk:MAXPOLL] [thrd:main]: Application maximum poll interval (60000ms) exceeded by 159ms (adjust max.poll.interval.ms for long-running message processing): leaving group

This warning typically appears not because ClickHouse fails to poll, but because there are no available threads in the background pool to handle the polling in time. In rare cases, the same error might also be caused by long flushing operations to Materialized Views (MVs), especially if their logic is complex or chained.

To resolve this, you should monitor and, if needed, increase the value of the background_message_broker_schedule_pool_size setting.

Step 1: Check Thread Pool Utilization

Run the following SQL query to inspect the current status of your background message broker thread pool:

SELECT
    (
        SELECT value
        FROM system.metrics
        WHERE metric = 'BackgroundMessageBrokerSchedulePoolTask'
    ) AS tasks,
    (
        SELECT value
        FROM system.metrics
        WHERE metric = 'BackgroundMessageBrokerSchedulePoolSize'
    ) AS pool_size,
    pool_size - tasks AS free_threads

If you have metric_log enabled, you can also monitor the minimum number of free threads over the day:

SELECT min(CurrentMetric_BackgroundMessageBrokerSchedulePoolSize - CurrentMetric_BackgroundMessageBrokerSchedulePoolTask) AS min_free_threads
FROM system.metric_log
WHERE event_date = today()

If free_threads is close to zero or negative, it means your thread pool is saturated and should be increased.

Step 2: Estimate Required Pool Size

To estimate a reasonable value for background_message_broker_schedule_pool_size, run the following query:

WITH
    toUInt32OrDefault(extract(engine_full, 'kafka_num_consumers\s*=\s*(\d+)')) as kafka_num_consumers,
    extract(engine_full, 'kafka_thread_per_consumer\s*=\s*(\d+|\'true\')') not in ('', '0') as kafka_thread_per_consumer,
    multiIf(
        engine = 'Kafka',  
            if(kafka_thread_per_consumer AND kafka_num_consumers > 0, kafka_num_consumers, 1),
        engine = 'RabbitMQ',
            3,
        engine = 'NATS',
            3,
        0 /* should not happen */
    ) as threads_needed
SELECT 
    ceil(sum(threads_needed) * 1.25)
FROM 
    system.tables
WHERE 
    engine in ('Kafka', 'RabbitMQ', 'NATS')

This will return an estimate that includes a 25% buffer to accommodate spikes in load.

Step 3: Apply the New Setting

Create or update the following configuration file:
Path: /etc/clickhouse-server/config.d/background_message_broker_schedule_pool_size.xml
Content:
```
<yandex>
    <background_message_broker_schedule_pool_size>120</background_message_broker_schedule_pool_size>
</yandex>
```
Replace 120 with the value recommended from Step 2 (rounded up if needed).

(Only for ClickHouse versions 23.8 and older)

Add the same setting to the default user profile:

Path: /etc/clickhouse-server/users.d/background_message_broker_schedule_pool_size.xml

Content:

<yandex>
    <profiles>
        <default>
            <background_message_broker_schedule_pool_size>120</background_message_broker_schedule_pool_size>
        </default>
    </profiles>
</yandex>

Step 4: Restart ClickHouse

After applying the configuration, restart ClickHouse to apply the changes:

sudo systemctl restart clickhouse-server

Summary

A saturated background message broker thread pool can lead to missed Kafka polls and consumer group dropouts. Monitoring your metrics and adjusting background_message_broker_schedule_pool_size accordingly ensures stable operation of Kafka, RabbitMQ, and NATS integrations.

If the problem persists even after increasing the pool size, consider investigating slow MV chains or flushing logic as a potential bottleneck.

4.9.4.2 - Error handling

Error handling

Pre 21.6

There are couple options:

Certain formats which has schema in built in them (like JSONEachRow) could silently skip any unexpected fields after enabling setting input_format_skip_unknown_fields

It’s also possible to skip up to N malformed messages for each block, with used setting kafka_skip_broken_messages but it’s also does not support all possible formats.

After 21.6

It’s possible to stream messages which could not be parsed, this behavior could be enabled via setting: kafka_handle_error_mode='stream' and ClickHouse® wil write error and message from Kafka itself to two new virtual columns: _error, _raw_message.

So you can create another Materialized View which would collect to a separate table all errors happening while parsing with all important information like offset and content of message.

CREATE TABLE default.kafka_engine
(
    `i` Int64,
    `s` String
)
ENGINE = Kafka
SETTINGS kafka_broker_list = 'kafka:9092'
kafka_topic_list = 'topic',
kafka_group_name = 'clickhouse',
kafka_format = 'JSONEachRow',
kafka_handle_error_mode='stream';

CREATE TABLE default.kafka_errors
(
    `topic` String,
    `partition` Int64,
    `offset` Int64,
    `raw` String,
    `error` String
)
ENGINE = MergeTree
ORDER BY (topic, partition, offset)
SETTINGS index_granularity = 8192


CREATE MATERIALIZED VIEW default.kafka_errors_mv TO default.kafka_errors
AS
SELECT
    _topic AS topic,
    _partition AS partition,
    _offset AS offset,
    _raw_message AS raw,
    _error AS error
FROM default.kafka_engine
WHERE length(_error) > 0

https://github.com/ClickHouse/ClickHouse/pull/20249

https://github.com/ClickHouse/ClickHouse/pull/21850

https://altinity.com/blog/clickhouse-kafka-engine-faq

Since 25.8

dead letter queue can be used via setting: kafka_handle_error_mode='dead_letter_queue' https://github.com/ClickHouse/ClickHouse/pull/68873

and error related data will be saved in system.dead_letter_queue table.

Table connections

4.10 - RabbitMQ

RabbitMQ engine in ClickHouse® 24.3+

Settings

Basic RabbitMQ settings and use cases: https://clickhouse.com/docs/en/engines/table-engines/integrations/rabbitmq

Latest improvements/fixes

(v23.10+)

Allow to save unparsed records and errors in RabbitMQ: NATS and FileLog engines. Add virtual columns _error and _raw_message (for NATS and RabbitMQ), _raw_record (for FileLog) that are filled when ClickHouse fails to parse new record. The behaviour is controlled under storage settings nats_handle_error_mode for NATS, rabbitmq_handle_error_mode for RabbitMQ, handle_error_mode for FileLog similar to kafka_handle_error_mode. If it’s set to default, en exception will be thrown when ClickHouse fails to parse a record, if it’s set to stream, error and raw record will be saved into virtual columns. Closes #36035 and #55477

(v24+)

4.10.1 - RabbitMQ Error handling

Error handling for RabbitMQ table engine

Same approach as in Kafka but virtual columns are different. Check https://clickhouse.com/docs/en/engines/table-engines/integrations/rabbitmq#virtual-columns

CREATE TABLE IF NOT EXISTS rabbitmq.broker_errors_queue
(
  exchange_name String,
  channel_id String,
  delivery_tag UInt64,
  redelivered UInt8,
  message_id String,
  timestamp UInt64
)
engine = RabbitMQ
SETTINGS
    rabbitmq_host_port = 'localhost:5672',
    rabbitmq_exchange_name = 'exchange-test', -- required parameter even though this is done via the rabbitmq config
    rabbitmq_queue_consume = true,
    rabbitmq_queue_base = 'test-errors',
    rabbitmq_format = 'JSONEachRow',
    rabbitmq_username = 'guest',
    rabbitmq_password = 'guest',
    rabbitmq_handle_error_mode = 'stream';

CREATE MATERIALIZED VIEW IF NOT EXISTS rabbitmq.broker_errors_mv
(
  exchange_name String,
  channel_id String,
  delivery_tag UInt64,
  redelivered UInt8,
  message_id String,
  timestamp UInt64
  raw_message String,
  error String
)
ENGINE = MergeTree
ORDER BY (error)
SETTINGS index_granularity = 8192 AS
SELECT
  _exchange_name AS exchange_name,
  _channel_id AS channel_id,
  _delivery_tag AS delivery_tag,
  _redelivered AS redelivered,
  _message_id AS message_id,
  _timestamp AS timestamp,
  _raw_message AS raw_message,
  _error AS error
FROM rabbitmq.broker_errors_queue
WHERE length(_error) > 0

5 - Setup & maintenance

Learn how to set up, deploy, monitor, and backup ClickHouse® with step-by-step guides.

5.1 - S3 & object storage

S3 & object storage

5.1.1 - AWS S3 Recipes

AWS S3 Recipes

Using AWS IAM — Identity and Access Management roles

For EC2 instance, there is an option to configure an IAM role:

Role shall contain a policy with permissions like:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "allow-put-and-get",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject"
            ],
            "Resource": "arn:aws:s3:::BUCKET_NAME/test_s3_disk/*"
        }
    ]
}

Corresponding configuration of ClickHouse®:

<clickhouse>
    <storage_configuration>
        <disks>
            <disk_s3>
                <type>s3</type>
                <endpoint>http://s3.us-east-1.amazonaws.com/BUCKET_NAME/test_s3_disk/</endpoint>
                <use_environment_credentials>true</use_environment_credentials>
            </disk_s3>
        </disks>
        <policies>
            <policy_s3_only>
                <volumes>
                    <volume_s3>
                        <disk>disk_s3</disk>
                    </volume_s3>
                </volumes>
            </policy_s3_only>
        </policies>
    </storage_configuration>
</clickhouse>

Small check:

CREATE TABLE table_s3 (number Int64) ENGINE=MergeTree() ORDER BY tuple() PARTITION BY tuple() SETTINGS storage_policy='policy_s3_only';
INSERT INTO table_s3 SELECT * FROM system.numbers LIMIT 100000000;
SELECT * FROM table_s3;
DROP TABLE table_s3;

How to use AWS IRSA and IAM in the Altinity Kubernetes Operator for ClickHouse to allow S3 backup without Explicit credentials

Install clickhouse-operator https://github.com/Altinity/clickhouse-operator/tree/master/docs/operator_installation_details.md

Create Role and IAM Policy, look details in https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/setting-up-enable-IAM.html

Create service account with annotations

apiVersion: v1
kind: ServiceAccount
metadata:
  name: <SERVICE ACOUNT NAME>
  namespace: <NAMESPACE>
  annotations:
     eks.amazonaws.com/role-arn: arn:aws:iam::<ACCOUNT_ID>:role/<ROLE_NAME>

Link service account to podTemplate it will create AWS_ROLE_ARN and AWS_WEB_IDENTITY_TOKEN_FILE environment variables.

apiVersion: "clickhouse.altinity.com/v1"
kind: "ClickHouseInstallation"
metadata:
  name: <NAME>
  namespace: <NAMESPACE>
spec:
  defaults:
     templates:
       podTemplate: <POD_TEMPLATE_NAME>
  templates:
    podTemplates:
      - name: <POD_TEMPLATE_NAME>
        spec:
          serviceAccountName: <SERVICE ACCOUNT NAME>
          containers:
            - name: clickhouse-backup

For EC2 instances the same environment variables should be created:

AWS_ROLE_ARN=arn:aws:iam::<ACCOUNT_ID>:role/<ROLE_NAME>
AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token

5.1.2 - Clean up orphaned objects on s3

Clean up orphaned objects left in an S3-backed ClickHouse tiered‐storage

Problems

TRUNCATE and DROP TABLE remove metadata only.
Long-running queries, merges or other replicas may still reference parts, so ClickHouse delays removal.
There are bugs in Clickhouse that leave orphaned files, especially after failures.

Solutions

use our utility for garbage collection - https://github.com/Altinity/s3gc
or create a separate path in the bucket for every table and every replica and remove the whole path in AWS console
you can also use clickhouse-disk utility to delete s3 data:

clickhouse-disks --disk s3 --query "remove /cluster/database/table/replica1"

5.1.3 - How much data are written to S3 during mutations

Example of how much data ClickHouse® reads and writes to s3 during mutations.

Configuration

S3 disk with disabled merges

<clickhouse>
    <storage_configuration>
        <disks>
            <s3disk>
                <type>s3</type>
                <endpoint>https://s3.us-east-1.amazonaws.com/mybucket/test/test/</endpoint>
                <use_environment_credentials>1</use_environment_credentials>  <!-- use IAM AWS role -->
                    <!--access_key_id>xxxx</access_key_id>
                    <secret_access_key>xxx</secret_access_key-->
            </s3disk>
        </disks>
        <policies>
          <s3tiered>
              <volumes>
                  <default>
                      <disk>default</disk>
                  </default>
                  <s3disk>
                      <disk>s3disk</disk>  
                      <prefer_not_to_merge>true</prefer_not_to_merge>
                  </s3disk>
              </volumes>
          </s3tiered>
        </policies>
    </storage_configuration>
</clickhouse>

Let’s create a table and load some synthetic data.

CREATE TABLE test_s3
(
    `A` Int64,
    `S` String,
    `D` Date
)
ENGINE = MergeTree
PARTITION BY D
ORDER BY A
SETTINGS storage_policy = 's3tiered';

insert into test_s3 select number, number, today() - intDiv(number, 10000000) from numbers(7e8);
0 rows in set. Elapsed: 98.091 sec. Processed 700.36 million rows, 5.60 GB (7.14 million rows/s., 57.12 MB/s.)


select disk_name, partition, sum(rows), formatReadableSize(sum(bytes_on_disk)) size, count() part_count 
from system.parts where table= 'test_s3' and active 
group by disk_name, partition
order by partition;

┌─disk_name─┬─partition──┬─sum(rows)─┬─size──────┬─part_count─┐
│ default   │ 2023-05-06 │  10000000 │ 78.23 MiB │          5 │
│ default   │ 2023-05-07 │  10000000 │ 78.31 MiB │          6 │
│ default   │ 2023-05-08 │  10000000 │ 78.16 MiB │          5 │
....
│ default   │ 2023-07-12 │  10000000 │ 78.21 MiB │          5 │
│ default   │ 2023-07-13 │  10000000 │ 78.23 MiB │          6 │
│ default   │ 2023-07-14 │  10000000 │ 77.39 MiB │          5 │
└───────────┴────────────┴───────────┴───────────┴────────────┘
70 rows in set. Elapsed: 0.023 sec.

Performance of mutations for a local EBS (throughput: 500 MB/s)

select * from test_s3 where A=490000000;
1 row in set. Elapsed: 0.020 sec. Processed 8.19 thousand rows, 92.67 KB (419.17 thousand rows/s., 4.74 MB/s.)

select * from test_s3 where S='490000000';
1 row in set. Elapsed: 14.117 sec. Processed 700.00 million rows, 12.49 GB (49.59 million rows/s., 884.68 MB/s.)

delete from test_s3 where S = '490000000';
0 rows in set. Elapsed: 22.192 sec.

delete from test_s3 where A = '490000001';
0 rows in set. Elapsed: 2.243 sec.

alter table test_s3 delete where S = 590000000 settings mutations_sync=2;
0 rows in set. Elapsed: 21.387 sec.

alter table test_s3 delete where A = '590000001' settings mutations_sync=2;
0 rows in set. Elapsed: 3.372 sec.

alter table test_s3 update S='' where S = '690000000' settings mutations_sync=2;
0 rows in set. Elapsed: 20.265 sec.

alter table test_s3 update S='' where A = '690000001' settings mutations_sync=2;
0 rows in set. Elapsed: 1.979 sec.

Let’s move data to S3

alter table test_s3 modify TTL D + interval 10 day to disk 's3disk';

-- 10 minutes later
┌─disk_name─┬─partition──┬─sum(rows)─┬─size──────┬─part_count─┐
│ s3disk    │ 2023-05-06 │  10000000 │ 78.23 MiB │          5 │
│ s3disk    │ 2023-05-07 │  10000000 │ 78.31 MiB │          6 │
│ s3disk    │ 2023-05-08 │  10000000 │ 78.16 MiB │          5 │
│ s3disk    │ 2023-05-09 │  10000000 │ 78.21 MiB │          6 │
│ s3disk    │ 2023-05-10 │  10000000 │ 78.21 MiB │          6 │
...
│ s3disk    │ 2023-07-02 │  10000000 │ 78.22 MiB │          5 │
...
│ default   │ 2023-07-11 │  10000000 │ 78.20 MiB │          6 │
│ default   │ 2023-07-12 │  10000000 │ 78.21 MiB │          5 │
│ default   │ 2023-07-13 │  10000000 │ 78.23 MiB │          6 │
│ default   │ 2023-07-14 │  10000000 │ 77.40 MiB │          5 │
└───────────┴────────────┴───────────┴───────────┴────────────┘
70 rows in set. Elapsed: 0.007 sec.

Sizes of a table on S3 and a size of each column

select sum(rows), formatReadableSize(sum(bytes_on_disk)) size 
from system.parts where table= 'test_s3' and active and disk_name = 's3disk';
┌─sum(rows)─┬─size─────┐
│ 600000000 │ 4.58 GiB │
└───────────┴──────────┘

SELECT
    database,
    table,
    column,
    formatReadableSize(sum(column_data_compressed_bytes) AS size) AS compressed
FROM system.parts_columns
WHERE (active = 1) AND (database LIKE '%') AND (table LIKE 'test_s3') AND (disk_name = 's3disk')
GROUP BY
    database,
    table,
    column
ORDER BY column ASC

┌─database─┬─table───┬─column─┬─compressed─┐
│ default  │ test_s3 │ A      │ 2.22 GiB   │
│ default  │ test_s3 │ D      │ 5.09 MiB   │
│ default  │ test_s3 │ S      │ 2.33 GiB   │
└──────────┴─────────┴────────┴────────────┘

S3 Statistics of selects

select *, _part from test_s3 where A=100000000;
┌─────────A─┬─S─────────┬──────────D─┬─_part──────────────────┐
│ 100000000 │ 100000000 │ 2023-07-08 │ 20230708_106_111_1_738 │
└───────────┴───────────┴────────────┴────────────────────────┘
1 row in set. Elapsed: 0.104 sec. Processed 8.19 thousand rows, 65.56 KB (79.11 thousand rows/s., 633.07 KB/s.)

┌─S3GetObject─┬─S3PutObject─┬─ReadBufferFromS3─┬─WriteBufferFromS3─┐
│           6 │           0 │ 70.58 KiB        │ 0.00 B            │
└─────────────┴─────────────┴──────────────────┴───────────────────┘

Select by primary key read only 70.58 KiB from S3

Size of this part

SELECT
    database, table, column,
    formatReadableSize(sum(column_data_compressed_bytes) AS size) AS compressed
FROM system.parts_columns
WHERE (active = 1) AND (database LIKE '%') AND (table LIKE 'test_s3') AND (disk_name = 's3disk')
    and name = '20230708_106_111_1_738'
GROUP BY database, table, column ORDER BY column ASC

┌─database─┬─table───┬─column─┬─compressed─┐
│ default  │ test_s3 │ A      │ 22.51 MiB  │
│ default  │ test_s3 │ D      │ 51.47 KiB  │
│ default  │ test_s3 │ S      │ 23.52 MiB  │
└──────────┴─────────┴────────┴────────────┘

select * from test_s3 where S='100000000';
┌─────────A─┬─S─────────┬──────────D─┐
│ 100000000 │ 100000000 │ 2023-07-08 │
└───────────┴───────────┴────────────┘
1 row in set. Elapsed: 86.745 sec. Processed 700.00 million rows, 12.49 GB (8.07 million rows/s., 144.04 MB/s.)

┌─S3GetObject─┬─S3PutObject─┬─ReadBufferFromS3─┬─WriteBufferFromS3─┐
│         947 │           0 │ 2.36 GiB         │ 0.00 B            │
└─────────────┴─────────────┴──────────────────┴───────────────────┘

Select using fullscan of S column read only 2.36 GiB from S3, the whole S column (2.33 GiB) plus parts of A and D.


delete from test_s3 where A=100000000;
0 rows in set. Elapsed: 17.429 sec.

┌─q──┬─S3GetObject─┬─S3PutObject─┬─ReadBufferFromS3─┬─WriteBufferFromS3─┐
│ Q3 │        2981 │           6 │ 23.06 MiB        │ 27.25 KiB         │
└────┴─────────────┴─────────────┴──────────────────┴───────────────────┘

insert into test select 'Q3' q, event,value  from system.events where event like '%S3%';


delete from test_s3 where S='100000001';
0 rows in set. Elapsed: 31.417 sec.
┌─q──┬─S3GetObject─┬─S3PutObject─┬─ReadBufferFromS3─┬─WriteBufferFromS3─┐
│ Q4 │        4209 │           6 │ 2.39 GiB         │ 27.25 KiB         │
└────┴─────────────┴─────────────┴──────────────────┴───────────────────┘
insert into test select 'Q4' q, event,value  from system.events where event like '%S3%';



alter table test_s3 delete where A=110000000 settings mutations_sync=2;
0 rows in set. Elapsed: 19.521 sec.

┌─q──┬─S3GetObject─┬─S3PutObject─┬─ReadBufferFromS3─┬─WriteBufferFromS3─┐
│ Q5 │        2986 │          15 │ 42.27 MiB        │ 41.72 MiB         │
└────┴─────────────┴─────────────┴──────────────────┴───────────────────┘
insert into test select 'Q5' q, event,value  from system.events where event like '%S3%';


alter table test_s3 delete where S='110000001' settings mutations_sync=2;
0 rows in set. Elapsed: 29.650 sec.

┌─q──┬─S3GetObject─┬─S3PutObject─┬─ReadBufferFromS3─┬─WriteBufferFromS3─┐
│ Q6 │        4212 │          15 │ 2.42 GiB         │ 41.72 MiB         │
└────┴─────────────┴─────────────┴──────────────────┴───────────────────┘
insert into test select 'Q6' q, event,value  from system.events where event like '%S3%';

5.1.4 - Example of the table at s3 with cache

s3 disk and s3 cache.

Storage configuration

cat /etc/clickhouse-server/config.d/s3.xml
<clickhouse>
    <storage_configuration>
        <disks>
            <s3disk>
                <type>s3</type>
                <endpoint>https://s3.us-east-1.amazonaws.com/mybucket/test/s3cached/</endpoint>
                <use_environment_credentials>1</use_environment_credentials>  <!-- use IAM AWS role -->
                    <!--access_key_id>xxxx</access_key_id>
                    <secret_access_key>xxx</secret_access_key-->
            </s3disk>
            <cache>
                <type>cache</type>
                <disk>s3disk</disk>
                <path>/var/lib/clickhouse/disks/s3_cache/</path>
                <max_size>50Gi</max_size>  <!-- 50GB local cache to cache remote data -->
            </cache>
        </disks>
        <policies>
          <s3tiered>
              <volumes>
                  <default>
                      <disk>default</disk>
                      <max_data_part_size_bytes>50000000000</max_data_part_size_bytes>   <!-- only for parts less than 50GB after they moved to s3 during merges -->         
                  </default>
                  <s3cached>
                      <disk>cache</disk>  <!-- sandwich cache plus s3disk -->
                      <!-- prefer_not_to_merge>true</prefer_not_to_merge>
                      <perform_ttl_move_on_insert>false</perform_ttl_move_on_insert-->
                  </s3cached>
              </volumes>
          </s3tiered>
        </policies>
    </storage_configuration>
</clickhouse>

select * from system.disks
┌─name────┬─path──────────────────────────────┬───────────free_space─┬──────────total_space─┬
│ cache   │ /var/lib/clickhouse/disks/s3disk/ │ 18446744073709551615 │ 18446744073709551615 │
│ default │ /var/lib/clickhouse/              │         149113987072 │         207907635200 │
│ s3disk  │ /var/lib/clickhouse/disks/s3disk/ │ 18446744073709551615 │ 18446744073709551615 │
└─────────┴───────────────────────────────────┴──────────────────────┴──────────────────────┴

select * from system.storage_policies;
┌─policy_name─┬─volume_name─┬─volume_priority─┬─disks───────┬─volume_type─┬─max_data_part_size─┬─move_factor─┬─prefer_not_to_merge─┐
│ default     │ default     │               1 │ ['default'] │ JBOD        │                  0 │           0 │                   0 │
│ s3tiered    │ default     │               1 │ ['default'] │ JBOD        │        50000000000 │         0.1 │                   0 │
│ s3tiered    │ s3cached    │               2 │ ['s3disk']  │ JBOD        │                  0 │         0.1 │                   0 │
└─────────────┴─────────────┴─────────────────┴─────────────┴─────────────┴────────────────────┴─────────────┴─────────────────────┘

example with a new table

CREATE TABLE test_s3
(
    `A` Int64,
    `S` String,
    `D` Date
)
ENGINE = MergeTree
PARTITION BY D
ORDER BY A
SETTINGS storage_policy = 's3tiered';

insert into test_s3 select number, number, '2023-01-01' from numbers(1e9);

0 rows in set. Elapsed: 270.285 sec. Processed 1.00 billion rows, 8.00 GB (3.70 million rows/s., 29.60 MB/s.)

Table size is 7.65 GiB and it at the default disk (EBS):

select disk_name, partition, sum(rows), formatReadableSize(sum(bytes_on_disk)) size, count() part_count 
from system.parts where table= 'test_s3' and active 
group by disk_name, partition;
┌─disk_name─┬─partition──┬──sum(rows)─┬─size─────┬─part_count─┐
│ default   │ 2023-01-01 │ 1000000000 │ 7.65 GiB │          8 │
└───────────┴────────────┴────────────┴──────────┴────────────┘

It seems my EBS write speed is slower than S3 write speed:

alter table test_s3 move partition '2023-01-01' to volume 's3cached';
0 rows in set. Elapsed: 98.979 sec.

alter table test_s3 move partition '2023-01-01' to volume 'default';
0 rows in set. Elapsed: 127.741 sec.

Queries performance against EBS:

select * from test_s3 where A = 443;
1 row in set. Elapsed: 0.002 sec. Processed 8.19 thousand rows, 71.64 KB (3.36 million rows/s., 29.40 MB/s.)

select uniq(A) from test_s3;
1 row in set. Elapsed: 11.439 sec. Processed 1.00 billion rows, 8.00 GB (87.42 million rows/s., 699.33 MB/s.)

select count() from test_s3 where S like '%4422%'
1 row in set. Elapsed: 17.484 sec. Processed 1.00 billion rows, 17.89 GB (57.20 million rows/s., 1.02 GB/s.)

Let’s move data to S3

alter table test_s3 move partition '2023-01-01' to volume 's3cached';
0 rows in set. Elapsed: 81.068 sec.

select disk_name, partition, sum(rows), formatReadableSize(sum(bytes_on_disk)) size, count() part_count 
from system.parts where table= 'test_s3' and active 
group by disk_name, partition;
┌─disk_name─┬─partition──┬──sum(rows)─┬─size─────┬─part_count─┐
│ s3disk    │ 2023-01-01 │ 1000000000 │ 7.65 GiB │          8 │
└───────────┴────────────┴────────────┴──────────┴────────────┘

The first query execution against S3, the second against the cache (local EBS):

select * from test_s3 where A = 443;
1 row in set. Elapsed: 0.458 sec. Processed 8.19 thousand rows, 71.64 KB (17.88 thousand rows/s., 156.35 KB/s.)
1 row in set. Elapsed: 0.003 sec. Processed 8.19 thousand rows, 71.64 KB (3.24 million rows/s., 28.32 MB/s.)

select uniq(A) from test_s3;
1 row in set. Elapsed: 26.601 sec. Processed 1.00 billion rows, 8.00 GB (37.59 million rows/s., 300.74 MB/s.)
1 row in set. Elapsed: 8.675 sec. Processed 1.00 billion rows, 8.00 GB (115.27 million rows/s., 922.15 MB/s.)

select count() from test_s3 where S like '%4422%'
1 row in set. Elapsed: 33.586 sec. Processed 1.00 billion rows, 17.89 GB (29.77 million rows/s., 532.63 MB/s.)
1 row in set. Elapsed: 16.551 sec. Processed 1.00 billion rows, 17.89 GB (60.42 million rows/s., 1.08 GB/s.)

Cache introspection

select cache_base_path, formatReadableSize(sum(size)) from system.filesystem_cache group by 1;
┌─cache_base_path─────────────────────┬─formatReadableSize(sum(size))─┐
│ /var/lib/clickhouse/disks/s3_cache/ │ 7.64 GiB                      │
└─────────────────────────────────────┴───────────────────────────────┘

system drop FILESYSTEM cache;

select cache_base_path, formatReadableSize(sum(size)) from system.filesystem_cache group by 1;
0 rows in set. Elapsed: 0.005 sec.

select * from test_s3 where A = 443;
1 row in set. Elapsed: 0.221 sec. Processed 8.19 thousand rows, 71.64 KB (37.10 thousand rows/s., 324.47 KB/s.)

select cache_base_path, formatReadableSize(sum(size)) from system.filesystem_cache group by 1;
┌─cache_base_path─────────────────────┬─formatReadableSize(sum(size))─┐
│ /var/lib/clickhouse/disks/s3_cache/ │ 105.95 KiB                    │
└─────────────────────────────────────┴───────────────────────────────┘

No data is stored locally (except system log tables).

select name, formatReadableSize(free_space) free_space, formatReadableSize(total_space) total_space from system.disks;
┌─name────┬─free_space─┬─total_space─┐
│ cache   │ 16.00 EiB  │ 16.00 EiB   │
│ default │ 48.97 GiB  │ 49.09 GiB   │
│ s3disk  │ 16.00 EiB  │ 16.00 EiB   │
└─────────┴────────────┴─────────────┘

example with an existing table

The mydata table is created without the explicitly defined storage_policy, it means that implicitly storage_policy=default / volume=default / disk=default.

select disk_name, partition, sum(rows), formatReadableSize(sum(bytes_on_disk)) size, count() part_count 
from system.parts where table='mydata' and active 
group by disk_name, partition
order by partition;
┌─disk_name─┬─partition─┬─sum(rows)─┬─size───────┬─part_count─┐
│ default   │ 202201    │ 516666677 │ 4.01 GiB   │         13 │
│ default   │ 202202    │ 466666657 │ 3.64 GiB   │         13 │
│ default   │ 202203    │  16666666 │ 138.36 MiB │         10 │
│ default   │ 202301    │ 516666677 │ 4.01 GiB   │         10 │
│ default   │ 202302    │ 466666657 │ 3.64 GiB   │         10 │
│ default   │ 202303    │  16666666 │ 138.36 MiB │         10 │
└───────────┴───────────┴───────────┴────────────┴────────────┘

-- Let's change the storage policy, this command instant and changes only metadata of the table, and possible because the new storage policy and the old has the volume `default`.

alter table mydata modify setting storage_policy = 's3tiered';

0 rows in set. Elapsed: 0.057 sec.

straightforward (heavy) approach

-- Let's add TTL, it's a heavy command and takes a lot time and creates the performance impact, because it reads `D` column and moves parts to s3.
alter table mydata modify TTL D + interval 1 year to volume 's3cached';

0 rows in set. Elapsed: 140.661 sec.

┌─disk_name─┬─partition─┬─sum(rows)─┬─size───────┬─part_count─┐
│ s3disk    │ 202201    │ 516666677 │ 4.01 GiB   │         13 │
│ s3disk    │ 202202    │ 466666657 │ 3.64 GiB   │         13 │
│ s3disk    │ 202203    │  16666666 │ 138.36 MiB │         10 │
│ default   │ 202301    │ 516666677 │ 4.01 GiB   │         10 │
│ default   │ 202302    │ 466666657 │ 3.64 GiB   │         10 │
│ default   │ 202303    │  16666666 │ 138.36 MiB │         10 │
└───────────┴───────────┴───────────┴────────────┴────────────┘

gentle (manual) approach

-- alter modify TTL changes only metadata of the table and applied to only newly insterted data.
set materialize_ttl_after_modify=0;
alter table mydata modify TTL D + interval 1 year to volume 's3cached';
0 rows in set. Elapsed: 0.049 sec.

-- move data slowly partition by partition

alter table mydata move partition id '202201' to volume 's3cached';
0 rows in set. Elapsed: 49.410 sec.

alter table mydata move partition id '202202' to volume 's3cached';
0 rows in set. Elapsed: 36.952 sec.

alter table mydata move partition id '202203' to volume 's3cached';
0 rows in set. Elapsed: 4.808 sec.

-- data can be optimized to reduce number of parts before moving it to s3
optimize table mydata partition id '202301' final;
0 rows in set. Elapsed: 66.551 sec.

alter table mydata move partition id '202301' to volume 's3cached';
0 rows in set. Elapsed: 33.332 sec.

┌─disk_name─┬─partition─┬─sum(rows)─┬─size───────┬─part_count─┐
│ s3disk    │ 202201    │ 516666677 │ 4.01 GiB   │         13 │
│ s3disk    │ 202202    │ 466666657 │ 3.64 GiB   │         13 │
│ s3disk    │ 202203    │  16666666 │ 138.36 MiB │         10 │
│ s3disk    │ 202301    │ 516666677 │ 4.01 GiB   │          1 │ -- optimized partition
│ default   │ 202302    │ 466666657 │ 3.64 GiB   │         13 │
│ default   │ 202303    │  16666666 │ 138.36 MiB │         10 │
└───────────┴───────────┴───────────┴────────────┴────────────┘

S3 and ClickHouse® start time

Let’s create a table with 1000 parts and move them to s3.

CREATE TABLE test_s3( A Int64, S String, D Date)
ENGINE = MergeTree PARTITION BY D ORDER BY A
SETTINGS storage_policy = 's3tiered';

insert into test_s3 select number, number, toDate('2000-01-01') + intDiv(number,1e6) from numbers(1e9);
optimize table test_s3 final settings optimize_skip_merged_partitions = 1;

select disk_name, sum(rows), formatReadableSize(sum(bytes_on_disk)) size, count() part_count 
from system.parts where table= 'test_s3' and active group by disk_name;
┌─disk_name─┬──sum(rows)─┬─size─────┬─part_count─┐
│ default   │ 1000000000 │ 7.64 GiB │       1000 │
└───────────┴────────────┴──────────┴────────────┘

alter table test_s3 modify ttl D + interval 1 year to disk 's3disk';

select disk_name, sum(rows), formatReadableSize(sum(bytes_on_disk)) size, count() part_count 
from system.parts where table= 'test_s3' and active 
group by disk_name;
┌─disk_name─┬─sum(rows)─┬─size─────┬─part_count─┐
│ default   │ 755000000 │ 5.77 GiB │        755 │
│ s3disk    │ 245000000 │ 1.87 GiB │        245 │
└───────────┴───────────┴──────────┴────────────┘

----  several minutes later ----

┌─disk_name─┬──sum(rows)─┬─size─────┬─part_count─┐
│ s3disk    │ 1000000000 │ 7.64 GiB │       1000 │
└───────────┴────────────┴──────────┴────────────┘

start time

:) select name, value from system.merge_tree_settings where name = 'max_part_loading_threads';
┌─name─────────────────────┬─value─────┐
│ max_part_loading_threads │ 'auto(4)' │
└──────────────────────────┴───────────┘

# systemctl stop clickhouse-server
# time systemctl start clickhouse-server  / real	4m26.766s
# systemctl stop clickhouse-server
# time systemctl start clickhouse-server  / real	4m24.263s

# cat /etc/clickhouse-server/config.d/max_part_loading_threads.xml
<?xml version="1.0"?>
<clickhouse>
    <merge_tree>
       <max_part_loading_threads>128</max_part_loading_threads>
    </merge_tree>
</clickhouse>

# systemctl stop clickhouse-server
# time systemctl start clickhouse-server / real	0m11.225s
# systemctl stop clickhouse-server
# time systemctl start clickhouse-server / real	0m10.797s

       <max_part_loading_threads>256</max_part_loading_threads>

# systemctl stop clickhouse-server
# time systemctl start clickhouse-server / real	0m8.474s
# systemctl stop clickhouse-server
# time systemctl start clickhouse-server / real	0m8.130s

5.1.5 - S3Disk

Settings

<clickhouse>
  <storage_configuration>
    <disks>
      <s3>
        <type>s3</type>
        <endpoint>http://s3.us-east-1.amazonaws.com/BUCKET_NAME/test_s3_disk/</endpoint>
        <access_key_id>ACCESS_KEY_ID</access_key_id>
        <secret_access_key>SECRET_ACCESS_KEY</secret_access_key>
        <skip_access_check>true</skip_access_check>
        <send_metadata>true</send_metadata>
      </s3>
    </disks>
  </storage_configuration>
</clickhouse>

skip_access_check — if true, it’s possible to use read only credentials with regular MergeTree table. But you would need to disable merges (prefer_not_to_merge setting) on s3 volume as well.
send_metadata — if true, ClickHouse® will populate s3 object with initial part & file path, which allow you to recover metadata from s3 and make debug easier.

Restore metadata from S3

Default

Limitations:

ClickHouse need RW access to this bucket

In order to restore metadata, you would need to create restore file in metadata_path/_s3_disk_name_ directory:

touch /var/lib/clickhouse/disks/_s3_disk_name_/restore

In that case ClickHouse would restore to the same bucket and path and update only metadata files in s3 bucket.

Custom

Limitations:

ClickHouse needs RO access to the old bucket and RW to the new.
ClickHouse will copy objects in case of restoring to a different bucket or path.

If you would like to change bucket or path, you need to populate restore file with settings in key=value format:

cat /var/lib/clickhouse/disks/_s3_disk_name_/restore

source_bucket=s3disk
source_path=vol1/

5.2 - AggregateFunction(uniq, UUID) doubled after ClickHouse® upgrade

What happened

After ClickHouse® upgrade from version pre 21.6 to version after 21.6, count of unique UUID in AggregatingMergeTree tables nearly doubled in case of merging of data which was generated in different ClickHouse versions.

Why happened

In pull request which changed the internal representation of big integers data types (and UUID). SipHash64 hash-function used for uniq aggregation function for UUID data type was replaced with intHash64, which leads to different result for the same UUID value across different ClickHouse versions. Therefore, it results in doubling of counts, when uniqState created by different ClickHouse versions being merged together.

Related issue .

Solution

You need to replace any occurrence of uniqState(uuid) in MATERIALIZED VIEWs with uniqState(sipHash64(uuid)) and change data type for already saved data from AggregateFunction(uniq, UUID) to AggregateFunction(uniq, UInt64), because result data type of sipHash64 is UInt64.

-- On ClickHouse version 21.3

CREATE TABLE uniq_state
(
    `key` UInt32,
    `value` AggregateFunction(uniq, UUID)
)
ENGINE = MergeTree
ORDER BY key

INSERT INTO uniq_state SELECT
    number % 10000 AS key,
    uniqState(reinterpretAsUUID(number))
FROM numbers(1000000)
GROUP BY key

Ok.

0 rows in set. Elapsed: 0.404 sec. Processed 1.05 million rows, 8.38 MB (2.59 million rows/s., 20.74 MB/s.)

SELECT
    key % 20,
    uniqMerge(value)
FROM uniq_state
GROUP BY key % 20

┌─modulo(key, 20)─┬─uniqMerge(value)─┐
│               0 │            50000 │
│               1 │            50000 │
│               2 │            50000 │
│               3 │            50000 │
│               4 │            50000 │
│               5 │            50000 │
│               6 │            49999 │
│               7 │            50000 │
│               8 │            49999 │
│               9 │            50000 │
│              10 │            50000 │
│              11 │            50000 │
│              12 │            50000 │
│              13 │            50000 │
│              14 │            50000 │
│              15 │            50000 │
│              16 │            50000 │
│              17 │            50000 │
│              18 │            50000 │
│              19 │            50000 │
└─────────────────┴──────────────────┘


-- After upgrade of ClickHouse to 21.8

SELECT
    key % 20,
    uniqMerge(value)
FROM uniq_state
GROUP BY key % 20


┌─modulo(key, 20)─┬─uniqMerge(value)─┐
│               0 │            50000 │
│               1 │            50000 │
│               2 │            50000 │
│               3 │            50000 │
│               4 │            50000 │
│               5 │            50000 │
│               6 │            49999 │
│               7 │            50000 │
│               8 │            49999 │
│               9 │            50000 │
│              10 │            50000 │
│              11 │            50000 │
│              12 │            50000 │
│              13 │            50000 │
│              14 │            50000 │
│              15 │            50000 │
│              16 │            50000 │
│              17 │            50000 │
│              18 │            50000 │
│              19 │            50000 │
└─────────────────┴──────────────────┘

20 rows in set. Elapsed: 0.240 sec. Processed 10.00 thousand rows, 1.16 MB (41.72 thousand rows/s., 4.86 MB/s.)


CREATE TABLE uniq_state_2
ENGINE = MergeTree
ORDER BY key AS
SELECT *
FROM uniq_state

Ok.

0 rows in set. Elapsed: 0.128 sec. Processed 10.00 thousand rows, 1.16 MB (78.30 thousand rows/s., 9.12 MB/s.)


INSERT INTO uniq_state_2 SELECT
    number % 10000 AS key,
    uniqState(reinterpretAsUUID(number))
FROM numbers(1000000)
GROUP BY key

Ok.

0 rows in set. Elapsed: 0.266 sec. Processed 1.05 million rows, 8.38 MB (3.93 million rows/s., 31.48 MB/s.)


SELECT
    key % 20,
    uniqMerge(value)
FROM uniq_state_2
GROUP BY key % 20

┌─modulo(key, 20)─┬─uniqMerge(value)─┐
│               0 │            99834 │ <- Count of unique values nearly doubled.
│               1 │           100219 │
│               2 │           100128 │
│               3 │           100457 │
│               4 │           100272 │
│               5 │           100279 │
│               6 │            99372 │
│               7 │            99450 │
│               8 │            99974 │
│               9 │            99632 │
│              10 │            99562 │
│              11 │           100660 │
│              12 │           100439 │
│              13 │           100252 │
│              14 │           100650 │
│              15 │            99320 │
│              16 │           100095 │
│              17 │            99632 │
│              18 │            99540 │
│              19 │           100098 │
└─────────────────┴──────────────────┘

20 rows in set. Elapsed: 0.356 sec. Processed 20.00 thousand rows, 2.33 MB (56.18 thousand rows/s., 6.54 MB/s.)


CREATE TABLE uniq_state_3
ENGINE = MergeTree
ORDER BY key AS
SELECT *
FROM uniq_state

0 rows in set. Elapsed: 0.126 sec. Processed 10.00 thousand rows, 1.16 MB (79.33 thousand rows/s., 9.24 MB/s.)

-- Option 1, create separate column

ALTER TABLE uniq_state_3
    ADD COLUMN `value_2` AggregateFunction(uniq, UInt64) DEFAULT unhex(hex(value));
	
	
ALTER TABLE uniq_state_3
    UPDATE value_2 = value_2 WHERE 1;
	
	
SELECT *
FROM system.mutations
WHERE is_done = 0;


Ok.

0 rows in set. Elapsed: 0.008 sec.


INSERT INTO uniq_state_3 (key, value_2) SELECT
    number % 10000 AS key,
    uniqState(sipHash64(reinterpretAsUUID(number)))
FROM numbers(1000000)
GROUP BY key

Ok.

0 rows in set. Elapsed: 0.337 sec. Processed 1.05 million rows, 8.38 MB (3.11 million rows/s., 24.89 MB/s.)


SELECT
    key % 20,
    uniqMerge(value),
    uniqMerge(value_2)
FROM uniq_state_3
GROUP BY key % 20

┌─modulo(key, 20)─┬─uniqMerge(value)─┬─uniqMerge(value_2)─┐
│               0 │            50000 │              50000 │
│               1 │            50000 │              50000 │
│               2 │            50000 │              50000 │
│               3 │            50000 │              50000 │
│               4 │            50000 │              50000 │
│               5 │            50000 │              50000 │
│               6 │            49999 │              49999 │
│               7 │            50000 │              50000 │
│               8 │            49999 │              49999 │
│               9 │            50000 │              50000 │
│              10 │            50000 │              50000 │
│              11 │            50000 │              50000 │
│              12 │            50000 │              50000 │
│              13 │            50000 │              50000 │
│              14 │            50000 │              50000 │
│              15 │            50000 │              50000 │
│              16 │            50000 │              50000 │
│              17 │            50000 │              50000 │
│              18 │            50000 │              50000 │
│              19 │            50000 │              50000 │
└─────────────────┴──────────────────┴────────────────────┘

20 rows in set. Elapsed: 0.768 sec. Processed 20.00 thousand rows, 4.58 MB (26.03 thousand rows/s., 5.96 MB/s.)

-- Option 2, modify column in-place with String as intermediate data type. 

ALTER TABLE uniq_state_3
    MODIFY COLUMN `value` String

Ok.

0 rows in set. Elapsed: 0.280 sec.


ALTER TABLE uniq_state_3
    MODIFY COLUMN `value` AggregateFunction(uniq, UInt64)

Ok.

0 rows in set. Elapsed: 0.254 sec.


INSERT INTO uniq_state_3 (key, value) SELECT
    number % 10000 AS key,
    uniqState(sipHash64(reinterpretAsUUID(number)))
FROM numbers(1000000)
GROUP BY key

Ok.

0 rows in set. Elapsed: 0.554 sec. Processed 1.05 million rows, 8.38 MB (1.89 million rows/s., 15.15 MB/s.)


SELECT
    key % 20,
    uniqMerge(value),
    uniqMerge(value_2)
FROM uniq_state_3
GROUP BY key % 20

┌─modulo(key, 20)─┬─uniqMerge(value)─┬─uniqMerge(value_2)─┐
│               0 │            50000 │              50000 │
│               1 │            50000 │              50000 │
│               2 │            50000 │              50000 │
│               3 │            50000 │              50000 │
│               4 │            50000 │              50000 │
│               5 │            50000 │              50000 │
│               6 │            49999 │              49999 │
│               7 │            50000 │              50000 │
│               8 │            49999 │              49999 │
│               9 │            50000 │              50000 │
│              10 │            50000 │              50000 │
│              11 │            50000 │              50000 │
│              12 │            50000 │              50000 │
│              13 │            50000 │              50000 │
│              14 │            50000 │              50000 │
│              15 │            50000 │              50000 │
│              16 │            50000 │              50000 │
│              17 │            50000 │              50000 │
│              18 │            50000 │              50000 │
│              19 │            50000 │              50000 │
└─────────────────┴──────────────────┴────────────────────┘

20 rows in set. Elapsed: 0.589 sec. Processed 30.00 thousand rows, 6.87 MB (50.93 thousand rows/s., 11.66 MB/s.)

SHOW CREATE TABLE uniq_state_3;

CREATE TABLE default.uniq_state_3
(
    `key` UInt32,
    `value` AggregateFunction(uniq, UInt64),
    `value_2` AggregateFunction(uniq, UInt64) DEFAULT unhex(hex(value))
)
ENGINE = MergeTree
ORDER BY key
SETTINGS index_granularity = 8192

-- Option 3, CAST uniqState(UInt64) to String.

CREATE TABLE uniq_state_4
ENGINE = MergeTree
ORDER BY key AS
SELECT *
FROM uniq_state

Ok.

0 rows in set. Elapsed: 0.146 sec. Processed 10.00 thousand rows, 1.16 MB (68.50 thousand rows/s., 7.98 MB/s.)

INSERT INTO uniq_state_4 (key, value) SELECT
    number % 10000 AS key,
    CAST(uniqState(sipHash64(reinterpretAsUUID(number))), 'String')
FROM numbers(1000000)
GROUP BY key

Ok.

0 rows in set. Elapsed: 0.476 sec. Processed 1.05 million rows, 8.38 MB (2.20 million rows/s., 17.63 MB/s.)

SELECT
    key % 20,
    uniqMerge(value)
FROM uniq_state_4
GROUP BY key % 20

┌─modulo(key, 20)─┬─uniqMerge(value)─┐
│               0 │            50000 │
│               1 │            50000 │
│               2 │            50000 │
│               3 │            50000 │
│               4 │            50000 │
│               5 │            50000 │
│               6 │            49999 │
│               7 │            50000 │
│               8 │            49999 │
│               9 │            50000 │
│              10 │            50000 │
│              11 │            50000 │
│              12 │            50000 │
│              13 │            50000 │
│              14 │            50000 │
│              15 │            50000 │
│              16 │            50000 │
│              17 │            50000 │
│              18 │            50000 │
│              19 │            50000 │
└─────────────────┴──────────────────┘

20 rows in set. Elapsed: 0.281 sec. Processed 20.00 thousand rows, 2.33 MB (71.04 thousand rows/s., 8.27 MB/s.)

SHOW CREATE TABLE uniq_state_4;

CREATE TABLE default.uniq_state_4
(
    `key` UInt32,
    `value` AggregateFunction(uniq, UUID)
)
ENGINE = MergeTree
ORDER BY key
SETTINGS index_granularity = 8192

5.3 - Can not connect to my ClickHouse® server

Can not connect to my ClickHouse® server.

Can not connect to my ClickHouse® server

Errors like “Connection reset by peer, while reading from socket”

Ensure that the clickhouse-server is running
```
systemctl status clickhouse-server
```
If server was restarted recently and don’t accept the connections after the restart - most probably it still just starting. During the startup sequence it need to iterate over all data folders in /var/lib/clickhouse-server In case if you have a very high number of folders there (usually caused by a wrong partitioning, or a very high number of tables / databases) that startup time can take a lot of time (same can happen if disk is very slow, for example NFS).
You can check that by looking for ‘Ready for connections’ line in /var/log/clickhouse-server/clickhouse-server.log (Information log level needed)
Ensure you use the proper port ip / interface?
Ensure you’re not trying to connect to secure port without tls / https or vice versa.
For clickhouse-client - pay attention on host / port / secure flags.
Ensure the interface you’re connecting to is the one which ClickHouse listens (by default ClickHouse listens only localhost).
Note: If you uncomment line <listen_host>0.0.0.0</listen_host> only - ClickHouse will listen only ipv4 interfaces, while the localhost (used by clickhouse-client) may be resolved to ipv6 address. And clickhouse-client may be failing to connect.
How to check which interfaces / ports do ClickHouse listen?
```
sudo lsof -i -P -n | grep LISTEN

echo listen_host
sudo clickhouse-extract-from-config --config=/etc/clickhouse-server/config.xml --key=listen_host
echo tcp_port
sudo clickhouse-extract-from-config --config=/etc/clickhouse-server/config.xml --key=tcp_port
echo tcp_port_secure
sudo clickhouse-extract-from-config --config=/etc/clickhouse-server/config.xml --key=tcp_port_secure
echo http_port
sudo clickhouse-extract-from-config --config=/etc/clickhouse-server/config.xml --key=http_port
echo https_port
sudo clickhouse-extract-from-config --config=/etc/clickhouse-server/config.xml --key=https_port
```
For secure connection:
- ensure that server uses some certificate which can be validated by the client
- OR disable certificate checks on the client (UNSECURE)
Check for errors in /var/log/clickhouse-server/clickhouse-server.err.log ?
Is ClickHouse able to serve some trivial tcp / http requests from localhost?
```
curl 127.0.0.1:9200
curl 127.0.0.1:8123
```

Check number of sockets opened by ClickHouse

sudo lsof -i -a -p $(pidof clickhouse-server)

# or (adjust 9000 / 8123 ports if needed)
netstat -tn 2>/dev/null | tail -n +3 | awk '{ printf("%s\t%s\t%s\t%s\t%s\t%s\n", $1, $2, $3, $4, $5, $6) }' | clickhouse-local -S "Proto String, RecvQ Int64, SendQ Int64, LocalAddress String, ForeignAddress String, State LowCardinality(String)" --query="SELECT * FROM table WHERE LocalAddress like '%:9000' FORMAT PrettyCompact"

netstat -tn 2>/dev/null | tail -n +3 | awk '{ printf("%s\t%s\t%s\t%s\t%s\t%s\n", $1, $2, $3, $4, $5, $6) }' | clickhouse-local -S "Proto String, RecvQ Int64, SendQ Int64, LocalAddress String, ForeignAddress String, State LowCardinality(String)" --query="SELECT * FROM table WHERE LocalAddress like '%:8123' FORMAT PrettyCompact"

ClickHouse has a limit of number of open connections (4000 by default).

Check also:

# system overall support limited number of connections it can handle
netstat

# you can also be reaching of of the process ulimits (Max open files)
cat /proc/$(pidof -s clickhouse-server)/limits

Check firewall / selinux rules (if used)

5.4 - cgroups and kubernetes cloud providers

cgroups and kubernetes cloud providers.

Why my ClickHouse® is slow after upgrade to version 22.2 and higher?

The probable reason is that ClickHouse 22.2 started to respect cgroups (Respect cgroups limits in max_threads autodetection. #33342 (JaySon ).

You can observe that max_threads = 1

SELECT
    name,
    value
FROM system.settings
WHERE name = 'max_threads'

┌─name────────┬─value─────┐
│ max_threads │ 'auto(1)' │
└─────────────┴───────────┘

This makes ClickHouse to execute all queries with a single thread (normal behavior is half of available CPU cores, cores = 64, then ‘auto(32)’).

We observe this cgroups behavior with AWS EKS (Kubernetes) environment and Altinity ClickHouse Operator in case if requests.cpu and limits.cpu are not set for a resource.

Workaround

We suggest to set requests.cpu = half of available CPU cores, and limits.cpu = CPU cores.

For example in case of 16 CPU cores:

          resources:
            requests:
              memory: ...
              cpu: 8
            limits:
              memory: ....
              cpu: 16

Then you should get a new result:

SELECT
    name,
    value
FROM system.settings
WHERE name = 'max_threads'

┌─name────────┬─value─────┐
│ max_threads │ 'auto(8)' │
└─────────────┴───────────┘

in depth

For some reason AWS EKS sets cgroup kernel parameters in case of empty requests.cpu & limits.cpu into these:

# cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us
-1

# cat /sys/fs/cgroup/cpu/cpu.cfs_period_us
100000

# cat /sys/fs/cgroup/cpu/cpu.shares
2

This makes ClickHouse to set max_threads = 1 because of

cgroup_share = /sys/fs/cgroup/cpu/cpu.shares (2)
PER_CPU_SHARES = 1024
share_count = ceil( cgroup_share / PER_CPU_SHARES ) ---> ceil(2 / 1024) ---> 1

Fix

Incorrect calculation was fixed in https://github.com/ClickHouse/ClickHouse/pull/35815 and will work correctly on newer releases.

5.5 - Transforming ClickHouse logs to ndjson using Vector.dev

Transforming ClickHouse logs to ndjson using Vector.dev

ClickHouse 22.8

Starting from 22.8 version, ClickHouse support writing logs in JSON format:

<?xml version="1.0"?>
<clickhouse>
    <logger>
        <!-- Structured log formatting:
        You can specify log format(for now, JSON only). In that case, the console log will be printed
        in specified format like JSON.
        For example, as below:
        {"date_time":"1650918987.180175","thread_name":"#1","thread_id":"254545","level":"Trace","query_id":"","logger_name":"BaseDaemon","message":"Received signal 2","source_file":"../base/daemon/BaseDaemon.cpp; virtual void SignalListener::run()","source_line":"192"}
        To enable JSON logging support, just uncomment <formatting> tag below.
        -->
        <formatting>json</formatting>
    </logger>
</clickhouse>

Transforming ClickHouse logs to ndjson using Vector.dev"

Installation of vector.dev

# arm64
wget https://packages.timber.io/vector/0.15.2/vector_0.15.2-1_arm64.deb

# amd64
wget https://packages.timber.io/vector/0.15.2/vector_0.15.2-1_amd64.deb

dpkg -i vector_0.15.2-1_*.deb

systemctl stop vector

mkdir /var/log/clickhouse-server-json

chown vector.vector /var/log/clickhouse-server-json

usermod -a -G clickhouse vector

vector config

# cat /etc/vector/vector.toml
data_dir = "/var/lib/vector"

[sources.clickhouse-log]
  type                          = "file"
  include                       = [ "/var/log/clickhouse-server/clickhouse-server.log" ]
  fingerprinting.strategy       = "device_and_inode"
  message_start_indicator = '^\d+\.\d+\.\d+ \d+:\d+:\d+'
  multi_line_timeout = 1000


[transforms.clickhouse-log-text]
  inputs                        = [ "clickhouse-log" ]
  type                          = "remap"
  source = '''
     . |= parse_regex!(.message, r'^(?P<timestamp>\d+\.\d+\.\d+ \d+:\d+:\d+\.\d+) \[\s?(?P<thread_id>\d+)\s?\] \{(?P<query_id>.*)\} <(?P<severity>\w+)> (?s)(?P<message>.*$)')
  '''

[sinks.emit-clickhouse-log-json]
  type = "file"
  inputs = [ "clickhouse-log-text" ]
  compression = "none"
  path = "/var/log/clickhouse-server-json/clickhouse-server.%Y-%m-%d.ndjson"
  encoding.only_fields = ["timestamp", "thread_id", "query_id", "severity", "message" ]
  encoding.codec = "ndjson"

start

systemctl start vector

tail /var/log/clickhouse-server-json/clickhouse-server.2022-04-21.ndjson
{"message":"DiskLocal: Reserving 1.00 MiB on disk `default`, having unreserved 166.80 GiB.","query_id":"","severity":"Debug","thread_id":"283239","timestamp":"2022.04.21 13:43:21.164660"}
{"message":"MergedBlockOutputStream: filled checksums 202204_67118_67118_0 (state Temporary)","query_id":"","severity":"Trace","thread_id":"283239","timestamp":"2022.04.21 13:43:21.166810"}
{"message":"system.metric_log (e3365172-4c9b-441b-b803-756ae030e741): Renaming temporary part tmp_insert_202204_67118_67118_0 to 202204_171703_171703_0.","query_id":"","severity":"Trace","thread_id":"283239","timestamp":"2022.04.21 13:43:21.167226"}
....

sink logs into ClickHouse table

Be careful with logging ClickHouse messages into the same ClickHouse instance, it will cause endless recursive self-logging.

create table default.clickhouse_logs(
  timestamp DateTime64(3),
  host LowCardinality(String),
  thread_id LowCardinality(String),
  severity LowCardinality(String),
  query_id String,
  message String)
Engine = MergeTree 
Partition by toYYYYMM(timestamp)
Order by (toStartOfHour(timestamp), host, severity, query_id);

create user vector identified  by 'vector1234';
grant insert on default.clickhouse_logs to vector;
create settings profile or replace profile_vector settings log_queries=0 readonly TO vector;

[sinks.clickhouse-output-clickhouse]
    inputs   = ["clickhouse-log-text"]
    type     = "clickhouse"

    host = "http://localhost:8123"
    database = "default"
    auth.strategy = "basic"
    auth.user = "vector"
    auth.password = "vector1234"
    healthcheck = true
    table = "clickhouse_logs"

    encoding.timestamp_format = "unix"

    buffer.type = "disk"
    buffer.max_size = 104900000
    buffer.when_full = "block"

    request.in_flight_limit = 20

    encoding.only_fields =  ["host", "timestamp", "thread_id", "query_id", "severity", "message"]

select * from default.clickhouse_logs limit 10;
┌───────────────timestamp─┬─host───────┬─thread_id─┬─severity─┬─query_id─┬─message─────────────────────────────────────────────────────
│ 2022-04-21 19:08:13.443 │ clickhouse │ 283155    │ Debug    │          │ HTTP-Session: 13e87050-7824-46b0-9bd5-29469a1b102f Authentic
│ 2022-04-21 19:08:13.443 │ clickhouse │ 283155    │ Debug    │          │ HTTP-Session: 13e87050-7824-46b0-9bd5-29469a1b102f Authentic
│ 2022-04-21 19:08:13.443 │ clickhouse │ 283155    │ Debug    │          │ HTTP-Session: 13e87050-7824-46b0-9bd5-29469a1b102f Creating
│ 2022-04-21 19:08:13.447 │ clickhouse │ 283155    │ Debug    │          │ MemoryTracker: Peak memory usage (for query): 4.00 MiB.
│ 2022-04-21 19:08:13.447 │ clickhouse │ 283155    │ Debug    │          │ HTTP-Session: 13e87050-7824-46b0-9bd5-29469a1b102f Destroyin
│ 2022-04-21 19:08:13.495 │ clickhouse │ 283155    │ Debug    │          │ HTTP-Session: f7eb829f-7b3a-4c43-8a41-a2e6676177fb Authentic
│ 2022-04-21 19:08:13.495 │ clickhouse │ 283155    │ Debug    │          │ HTTP-Session: f7eb829f-7b3a-4c43-8a41-a2e6676177fb Authentic
│ 2022-04-21 19:08:13.495 │ clickhouse │ 283155    │ Debug    │          │ HTTP-Session: f7eb829f-7b3a-4c43-8a41-a2e6676177fb Creating
│ 2022-04-21 19:08:13.496 │ clickhouse │ 283155    │ Debug    │          │ MemoryTracker: Peak memory usage (for query): 4.00 MiB.
│ 2022-04-21 19:08:13.496 │ clickhouse │ 283155    │ Debug    │          │ HTTP-Session: f7eb829f-7b3a-4c43-8a41-a2e6676177fb Destroyin
└─────────────────────────┴────────────┴───────────┴──────────┴──────────┴─────────────────────────────────────────────────────────────

5.6 - Altinity Kubernetes Operator For ClickHouse®

Altinity Kubernetes Operator For ClickHouse®

Altinity Kubernetes Operator for ClickHouse® Documentation

https://github.com/Altinity/clickhouse-operator/blob/master/docs/README.md

5.7 - ClickHouse® and different filesystems

ClickHouse® and different filesystems.

In general ClickHouse® should work with any POSIX-compatible filesystem.

hard links and soft links support is mandatory.
ClickHouse can use O_DIRECT mode to bypass the cache (and async io)
ClickHouse can use renameat2 command for some atomic operations (not all the filesystems support that).
depending on the schema and details of the usage the filesystem load can vary between the setup. The most natural load - is high throughput, with low or moderate IOPS.
data is compressed in ClickHouse (LZ4 by default), while indexes / marks / metadata files - no. Enabling disk-level compression can sometimes improve the compression, but can affect read / write speed.

ext4

no issues, fully supported.

The minimum kernel version required is 3.15 (newer are recommended)

XFS

Performance issues reported by users, use on own risk. Old kernels are not recommended (4.0 or newer is recommended).

According to the users’ feedback, XFS behaves worse with ClickHouse under heavy load. We don’t have real proofs/benchmarks though, example reports:

In GitHub there are complaints about XFS from Cloudflare.
Recently my colleague discovered that two of ClickHouse servers perform worse in a cluster than others and they found that they accidentally set up those servers with XFS instead of Ext4.
in the system journal you can sometimes see reports like ’task XYZ blocked for more than 120 seconds’ and stack trace pointing to XFS code (example: https://gist.github.com/filimonov/85b894268f978c2ccc18ea69bae5adbd )
system goes to 99% io kernel under load sometimes.
we have XFS, sometimes ClickHouse goes to “sleep” because XFS daemon is doing smth unknown

Maybe the above problem can be workaround by some tuning/settings, but so far we do not have a working and confirmed way to do this.

ZFS

Limitations exist, extra tuning may be needed, and having more RAM is recommended. Old kernels are not recommended.

Memory usage control - ZFS adaptive replacement cache (ARC) can take a lot of RAM. It can be the reason of out-of-memory issues when memory is also requested by the ClickHouse.

It seems that the most important thing is zfs_arc_max - you just need to limit the maximum size of the ARC so that the sum of the maximum size of the arc + the CH itself does not exceed the size of the available RAM. For example, we set a limit of 80% RAM for ClickHouse and 10% for ARC. 10% will remain for the system and other applications

Tuning:

another potentially interesting setting is primarycache=metadata, see benchmark example: https://www.ikus-soft.com/en/blog/2018-05-23-proxmox-primarycache-all-metadata/
examples of tuning ZFS for MySQL https://wiki.freebsd.org/ZFSTuningGuide - perhaps some of this can also be useful (atime, recordsize) but everything needs to be carefully checked with benchmarks (I have no way).
best practices: https://efim360.ru/zfs-best-practices-guide/

important note: In versions before 2.2 ZFS does not support the renameat2 command, which is used by the Atomic database engine, and therefore some of the Atomic functionality will not be available.

In old versions of ClickHouse, you can face issues with the O_DIRECT mode.

Also there is a well-known (and controversial) Linus Torvalds opinion: “Don’t Use ZFS on Linux” [1] , [2] , [3] .

BTRFS

Not enough information. Some users report performance improvement for their use case.

ReiserFS

Not enough information.

Lustre

There are reports that some people successfully use it in their setups. A fast network is required.

There were some reports about data damage on the disks on older ClickHouse versions, which could be caused by the issues with O_DIRECT or async io support on Lustre.

NFS (and EFS)

According to the reports - it works, throughput depends a lot on the network speed. IOPS / number of file operations per seconds can be super low (due to the locking mechanism).

https://github.com/ClickHouse/ClickHouse/issues/31113

MooseFS

There are installations using that. No extra info.

GlusterFS

There are installations using that. No extra info.

Ceph

There are installations using that. Some information: https://github.com/ClickHouse/ClickHouse/issues/8315

5.8 - ClickHouse® Access Control and Account Management (RBAC)

Access Control and Account Management (RBAC).

Documentation https://clickhouse.com/docs/en/operations/access-rights/

Enable ClickHouse® RBAC and create admin user

Create an admin user like (root in MySQL or postgres in PostgreSQL) to do the DBA/admin ops in the user.xml file and set the access management property for the admin user

<clickhouse>
<users>
  <default>
  ....
  </default>
  <admin>
      <!--    
        Password could be specified in plaintext or in SHA256 (in hex format).

        If you want to specify password in plaintext (not recommended), place it in 'password' element.
        Example: <password>qwerty</password>.
        Password could be empty.

        If you want to specify SHA256, place it in 'password_sha256_hex' element.
        Example: <password_sha256_hex>65e84be33532fb784c48129675f9eff3a682b27168c0ea744b2cf58ee02337c5</password_sha256_hex>
        Restrictions of SHA256: impossibility to connect to ClickHouse using MySQL JS client (as of July 2019).

        If you want to specify double SHA1, place it in 'password_double_sha1_hex' element.
        Example: <password_double_sha1_hex>e395796d6546b1b65db9d665cd43f0e858dd4303</password_double_sha1_hex>
      -->
      <password></password> 
      <networks>
          <ip>::/0</ip>
      </networks>
      <!-- Settings profile for user. -->
      <profile>default</profile>
      <!-- Quota for user. -->
      <quota>default</quota>
      <!-- Set This parameter to Enable RBAC
      Admin user can create other users and grant rights to them. -->
      <access_management>1</access_management>
  </admin>
...
</clickhouse>

default user

As default is used for many internal and background operations, so it is not convenient to set it up with a password, because you would have to change it in many configs/parts. Best way to secure the default user is only allow localhost or trusted network connections like this in users.xml:

<clickhouse>
<users>
    <default>
    ......    
        <networks>
            <ip>127.0.0.1/8</ip>
            <ip>10.10.10.0/24</ip>
        </networks>
    
    ......
    </default>
</clickhouse>

replication user

The replication user is defined by interserver_http_credential tag. It does not relate to a ClickHouse client credentials configuration. If this tag is ommited then authentication is not used during replication. Ports 9009 and 9010(tls) provide low-level data access between servers. This ports should not be accessible from untrusted networks. You can specify credentials for authentication between replicas. This is required when interserver_https_port is accessible from untrusted networks. You can do so by defining user and password to the interserver credentials. Then replication protocol will use basic access authentication when connecting by HTTP/HTTPS to other replicas:

  <interserver_http_credentials>
      <user>replication</user>
      <password>password</password>
  </interserver_http_credentials>

Create users and roles

Now we can setup users/roles using a generic best-practice approach for RBAC from other databases, like using roles, granting permissions to roles, creating users for different applications, etc…

see User Hardening article

Example: 3 roles (dba, dashboard_ro, ingester_rw)

create role dba on cluster '{cluster}';
grant all on *.* to dba on cluster '{cluster}';
create user `user1` identified  by 'pass1234' on cluster '{cluster}';
grant dba to user1 on cluster '{cluster}';


create role dashboard_ro on cluster '{cluster}';
grant select on default.* to dashboard_ro on cluster '{cluster}';
grant dictGet on *.*  to dashboard_ro on cluster '{cluster}';

create settings profile or replace profile_dashboard_ro on cluster '{cluster}'
settings max_concurrent_queries_for_user = 10 READONLY, 
         max_threads = 16 READONLY, 
         max_memory_usage_for_user = '30G' READONLY,
         max_memory_usage = '30G' READONLY,
         max_execution_time = 60 READONLY,
         max_rows_to_read = 1000000000 READONLY,
         max_bytes_to_read = '5000G' READONLY
TO dashboard_ro;

create user `dash1` identified  by 'pass1234' on cluster '{cluster}';

grant dashboard_ro to dash1 on cluster '{cluster}';

create role ingester_rw on cluster '{cluster}';
grant select,insert on default.* to ingester_rw on cluster '{cluster}';

create settings profile or replace profile_ingester_rw on cluster '{cluster}'
settings max_concurrent_queries_for_user = 40 READONLY,    -- user can run 40 queries (select, insert ...) simultaneously  
         max_threads = 10 READONLY,                        -- each query can use up to 10 cpu (READONLY means user cannot override a value)
         max_memory_usage_for_user = '30G' READONLY,       -- all queries of the user can use up to 30G RAM
         max_memory_usage = '25G' READONLY,                -- each query can use up to 25G RAM
         max_execution_time = 200 READONLY,                -- each query can executes no longer 200 seconds
         max_rows_to_read = 1000000000 READONLY,           -- each query can read up to 1 billion rows
         max_bytes_to_read = '5000G' READONLY              -- each query can read up to 5 TB from a MergeTree
TO ingester_rw;

create user `ingester_app1` identified  by 'pass1234'　on cluster '{cluster}';

grant ingester_rw to ingester_app1 on cluster '{cluster}';

check

$ clickhouse-client -u dash1 --password pass1234

create table test ( A Int64) Engine=Log;
   DB::Exception: dash1: Not enough privileges
   
   
$ clickhouse-client -u user1 --password pass1234

create table test ( A Int64) Engine=Log;
Ok.

drop table test;
Ok.


$ clickhouse-client -u ingester_app1 --password pass1234

select count() from system.numbers limit 1000000000000;
   DB::Exception: Received from localhost:9000. DB::Exception: Limit for rows or bytes to read exceeded, max rows: 1.00 billion

clean up

show profiles;
┌─name─────────────────┐
│ default              │
│ profile_dashboard_ro │
│ profile_ingester_rw  │
│ readonly             │
└──────────────────────┘

drop profile if exists readonly on cluster '{cluster}';
drop profile if exists profile_dashboard_ro on cluster '{cluster}';
drop profile if exists profile_ingester_rw on cluster '{cluster}';


show roles;
┌─name─────────┐
│ dashboard_ro │
│ dba          │
│ ingester_rw  │
└──────────────┘

drop role if exists dba on cluster '{cluster}';
drop role if exists dashboard_ro on cluster '{cluster}';
drop role if exists ingester_rw on cluster '{cluster}';


show users;
┌─name──────────┐
│ dash1         │
│ default       │
│ ingester_app1 │
│ user1         │
└───────────────┘

drop user if exists ingester_app1 on cluster '{cluster}';
drop user if exists user1 on cluster '{cluster}';
drop user if exists dash1 on cluster '{cluster}';

5.9 - Client Timeouts

How to prevent connection errors.

Timeout settings are related to the client, server, and network. They can be tuned to solve sporadic timeout issues.

It’s important to understand that network devices (routers, NATs, load balancers ) could have their own timeouts. Sometimes, they won’t respect TCP keep-alive and close the session due to inactivity. Only application-level keepalives could prevent TCP sessions from closing.

Below are the settings that will work only if you set them in the default user profile. The problem is that they should be applied before the connection happens. And if you send them with a query/connection, it’s already too late:

SETTINGS
        receive_timeout = 3600,
        send_timeout = 3600,
        http_receive_timeout = 3600,
        http_send_timeout = 3600,
        http_connection_timeout = 2

Those can be set on the query level (but in the profile, too):

SETTINGS
    tcp_keep_alive_timeout = 3600,
    --!!!send_progress_in_http_headers = 1,
    http_headers_progress_interval_ms = 10000,
    http_wait_end_of_query = 1,
    max_execution_time = 3600

https://clickhouse.com/docs/en/integrations/language-clients/javascript#keep-alive-nodejs-only

send_progress_in_http_headers will not be applied in this way because here we can configure the JDBC driver’s client options only (this ), but there is an option called custom_settings (this ) that will apply custom ch query settings for every query before the actual connection is created. The correct JDBC connection string will look like this:

jdbc:clickhouse://"${clickhouse.host}"/"${clickhouse.db}"?ssl=true&socket_timeout=3600000&socket_keepalive=true&custom_settings=send_progress_in_http_headers=1

Description

http_send_timeout & send_timeout: The timeout for sending data to the socket. If the server takes longer than this value to send data, the connection will be terminated (i.e., when the server pushes data to the client, and the client is not reading that for some reason).
http_receive_timeout & receive_timeout: The timeout for receiving data from the socket. If the server takes longer than this value to receive the entire request from the client, the connection will be terminated. This setting ensures that the server is not kept waiting indefinitely for slow or unresponsive clients (i.e., the server tries to get some data from the client, but the client does not send anything).
http_connection_timeout & connect_timeout: Defines how long ClickHouse should wait when it connects to another server. If the connection cannot be established within this time frame, it will be terminated. This does not impact the clients which connect to ClickHouse using HTTP (it only matters when ClickHouse works as a TCP/HTTP client).
keep_alive_timeout: This is for ‘Connection: keep-alive’ in HTTP 1.1, only for HTTP. It defines how long ClickHouse can wait for the next request in the same connection to arrive after serving the previous one. It does not lead to any SOCKET_TIMEOUT exception, just closes the socket if the client doesn’t start a new request after that time.

sync_request_timeout – timeout for server ping. Defaults to 5 seconds.

In some cases, if the data sync request time out, it may be caused by many different reasons, basically it shouldn’t take more than 5 seconds for synchronous request-result protocol call (like Ping or TableStatus) in most of the normal circumstances, thus if time out setting too long, eg. 5 minutes or longer than that, then you will run into more overall performance issues. This is not good for any application on the server.

How to check the current timeouts:

SELECT
    name,
    value,
    changed,
    description
FROM system.settings
WHERE (name ILIKE '%send_timeout%') OR (name ILIKE '%receive_timeout%') OR (name ILIKE '%keep_alive%') OR (name ILIKE '%_http_headers') OR (name ILIKE 'http_headers_progres_%') OR (name ILIKE 'http_connection_%')

5.10 - Compatibility layer for the Altinity Kubernetes Operator for ClickHouse®

Page description for heading and indexes.

It’s possible to expose clickhouse-server metrics in the style used by the Altinity Kubernetes Operator for ClickHouse®. It’s for the clickhouse-operator grafana dashboard.

CREATE VIEW system.operator_compatible_metrics
(
    `name` String,
    `value` Float64,
    `help` String,
    `labels` Map(String, String),
    `type` String
) AS
SELECT
    concat('chi_clickhouse_event_', event) AS name,
    CAST(value, 'Float64') AS value,
    description AS help,
    map('hostname', hostName()) AS labels,
    'counter' AS type
FROM system.events
UNION ALL
SELECT
    concat('chi_clickhouse_metric_', metric) AS name,
    CAST(value, 'Float64') AS value,
    description AS help,
    map('hostname', hostName()) AS labels,
    'gauge' AS type
FROM system.metrics
UNION ALL
SELECT
    concat('chi_clickhouse_metric_', metric) AS name,
    value,
    '' AS help,
    map('hostname', hostName()) AS labels,
    'gauge' AS type
FROM system.asynchronous_metrics
UNION ALL
SELECT
    'chi_clickhouse_metric_MemoryDictionaryBytesAllocated' AS name,
    CAST(sum(bytes_allocated), 'Float64') AS value,
    'Memory size allocated for dictionaries' AS help,
    map('hostname', hostName()) AS labels,
    'gauge' AS type
FROM system.dictionaries
UNION ALL
SELECT
    'chi_clickhouse_metric_LongestRunningQuery' AS name,
    CAST(max(elapsed), 'Float64') AS value,
    'Longest running query time' AS help,
    map('hostname', hostName()) AS labels,
    'gauge' AS type
FROM system.processes
UNION ALL
WITH
    ['chi_clickhouse_table_partitions', 'chi_clickhouse_table_parts', 'chi_clickhouse_table_parts_bytes', 'chi_clickhouse_table_parts_bytes_uncompressed', 'chi_clickhouse_table_parts_rows', 'chi_clickhouse_metric_DiskDataBytes', 'chi_clickhouse_metric_MemoryPrimaryKeyBytesAllocated'] AS names,
    [uniq(partition), count(), sum(bytes), sum(data_uncompressed_bytes), sum(rows), sum(bytes_on_disk), sum(primary_key_bytes_in_memory_allocated)] AS values,
    arrayJoin(arrayZip(names, values)) AS tpl
SELECT
    tpl.1 AS name,
    CAST(tpl.2, 'Float64') AS value,
    '' AS help,
    map('database', database, 'table', table, 'active', toString(active), 'hostname', hostName()) AS labels,
    'gauge' AS type
FROM system.parts
GROUP BY
    active,
    database,
    table
UNION ALL
WITH
    ['chi_clickhouse_table_mutations', 'chi_clickhouse_table_mutations_parts_to_do'] AS names,
    [CAST(count(), 'Float64'), CAST(sum(parts_to_do), 'Float64')] AS values,
    arrayJoin(arrayZip(names, values)) AS tpl
SELECT
    tpl.1 AS name,
    tpl.2 AS value,
    '' AS help,
    map('database', database, 'table', table, 'hostname', hostName()) AS labels,
    'gauge' AS type
FROM system.mutations
WHERE is_done = 0
GROUP BY
    database,
    table
UNION ALL
WITH if(coalesce(reason, 'unknown') = '', 'detached_by_user', coalesce(reason, 'unknown')) AS detach_reason
SELECT
    'chi_clickhouse_metric_DetachedParts' AS name,
    CAST(count(), 'Float64') AS value,
    '' AS help,
    map('database', database, 'table', table, 'disk', disk, 'hostname', hostName()) AS labels,
    'gauge' AS type
FROM system.detached_parts
GROUP BY
    database,
    table,
    disk,
    reason
ORDER BY name ASC

nano /etc/clickhouse-server/config.d/operator_metrics.xml
<clickhouse>
    <http_handlers>
        <rule>
            <url>/metrics</url>
            <methods>POST,GET</methods>
            <handler>
                <type>predefined_query_handler</type>
                <query>SELECT * FROM system.operator_compatible_metrics FORMAT Prometheus</query>
                <content_type>text/plain; charset=utf-8</content_type>
            </handler>
        </rule>
        <defaults/>
        <rule>
            <url>/</url>
            <methods>POST,GET</methods>
            <headers><pragma>no-cache</pragma></headers>
            <handler>
                <type>dynamic_query_handler</type>
                <query_param_name>query</query_param_name>
            </handler>
        </rule>
    </http_handlers>    
</clickhouse>

curl http://localhost:8123/metrics
# HELP chi_clickhouse_metric_Query Number of executing queries
# TYPE chi_clickhouse_metric_Query gauge
chi_clickhouse_metric_Query{hostname="LAPTOP"} 1

# HELP chi_clickhouse_metric_Merge Number of executing background merges
# TYPE chi_clickhouse_metric_Merge gauge
chi_clickhouse_metric_Merge{hostname="LAPTOP"} 0

# HELP chi_clickhouse_metric_PartMutation Number of mutations (ALTER DELETE/UPDATE)
# TYPE chi_clickhouse_metric_PartMutation gauge
chi_clickhouse_metric_PartMutation{hostname="LAPTOP"} 0

5.11 - How to convert uniqExact states to approximate uniq functions states

A way to convert to uniqExactState to other uniqStates (like uniqCombinedState) in ClickHouse®

uniqExactState

uniqExactState is stored in two parts: a count of values in LEB128 format + list values without a delimiter. Depending on the orignial datatype of the values to count, the datatype of the list values differ.

Numeric Values

In case of numeric values like UInt8, UInt64 etc. the representation of uniqExactState is just a simple array of the unique values encountered. Therefore it is easy to recover the values from the state which have appeared:

┌─hex(uniqExactState(arrayJoin([1, 3])))─┐
│ 020103                                 │
└────────────────────────────────────────┘
  02        01             03   
  ^         ^              ^
  LEB128    hex(1::UInt8)  hex(3::UInt8)


┌─finalizeAggregation(CAST(unhex('020103'), 'AggregateFunction(groupArray, UInt8)'))─┐
│ [1,3]                                                                              │
└────────────────────────────────────────────────────────────────────────────────────┘

String Values

Internal Representation

In case of values of data type String, ClickHouse® applies a hashing algorithm before storing the values into the internal array, otherwise the amount of space needed could get enormous.

┌─hex(uniqExactState(toString(arrayJoin([1]))))─┐
│ 01E2756D8F7A583CA23016E03447724DE7            │
└───────────────────────────────────────────────┘
  01         E2756D8F7A583CA23016E03447724DE7
  ^          ^
  LEB128     hash of '1'


┌─hex(uniqExactState(toString(arrayJoin([1, 2]))))───────────────────┐
│ 024809CB4528E00621CF626BE9FA14E2BFE2756D8F7A583CA23016E03447724DE7 │
└────────────────────────────────────────────────────────────────────┘
  02     4809CB4528E00621CF626BE9FA14E2BF E2756D8F7A583CA23016E03447724DE7
  ^        ^                                ^
  LEB128 hash of '2'                      hash of '1'

So, our task is to find how we can generate such values by ourself, speak what hash function is used. In case of String data type, it is just the simple sipHash128 function.

┌─hex(sipHash128(toString(2)))─────┬─hex(sipHash128(toString(1)))─────┐
│ 4809CB4528E00621CF626BE9FA14E2BF │ E2756D8F7A583CA23016E03447724DE7 │
└──────────────────────────────────┴──────────────────────────────────┘

Getting the Hash Values

The second task: now that we know how the state is formed, how can we demangle it and convert it into an Array of values. Unfortunatelly it is not possible to get the original values back, as sipHash128 is a one way conversion, but at least we can try to get an Array of hashes. Luckily for us, ClickHouse® use the exact same serialization (LEB128 + list of values) for Arrays (in this case if uniqExactState and Array are serialized into RowBinary format).

One way to “convert” the uniqExactState to an Array of hashes would be via an external helper UDF function to do that conversion:

cat /etc/clickhouse-server/pipe_function.xml
<clickhouse>
  <function>
    <type>executable</type>
    <execute_direct>0</execute_direct>
    <name>pipe</name>
    <return_type>Array(FixedString(16))</return_type>
    <argument>
      <type>String</type>
    </argument>
    <format>RowBinary</format>
    <command>cat</command>
    <send_chunk_header>0</send_chunk_header>
  </function>
</clickhouse>

This UDF – pipe converts uniqExactState to the Array(FixedString(16)):

┌─arrayMap(x -> hex(x), pipe(uniqExactState(toString(arrayJoin([1, 2])))))──────────────┐
│ ['4809CB4528E00621CF626BE9FA14E2BF','E2756D8F7A583CA23016E03447724DE7']               │
└───────────────────────────────────────────────────────────────────────────────────────┘

This way only works if you have direct access to your ClickHouse® installation. However if you are on a managed platform like Altinity.Cloud installing executable UDFs is typically not supported for security reasons. Luckily we know that the internal representation of sipHash128 is FixedString(16) which has exactly 128 bit. UInt128 also takes up exactly 128 bit. Therefore we can consider the uniqExactState(String) as a representation of Array(UInt128).

Again, we can therefore convert our state to an Array:

┌─arrayMap(lambda(tuple(x), hex(reinterpretAsFixedString(x))), finalizeAggregation(CAST(unhex(hex(uniqExactState(arrayJoin(['1', '2'])))), 'AggregateFunction(groupArray, UInt128)')))─┐
│ ['4809CB4528E00621CF626BE9FA14E2BF','E2756D8F7A583CA23016E03447724DE7']                                                                                                              │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

As you can see the Array is identical to the one we created with the pipe function.

Full Example of Conversion

And here is the full example, how you can convert uniqExactState(string) to any approximate uniq function like uniqState(string) or uniqCombinedState(string) by reinterpret and arrayReduce('func', [..]).

-- Generate demo with random data, uniqs are stored as heavy uniqExact
CREATE TABLE aggregates
(
    `id` UInt32,
    `uniqExact` AggregateFunction(uniqExact, String)
)
ENGINE = AggregatingMergeTree
ORDER BY id as
SELECT
    number % 10000 AS id,
    uniqExactState(toString(number))
FROM numbers(10000000)
GROUP BY id;

0 rows in set. Elapsed: 2.042 sec. Processed 10.01 million rows, 80.06 MB (4.90 million rows/s., 39.21 MB/s.)

-- Let's add a new columns to store optimized, approximate uniq & uniqCombined
ALTER TABLE aggregates
    ADD COLUMN `uniq` AggregateFunction(uniq, FixedString(16)), 
    ADD COLUMN `uniqCombined` AggregateFunction(uniqCombined, FixedString(16)); 

-- Materialize values in the new columns
ALTER TABLE aggregates 
UPDATE 
  uniqCombined = arrayReduce('uniqCombinedState', arrayMap(x -> reinterpretAsFixedString(x), finalizeAggregation(unhex(hex(uniqExact))::AggregateFunction(groupArray, UInt128)))), 
  uniq = arrayReduce('uniqState', arrayMap(x -> reinterpretAsFixedString(x), finalizeAggregation(unhex(hex(uniqExact))::AggregateFunction(groupArray, UInt128)))) 
WHERE 1 
SETTINGS mutations_sync=2;

-- Check results, results are slighty different, because uniq & uniqCombined are approximate functions
SELECT
    id % 20 AS key,
    uniqExactMerge(uniqExact),
    uniqCombinedMerge(uniqCombined),
    uniqMerge(uniq)
FROM aggregates
GROUP BY key

┌─key─┬─uniqExactMerge(uniqExact)─┬─uniqCombinedMerge(uniqCombined)─┬─uniqMerge(uniq)─┐
│   0 │                    500000 │                          500195 │          500455 │
│   1 │                    500000 │                          502599 │          501549 │
│   2 │                    500000 │                          498058 │          504428 │
│   3 │                    500000 │                          499748 │          500195 │
│   4 │                    500000 │                          500791 │          500836 │
│   5 │                    500000 │                          502430 │          497558 │
│   6 │                    500000 │                          500262 │          501785 │
│   7 │                    500000 │                          501514 │          495758 │
│   8 │                    500000 │                          500121 │          498597 │
│   9 │                    500000 │                          502173 │          500455 │
│  10 │                    500000 │                          499144 │          498386 │
│  11 │                    500000 │                          500525 │          503139 │
│  12 │                    500000 │                          503624 │          497103 │
│  13 │                    500000 │                          499986 │          497992 │
│  14 │                    500000 │                          502027 │          494833 │
│  15 │                    500000 │                          498831 │          500983 │
│  16 │                    500000 │                          501103 │          500836 │
│  17 │                    500000 │                          499409 │          496791 │
│  18 │                    500000 │                          501641 │          502991 │
│  19 │                    500000 │                          500648 │          500881 │
└─────┴───────────────────────────┴─────────────────────────────────┴─────────────────┘

20 rows in set. Elapsed: 2.312 sec. Processed 10.00 thousand rows, 7.61 MB (4.33 thousand rows/s., 3.29 MB/s.)

Now, lets repeat the same insert, but in that case we will also populate uniq & uniqCombined with values converted via sipHash128 function. If we did everything right, uniq counts will not change, because we inserted the exact same values.

INSERT INTO aggregates SELECT
    number % 10000 AS id,
    uniqExactState(toString(number)),
    uniqState(sipHash128(toString(number))),
    uniqCombinedState(sipHash128(toString(number)))
FROM numbers(10000000)
GROUP BY id;

0 rows in set. Elapsed: 5.386 sec. Processed 10.01 million rows, 80.06 MB (1.86 million rows/s., 14.86 MB/s.)


SELECT
    id % 20 AS key,
    uniqExactMerge(uniqExact),
    uniqCombinedMerge(uniqCombined),
    uniqMerge(uniq)
FROM aggregates
GROUP BY key

┌─key─┬─uniqExactMerge(uniqExact)─┬─uniqCombinedMerge(uniqCombined)─┬─uniqMerge(uniq)─┐
│   0 │                    500000 │                          500195 │          500455 │
│   1 │                    500000 │                          502599 │          501549 │
│   2 │                    500000 │                          498058 │          504428 │
│   3 │                    500000 │                          499748 │          500195 │
│   4 │                    500000 │                          500791 │          500836 │
│   5 │                    500000 │                          502430 │          497558 │
│   6 │                    500000 │                          500262 │          501785 │
│   7 │                    500000 │                          501514 │          495758 │
│   8 │                    500000 │                          500121 │          498597 │
│   9 │                    500000 │                          502173 │          500455 │
│  10 │                    500000 │                          499144 │          498386 │
│  11 │                    500000 │                          500525 │          503139 │
│  12 │                    500000 │                          503624 │          497103 │
│  13 │                    500000 │                          499986 │          497992 │
│  14 │                    500000 │                          502027 │          494833 │
│  15 │                    500000 │                          498831 │          500983 │
│  16 │                    500000 │                          501103 │          500836 │
│  17 │                    500000 │                          499409 │          496791 │
│  18 │                    500000 │                          501641 │          502991 │
│  19 │                    500000 │                          500648 │          500881 │
└─────┴───────────────────────────┴─────────────────────────────────┴─────────────────┘

20 rows in set. Elapsed: 3.318 sec. Processed 20.00 thousand rows, 11.02 MB (6.03 thousand rows/s., 3.32 MB/s.)

Let’s compare the data size, uniq won in this case, but check this article Functions to count uniqs , mileage may vary.

optimize table aggregates final;

SELECT
    column,
    formatReadableSize(sum(column_data_compressed_bytes) AS size) AS compressed,
    formatReadableSize(sum(column_data_uncompressed_bytes) AS usize) AS uncompressed
FROM system.parts_columns
WHERE (active = 1)  AND (table LIKE 'aggregates') and column like '%uniq%'
GROUP BY column
ORDER BY size DESC;

┌─column───────┬─compressed─┬─uncompressed─┐
│ uniqExact    │ 153.21 MiB │ 152.61 MiB   │
│ uniqCombined │ 76.62 MiB  │ 76.32 MiB    │
│ uniq         │ 38.33 MiB  │ 38.18 MiB    │
└──────────────┴────────────┴──────────────┘

5.12 - Custom Settings

Using custom settings

Using custom settings in config

You can not use the custom settings in config file ‘as is’, because ClickHouse® don’t know which datatype should be used to parse it.

cat /etc/clickhouse-server/users.d/default_profile.xml 
<?xml version="1.0"?>
<yandex>
    <profiles>
        <default>
     	     <custom_data_version>1</custom_data_version> <!-- will not work! see below -->
        </default>
    </profiles>
</yandex>

That will end up with the following error:

2021.09.24 12:50:37.369259 [ 264905 ] {} <Error> ConfigReloader: Error updating configuration from '/etc/clickhouse-server/users.xml' config.: Code: 536. DB::Exception: Couldn't restore Field from dump: 1: while parsing value '1' for setting 'custom_data_version'. (CANNOT_RESTORE_FROM_FIELD_DUMP), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0x9440eba in /usr/lib/debug/.build-id/ba/25f6646c3be7aa95f452ec85461e96178aa365.debug
1. DB::Field::restoreFromDump(std::__1::basic_string_view<char, std::__1::char_traits<char> > const&)::$_4::operator()() const @ 0x10449da0 in /usr/lib/debug/.build-id/ba/25f6646c3be7aa95f452ec85461e96178aa365.debug
2. DB::Field::restoreFromDump(std::__1::basic_string_view<char, std::__1::char_traits<char> > const&) @ 0x10449bf1 in /usr/lib/debug/.build-id/ba/25f6646c3be7aa95f452ec85461e96178aa365.debug
3. DB::BaseSettings<DB::SettingsTraits>::stringToValueUtil(std::__1::basic_string_view<char, std::__1::char_traits<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) @ 0x1042e2bf in /usr/lib/debug/.build-id/ba/25f6646c3be7aa95f452ec85461e96178aa365.debug
4. DB::UsersConfigAccessStorage::parseFromConfig(Poco::Util::AbstractConfiguration const&) @ 0x1041a097 in /usr/lib/debug/.build-id/ba/25f6646c3be7aa95f452ec85461e96178aa365.debug
5. void std::__1::__function::__policy_invoker<void (Poco::AutoPtr<Poco::Util::AbstractConfiguration>, bool)>::__call_impl<std::__1::__function::__default_alloc_func<DB::UsersConfigAccessStorage::load(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::function<std::__1::shared_ptr<zkutil::ZooKeeper> ()> const&)::$_0, void (Poco::AutoPtr<Poco::Util::AbstractConfiguration>, bool)> >(std::__1::__function::__policy_storage const*, Poco::AutoPtr<Poco::Util::AbstractConfiguration>&&, bool) @ 0x1042e7ff in /usr/lib/debug/.build-id/ba/25f6646c3be7aa95f452ec85461e96178aa365.debug
6. DB::ConfigReloader::reloadIfNewer(bool, bool, bool, bool) @ 0x11caf54e in /usr/lib/debug/.build-id/ba/25f6646c3be7aa95f452ec85461e96178aa365.debug
7. DB::ConfigReloader::run() @ 0x11cb0f8f in /usr/lib/debug/.build-id/ba/25f6646c3be7aa95f452ec85461e96178aa365.debug
8. ThreadFromGlobalPool::ThreadFromGlobalPool<void (DB::ConfigReloader::*)(), DB::ConfigReloader*>(void (DB::ConfigReloader::*&&)(), DB::ConfigReloader*&&)::'lambda'()::operator()() @ 0x11cb19f1 in /usr/lib/debug/.build-id/ba/25f6646c3be7aa95f452ec85461e96178aa365.debug
9. ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>) @ 0x9481f5f in /usr/lib/debug/.build-id/ba/25f6646c3be7aa95f452ec85461e96178aa365.debug
10. void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void ThreadPoolImpl<std::__1::thread>::scheduleImpl<void>(std::__1::function<void ()>, int, std::__1::optional<unsigned long>)::'lambda0'()> >(void*) @ 0x9485843 in /usr/lib/debug/.build-id/ba/25f6646c3be7aa95f452ec85461e96178aa365.debug
11. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
12. __clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so
 (version 21.10.1.8002 (official build))


2021.09.29 11:36:07.722213 [ 2090 ] {} <Error> Application: DB::Exception: Couldn't restore Field from dump: 1: while parsing value '1' for setting 'custom_data_version'

To make it work you need to change it an the following way:

cat /etc/clickhouse-server/users.d/default_profile.xml 
<?xml version="1.0"?>
<yandex>
    <profiles>
        <default>
            <custom_data_version>UInt64_1</custom_data_version>
        </default>
    </profiles>
</yandex>

cat /etc/clickhouse-server/users.d/default_profile.xml 
<?xml version="1.0"?>
<yandex>
    <profiles>
        <default>
            <custom_data_version>'1'</custom_data_version>
        </default>
    </profiles>
</yandex>

The list of recognized prefixes is in the sources: https://github.com/ClickHouse/ClickHouse/blob/ea13a8b562edbc422c07b5b4ecce353f79b6cb63/src/Core/Field.cpp#L253-L270

5.13 - Description of asynchronous_metrics

Description of asynchronous_metrics

CompiledExpressionCacheCount    -- number or compiled cached expression (if CompiledExpressionCache is enabled)

jemalloc -- parameters of jemalloc allocator, they are not very useful, and not interesting

MarkCacheBytes / MarkCacheFiles  -- there are cache for .mrk files (default size is 5GB), you can see is it use all 5GB or not

MemoryCode  -- how much memory allocated for ClickHouse® executable 

MemoryDataAndStack -- virtual memory allocated for data and stack

MemoryResident  -- real memory used by ClickHouse ( the same as top RES/RSS)

MemoryShared   -- shared memory used by ClickHouse

MemoryVirtual  -- virtual memory used by ClickHouse ( the same as top VIRT)

NumberOfDatabases

NumberOfTables

ReplicasMaxAbsoluteDelay -- important parameter - replica max absolute delay in seconds

ReplicasMaxRelativeDelay -- replica max relative delay (from other replicas) in seconds

ReplicasMaxInsertsInQueue  -- max number of parts to fetch for a single Replicated table

ReplicasSumInsertsInQueue  -- sum of parts to fetch for all Replicated tables

ReplicasMaxMergesInQueue  -- max number of merges in queue for a single Replicated table

ReplicasSumMergesInQueue  -- total number of merges in queue for all Replicated tables

ReplicasMaxQueueSize -- max number of tasks  for a single Replicated table 

ReplicasSumQueueSize -- total number of tasks in replication queue

UncompressedCacheBytes/UncompressedCacheCells  -- allocated memory for uncompressed cache (disabled by default)

Uptime     -- uptime seconds

5.14 - ClickHouse® data/disk encryption (at rest)

Example how to encrypt data in tables using storage policies.

Create folder

mkdir /data/clickhouse_encrypted
chown clickhouse.clickhouse /data/clickhouse_encrypted

Configure encrypted disk and storage

cat /etc/clickhouse-server/config.d/encrypted_storage.xml
<clickhouse>
    <storage_configuration>
        <disks>
            <disk1>
                <type>local</type>
                <path>/data/clickhouse_encrypted/</path>
            </disk1>
            <encrypted_disk>
                <type>encrypted</type>
                <disk>disk1</disk>
                <path>encrypted/</path>
                <algorithm>AES_128_CTR</algorithm>
                <key_hex id="0">00112233445566778899aabbccddeeff</key_hex>
                <current_key_id>0</current_key_id>
            </encrypted_disk>
        </disks>
        <policies>
            <encrypted>
                <volumes>
                    <encrypted_volume>
                        <disk>encrypted_disk</disk>
                    </encrypted_volume>
                </volumes>
            </encrypted>
        </policies>
    </storage_configuration>
</clickhouse>

systemctl restart clickhouse-server

select name, path, type, is_encrypted from system.disks;
┌─name───────────┬─path──────────────────────────────────┬─type──┬─is_encrypted─┐
│ default        │ /var/lib/clickhouse/                  │ local │            0 │
│ disk1          │ /data/clickhouse_encrypted/           │ local │            0 │
│ encrypted_disk │ /data/clickhouse_encrypted/encrypted/ │ local │            1 │
└────────────────┴───────────────────────────────────────┴───────┴──────────────┘

select * from system.storage_policies;
┌─policy_name─┬─volume_name──────┬─volume_priority─┬─disks──────────────┬─volume_type─┬─max_data_part_size─┬─move_factor─┬─prefer_not_to_merge─┐
│ default     │ default          │               1 │ ['default']        │ JBOD        │                  0 │           0 │                   0 │
│ encrypted   │ encrypted_volume │               1 │ ['encrypted_disk'] │ JBOD        │                  0 │           0 │                   0 │
└─────────────┴──────────────────┴─────────────────┴────────────────────┴─────────────┴────────────────────┴─────────────┴─────────────────────┘

Create table

CREATE TABLE bench_encrypted(c_int Int64, c_str varchar(255), c_float Float64) 
engine=MergeTree order by c_int
settings storage_policy = 'encrypted';

cat /data/clickhouse_encrypted/encrypted/store/906/9061167e-d5f7-45ea-8e54-eb6ba3b678dc/format_version.txt
ENC�AdruM�˪h"��^�

Compare performance of encrypted and not encrypted tables

CREATE TABLE bench_encrypted(c_int Int64, c_str varchar(255), c_float Float64) 
engine=MergeTree order by c_int
settings storage_policy = 'encrypted';

insert into bench_encrypted
select toInt64(cityHash64(number)), lower(hex(MD5(toString(number)))), number/cityHash64(number)*10000000 
from numbers_mt(100000000);

0 rows in set. Elapsed: 33.357 sec. Processed 100.66 million rows, 805.28 MB (3.02 million rows/s., 24.14 MB/s.)


CREATE TABLE bench_unencrypted(c_int Int64, c_str varchar(255), c_float Float64) 
engine=MergeTree order by c_int;

insert into bench_unencrypted
select toInt64(cityHash64(number)), lower(hex(MD5(toString(number)))), number/cityHash64(number)*10000000 
from numbers_mt(100000000);

0 rows in set. Elapsed: 31.175 sec. Processed 100.66 million rows, 805.28 MB (3.23 million rows/s., 25.83 MB/s.)


select avg(c_float) from bench_encrypted;
1 row in set. Elapsed: 0.195 sec. Processed 100.00 million rows, 800.00 MB (511.66 million rows/s., 4.09 GB/s.)

select avg(c_float) from bench_unencrypted;
1 row in set. Elapsed: 0.150 sec. Processed 100.00 million rows, 800.00 MB (668.71 million rows/s., 5.35 GB/s.)


select sum(c_int) from bench_encrypted;
1 row in set. Elapsed: 0.281 sec. Processed 100.00 million rows, 800.00 MB (355.74 million rows/s., 2.85 GB/s.)

select sum(c_int) from bench_unencrypted;
1 row in set. Elapsed: 0.193 sec. Processed 100.00 million rows, 800.00 MB (518.88 million rows/s., 4.15 GB/s.)


set max_threads=1;

select avg(c_float) from bench_encrypted;
1 row in set. Elapsed: 0.934 sec. Processed 100.00 million rows, 800.00 MB (107.03 million rows/s., 856.23 MB/s.)

select avg(c_float) from bench_unencrypted;
1 row in set. Elapsed: 0.874 sec. Processed 100.00 million rows, 800.00 MB (114.42 million rows/s., 915.39 MB/s.)

read key_hex from environment variable

cat /etc/clickhouse-server/config.d/encrypted_storage.xml
<clickhouse>
    <storage_configuration>
        <disks>
            <disk1>
                <type>local</type>
                <path>/data/clickhouse_encrypted/</path>
            </disk1>
            <encrypted_disk>
                <type>encrypted</type>
                <disk>disk1</disk>
                <path>encrypted/</path>
                <algorithm>AES_128_CTR</algorithm>
                <key_hex from_env="DiskKey"/>
            </encrypted_disk>
        </disks>
        <policies>
            <encrypted>
                <volumes>
                    <encrypted_volume>
                        <disk>encrypted_disk</disk>
                    </encrypted_volume>
                </volumes>
            </encrypted>
        </policies>
    </storage_configuration>
</clickhouse>

cat /etc/default/clickhouse-server
DiskKey=00112233445566778899aabbccddeeff

systemctl restart clickhouse-server

5.15 - DR two DC

Disaster Recovery configuration between two data centers

Clickhouse uses Keeper (or ZooKeeper) to inform other cluster nodes about changes. Clickhouse nodes then fetch new parts directly from other nodes in the cluster. The Keeper cluster is a key for building a DR schema. You can consider Keeper a “true” cluster while clickhouse-server nodes as storage access instruments.

To implement a disaster recovery (DR) setup for ClickHouse across two physically separated data centers (A and B), with only one side active at a time, you can create a single ClickHouse cluster spanning both data centers. This setup will address data synchronization, replication, and coordination needs.

Cluster Configuration

Create a single ClickHouse cluster with nodes in both data centers.
Configure the appropriate number of replicas and shards based on your performance and redundancy requirements.
Use ClickHouse Keeper or ZooKeeper for cluster coordination (see Keeper flavors discussion below).

Data Synchronization and Replication

ClickHouse replicas operate in a master-master configuration, eliminating the need for a separate slave approach.
Configure replicas across both data centers to ensure data synchronization.
While both DCs have active replicas, consider DC B replicas as “passive” from the application’s perspective.

Example Configuration:

<remote_servers>
    <company_cluster>
        <shard>
            <replica>
                <host>ch1.dc-a.company.com</host>
            </replica>
            <replica>
                <host>ch2.dc-a.company.com</host>
            </replica>
            <replica>
                <host>ch1.dc-b.company.com</host>
            </replica>
            <replica>
                <host>ch2.dc-b.company.com</host>
            </replica>
        </shard>
<!-- Add more shards as needed -->
    </company_cluster>
</remote_servers>

Keeper Setup

In the active data center (DC A):
- Deploy 3 active Keeper nodes
In the passive data center (DC B):
- Deploy 1 Keeper node in observer role

Failover Process:

In case of a failover:

Shut down the ClickHouse cluster in DC A completely
Manually switch Keeper in DC B from observer to active participant (restart needed).
Create two additional Keeper nodes (they will replicate the state automatically).
Add two additional Keeper nodes to clickhouse configs

ClickHouse Keeper vs. ZooKeeper

While ClickHouse Keeper is generally preferable for very high-load scenarios, ZooKeeper remains a viable option for many deployments.

Considerations:

ClickHouse Keeper is optimized for ClickHouse operations and can handle higher loads.
ZooKeeper is well-established and works well for many clients.

The choice between ClickHouse Keeper and ZooKeeper is more about the overall system architecture and load patterns.

Configuration Synchronization

To keep configurations in sync:

Use ON CLUSTER clause for DDL statements
Store RBAC objects in Keeper
Implement a configuration management system (e.g., Ansible, Puppet) to simultaneously apply changes to clickhouse configuration files in config.d

5.16 - How ALTERs work in ClickHouse®

How ALTERs work in ClickHouse®:

ADD (COLUMN/INDEX/PROJECTION)

Lightweight, will only change table metadata. So new entity will be added in case of creation of new parts during INSERT’s OR during merges of old parts.

In case of COLUMN, ClickHouse will calculate column value on fly in query context.

CREATE TABLE test_materialization
(
    `key` UInt32,
    `value` UInt32
)
ENGINE = MergeTree
ORDER BY key;

INSERT INTO test_materialization(key, value) SELECT 1, 1;
INSERT INTO test_materialization(key, value) SELECT 2, 2;

ALTER TABLE test_materialization ADD COLUMN inserted_at DateTime DEFAULT now();

SELECT key, inserted_at FROM test_materialization;

┌─key─┬─────────inserted_at─┐
│   1 │ 2022-09-01 03:28:58 │
└─────┴─────────────────────┘
┌─key─┬─────────inserted_at─┐
│   2 │ 2022-09-01 03:28:58 │
└─────┴─────────────────────┘

SELECT key, inserted_at FROM test_materialization;

┌─key─┬─────────inserted_at─┐
│   1 │ 2022-09-01 03:29:11 │
└─────┴─────────────────────┘
┌─key─┬─────────inserted_at─┐
│   2 │ 2022-09-01 03:29:11 │
└─────┴─────────────────────┘

Each query will return different inserted_at value, because each time now() function being executed. 


INSERT INTO test_materialization(key, value) SELECT 3, 3;

SELECT key, inserted_at FROM test_materialization;

┌─key─┬─────────inserted_at─┐
│   3 │ 2022-09-01 03:29:36 │   -- < This value was materialized during ingestion, that's why it's smaller than value for keys 1 & 2
└─────┴─────────────────────┘
┌─key─┬─────────inserted_at─┐
│   1 │ 2022-09-01 03:29:53 │
└─────┴─────────────────────┘
┌─key─┬─────────inserted_at─┐
│   2 │ 2022-09-01 03:29:53 │
└─────┴─────────────────────┘

OPTIMIZE TABLE test_materialization FINAL;

SELECT key, inserted_at FROM test_materialization;

┌─key─┬─────────inserted_at─┐
│   1 │ 2022-09-01 03:30:52 │
│   2 │ 2022-09-01 03:30:52 │
│   3 │ 2022-09-01 03:29:36 │
└─────┴─────────────────────┘

SELECT key, inserted_at FROM test_materialization;

┌─key─┬─────────inserted_at─┐
│   1 │ 2022-09-01 03:30:52 │
│   2 │ 2022-09-01 03:30:52 │
│   3 │ 2022-09-01 03:29:36 │
└─────┴─────────────────────┘

So, data inserted after addition of column can have lower inserted_at value then old data without materialization.

If you want to backpopulate data for old parts, you have multiple options:

MATERIALIZE (COLUMN/INDEX/PROJECTION) (PART[ITION ID] ‘’)

Will materialize this entity.

OPTIMIZE TABLE xxxx (PART[ITION ID] ‘’) (FINAL)

Will trigger merge, which will lead to materialization of all entities in affected parts.

ALTER TABLE xxxx UPDATE column_name = column_name WHERE 1;

Will trigger mutation, which will materialize this column.

DROP (COLUMN/INDEX/PROJECTION)

Lightweight, it’s only about changing of table metadata and removing corresponding files from filesystem. For Compact parts it will trigger merge, which can be heavy. issue

DROP DETACHED command

The DROP DETACHED command in ClickHouse® is used to remove parts or partitions that have previously been detached (i.e., moved to the detached directory and forgotten by the server). The syntax is:

Warning

Be careful before dropping any detached part or partition. Validate that data is no longer needed and keep a backup before running destructive commands.

ALTER TABLE table_name [ON CLUSTER cluster] DROP DETACHED PARTITION|PART ALL|partition_expr

MODIFY COLUMN (DATE TYPE)

Change column type in table schema.
Schedule mutation to change type for old parts.

Mutations

Affected parts - parts with rows matching condition.

ALTER TABLE xxxxx DELETE WHERE column_1 = 1;

Will overwrite all column data in affected parts.
For all part(ition)s will create new directories on disk and write new data to them or create hardlinks if they untouched.
Register new parts names in ZooKeeper.

ALTER TABLE xxxxx DELETE IN PARTITION ID ’’ WHERE column_1 = 1;

Will do the same but only for specific partition.

ALTER TABLE xxxxx UPDATE SET column_2 = column_2, column_3 = column_3 WHERE column_1 = 1;

Will overwrite column_2, column_3 data in affected parts.
For all part(ition)s will create new directories on disk and write new data to them or create hardlinks if they untouched.
Register new parts names in ZooKeeper.

DELETE FROM xxxxx WHERE column_1 = 1;

Will create & populate hidden boolean column in affected parts. (_row_exists column)
For all part(ition)s will create new directories on disk and write new data to them or create hardlinks if they untouched.
Register new parts names in ZooKeeper.

Despite that LWD mutations will not rewrite all columns, steps 2 & 3 in case of big tables can take significant time.

5.17 - How to recreate a table in case of total corruption of the replication queue

How to recreate a table in case of total corruption of the replication queue.

How to fix a replication using hard-reset way

Find the best replica (replica with the most fresh/consistent) data.
Backup the table alter table mydatabase.mybadtable freeze;
Stop all applications!!! Stop ingestion. Stop queries - table will be empty for some time.
Check that detached folder is empty or clean it.

SELECT concat('alter table ', database, '.', table, ' drop detached part \'', name, '\' settings allow_drop_detached=1;')
FROM system.detached_parts
WHERE (database = 'mydatabase') AND (table = 'mybadtable')
FORMAT TSVRaw;

Make sure that detached folder is empty select count() from system.detached_parts where database='mydatabase' and table ='mybadtable';
Detach all parts (table will became empty)

SELECT concat('alter table ', database, '.', table, ' detach partition id \'', partition_id, '\';') AS detach
FROM system.parts
WHERE (active = 1) AND (database = 'mydatabase') AND (table = 'mybadtable')
GROUP BY detach
ORDER BY detach ASC
FORMAT TSVRaw;

Make sure that table is empty select count() from mydatabase.mybadtable;
Attach all parts back

SELECT concat('alter table ', database, '.', table, ' attach part \'', a.name, '\';')
FROM system.detached_parts AS a
WHERE (database = 'mydatabase') AND (table = 'mybadtable')
FORMAT TSVRaw;

Make sure that data is consistent at all replicas

SELECT
    formatReadableSize(sum(bytes)) AS size,
    sum(rows),
    count() AS part_count,
    uniqExact(partition) AS partition_count
FROM system.parts
WHERE (active = 1) AND (database = 'mydatabase') AND (table = 'mybadtable');

5.18 - http handler example

http handler example

http handler example (how to disable /play)

# cat /etc/clickhouse-server/config.d/play_disable.xml
<?xml version="1.0" ?>
<yandex>
     <http_handlers>
        <rule>
            <url>/play</url>
            <methods>GET</methods>
            <handler>
                <type>static</type>
                <status>403</status>
                <content_type>text/plain; charset=UTF-8</content_type>
                <response_content></response_content>
            </handler>
        </rule>
        <defaults/>         <!-- handler to save default handlers ?query / ping -->
    </http_handlers>
</yandex>

5.19 - Jemalloc heap profiling

Example of .xml config to enable remote pprof style access

Config

<!-- cat config.d/jemalloc_dict.xml -->
<clickhouse>
	<dictionaries_config>/etc/clickhouse-server/config.d/*_dict.xml</dictionaries_config>
	<http_handlers>
		<rule>
			<url>/pprof/heap</url>
			<methods>GET,POST</methods>
			<handler>
				<type>static</type>
				<response_content>file://jemalloc_clickhouse.heap</response_content>
			</handler>
		</rule>
		<rule>
			<url>/pprof/cmdline</url>
			<methods>GET</methods>
			<handler>
				<type>predefined_query_handler</type>
				<query>SELECT '/var/lib/clickhouse' FORMAT TSVRaw</query>
			</handler>
		</rule>
		<rule>
			<url>/pprof/symbol</url>
			<methods>GET</methods>
			<handler>
				<type>predefined_query_handler</type>
				<query>SELECT 'num_symbols: ' || count() FROM system.symbols FORMAT TSVRaw SETTINGS allow_introspection_functions = 1</query>
			</handler>
		</rule>
		<rule>
			<url>/pprof/symbol</url>
			<methods>POST</methods>
			<handler>
				<type>predefined_query_handler</type>
				<query>WITH arrayJoin(splitByChar('+', {_request_body:String})) as addr SELECT addr || '    ' || demangle(addressToSymbol(reinterpretAsUInt64(reverse(substr(unhex(addr),2))))) SETTINGS allow_introspection_functions = 1 FORMAT TSVRaw</query>
			</handler>
		</rule>
		<defaults/>
	</http_handlers>
	<dictionary>
		<name>jemalloc_ls</name>
		<structure>
			<key>
				<attribute>
					<name>id</name>
					<type>String</type>
				</attribute>
			</key>
			<attribute>
				<name>file</name>
				<type>String</type>
				<null_value />
			</attribute>
			<attribute>
				<name>size</name>
				<type>UInt32</type>
				<null_value />
			</attribute>
			<attribute>
				<name>time</name>
				<type>DateTime</type>
				<null_value />
			</attribute>
		</structure>
		<source>
			<executable>
				<command>for f in /tmp/jemalloc_clickhouse.*; do [ -f &quot;$f&quot; ] || continue; echo -e &quot;$(basename &quot;$f&quot; | cut -d. -f2-3)\t$f\t$(stat -c%s &quot;$f&quot;)\t$(stat -c%Y &quot;$f&quot;)&quot;; done</command>
				<execute_direct>false</execute_direct>
				<format>TSV</format>
			</executable>
		</source>
		<layout>
			<complex_key_direct/>
		</layout>
		<lifetime>300</lifetime>
	</dictionary>
	<dictionary>
		<name>jemalloc_cp</name>
		<structure>
			<id>
				<name>id</name>
				<type>UInt32</type>
			</id>
			<attribute>
				<name>status</name>
				<type>UInt32</type>
				<null_value />
			</attribute>
		</structure>
		<source>
			<executable>
				<command>ver=${1:-$(head -n1 | tr -d &quot;[:space:]&quot;)}; file=$(ls -t -- /tmp/jemalloc_clickhouse.*.&quot;$ver&quot;.heap 2&gt;/dev/null | head -n1); if [ -n &quot;$file&quot; ] &amp;&amp; cp -- &quot;$file&quot; /var/lib/clickhouse/user_files/jemalloc_clickhouse.heap; then printf &apos;1\t\n&apos;; else printf &apos;0\t\n&apos;; fi</command>
				<execute_direct>false</execute_direct>
				<format>TSV</format>
			</executable>
		</source>
		<layout>
			<direct/>
		</layout>
		<lifetime>300</lifetime>
	</dictionary>
</clickhouse>

$ curl https://user:password@cluster.env.altinity.cloud:8443/pprof/cmdline
/var/lib/clickhouse

$ curl https://user:password@cluster.env.altinity.cloud:8443/pprof/symbol
num_symbols: 702648

$ curl -d '0x0F99B044+0x008512D0' https://user:password@cluster.env.altinity.cloud:8443/pprof/symbol
0x0F99B044    DB::StorageSystemFilesystemCache::getColumnsDescription()
0x008512D0    icudt75_dat

cluster :) SYSTEM JEMALLOC ENABLE PROFILE;

SYSTEM JEMALLOC ENABLE PROFILE

Ok.

0 rows in set. Elapsed: 0.270 sec.

cluster :) SELECT uniqExact(number) FROM numbers_mt(1000000000);

SELECT uniqExact(number)
FROM numbers_mt(1000000000)

┌─uniqExact(number)─┐
│        1000000000 │ -- 1.00 billion
└───────────────────┘

1 row in set. Elapsed: 6.585 sec. Processed 1.00 billion rows, 8.00 GB (151.86 million rows/s., 1.21 GB/s.)
Peak memory usage: 25.19 GiB.

cluster :) SYSTEM JEMALLOC FLUSH PROFILE;

SYSTEM JEMALLOC FLUSH PROFILE

Ok.

0 rows in set. Elapsed: 0.272 sec.

cluster :) SELECT * FROM dictionary('jemalloc_ls');

SELECT *
FROM dictionary('jemalloc_ls')

┌─id─────┬─file──────────────────────────────┬───size─┬────────────────time─┐
│        │                                   │      0 │ 1970-01-01 00:00:00 │
│ -e 8.0 │ /tmp/jemalloc_clickhouse.8.0.heap │ 108004 │ 2025-09-01 00:44:13 │
│ -e 8.1 │ /tmp/jemalloc_clickhouse.8.1.heap │ 111115 │ 2025-09-01 00:46:46 │
│ -e 8.2 │ /tmp/jemalloc_clickhouse.8.2.heap │ 128098 │ 2025-09-01 00:47:07 │
│ -e 8.3 │ /tmp/jemalloc_clickhouse.8.3.heap │ 123980 │ 2025-09-01 00:48:14 │
│ -e 8.4 │ /tmp/jemalloc_clickhouse.8.4.heap │ 124230 │ 2025-09-01 00:48:15 │
│ -e 8.5 │ /tmp/jemalloc_clickhouse.8.5.heap │ 117733 │ 2025-09-01 12:18:53 │
└────────┴───────────────────────────────────┴────────┴─────────────────────┘

7 rows in set. Elapsed: 0.021 sec.

cluster :) SELECT dictGet('jemalloc_cp', 'status', 4);

SELECT dictGet('jemalloc_cp', 'status', 4)

┌─dictGet('jem⋯status', 4)─┐
│                        0 │
└──────────────────────────┘

1 row in set. Elapsed: 0.014 sec.

$ jeprof --svg https://user:password@cluster.env.altinity.cloud:8443/pprof/heap > ./mem.svg
Fetching /pprof/heap profile from https://user:password@cluster.env.altinity.cloud:8443/pprof/heap to
  /home/user/jeprof/clickhouse.1756728952.user.pprof.heap
Wrote profile to /home/user/jeprof/clickhouse.1756728952.user.pprof.heap
Dropping nodes with <= 90.7 MB; edges with <= 18.1 abs(MB)

cluster :) SELECT dictGet('jemalloc_cp', 'status', 5);

SELECT dictGet('jemalloc_cp', 'status', 5)

┌─dictGet('jem⋯status', 5)─┐
│                        0 │
└──────────────────────────┘

1 row in set. Elapsed: 0.014 sec.

$ jeprof --svg https://user:password@cluster.env.altinity.cloud:8443/pprof/heap --base /home/user/jeprof/clickhouse.1756728952.user.pprof.heap > ./mem_diff.svg
Fetching /pprof/heap profile from https://user:password@cluster.env.altinity.cloud:8443/pprof/heap to
  /home/user/jeprof/clickhouse.1756729237.user.pprof.heap
Wrote profile to /home/user/jeprof/clickhouse.1756729237.user.pprof.heap

cluster :) SYSTEM JEMALLOC DISABLE PROFILE;

SYSTEM JEMALLOC DISABLE PROFILE

Ok.

0 rows in set. Elapsed: 0.271 sec.

5.20 - Keeper-Dependent Features in ClickHouse

Keeper-Dependent Features in ClickHouse

This is a consolidated list of features that depend on ClickHouse Keeper (or ZooKeeper-compatible API).

Keeper below means either ClickHouse Keeper or Apache ZooKeeper, depending on deployment.

#	Feature	How configured	Default path / example	Keeper structure (brief)
1	Replicated tables (`ReplicatedMergeTree` family)	Use `ENGINE = ReplicatedMergeTree(...)`, or omit args and rely on server defaults `default_replica_path`, `default_replica_name`.	Default path: `/clickhouse/tables/{uuid}/{shard}`; replica: `{replica}`.	`<table_path>/metadata`, `columns`, `log`, `blocks`, `async_blocks`, `deduplication_hashes`, `block_numbers`, `leader_election`, `replicas/<replica>/queue, parts, flags, ...`, `mutations`, `quorum`.
2	`S3Queue`	Table setting `keeper_path`; if omitted CH builds path from `s3queue_default_zookeeper_path` + DB UUID + table UUID.	Default prefix: `/clickhouse/s3queue/`.	`<keeper_path>/metadata`, `processed`, `failed`, `processing`, `persistent_processing`, `registry`; ordered mode may also use `buckets/...` subtrees.
3	`Kafka` Keeper-offset mode (`StorageKafka2`, experimental)	Enable setting `allow_experimental_kafka_offsets_storage_in_keeper=1` and set both `kafka_keeper_path`, `kafka_replica_name`.	No default keeper path (must be set), docs example: `/clickhouse/{database}/{uuid}`.	`<kafka_keeper_path>/topics/<topic>/partitions`, `topic_partition_locks`, `replicas/<replica>`, temporary `dropped` for drop coordination.
4	Distributed DDL queue (`ON CLUSTER`)	Server settings: `distributed_ddl.path`, `distributed_ddl.replicas_path`.	Defaults: `/clickhouse/task_queue/ddl/` and `/clickhouse/task_queue/replicas/`.	Queue entries as `query-XXXXXXXXXX` nodes. Each entry has status dirs: `active/<host_id>`, `finished/<host_id>`, optional `synced/<host_id>`, `shards/<shard>/...`. Replicas liveness is tracked under `<replicas_path>/<host_id>/active` (ephemeral).
5	`KeeperMap` table engine	Server config must define `keeper_map_path_prefix`; table uses engine arg `root_path`.	No built-in default (disabled if prefix is absent). Common example: `/keeper_map_tables`.	`<prefix>/<root_path>/metadata`, `metadata/tables/<table_unique_id>`, `data/<serialized_key>`, drop/cleanup coordination nodes under `metadata/...`.
6	Replicated databases (`ENGINE=Replicated`)	`ENGINE = Replicated(zoo_path, shard, replica)` or omit args and use DB-replicated defaults.	Default DB path: `/clickhouse/databases/{uuid}`.	`<db_path>/log/query-`, `replicas/<full_replica_name>/log_ptr, digest, replica_group, ...`, `metadata/<table_name>`, `counter/cnt-`, `max_log_ptr`, `logs_to_keep`.
7	Replicated access entities (users/roles/grants/quotas/policies)	Configure `user_directories` with `<replicated><zookeeper_path>...</zookeeper_path></replicated>`.	No mandatory default; common example: `/clickhouse/access`.	`<path>/uuid/<entity_uuid>` stores entity payload. Type maps: `U` (users), `R` (roles), `S` (settings profiles), `P` (row policies), `Q` (quotas), `M` (masking policies), each mapping name -> UUID.
8	Replicated SQL UDFs (`CREATE FUNCTION`)	Set server config `user_defined_zookeeper_path` (otherwise disk storage is used).	No default keeper path; common example: `/clickhouse/udf`.	Root node plus one znode per function, e.g. `function_<escaped_name>.sql` containing CREATE FUNCTION text.
9	Named collections in Keeper	Configure `named_collections_storage.type = keeper	zookeeper`(or encrypted variants) and set`named_collections_storage.path`.	Default storage type is `local`; no default keeper path when keeper mode is selected.
10	Workload scheduler definitions in Keeper (`CREATE WORKLOAD`, `CREATE RESOURCE`)	Set `workload_zookeeper_path` (if absent, disk `workload_path` is used).	No default keeper path; docs example: `/clickhouse/workload/definitions.sql`.	Single watched znode at the configured path, content is a serialized list of workload/resource CREATE statements.
11	Cluster discovery (experimental)	Enable `allow_experimental_cluster_discovery=1`; configure `<remote_servers>...<discovery><path>...`.	No default path; examples use `/clickhouse/discovery/<cluster>`.	Discovery root contains `shards/<server_uuid>` ephemeral nodes with JSON payload (`address`, `shard_id`, version).
12	`BACKUP/RESTORE ... ON CLUSTER` coordination	Server config `backups.zookeeper_path`.	Default: `/clickhouse/backups`.	Operation roots like `backup-<uuid>` / `restore-<uuid>` with coordination subtrees: stage sync, replicated objects acquisition, file mapping, keeper-map coordination, etc.
13	`AzureQueue`	Same object-storage queue keeper model as `S3Queue`, with `keeper_path` setting.	Uses the same queue metadata path logic (`s3queue_default_zookeeper_path` prefix if no explicit path).	Same pattern as `S3Queue`: `metadata`, `processed/failed/processing`, `registry`, optional `buckets` in ordered mode.
14	`generateSerialID()` function	Server setting `series_keeper_path`.	Default: `/clickhouse/series`.	One node per series: `<series_keeper_path>/<series_name>`, value is current counter.
15	Experimental transactions	Configure `transaction_log.zookeeper_path` (and enable related experimental transaction settings).	Default: `/clickhouse/txn`.	`<path>/tail_ptr` and `<path>/log/csn-*` sequential nodes storing commit sequence and transaction IDs.
16	`Shared` database engine (Cloud)	ClickHouse Cloud managed behavior (not typically user-configured in self-managed OSS).	Internal/cloud-managed.	Shared catalog is Keeper-backed; low-level path layout is internal and not documented as a stable public contract.

How ON CLUSTER fits in (important)

ON CLUSTER itself relies on a Keeper-backed distributed DDL queue (see row 4). Some of the features listed above already replicate via Keeper, which can make ON CLUSTER redundant:

Replicated database DDLs (row 6).
Replicated access entities (row 7).
Replicated UDFs (row 8).
Keeper-backed named collections (row 9).

ClickHouse has dedicated settings like ignore_on_cluster_for_replicated_* to control this behavior.

Notes

For many features, path names can be redirected to auxiliary Keeper clusters using <auxiliary_zookeepers> and the cluster_name:/path notation, where supported.
The metadata structure shown above reflects the stable conceptual layout from the current source tree; some minor subnodes may vary by version.

5.21 - Logging

Logging configuration and issues

Q. I get errors:

File not found: /var/log/clickhouse-server/clickhouse-server.log.0.
File not found: /var/log/clickhouse-server/clickhouse-server.log.8.gz.

...

 File not found: /var/log/clickhouse-server/clickhouse-server.err.log.0, Stack trace (when copying this message, always include the lines below):
0. Poco::FileImpl::handleLastErrorImpl(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) @ 0x11c2b345 in /usr/bin/clickhouse
1. Poco::PurgeOneFileStrategy::purge(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) @ 0x11c84618 in /usr/bin/clickhouse
2. Poco::FileChannel::log(Poco::Message const&) @ 0x11c314cc in /usr/bin/clickhouse
3. DB::OwnFormattingChannel::logExtended(DB::ExtendedLogMessage const&) @ 0x8681402 in /usr/bin/clickhouse
4. DB::OwnSplitChannel::logSplit(Poco::Message const&) @ 0x8682fa8 in /usr/bin/clickhouse
5. DB::OwnSplitChannel::log(Poco::Message const&) @ 0x8682e41 in /usr/bin/clickhouse

A. Check if you have proper permission to a log files folder, and enough disk space (& inode numbers) on the block device used for logging.

ls -la /var/log/clickhouse-server/
df -Th
df -Thi

Q. How to configure logging in ClickHouse®?

A. See https://github.com/ClickHouse/ClickHouse/blob/ceaf6d57b7f00e1925b85754298cf958a278289a/programs/server/config.xml#L9-L62

5.22 - High Memory Usage During Merge in system.metric_log

Resolving excessive memory consumption during merges in the ClickHouse® system.metric_log table.

Overview

In recent versions of ClickHouse®, the merge process (part compaction) in the system.metric_log table can consume a large amount of memory. The issue arises due to an unfortunate combination of settings, where:

the merge is already large enough to produce wide parts,
but not yet large enough to enable vertical merges.

This problem has become more pronounced in newer ClickHouse® versions because the system.metric_log table has expanded significantly — many new metrics were added, increasing the total number of columns.

Wide vs Compact — storage formats for table parts:
Wide — each column is stored in a separate file (more efficient for large datasets).
Compact — all data is stored in a single file (more efficient for small inserts).
Horizontal vs Vertical merge — algorithms for combining data during merges:
Horizontal merge reads and merges all columns at once — meaning all files are opened simultaneously, and buffers are allocated for each column and each part.
Vertical merge processes columns in batches — first merging only columns from ORDER BY, then the rest one by one. This approach significantly reduces memory usage.

The most memory-intensive scenario is a horizontal merge of wide parts in a table with a large number of columns.

Demonstrating the Problem

The issue can be reproduced easily by adjusting a few settings:

ALTER TABLE system.metric_log MODIFY SETTING min_bytes_for_wide_part = 100;
OPTIMIZE TABLE system.metric_log FINAL;

Example log output:

[c9d66aa9f9d1] 2025.11.10 10:04:59.091067 [97] <Debug> MemoryTracker: Background process (mutate/merge) peak memory usage: 6.00 GiB.

The merge consumed 6 GB of memory — far too much for this table.

Vertical Merges Are Not Affected

If you explicitly force vertical merges, memory consumption normalizes, although the process becomes slightly slower:

ALTER TABLE system.metric_log MODIFY SETTING 
    min_bytes_for_wide_part = 100,
    vertical_merge_algorithm_min_rows_to_activate = 1;

OPTIMIZE TABLE system.metric_log FINAL;

Example log output:

[c9d66aa9f9d1] 2025.11.10 10:06:14.575832 [97] <Debug> MemoryTracker: Background process (mutate/merge) peak memory usage: 13.98 MiB.

Now memory usage drops from 6 GB to only 14 MB.

Root Cause

The problem stems from the fact that:

the threshold for enabling wide parts is configured in bytes (min_bytes_for_wide_part);
while the threshold for enabling vertical merges is configured in rows (vertical_merge_algorithm_min_rows_to_activate).

When a table contains very wide rows (many lightweight columns), this mismatch causes wide parts to appear too early, while vertical merges are triggered much later.

Default Settings

Parameter	Value
`vertical_merge_algorithm_min_rows_to_activate`	131072
`vertical_merge_algorithm_min_bytes_to_activate`	0
`min_bytes_for_wide_part`	10485760 (10 MB)
`min_rows_for_wide_part`	0

The average row size in metric_log is approximately 2.8 KB, meaning wide parts are created after roughly:

10485760 / 2800 ≈ 3744 rows

Meanwhile, the vertical merge algorithm activates only after 131 072 rows — much later.

Possible Solutions

Increase min_bytes_for_wide_part For example, set it to at least 2800 * 131072 ≈ 350 MB. This delays the switch to the wide format until vertical merges can also be used.
Switch to a row-based threshold Use min_rows_for_wide_part instead of min_bytes_for_wide_part.
Lower the threshold for vertical merges Reduce vertical_merge_algorithm_min_rows_to_activate, or add a value for vertical_merge_algorithm_min_bytes_to_activate.

Example Local Fix for `metric_log`

Apply the configuration below, then restart ClickHouse® and drop the metric_log table (so it will be recreated with the updated settings):

<metric_log replace="1">
    <database>system</database>
    <table>metric_log</table>
    <engine>
        ENGINE = MergeTree
        PARTITION BY (event_date)
        ORDER BY (event_time)
        TTL event_date + INTERVAL 14 DAY DELETE
        SETTINGS min_bytes_for_wide_part = 536870912;
    </engine>
    <flush_interval_milliseconds>7500</flush_interval_milliseconds>
</metric_log>

This configuration increases the threshold for wide parts to 512 MB, preventing premature switching to the wide format and reducing memory usage during merges.

The PR #89811 introduces a similar improvement.

Global Fix (All Tables)

In addition to metric_log, other tables may also be affected — particularly those with average row sizes greater than ~80 bytes and hundreds of columns.

<clickhouse>
  <merge_tree>
    <min_bytes_for_wide_part>0</min_bytes_for_wide_part>    <!-- disable size based threshold for wide part -->
    <min_rows_for_wide_part>131072</min_rows_for_wide_part> <!-- use row based instread, same value as vertical_merge_algorithm_min_rows_to_activate -->
  </merge_tree>
</clickhouse>

These settings tell ClickHouse® to keep using compact parts longer and to enable the vertical merge algorithm simultaneously with the switch to the wide format, preventing sudden spikes in memory usage.

Caution: the vertical merge directly from compact parts to wide part can be VERY slow.

⚠️ Potential Risks and Trade-offs

Raising min_bytes_for_wide_part globally keeps more data in compact parts, which can both help and hurt depending on workload. Compact parts store all columns in a single data.bin file — this makes inserts much faster, especially for tables with many columns, since fewer files are created per part. It’s also a big advantage when storing data on S3 or other object storage, where every extra file adds latency and increases API call counts.

The trade-off is that this layout makes reads less efficient for column-selective queries. Reading one or two columns from a large compact part means scanning and decompressing shared blocks instead of isolated files. It can also reduce cache locality, slightly worsen compression (different columns compressed together), and make mutations or ALTERs more expensive because each change rewrites the entire part.

Lowering thresholds for vertical merges further decreases merge memory but may make the first merges slower, as they process columns sequentially. This configuration works best for wide, append-only tables or S3-based storage, while analytical tables with frequent updates or narrow schemas may perform better with defaults. If merge memory or S3 request overhead is your main concern, applying it globally is reasonable — otherwise, start with specific wide tables like system.metric_log, verify performance improvements, and expand gradually.

Additionally the the vertical merge directly from compact parts to wide part can be VERY slow.

✅ Summary

The root issue is a mismatch between byte-based and row-based thresholds for wide parts and vertical merges. Aligning these values — by adjusting one or both parameters — stabilizes memory usage and prevents excessive RAM consumption during merges in system.metric_log and similar tables.

5.23 - Precreate parts using clickhouse-local

Precreate parts using clickhouse-local.

Precreate parts using clickhouse-local

the code below were testes on 23.3

## 1. Imagine we want to process this file:

cat <<EOF > /tmp/data.csv
1,2020-01-01,"String"
2,2020-02-02,"Another string"
3,2020-03-03,"One more string"
4,2020-01-02,"String for first partition"
EOF

rm -rf /tmp/precreate_parts
mkdir -p /tmp/precreate_parts
cd /tmp/precreate_parts

## 2. that is the metadata for the table we want to fill
## schema should match the schema of the table from server
## (the easiest way is just to copy it from the server)

## I've added sleepEachRow(0.5) here just to mimic slow insert

clickhouse-local --path=. --query="CREATE DATABASE local"
clickhouse-local --path=. --query="CREATE TABLE local.test (id UInt64, d Date, s String, x MATERIALIZED sleepEachRow(0.5)) Engine=MergeTree ORDER BY id PARTITION BY toYYYYMM(d);"

## 3. we can insert the input file into that table in different manners:

## a) just plain insert
cat /tmp/data.csv | clickhouse-local --path=. --query="INSERT INTO local.test FORMAT CSV"

## b) use File on the top of stdin (allows to tune the types)
clickhouse-local --path=. --query="CREATE TABLE local.stdin (id UInt64, d Date, s String) Engine=File(CSV, stdin)"
cat /tmp/data.csv | clickhouse-local --path=. --query="INSERT INTO local.test SELECT * FROM local.stdin"

## c) Instead of stdin you can use file engine 
clickhouse-local --path=. --query "CREATE TABLE local.data_csv (id UInt64, d Date, s String) Engine=File(CSV, '/tmp/data.csv')"
clickhouse-local --path=. --query "INSERT INTO local.test SELECT * FROM local.data_csv" 

# 4. now we have already parts created
clickhouse-local --path=. --query "SELECT _part,* FROM local.test ORDER BY id"
ls -la data/local/test/

# if needed we can even preprocess them more agressively - by doing OPTIMIZE ON that 
clickhouse-local --path=. --query "OPTIMIZE TABLE local.test FINAL"

# that works, but clickhouse will keep inactive parts (those 'unmerged') in place.
ls -la data/local/test/

# we can use a bit hacky way to force it to remove inactive parts them
clickhouse-local --path=. --query "ALTER TABLE local.test MODIFY SETTING old_parts_lifetime=0, cleanup_delay_period=0, cleanup_delay_period_random_add=0"

## needed to give background threads time to clean inactive parts (max_block_size allows to stop that quickly if needed)
clickhouse-local --path=. --query "SELECT count() FROM numbers(100) WHERE sleepEachRow(0.1) SETTINGS max_block_size=1"

ls -la data/local/test/
clickhouse-local --path=. --query "SELECT _part,* FROM local.test ORDER BY id"

5.24 - Recovery after complete data loss

When disaster strikes

Atomic & Ordinary databases.

srv1 – good replica

srv2 – lost replica / we will restore it from srv1

test data (3 tables (atomic & ordinary databases))

srv1

create database testatomic on cluster '{cluster}' engine=Atomic;
create table testatomic.test on cluster '{cluster}' (A Int64, D Date, s String)
Engine = ReplicatedMergeTree('/clickhouse/{cluster}/tables/{database}/{table}','{replica}')
partition by toYYYYMM(D)
order by A;
insert into testatomic.test select number, today(), '' from numbers(1000000);


create database testordinary on cluster '{cluster}' engine=Ordinary;
create table testordinary.test on cluster '{cluster}' (A Int64, D Date, s String)
Engine = ReplicatedMergeTree('/clickhouse/{cluster}/tables/{database}/{table}','{replica}')
partition by toYYYYMM(D)
order by A;
insert into testordinary.test select number, today(), '' from numbers(1000000);


create table default.test on cluster '{cluster}' (A Int64, D Date, s String)
Engine = ReplicatedMergeTree('/clickhouse/{cluster}/tables/{database}/{table}','{replica}')
partition by toYYYYMM(D)
order by A;
insert into default.test select number, today(), '' from numbers(1000000);

destroy srv2

srv2

/etc/init.d/clickhouse-server stop
rm -rf /var/lib/clickhouse/*

generate script to re-create databases (create_database.sql).

srv1

$ cat /home/ubuntu/generate_schema.sql
SELECT concat('CREATE DATABASE "', name, '" ENGINE = ', engine, ' COMMENT \'', comment, '\';')
FROM system.databases
WHERE name NOT IN ('INFORMATION_SCHEMA', 'information_schema', 'system', 'default');

clickhouse-client < /home/ubuntu/generate_schema.sql > create_database.sql

check the result

$ cat create_database.sql
CREATE DATABASE "testatomic" ENGINE = Atomic COMMENT '';
CREATE DATABASE "testordinary" ENGINE = Ordinary COMMENT '';

transfer this create_database.sql to srv2 (scp / rsync)

make a copy of schema sql files (metadata_schema.tar)

srv1

cd /var/lib/clickhouse/
tar -cvhf /home/ubuntu/metadata_schema.tar metadata

-h - is important! (-h, –dereference Follow symlinks; archive and dump the files they point to.)

transfer this metadata_schema.tar to srv2 (scp / rsync)

create databases at srv2

srv2

/etc/init.d/clickhouse-server start
clickhouse-client < create_database.sql
/etc/init.d/clickhouse-server stop

create tables at srv2

srv2

cd /var/lib/clickhouse/
tar xkfv /home/ubuntu/metadata_schema.tar
sudo -u clickhouse touch /var/lib/clickhouse/flags/force_restore_data
/etc/init.d/clickhouse-server start

tar xkfv -k is important! To save folders/symlinks created with create database ( -k, –keep-old-files Don’t replace existing files when extracting )

check a recovery

srv2

SELECT count() FROM testatomic.test;
┌─count()─┐
│ 1000000 │
└─────────┘

SELECT count() FROM testordinary.test;
┌─count()─┐
│ 1000000 │
└─────────┘

SELECT count() FROM default.test;
┌─count()─┐
│ 1000000 │
└─────────┘

5.25 - How to Replicate ClickHouse RBAC Users and Grants with ZooKeeper/Keeper

Practical guide to configure Keeper-backed RBAC replication for users, roles, grants, policies, quotas, and profiles across ClickHouse nodes, including migration and troubleshooting.

How can I replicate CREATE USER and other RBAC commands automatically between servers?

This KB explains how to make SQL RBAC changes (CREATE USER, CREATE ROLE, GRANT, row policies, quotas, settings profiles, masking policies) automatically appear on all servers by storing access entities in ZooKeeper/ClickHouse Keeper.

Keeper below means either ClickHouse Keeper or ZooKeeper.

TL;DR:

By default, SQL RBAC changes (CREATE USER, GRANT, etc.) are local to each server.
Replicated access storage keeps RBAC entities in ZooKeeper/ClickHouse Keeper so changes automatically appear on all nodes.
This guide shows how to configure replicated RBAC, validate it, and migrate existing users safely.

Before diving into the details, the core concept is:

ClickHouse stores access entities in access storages configured by user_directories.
By default, following the shared-nothing concept, SQL RBAC objects are local (local_directory), so changes done on one node do not automatically appear on another node unless you run ... ON CLUSTER ....
With user_directories.replicated, ClickHouse stores the RBAC model in Keeper under a configured path (for example /clickhouse/access) and every node watches that path.
Each node maintains a local in-memory cache of replicated access entities and updates it via Keeper watch callbacks. As a result, access checks are fast and performed locally in memory, while RBAC modifications depend on Keeper availability and propagation.

The flow of this article:

Why this model helps.
How to configure it on a new cluster.
How to validate and operate it.
How to migrate existing RBAC safely.
Advanced troubleshooting and internals.

1. ON CLUSTER vs Keeper-backed RBAC: when to use which

ON CLUSTER executes DDL on hosts that exist at execution time. In practice, it fans out the query through the distributed DDL queue (also Keeper/ZooKeeper-dependent) to currently known cluster nodes. It does not automatically replay old RBAC DDL for replicas/shards added later.

Keeper-backed RBAC differences:

one shared RBAC state for the cluster;
new servers read the same RBAC state when they join;
no need to remember ON CLUSTER for every RBAC statement.

Mental model: Keeper-backed RBAC replicates access state, while ON CLUSTER fans out DDL to currently known nodes.

1.1 Pros and Cons of Keeper-backed RBAC

Pros:

Single source of truth for RBAC across nodes.
No manual file sync of users.xml / local access files.
Fast propagation through Keeper watch-driven refresh.
Natural SQL RBAC workflow (CREATE USER, GRANT, REVOKE, etc.).
Integrates with access-entity backup/restore.

Cons:

Writes depend on Keeper availability. CREATE/ALTER/DROP USER/ROLE and GRANT/REVOKE fail if Keeper is unavailable, while existing authentication/authorization may continue from already loaded cache until restart.
Operational complexity increases (Keeper health directly affects RBAC operations).
Keeper data loss or accidental Keeper path damage can remove replicated RBAC state, and users may lose access; keep regular RBAC backups and test restore procedures.
Can conflict with ON CLUSTER if both mechanisms are used without guard settings.
Invalid/corrupted payload in Keeper can be skipped or be startup-fatal, depending on throw_on_invalid_replicated_access_entities.
Very large RBAC sets (thousands of users/roles or very complex grants) can increase Keeper/watch pressure.
If Keeper is unavailable during server startup and replicated RBAC storage is configured, the server may fail to start.

2. Configure Keeper-backed RBAC on a new cluster

user_directories is the ClickHouse server configuration section that defines:

where access entities are read from (users.xml, local SQL access files, Keeper, LDAP, etc.),
and in which order those sources are checked (precedence).

In short: it is the access-storage routing configuration for users/roles/policies/profiles/quotas.

Apply on every ClickHouse node:

<clickhouse>
  <user_directories replace="replace">
    <users_xml>
      <path>/etc/clickhouse-server/users.xml</path>
    </users_xml>
    <replicated>
      <zookeeper_path>/clickhouse/access/</zookeeper_path>
    </replicated>
  </user_directories>
</clickhouse>

Why replace="replace" matters:

without replace="replace", your fragment can be merged with defaults;
defaults include local_directory, so SQL RBAC may still be written locally;
this can cause mixed behavior (some entities in Keeper, some in local files).

Recommended configuration for clusters using replicated RBAC:

users_xml: bootstrap/break-glass admin users and static defaults.
replicated: all SQL RBAC objects (CREATE USER, CREATE ROLE, GRANT, policies, profiles, quotas).
avoid local_directory as an active writable SQL RBAC storage to prevent mixed write behavior.

2.1 Understand `user_directories`: defaults, precedence, coexistence

What can be configured in user_directories:

users_xml (read-only config users),
local_directory (SQL users/roles in local files),
replicated (SQL users/roles in Keeper),
memory,
ldap (read-only remote auth source).

Defaults if user_directories is not specified:

ClickHouse uses legacy settings (users_config and access_control_path).
In typical default deployments this means users_xml + local_directory.

If user_directories is specified:

ClickHouse uses storages from this section and ignores users_config / access_control_path paths for access storages.
Order in user_directories defines precedence for lookup/auth.

When several storages coexist:

reads/auth checks storages by precedence order;
CREATE USER/ROLE/... without explicit IN ... goes to the first writable target by that order (and may conflict with entities found in higher-precedence storages).

There is special syntax to target a storage explicitly:

CREATE USER my_user IDENTIFIED BY '***' IN replicated;

This is supported, but for access control we usually do not recommend mixing storages intentionally. For sensitive access rights, a single source of truth (typically replicated) is safer and easier to operate.

3. Altinity Operator (CHI) configuration example

apiVersion: clickhouse.altinity.com/v1
kind: ClickHouseInstallation
metadata:
  name: rbac-replicated
spec:
  configuration:
    files:
      config.d/user_directories.xml: |
        <clickhouse>
          <user_directories replace="replace">
            <users_xml>
              <path>/etc/clickhouse-server/users.xml</path>
            </users_xml>
            <replicated>
              <zookeeper_path>/clickhouse/access/</zookeeper_path>
            </replicated>
          </user_directories>
        </clickhouse>

4. Validate the setup quickly

Check active storages and precedence:

SELECT name, type, params, precedence
FROM system.user_directories
ORDER BY precedence;

Expected result (values can vary by version/config; precedence values are relative and order matters):

name        type        precedence
users_xml   users_xml   0
replicated  replicated  1

Check where users are stored:

SELECT name, storage
FROM system.users
ORDER BY name;

Expected result for a SQL-created user:

name         storage
kb_test      replicated

Smoke test:

On node A: CREATE USER kb_test IDENTIFIED WITH no_password;
On node B: SHOW CREATE USER kb_test;
On either node: DROP USER kb_test;

RBAC changes usually propagate within milliseconds to seconds, depending on Keeper latency and cluster load.

Check Keeper data exists:

SELECT *
FROM system.zookeeper
WHERE path = '/clickhouse/access';

5. Handle existing `ON CLUSTER` RBAC scripts safely

There are two independent propagation mechanisms:

Replicated access storage: Keeper-based replication of RBAC entities.
ON CLUSTER: query fan-out through the distributed DDL queue (also Keeper/ZooKeeper-dependent).

When replicated access storage is enabled, combining both can be redundant or problematic.

Recommended practice:

Prefer RBAC SQL without ON CLUSTER, or enable ignore mode:

SET ignore_on_cluster_for_replicated_access_entities_queries = 1;

With this setting, existing RBAC scripts containing ON CLUSTER can still be used safely: the clause is rewritten away for replicated-access queries.

For production, prefer configuring this in a profile (for example default in users.xml) rather than relying on session-level SET:

<clickhouse>
  <profiles>
    <default>
      <ignore_on_cluster_for_replicated_access_entities_queries>1</ignore_on_cluster_for_replicated_access_entities_queries>
    </default>
  </profiles>
</clickhouse>

6. Migrate existing clusters/users

Switching to Keeper-backed RBAC should be treated as a storage migration..

Important: replay/restore RBAC on one node only. Objects are written to Keeper and then reflected on all nodes.

Key facts before migration:

Changing user_directories storage or changing zookeeper_path does not move existing SQL RBAC objects automatically.
If the path changes, old users and roles are not deleted but become effectively hidden from the new storage path.
zookeeper_path cannot be changed at runtime via SQL.

Recommended high-level steps:

Export/backup RBAC.
Apply the new user_directories config on all nodes.
Restart/reload as needed.
Restore/replay RBAC.
Validate from multiple nodes.

6.1 SQL-only migration (export/import RBAC DDL)

This path is useful when:

RBAC DDL is already versioned in your repo, or
you want to dump/replay access entities using SQL only.
Replaying SHOW ACCESS output is idempotent only if you handle IF NOT EXISTS/cleanup; otherwise prefer restoring into an empty RBAC namespace.

Recommended SQL-only flow:

On the source, check where the entities are stored (local vs. replicated):

SELECT name, storage FROM system.users ORDER BY name;
SELECT name, storage FROM system.roles ORDER BY name;
SELECT name, storage FROM system.settings_profiles ORDER BY name;
SELECT name, storage FROM system.quotas ORDER BY name;
SELECT name, storage FROM system.row_policies ORDER BY name;
SELECT name, storage FROM system.masking_policies ORDER BY name;

Export RBAC DDL from the source:

simplest full dump:

SHOW ACCESS;

Save the output as SQL (for example rbac_dump.sql) in your repo/artifacts.

You can also export individual objects with SHOW CREATE USER/ROLE/... when needed.

Switch the configuration to replicated user_directories on the target cluster and restart/reload.
Replay the exported SQL on one node (without ON CLUSTER in replicated mode).
Validate from another node (SHOW CREATE USER ..., SHOW GRANTS FOR ...).

6.2 Migration with `clickhouse-backup` (`--rbac-only`)

# backup local RBAC users/roles/etc.
clickhouse-backup create --rbac --rbac-only users_bkp_20260304

# restore (on node configured with replicated user directory)
clickhouse-backup restore --rbac-only users_bkp_20260304

Important:

this applies to SQL/RBAC users (created with CREATE USER ..., CREATE ROLE ..., etc.);
if your users are in users.xml, those are config-based (--configs) and this is not an automatic local->replicated RBAC conversion.
run restore on one node only; entities will be replicated through Keeper.
If clickhouse-backup is configured with use_embedded_backup_restore: true, it delegates to SQL BACKUP/RESTORE and follows embedded rules. (see below).

6.3 Migration with embedded SQL `BACKUP/RESTORE`

BACKUP
    TABLE system.users,
    TABLE system.roles,
    TABLE system.row_policies,
    TABLE system.quotas,
    TABLE system.settings_profiles,
    TABLE system.masking_policies
TO <backup_destination>;

-- after switching config
RESTORE
    TABLE system.users,
    TABLE system.roles,
    TABLE system.row_policies,
    TABLE system.quotas,
    TABLE system.settings_profiles,
    TABLE system.masking_policies
FROM <backup_destination>;

allow_backup behavior for embedded SQL backup/restore:

Storage-level flag in user_directories (<replicated>, <local_directory>, <users_xml>) controls whether that storage participates in backup/restore.
Entity-level setting allow_backup (for users/roles/settings profiles) can exclude specific RBAC objects from backup.

Defaults in ClickHouse code:

users_xml: allow_backup = false by default.
local_directory: allow_backup = true by default.
replicated: allow_backup = true by default.

Operational implication:

If you disable allow_backup for replicated storage, embedded BACKUP TABLE system.users ... may skip those entities (or fail if no backup-allowed access storage remains).

7. Troubleshooting: common support issues

Symptom	Typical root cause	What to do
User created on node A is missing on node B	RBAC still stored in `local_directory`	Verify `system.user_directories`; ensure `replicated` is configured on all nodes and active
RBAC objects “disappeared” after config change/restart	`zookeeper_path` or storage source changed	Restore from backup or recreate RBAC in the new storage; keep path stable
New replica has no historical users/roles	Team used only `... ON CLUSTER ...` before scaling	Enable Keeper-backed RBAC so new nodes load shared state
`CREATE USER ... ON CLUSTER` throws “already exists in replicated”	Query fan-out + replicated storage both applied	Remove `ON CLUSTER` for RBAC or enable `ignore_on_cluster_for_replicated_access_entities_queries`
`CREATE USER`/`GRANT` fails with Keeper/ZooKeeper error	Keeper unavailable or connection lost	Check `system.zookeeper_connection`, `system.zookeeper_connection_log`, and server logs
RBAC writes still go to `local_directory` even though `replicated` is configured	`local_directory` remains the first writable storage	Use `user_directories replace="replace"` and avoid writable local SQL storage in front of `replicated`
Server does not start when Keeper is down; no one can log in	Replicated access storage needs Keeper during initialization	Restore Keeper first, then restart; if needed use a temporary fallback config and keep a break-glass `users.xml` admin
Startup fails (or users are skipped) because of invalid RBAC payload in Keeper	Corrupted/invalid replicated entity and strict validation mode	Use `throw_on_invalid_replicated_access_entities` deliberately: `true` fail-fast, `false` skip+log; fix bad Keeper payload before re-enabling strict mode
Two independent clusters unexpectedly share the same users/roles	Both clusters point to the same Keeper ensemble and the same `zookeeper_path`	Use unique RBAC paths per cluster (recommended), or isolate with Keeper chroot (requires Keeper metadata repopulation/migration)
Cannot change RBAC keeper path with SQL at runtime	Not supported by design	Change config + controlled migration/restore
Trying to “sync” RBAC between independent clusters by pointing to another path	Wrong migration model	Use backup/restore or SQL export/import, not ad hoc path switching
Authentication errors from app/job, but local tests work	Network/IP/user mismatch, not replication itself	Check `system.query_log` and source IP; verify user host restrictions
Short window where user seems present/absent via load balancer	Propagation + node routing timing	Validate directly on each node; avoid assuming LB view is instantly consistent
Server fails after aggressive `user_directories` replacement	Required base users/profiles missing in config	Keep `users_xml` (or equivalent base definitions) intact

8. Operational guardrails for production

Keep the same user_directories config on all nodes.
Keep zookeeper_path unique per cluster/tenant.
Use a dedicated admin user for provisioning; avoid using default for automation.
Track configuration rollouts (who/when/what) to avoid hidden behavior changes.
Treat Keeper health as part of access-management SLO.
Plan RBAC backup/restore before changing storage path or cluster topology.

9. Observability and debugging signals

9.1 Check Keeper connectivity

SELECT * FROM system.zookeeper_connection;
SELECT * FROM system.zookeeper_connection_log ORDER BY event_time DESC LIMIT 100;
SELECT * FROM system.zookeeper WHERE path = '/clickhouse/access';

9.2 Relevant server log patterns

You can find feature-related lines in the log, by those patterns:

Access(replicated)
ZooKeeperReplicator
Can't have Replicated access without ZooKeeper
ON CLUSTER clause was ignored for query

9.3 Force RBAC reload

Force access reload:

SYSTEM RELOAD USERS;

10. Keeper path structure and semantics (advanced)

The following details are useful for advanced debugging or when inspecting Keeper paths manually.

If zookeeper_path=/clickhouse/access:

/clickhouse/access
  /uuid/<entity_uuid>   -> serialized ATTACH statements for one entity
  /U/<escaped_name>     -> user name -> UUID
  /R/<escaped_name>     -> role name -> UUID
  /S/<escaped_name>     -> settings profile name -> UUID
  /P/<escaped_name>     -> row policy name -> UUID
  /Q/<escaped_name>     -> quota name -> UUID
  /M/<escaped_name>     -> masking policy name -> UUID

When these paths are accessed:

startup/reconnect: ClickHouse syncs Keeper, creates roots if missing, loads all entities;
CREATE/ALTER/DROP RBAC SQL: updates uuid and type/name index nodes in Keeper transactions;
runtime: watch callbacks refresh changed entities into local in-memory mirror.

11. Low-level internals

Advanced note:

each ClickHouse node keeps a local in-memory cache of all replicated access entities;
cache is updated from Keeper watch notifications (list/entity watches), so auth/lookup paths use local memory and not direct Keeper reads on each request.
watch patterns used:
- list watch on /uuid children for create/delete detection;
- per-entity watch on /uuid/<id> for payload changes.
thread model:
- dedicated watcher thread (runWatchingThread);
- on errors: reset cached Keeper client, sleep, retry;
- after refresh: send AccessChangesNotifier notifications.
cache layers:
- primary cache: MemoryAccessStorage inside replicated access storage;
- higher-level caches in AccessControl (RoleCache, RowPolicyCache, QuotaCache, SettingsProfilesCache) are updated/invalidated via access change notifications.
Read path is memory-backed (MemoryAccessStorage mirror), not direct Keeper reads per query.
Write path requires Keeper availability; if Keeper is down, RBAC writes fail while some reads can continue from loaded state.
Insert target is selected by storage order and writeability in MultipleAccessStorage; this is why leftover local_directory can hijack SQL user creation.
ignore_on_cluster_for_replicated_access_entities_queries is implemented as AST rewrite that removes ON CLUSTER for access queries when replicated access storage is enabled.

12. Version and history highlights

Date	Change	Why it matters
2021-07-21	`ReplicatedAccessStorage` introduced (`e33a2bf7bc9`, PR #27426)	First Keeper-backed RBAC replication
2023-08-18	Ignore `ON CLUSTER` for replicated access entities (`14590305ad0`, PR #52975)	Reduced duplicate/overlap behavior
2023-12-12	Extended ignore behavior to `GRANT/REVOKE` (`b33f1245559`, PR #57538)	Fixed common operational conflict with grants
2025-06-03	Keeper replication logic extracted to `ZooKeeperReplicator` (`39eb90b73ef`, PR #81245)	Cleaner architecture, shared replication core
2026-01-24	Optional strict mode on invalid replicated entities (`3d654b79853`)	Lets operators fail fast on corrupted Keeper payloads

13. Code references for deep dives

src/Access/AccessControl.cpp
src/Access/MultipleAccessStorage.cpp
src/Access/ReplicatedAccessStorage.cpp
src/Access/ZooKeeperReplicator.cpp
src/Interpreters/removeOnClusterClauseIfNeeded.cpp
src/Access/IAccessStorage.cpp
src/Backups/BackupCoordinationOnCluster.cpp
src/Backups/RestoreCoordinationOnCluster.cpp
tests/integration/test_replicated_users/test.py
tests/integration/test_replicated_access/test_invalid_entity.py

5.26 - Replication: Can not resolve host of another ClickHouse® server

Symptom

When configuring Replication the ClickHouse® cluster nodes are experiencing communication issues, and an error message appears in the log that states that the ClickHouse host cannot be resolved.

<Error> DNSResolver: Cannot resolve host (xxxxx), error 0: DNS error.
 auto DB::StorageReplicatedMergeTree::processQueueEntry(ReplicatedMergeTreeQueue::SelectedEntryPtr)::(anonymous class)::operator()(DB::StorageReplicatedMergeTree::LogEntryPtr &) const: Code: 198. DB::Exception: Not found address of host: xxxx. (DNS_ERROR),

Cause:

The error message indicates that the host name of the one of the nodes of the cluster cannot be resolved by other cluster members, causing communication issues between the nodes.

Each node in the replication setup pushes its Fully Qualified Domain Name (FQDN) to Zookeeper, and if other nodes cannot access it using its FQDN, this can cause issues.

Action:

There are two possible solutions to this problem:

Change the FQDN to allow other nodes to access it. This solution can also help to keep the environment more organized. To do this, use the following command to edit the hostname file:

sudo vim /etc/hostname

Or use the following command to change the hostname:

sudo hostnamectl set-hostname ...

Use the configuration parameter <interserver_http_host> to specify the IP address or hostname that the nodes can use to communicate with each other. This solution can have some issues, such as the one described in this link: https://github.com/ClickHouse/ClickHouse/issues/2154 . To configure this parameter, refer to the documentation for more information: https://clickhouse.com/docs/en/operations/server-configuration-parameters/settings/#interserver-http-host .

5.27 - source parts size is greater than the current maximum

source parts size (…) is greater than the current maximum (…)

Symptom

I see messages like: source parts size (...) is greater than the current maximum (...) in the logs and/or inside system.replication_queue

Cause

Usually that means that there are already few big merges running. You can see the running merges using the query:

SELECT * FROM system.merges

That logic is needed to prevent picking a log of huge merges simultaneously (otherwise they will take all available slots and ClickHouse® will not be able to do smaller merges, which usually are important for keeping the number of parts stable).

Action

It is normal to see those messages on some stale replicas. And it should be resolved automatically after some time. So just wait & monitor system.merges & system.replication_queue tables, it should be resolved by it’s own.

If it happens often or don’t resolves by it’s own during some longer period of time, it could be caused by:

increased insert pressure
disk issues / high load (it works slow, not enough space etc.)
high CPU load (not enough CPU power to catch up with merges)
issue with table schemas leading to high merges pressure (high / increased number of tables / partitions / etc.)

Start from checking dmesg / system journals / ClickHouse monitoring to find the anomalies.

5.28 - Successful ClickHouse® deployment plan

Successful ClickHouse® deployment plan

Stage 0. Build POC

Install single node ClickHouse
Start with creating a single table (the biggest one), use MergeTree engine. Create ‘some’ schema (most probably it will be far from optimal). Prefer denormalized approach for all immutable dimensions, for mutable dimensions - consider dictionaries.
Load some amount of data (at least 5 Gb, and 10 mln rows) - preferable the real one, or as close to real as possible. Usually the simplest options are either through CSV / TSV files (or insert into clickhouse_table select * FROM mysql(...) where ...)
Create several representative queries.
Check the columns cardinality, and appropriate types, use minimal needed type
Review the partition by and order by. https://kb.altinity.com/engines/mergetree-table-engine-family/pick-keys/
Create the schema(s) with better/promising order by / partitioning, load data in. Pick the best schema.
consider different improvements of particular columns (codecs / better data types etc.) https://kb.altinity.com/altinity-kb-schema-design/codecs/altinity-kb-how-to-test-different-compression-codecs/
If the performance of certain queries is not enough - consider using PREWHERE / skipping indexes
Repeat 2-9 for next big table(s). Avoid scenarios when you need to join big tables.
Pick the clients library for you programming language (the most mature are python / golang / java / c++), build some pipeline - for inserts (low QPS, lot of rows in singe insert, check acknowledgements & retry the same block on failures), ETLs if needed, some reporting layer (https://kb.altinity.com/altinity-kb-integrations/bi-tools/ )

Stage 1. Planning the production setup

Collect more data / estimate insert speed, estimate the column sizes per day / month.
Measure the speed of queries
Consider improvement using materialized views / projections / dictionaries.
Collect requirements (ha / number of simultaneous queries / insert pressure / ’exactly once’ etc)
Do a cluster sizing estimation, plan the hardware
- https://kb.altinity.com/altinity-kb-setup-and-maintenance/cluster-production-configuration-guide/hardware-requirements/
- https://blog.cloudflare.com/clickhouse-capacity-estimation-framework/
plan the network, if needed - consider using LoadBalancers etc.
- https://kb.altinity.com/altinity-kb-setup-and-maintenance/cluster-production-configuration-guide/network-configuration/
If you need sharding - consider different sharding approaches.

Stage 2. Preprod setup & development

Install ClickHouse in cluster - several nodes / VMs + zookeeper
Create good config & automate config / os / restarts (ansible / puppet etc)
- https://kb.altinity.com/altinity-kb-setup-and-maintenance/altinity-kb-settings-to-adjust/
- for docker: https://kb.altinity.com/altinity-kb-setup-and-maintenance/altinity-kb-clickhouse-in-docker/
- for k8s, use the Altinity Kubernetes Operator for ClickHouse OR https://kb.altinity.com/altinity-kb-kubernetes/altinity-kb-possible-issues-with-running-clickhouse-in-k8s/
Set up monitoring / log processing / alerts etc.
- https://kb.altinity.com/altinity-kb-setup-and-maintenance/altinity-kb-monitoring/#build-your-own-monitoring
Set up users.
- https://kb.altinity.com/altinity-kb-setup-and-maintenance/rbac/
Think of schema management. Deploy the schema.
- https://kb.altinity.com/altinity-kb-setup-and-maintenance/schema-migration-tools/
Design backup / failover strategies:
- https://clickhouse.com/docs/en/operations/backup/
- https://github.com/Altinity/clickhouse-backup
Develop pipelines / queries, create test suite, CI/CD
Do benchmark / stress tests
Test configuration changes / server restarts / failovers / version upgrades
Review the security topics (tls, limits / restrictions, network, passwords)
Document the solution for operations

Stage 3. Production setup

Deploy the production setup (consider also canary / blue-greed deployments etc)
Schedule ClickHouse upgrades every 6 to 12 months (if possible)

5.29 - sysall database (system tables on a cluster level)

sysall database (system tables on a cluster level)

Requirements

The idea is that you have a macros cluster with cluster name.

For example you have a cluster named production and this cluster includes all ClickHouse® nodes.

$ cat /etc/clickhouse-server/config.d/clusters.xml
<?xml version="1.0" ?>
<yandex>
    <remote_servers>
        <production>
          <shard>
...

And you need to have a macro cluster set to production:

cat /etc/clickhouse-server/config.d/macros.xml
<?xml version="1.0" ?>
<yandex>
    <macros>
        <cluster>production</cluster>
        <replica>....</replica>
        ....
    </macros>
</yandex>

Now you should be able to query all nodes using clusterAllReplicas:

SELECT
    hostName(),
    FQDN(),
    materialize(uptime()) AS uptime
FROM clusterAllReplicas('{cluster}', system.one)
SETTINGS skip_unavailable_shards = 1

┌─hostName()─┬─FQDN()──────────────┬──uptime─┐
│ chhost1    │ chhost1.localdomain │ 1071574 │
│ chhost2    │ chhost2.localdomain │ 1071517 │
└────────────┴─────────────────────┴─────────┘

skip_unavailable_shards is necessary to query a system with some nodes are down.

Script to create DB objects

clickhouse-client -q 'show tables from system'> list
for i in `cat list`; do echo "CREATE OR REPLACE VIEW sysall."$i" as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system."$i") SETTINGS skip_unavailable_shards = 1;"; done;

CREATE DATABASE sysall;

CREATE OR REPLACE VIEW sysall.cluster_state AS
SELECT
    shard_num,
    replica_num,
    host_name,
    host_address,
    port,
    errors_count,
    uptime,
    if(uptime > 0, 'UP', 'DOWN') AS node_state
FROM system.clusters
LEFT JOIN
(
    SELECT
        replaceRegexpOne(hostName(),'-(\d+)-0$','-\1') AS host_name,  -- remove trailing 0
        FQDN() AS fqdn,
        materialize(uptime()) AS uptime
    FROM clusterAllReplicas('{cluster}', system.one)
) as hosts_info USING (host_name)
WHERE cluster = getMacro('cluster')
SETTINGS skip_unavailable_shards = 1;

CREATE OR REPLACE VIEW sysall.asynchronous_inserts as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.asynchronous_inserts) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.asynchronous_metrics as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.asynchronous_metrics) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.backups as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.backups) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.clusters as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.clusters) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.columns as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.columns) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.current_roles as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.current_roles) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.data_skipping_indices as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.data_skipping_indices) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.databases as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.databases) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.detached_parts as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.detached_parts) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.dictionaries as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.dictionaries) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.disks as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.disks) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.distributed_ddl_queue as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.distributed_ddl_queue) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.distribution_queue as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.distribution_queue) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.dropped_tables as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.dropped_tables) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.enabled_roles as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.enabled_roles) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.errors as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.errors) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.events as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.events) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.filesystem_cache as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.filesystem_cache) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.grants as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.grants) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.jemalloc_bins as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.jemalloc_bins) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.macros as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.macros) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.merge_tree_settings as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.merge_tree_settings) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.merges as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.merges) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.metrics as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.metrics) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.moves as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.moves) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.mutations as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.mutations) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.named_collections as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.named_collections) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.parts as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.parts) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.parts_columns as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.parts_columns) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.privileges as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.privileges) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.processes as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.processes) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.projection_parts as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.projection_parts) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.projection_parts_columns as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.projection_parts_columns) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.query_cache as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.query_cache) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.query_log as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.query_log) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.quota_limits as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.quota_limits) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.quota_usage as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.quota_usage) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.quotas as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.quotas) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.quotas_usage as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.quotas_usage) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.replicas as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.replicas) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.replicated_fetches as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.replicated_fetches) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.replicated_merge_tree_settings as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.replicated_merge_tree_settings) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.replication_queue as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.replication_queue) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.role_grants as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.role_grants) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.roles as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.roles) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.row_policies as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.row_policies) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.server_settings as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.server_settings) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.settings as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.settings) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.settings_profile_elements as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.settings_profile_elements) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.settings_profiles as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.settings_profiles) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.storage_policies as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.storage_policies) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.tables as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.tables) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.user_directories as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.user_directories) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.user_processes as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.user_processes) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.users as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.users) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.warnings as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.warnings) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.zookeeper as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.zookeeper) SETTINGS skip_unavailable_shards = 1;
CREATE OR REPLACE VIEW sysall.zookeeper_connection as select hostName() nodeHost, FQDN() nodeFQDN, * from clusterAllReplicas('{cluster}', system.zookeeper_connection) SETTINGS skip_unavailable_shards = 1;

Some examples

select * from sysall.cluster_state;
┌─shard_num─┬─replica_num─┬─host_name───────────┬─host_address─┬─port─┬─errors_count─┬──uptime─┬─node_state─┐
│         1 │           1 │ chhost1.localdomain │ 10.253.86.2  │ 9000 │            0 │ 1071788 │ UP         │
│         2 │           1 │ chhost2.localdomain │ 10.253.215.2 │ 9000 │            0 │ 1071731 │ UP         │
│         3 │           1 │ chhost3.localdomain │ 10.252.83.8  │ 9999 │            0 │       0 │ DOWN       │
└───────────┴─────────────┴─────────────────────┴──────────────┴──────┴──────────────┴─────────┴────────────┘


SELECT
    nodeFQDN,
    path,
    formatReadableSize(free_space) AS free,
    formatReadableSize(total_space) AS total
FROM sysall.disks
┌─nodeFQDN────────────┬─path─────────────────┬─free───────┬─total──────┐
│ chhost1.localdomain │ /var/lib/clickhouse/ │ 511.04 GiB │ 937.54 GiB │
│ chhost2.localdomain │ /var/lib/clickhouse/ │ 495.77 GiB │ 937.54 GiB │
└─────────────────────┴──────────────────────┴────────────┴────────────┘

5.30 - Timeouts during OPTIMIZE FINAL

Timeout exceeded ... or executing longer than distributed_ddl_task_timeout during OPTIMIZE FINAL.

`Timeout exceeded ...` or `executing longer than distributed_ddl_task_timeout` during `OPTIMIZE FINAL`

Timeout may occur

due to the fact that the client reach timeout interval.
- in case of TCP / native clients - you can change send_timeout / receive_timeout + tcp_keep_alive_timeout + driver timeout settings
- in case of HTTP clients - you can change http_send_timeout / http_receive_timeout + tcp_keep_alive_timeout + driver timeout settings
(in the case of ON CLUSTER queries) due to the fact that the timeout for query execution by shards ends
- see setting distributed_ddl_task_timeout

In the first case you additionally may get the misleading messages: Cancelling query. ... Query was cancelled.

In both cases, this does NOT stop the execution of the OPTIMIZE command. It continues to work even after the client is disconnected. You can see the progress of that in system.processes / show processlist / system.merges / system.query_log.

The same applies to queries like:

INSERT ... SELECT
CREATE TABLE ... AS SELECT
CREATE MATERIALIZED VIEW ... POPULATE ...

It is possible to run a query with some special query_id and then poll the status from the processlist (in the case of a cluster, it can be a bit more complicated).

5.31 - Use an executable dictionary as cron task

If you need to execute scheduled tasks, you can use an executable dictionary like it was a cron task.

Rationale

Imagine that we need to restart clickhouse-server every saturday at 10:00 AM. We can use an executable dictionary to do this. Here is the approach and code necessary to do this. It can be used for other operations like INSERT into tables or execute some other imaginative tasks that need an scheduled execution.

Let’s create a simple table to register all the restarts scheduled by this dictionary:

CREATE TABLE restart_table
(
    restart_datetime DateTime
)
ENGINE = TinyLog

Configuration

This is the ClickHouse configuration file we will be using for executable dictionaries. The dictionary is a dummy one (ignore the format and other stuff, we need format in the dict definition because if not it will fail loading), we don’t need it to do anything, just execute a script that has all the logic. The scheduled time is defined in the LIFETIME property of the dictionary (every 5 minutes dictionary will be refreshed and subsequently the script executed). Also for this case we need to load it on startup time setting lazy loading of dicts to false.

<!-- cat restart_dict.xml -->
<clickhouse>
    <dictionaries_config>/etc/clickhouse-server/config.d/*_dict.xml</dictionaries_config>
    <dictionaries_lazy_load>false</dictionaries_lazy_load>
    <dictionary>
        <name>restart_dict</name>
        <structure>
            <id>
                <name>restart_id</name>
                <type>UInt64</type>
            </id>
        </structure>
        <source>
            <executable>
                <command>restart_dict.sh</command>
                <execute_direct>true</execute_direct>
                <format>CSV</format>
            </executable>
        </source>
        <layout>
            <flat/>
        </layout>
        <lifetime>300</lifetime>
    </dictionary>
</clickhouse>

Action

Now the restart logic (which can be different for other needs). In this case it will do nothing until the restart windows comes. During the restart window, we check if there has been a restart in the same window timeframe (if window is an hour the condition should be 1h). The script will issue a SYSTEM SHUTDOWN command to restart the server. The script will also insert a record in the restart_table to register the restart time.

#!/bin/bash

CLICKHOUSE_USER="admin"
CLICKHOUSE_PASSWORD="xxxxxxxxx"

# Check if today is Saturday and the time is 10:00 AM CET or later
# Get current day of week (1-7, where 7 is Sunday)
# reload time for dict is 300 secs / 10 mins
current_day=$(date +%u)
# Get current time in hours and minutes
current_time=$(date +%H%M)

# Check if today is Saturday (6) and the time is between 10:00 AM and 11:00 AM
if [[ $current_day -eq 6 && $current_time -ge 1000 && $current_time -lt 1100 ]]; then
    # Get current date and time as timestamp
    current_timestamp=$(date +%s)
    last_restart_timestamp=$(clickhouse-client --user $CLICKHOUSE_USER --password $CLICKHOUSE_PASSWORD --query "SELECT max(toUnixTimestamp(restart_datetime)) FROM restart_table")
    # Check if the last restart timestamp is within last hour, if not then restart
    if [[ $(( current_timestamp - last_restart_timestamp )) -ge 3600 ]]; then
        # Push data to log table and restart
        echo $current_timestamp | clickhouse-client --user $CLICKHOUSE_USER --password $CLICKHOUSE_PASSWORD --query "INSERT INTO restart_table FORMAT TSVRaw"
        clickhouse-client --user $CLICKHOUSE_USER --password $CLICKHOUSE_PASSWORD --query "SYSTEM SHUTDOWN"
    fi
fi

Improvements

If the dictionary has a high frecuency refresh time, then clickhouse could end up executing that script multiple times using a lot of resources and creating processes that can look like ‘stuck’ ones. To overcome this we can use the executable pool setting: https://clickhouse.com/docs/sql-reference/dictionaries#executable-pool

Executable pool will spawn a pool of processes (similar as a pool of connections) with the specified command and keep them running until they exit, which is useful for heavy scripts/python and reduces the initialization impact of those on clickhouse.

5.32 - Useful settings to turn on/Defaults that should be reconsidered

Useful settings to turn on.

Useful settings to turn on/Defaults that should be reconsidered

Some setting that are not enabled by default.

ttl_only_drop_parts

Enables or disables complete dropping of data parts where all rows are expired in MergeTree tables.

When ttl_only_drop_parts is disabled (by default), the ClickHouse® server only deletes expired rows according to their TTL.

When ttl_only_drop_parts is enabled, the ClickHouse server drops a whole part when all rows in it are expired.

Dropping whole parts instead of partial cleaning TTL-d rows allows having shorter merge_with_ttl_timeout times and lower impact on system performance.

join_use_nulls

Might be you not expect that join will be filled with default values for missing columns (instead of classic NULLs) during JOIN.

Sets the type of JOIN behaviour. When merging tables, empty cells may appear. ClickHouse fills them differently based on this setting.

Possible values:

0 — The empty cells are filled with the default value of the corresponding field type. 1 — JOIN behaves the same way as in standard SQL. The type of the corresponding field is converted to Nullable, and empty cells are filled with NULL.

aggregate_functions_null_for_empty

Default behaviour is not compatible with ANSI SQL (ClickHouse avoids Nullable types by performance reasons)

select sum(x), avg(x) from (select 1 x where 0);
┌─sum(x)─┬─avg(x)─┐
│      0 │    nan │
└────────┴────────┘

set aggregate_functions_null_for_empty=1;

select sum(x), avg(x) from (select 1 x where 0);
┌─sumOrNull(x)─┬─avgOrNull(x)─┐
│         ᴺᵁᴸᴸ │         ᴺᵁᴸᴸ │
└──────────────┴──────────────┘

5.33 - Who ate my CPU

Queries to find which subsytem of ClickHouse® is using the most of CPU.

Merges

SELECT
    table,
    round((elapsed * (1 / progress)) - elapsed, 2) AS estimate,
    elapsed,
    progress,
    is_mutation,
    formatReadableSize(total_size_bytes_compressed) AS size,
    formatReadableSize(memory_usage) AS mem
FROM system.merges
ORDER BY elapsed DESC

Mutations

SELECT
    database,
    table,
    substr(command, 1, 30) AS command,
    sum(parts_to_do) AS parts_to_do,
    anyIf(latest_fail_reason, latest_fail_reason != '')
FROM system.mutations
WHERE NOT is_done
GROUP BY
    database,
    table,
    command

Current Processes

select elapsed, query from system.processes where is_initial_query and elapsed > 2

Processes retrospectively

SELECT
    normalizedQueryHash(query) hash,
    current_database,
    sum(ProfileEvents['UserTimeMicroseconds'] as userCPUq)/1000 AS userCPUms,
    count(),
    sum(query_duration_ms) query_duration_ms,
    userCPUms/query_duration_ms cpu_per_sec, 
    argMax(query, userCPUq) heaviest_query
FROM system.query_log
WHERE (type = 2) AND (event_date >= today())
GROUP BY
    current_database,
    hash
ORDER BY userCPUms DESC
LIMIT 10
FORMAT Vertical;

5.34 - Zookeeper session has expired

Zookeeper session has expired

Q. I get “Zookeeper session has expired” once. What should i do? Should I worry?

Getting exceptions or lack of acknowledgement in distributed system from time to time is a normal situation. Your client should do the retry. If that happened once and your client do retries correctly - nothing to worry about.

It it happens often, or with every retry - it may be a sign of some misconfiguration / issue in cluster (see below).

Q. we see a lot of these: Zookeeper session has expired. Switching to a new session

A. There is a single Zookeeper session per server. But there are many threads that can use Zookeeper simultaneously. So the same event (we lose the single Zookeeper session we had), will be reported by all the threads/queries which were using that Zookeeper session.

Usually after loosing the Zookeeper session that exception is printed by all the thread which watch Zookeeper replication queues, and all the threads which had some in-flight Zookeeper operations (for example inserts, ON CLUSTER commands etc).

If you see a lot of those simultaneously - that just means you have a lot of threads talking to Zookeeper simultaneously (or may be you have many replicated tables?).

BTW: every Replicated table comes with its own cost, so you can’t scale the number of replicated tables indefinitely .

Typically after several hundreds (sometimes thousands) of replicated tables, the ClickHouse® server becomes unusable: it can’t do any other work, but only keeping replication housekeeping tasks. ‘ClickHouse-way’ is to have a few (maybe dozens) of very huge tables instead of having thousands of tiny tables. (Side note: the number of not-replicated tables can be scaled much better).

So again if during short period of time you see lot of those exceptions and that don’t happen anymore for a while - nothing to worry about. Just ensure your client is doing retries properly.

Q. We are wondering what is causing that session to “timeout” as the default looks like 30 seconds, and there’s certainly stuff happening much more frequently than every 30 seconds.

Typically that has nothing with an expiration/timeout - even if you do nothing there are heartbeat events in the Zookeeper protocol.

So internally inside ClickHouse:

we have a ‘zookeeper client’ which in practice is a single Zookeeper connection (TCP socket), with 2 threads - one serving reads, the seconds serving writes, and some API around.
while everything is ok Zookeeper client keeps a single logical ‘zookeeper session’ (also by sending heartbeats etc).
we may have hundreds of ‘users’ of that Zookeeper client - those are threads that do some housekeeping, serve queries etc.
Zookeeper client normally have dozen ‘in-flight’ requests (asked by different threads). And if something bad happens with that (disconnect, some issue with Zookeeper server, some other failure), Zookeeper client needs to re-establish the connection and switch to the new session so all those ‘in-flight’ requests will be terminated with a ‘session expired’ exception.

Q. That problem happens very often (all the time, every X minutes / hours / days).

Sometimes the real issue can be visible somewhere close to the first ‘session expired’ exception in the log. (i.e. Zookeeper client thread can know & print to logs the real reason, while all ‘user’ threads just get ‘session expired’).

Also Zookeeper logs may ofter have a clue to that was the real problem.

Known issues which can lead to session termination by Zookeeper:

connectivity / network issues.
jute.maxbuffer overrun. If you need to pass too much data in a single Zookeeper transaction. (often happens if you need to do ALTER table UPDATE or other mutation on the table with big number of parts). The fix is adjusting JVM setting: -Djute.maxbuffer=8388608. See https://kb.altinity.com/altinity-kb-setup-and-maintenance/altinity-kb-zookeeper/jvm-sizes-and-garbage-collector-settings/
XID overflow. XID is a transaction counter in Zookeeper, if you do too many transactions the counter reaches maxint32, and to restart the counter Zookeeper closes all the connections. Usually, that happens rarely, and is not avoidable in Zookeeper (well in clickhouse-keeper that problem solved). There are some corner cases / some schemas which may end up with that XID overflow happening quite often. (a worst case we saw was once per 3 weeks).

Q. “Zookeeper session has expired” happens every time I try to start the mutation / do other ALTER on Replicated table.

During ALTERing replicated table ClickHouse need to create a record in Zookeeper listing all the parts which should be mutated (that usually means = list names of all parts of the table). If the size of list of parts exceeds maximum buffer size - Zookeeper drops the connection.

Parts name length can be different for different tables. In average with default jute.maxbuffer (1Mb) mutations start to fail for tables which have more than 5000 parts.

Solutions:

rethink partitioning, high number of parts in table is usually not recommended
increase jute.maxbuffer on Zookeeper side to values about 8M
use IN PARTITION clause for mutations (where applicable) - since 20.12
switch to clickhouse-keeper

Q. “Zookeeper session has expired and also Operation timeout” happens when reading blocks from Zookeeper:

2024.02.22 07:20:39.222171 [ 1047 ] {} <Error> ZooKeeperClient: Code: 999. Coordination::Exception: Operation timeout (no response) for request List for path: 
/clickhouse/tables/github_events/block_numbers/20240205105000 (Operation timeout). (KEEPER_EXCEPTION), 
2024.02.22 07:20:39.223293 [ 246 ] {} <Error> default.github_events : void DB::StorageReplicatedMergeTree::mergeSelectingTask(): 
Code: 999. Coordination::Exception: /clickhouse/tables/github_events/block_numbers/20240205105000 (Connection loss).

Sometimes these Session expired and operation timeout are common, because of merges that read all the blocks in Zookeeper for a table and if there are many blocks (and partitions) read time can be longer than the 10 secs default operation timeout . When dropping a partition, ClickHouse never drops old block numbers from Zookeeper, so the list grows indefinitely. It is done as a precaution against race between DROP PARTITION and INSERT. It is safe to clean those old blocks manually

This is being addressed in #59507 Add <code>FORGET PARTITION</code> query to remove old partition nodes from

Solutions: Manually remove old/forgotten blocks https://kb.altinity.com/altinity-kb-useful-queries/remove_unneeded_block_numbers/

Related issues:

5.35 - Server configuration files

How to organize configuration files in ClickHouse® and how to manage changes

Сonfig management (recommended structure)

ClickHouse® server config consists of two parts server settings (config.xml) and users settings (users.xml).

By default they are stored in the folder /etc/clickhouse-server/ in two files config.xml & users.xml.

We suggest never change vendor config files and place your changes into separate .xml files in sub-folders. This way is easier to maintain and ease ClickHouse upgrades.

/etc/clickhouse-server/users.d – sub-folder for user settings (derived from users.xml filename).

/etc/clickhouse-server/config.d – sub-folder for server settings (derived from config.xml filename).

/etc/clickhouse-server/conf.d – sub-folder for any (both) settings.

If the root config (xml or yaml) has a different name, such as keeper_config.xml or config_instance_66.xml, then the keeper_config.d and config_instance_66.d folders will be used. But conf.d is always used and processed last.

File names of your xml files can be arbitrary but they are applied in alphabetical order.

Examples:

$ cat /etc/clickhouse-server/config.d/listen_host.xml
<?xml version="1.0" ?>
<clickhouse>
  <listen_host>::</listen_host>
</clickhouse>


$ cat /etc/clickhouse-server/config.d/macros.xml
<?xml version="1.0" ?>
<clickhouse>
  <macros>
    <cluster>test</cluster>
    <replica>host22</replica>
    <shard>0</shard>
    <server_id>41295</server_id>
    <server_name>host22.server.com</server_name>
  </macros>
</clickhouse>

cat /etc/clickhouse-server/config.d/zoo.xml
<?xml version="1.0" ?>
<clickhouse>
  <zookeeper>
    <node>
      <host>localhost</host>
      <port>2181</port>
    </node>
  </zookeeper>
  <distributed_ddl>
    <path>/clickhouse/test/task_queue/ddl</path>
  </distributed_ddl>
</clickhouse>

cat /etc/clickhouse-server/users.d/enable_access_management_for_user_default.xml
<?xml version="1.0" ?>
<clickhouse>
  <users>
    <default>
      <access_management>1</access_management>
    </default>
  </users>
</clickhouse>

cat /etc/clickhouse-server/users.d/memory_usage.xml
<?xml version="1.0" ?>
<clickhouse>
    <profiles>
        <default>
            <max_bytes_before_external_group_by>25290221568</max_bytes_before_external_group_by>
            <max_memory_usage>50580443136</max_memory_usage>
        </default>
    </profiles>
</clickhouse>

BTW, you can define any macro in your configuration and use them in Zookeeper paths

 ReplicatedMergeTree('/clickhouse/{cluster}/tables/my_table','{replica}')

or in your code using function getMacro:

CREATE OR REPLACE VIEW srv_server_info
SELECT (SELECT getMacro('shard')) AS shard_num,
       (SELECT getMacro('server_name')) AS server_name,
       (SELECT getMacro('server_id')) AS server_key

Settings can be appended to an XML tree (default behaviour) or replaced or removed.

Example how to delete tcp_port & http_port defined on higher level in the main config.xml (it disables open tcp & http ports if you configured secure ssl):

cat /etc/clickhouse-server/config.d/disable_open_network.xml
<?xml version="1.0"?>
<clickhouse>
  <http_port remove="1"/>
  <tcp_port remove="1"/>
</clickhouse>

Example how to replace remote_servers section defined on higher level in the main config.xml (it allows to remove default test clusters.

<?xml version="1.0" ?>
<clickhouse>
  <remote_servers replace="1">
    <mycluster>
      ....
    </mycluster>
  </remote_servers>
</clickhouse>

Settings & restart

General ‘rule of thumb’:

server settings (config.xml and config.d) changes require restart;
user settings (users.xml and users.d) changes don’t require restart.

But there are exceptions from those rules (see below).

Server config (config.xml) sections which don’t require restart

<max_server_memory_usage>
<max_server_memory_usage_to_ram_ratio>
<max_table_size_to_drop> (since 19.12)
<max_partition_size_to_drop> (since 19.12)
<max_concurrent_queries> (since 21.11, also for versions older than v24 system tables are not updated with the new config values)
<macros>
<remote_servers>
<dictionaries_config>
<user_defined_executable_functions_config>
<models_config>
<keeper_server>
<zookeeper> (but reconnect don’t happen automatically)
<storage_configuration> – only if you add a new entity (disk/volume/policy), to modify these enitities restart is mandatory.
<user_directories>
<access_control_path>
<encryption_codecs>
<logger> (since 21.11)

Those sections (live in separate files):

<dictionaries>
<functions>
<models>

User settings which require restart.

Most of user setting changes don’t require restart, but they get applied at the connect time, so existing connection may still use old user-level settings. That means that that new setting will be applied to new sessions / after reconnect.

The list of user setting which require server restart:

<background_buffer_flush_schedule_pool_size>
<background_pool_size>
<background_merges_mutations_concurrency_ratio>
<background_move_pool_size>
<background_fetches_pool_size>
<background_common_pool_size>
<background_schedule_pool_size>
<background_message_broker_schedule_pool_size>
<background_distributed_schedule_pool_size>
<max_replicated_fetches_network_bandwidth_for_server>
<max_replicated_sends_network_bandwidth_for_server>

See also select * from system.settings where description ilike '%start%'

Also there are several ’long-running’ user sessions which are almost never restarted and can keep the setting from the server start (it’s DDLWorker, Kafka , and some other service things).

Dictionaries

We suggest to store each dictionary description in a separate (own) file in a /etc/clickhouse-server/dict sub-folder.

$ cat /etc/clickhouse-server/dict/country.xml
<?xml version="1.0"?>
<dictionaries>
  <dictionary>
    <name>country</name>
    <source>
      <http>
      ...
  </dictionary>
</dictionaries>

and add to the configuration

$ cat /etc/clickhouse-server/config.d/dictionaries.xml
<?xml version="1.0"?>
<clickhouse>
  <dictionaries_config>dict/*.xml</dictionaries_config>
  <dictionaries_lazy_load>true</dictionaries_lazy_load>
</clickhouse>

dict/*.xml – relative path, servers seeks files in the folder /etc/clickhouse-server/dict. More info in Multiple ClickHouse instances .

incl attribute & metrica.xml

incl attribute allows to include some XML section from a special include file multiple times.

By default include file is /etc/metrika.xml. You can use many include files for each XML section.

For example to avoid repetition of user/password for each dictionary you can create an XML file:

$ cat /etc/clickhouse-server/dict_sources.xml
<?xml version="1.0"?>
<clickhouse>
  <mysql_config>
      <port>3306</port>
      <user>user</user>
      <password>123</password>
      <replica>
        <host>mysql_host</host>
        <priority>1</priority>
      </replica>
      <db>my_database</db>
  </mysql_config>
</clickhouse>

Include this file:

$ cat /etc/clickhouse-server/config.d/dictionaries.xml
<?xml version="1.0"?>
<clickhouse>
  ...
  <include_from>/etc/clickhouse-server/dict_sources.xml</include_from>
</clickhouse>

And use in dictionary descriptions (incl=“mysql_config”):

$ cat /etc/clickhouse-server/dict/country.xml
<?xml version="1.0"?>
<dictionaries>
  <dictionary>
    <name>country</name>
    <source>
        <mysql incl="mysql_config">
            <table>my_table</table>
            <invalidate_query>select max(id) from my_table</invalidate_query>
        </mysql>
    </source>
      ...
  </dictionary>
</dictionaries>

Multiple ClickHouse instances at one host

By default ClickHouse server configs are in /etc/clickhouse-server/ because clickhouse-server runs with a parameter –config-file /etc/clickhouse-server/config.xml

config-file is defined in startup scripts:

/etc/init.d/clickhouse-server – init-V
/etc/systemd/system/clickhouse-server.service – systemd

ClickHouse uses the path from config-file parameter as base folder and seeks for other configs by relative path. All sub-folders users.d / config.d are relative.

You can start multiple clickhouse-server each with own –config-file.

For example:

/usr/bin/clickhouse-server --config-file /etc/clickhouse-server-node1/config.xml
  /etc/clickhouse-server-node1/  config.xml ... users.xml
  /etc/clickhouse-server-node1/config.d/disable_open_network.xml
  /etc/clickhouse-server-node1/users.d/....

/usr/bin/clickhouse-server --config-file /etc/clickhouse-server-node2/config.xml
  /etc/clickhouse-server-node2/   config.xml ... users.xml
  /etc/clickhouse-server-node2/config.d/disable_open_network.xml
  /etc/clickhouse-server-node2/users.d/....

If you need to run multiple servers for CI purposes you can combine all settings in a single fat XML file and start ClickHouse without config folders/sub-folders.

/usr/bin/clickhouse-server --config-file /tmp/ch1.xml
/usr/bin/clickhouse-server --config-file /tmp/ch2.xml
/usr/bin/clickhouse-server --config-file /tmp/ch3.xml

Each ClickHouse instance must work with own data-folder and tmp-folder.

By default ClickHouse uses /var/lib/clickhouse/. It can be overridden in path settings

<path>/data/clickhouse-ch1/</path>

<tmp_path>/data/clickhouse-ch1/tmp/</tmp_path>

<user_files_path>/data/clickhouse-ch1/user_files/</user_files_path>
  <local_directory>
    <path>/data/clickhouse-ch1/access/</path>
  </local_directory>

<format_schema_path>/data/clickhouse-ch1/format_schemas/</format_schema_path>

preprocessed_configs

ClickHouse server watches config files and folders. When you change, add or remove XML files ClickHouse immediately assembles XML files into a combined file. These combined files are stored in /var/lib/clickhouse/preprocessed_configs/ folders.

You can verify that your changes are valid by checking /var/lib/clickhouse/preprocessed_configs/config.xml, /var/lib/clickhouse/preprocessed_configs/users.xml.

If something wrong with with your settings e.g. unclosed XML element or typo you can see alerts about this mistakes in /var/log/clickhouse-server/clickhouse-server.log

If you see your changes in preprocessed_configs it does not mean that changes are applied on running server, check Settings and restart.

5.36 - Aggressive merges

Aggressive merges

Q: Is there any way I can dedicate more resources to the merging process when running ClickHouse® on pretty beefy machines (like 36 cores, 1TB of RAM, and large NVMe disks)?

A: Such things are done by increasing the level of parallelism:

1. background_pool_size - how many threads will actually be doing merges and mutations. If you can push most server resources toward merges, for example, in a controlled backlog-clearing window with little foreground traffic, you can raise it aggressively. If you use replicated tables, review max_replicated_merges_in_queue together with it.

2. background_merges_mutations_concurrency_ratio - how many merges and mutations may be assigned relative to background_pool_size. Sometimes the default (2) may work against you by favoring more smaller tasks, which is useful for continuous real-time inserts but less useful when you want a backlog-clearing merge window. In that case, trying 1 is reasonable.

number_of_free_entries_in_pool_to_lower_max_size_of_merge (merge_tree setting) should be changed together with background_pool_size (50-90% of that). “When there is less than a specified number of free entries in the pool (or replicated queue), start to lower the maximum size of the merge to process (or to put in the queue). This is to allow small merges to process - not filling the pool with long-running merges.” To make it really aggressive, try 90-95% of background_pool_size, for ex. 34 (so you will have 34 huge merges and 2 small ones).

Runtime vs restart semantics

background_pool_size and background_merges_mutations_concurrency_ratio can be increased at runtime, but lowering them requires a restart.

Merge scheduling tradeoffs

background_merges_mutations_scheduling_policy is an adjacent knob worth considering:

shortest_task_first helps clear small parts quickly, but can starve large merges if inserts keep producing small parts.
round_robin is safer when starvation of large merges is a concern.

Other settings to consider

control how large target parts may become via max_bytes_to_merge_at_max_space_in_pool if the backlog is dominated by many medium parts instead of tiny fragments.
review min_merge_bytes_to_use_direct_io if you suspect page-cache churn during very large merges. Direct I/O is workload-dependent, so benchmark it instead of assuming it is always better or worse.
on replicated tables with slow merges and a fast network, consider execute_merges_on_single_replica_time_threshold so one replica performs the merge and others can fetch the merged part instead of repeating the same work.
analyze whether Vertical or Horizontal merge is better for your schema. Vertical merges typically use less RAM and keep fewer files open, while Horizontal merges may be simpler and faster for some layouts.
if you have a lot of tables, review scheduler capacity as well: background_schedule_pool_size and background_common_pool_size.
review the schema, especially codecs/compression, because they reduce size but can materially change merge speed.
try to form bigger parts during inserts with min_insert_block_size_bytes, min_insert_block_size_rows, and max_insert_block_size.
check whether Wide or Compact parts are being created (system.parts). Part format is controlled by min_bytes_for_wide_part and min_rows_for_wide_part, so inspect those settings for your version instead of assuming a fixed default cutoff.
consider using recent ClickHouse releases, because mark compression improvements can reduce I/O overhead in merge-heavy workloads.

How to validate changes

All adjustments should be validated with a reproducible benchmark or controlled backlog-clearing test. Compare the before/after trend for merge backlog or part counts, then watch whether the system clears the backlog faster without harming foreground workload. Also monitor how system resources are used or saturated during the test, especially CPU, disk I/O, and for replicated tables network plus ClickHouse Keeper / ZooKeeper load.

Monitor or plot pool usage:

select * from system.metrics where metric like '%PoolTask'

If the relevant pool task counters stay near saturation while backlog does not improve, you are likely limited by another bottleneck such as disk bandwidth, network fetches, or insert shape rather than by merge thread count alone.

Do not use this template when…

the same nodes must sustain low-latency reads and writes continuously, with little room for merge-heavy maintenance windows;
the cluster is already constrained by disk bandwidth rather than merge thread count;
the workload is dominated by mutations, where number_of_free_entries_in_pool_to_execute_mutation may need separate treatment.

Server config example

cat /etc/clickhouse-server/config.d/aggresive_merges.xml
<clickhouse>
 <background_pool_size>36</background_pool_size>
 <background_schedule_pool_size>128</background_schedule_pool_size>
 <background_common_pool_size>8</background_common_pool_size>
 <background_merges_mutations_concurrency_ratio>1</background_merges_mutations_concurrency_ratio>
 <merge_tree>
  <number_of_free_entries_in_pool_to_lower_max_size_of_merge>32</number_of_free_entries_in_pool_to_lower_max_size_of_merge>
  <max_replicated_merges_in_queue>36</max_replicated_merges_in_queue>
  <max_bytes_to_merge_at_max_space_in_pool>161061273600</max_bytes_to_merge_at_max_space_in_pool>
  <min_merge_bytes_to_use_direct_io>10737418240</min_merge_bytes_to_use_direct_io> <!-- 0 to disable -->
 </merge_tree>
</clickhouse>

Legacy profile-style example

Only use the default profile layout if you are intentionally keeping an older configuration style or a compatibility path. See the version notes at the end of this article before copying it.

cat /etc/clickhouse-server/users.d/aggresive_merges.xml
<clickhouse>
<profiles>
  <default>
    <background_pool_size>36</background_pool_size>
    <background_merges_mutations_concurrency_ratio>1</background_merges_mutations_concurrency_ratio>
  </default>
</profiles>
</clickhouse>

Version notes

Through 23.2.x, ClickHouse read these pool settings from the main config and also fell back to profiles.default.* in Context.cpp. That older path covered not only background_pool_size and background_merges_mutations_concurrency_ratio, but also settings such as background_schedule_pool_size and background_common_pool_size.
Starting with 23.3.1.2823-lts, ClickHouse changed this area in PR #48055 (“Refactor reading the pool setting & from server config”). From that release forward, these settings were documented as server settings and the source marked background_pool_size and background_merges_mutations_concurrency_ratio as moved to server config.
For 23.3.1.2823-lts and later, prefer server config (config.xml / config.d) for background_* settings. This is the layout shown in the main example above.
background_pool_size and background_merges_mutations_concurrency_ratio still keep a backward-compatibility path from the default profile at server startup in current upstream source and docs. That is why the legacy profile-style example above is limited to those two settings.
This article intentionally does not show background_schedule_pool_size or background_common_pool_size in users.d. Older versions accepted that pattern, but current upstream docs do not document those settings as profile-based compatibility knobs. For current versions, keep them in server config.

5.37 - Altinity Backup for ClickHouse®

Altinity Backup for ClickHouse® + backblaze

Installation and configuration

Download the latest clickhouse-backup.tar.gz from assets from https://github.com/Altinity/clickhouse-backup/releases

This tar.gz contains a single binary of clickhouse-backup and an example of config file.

Backblaze has s3 compatible API but requires empty acl parameter acl: "".

https://www.backblaze.com/ has 15 days and free 10Gb S3 trial.

$ mkdir clickhouse-backup
$ cd clickhouse-backup
$ wget https://github.com/Altinity/clickhouse-backup/releases/download/v2.5.20/clickhouse-backup.tar.gz
$ tar zxf clickhouse-backup.tar.gz
$ rm clickhouse-backup.tar.gz
$ cat config.yml

general:
  remote_storage: s3
  disable_progress_bar: false
  backups_to_keep_local: 0
  backups_to_keep_remote: 0
  log_level: info
  allow_empty_backups: false
clickhouse:
  username: default
  password: ""
  host: localhost
  port: 9000
  disk_mapping: {}
  skip_tables:
  - system.*
  timeout: 5m
  freeze_by_part: false
  secure: false
  skip_verify: false
  sync_replicated_tables: true
  log_sql_queries: false
s3:
  access_key: 0****1
  secret_key: K****1
  bucket: "mybucket"
  endpoint: https://s3.us-west-000.backblazeb2.com
  region: us-west-000
  acl: ""
  force_path_style: false
  path: clickhouse-backup
  disable_ssl: false
  part_size: 536870912
  compression_level: 1
  compression_format: tar
  sse: ""
  disable_cert_verification: false
  storage_class: STANDARD

I have a database test with table test

select count() from test.test;

┌─count()─┐
│  400000 │
└─────────┘

clickhouse-backup list should work without errors (it scans local and remote (s3) folders):

$ sudo ./clickhouse-backup list -c config.yml
$

Backup

create a local backup of database test
upload this backup to remote
remove the local backup
drop the source database

$ sudo ./clickhouse-backup create --tables='test.*' bkp01 -c config.yml
2021/05/31 23:11:13  info done   backup=bkp01 operation=create table=test.test
2021/05/31 23:11:13  info done   backup=bkp01 operation=create

$ sudo ./clickhouse-backup upload bkp01 -c config.yml
 1.44 MiB / 1.44 MiB [=====================] 100.00% 2s
2021/05/31 23:12:13  info done   backup=bkp01 operation=upload table=test.test
2021/05/31 23:12:17  info done   backup=bkp01 operation=upload

$ sudo ./clickhouse-backup list -c config.yml
bkp01   1.44MiB   31/05/2021 23:11:13   local
bkp01   1.44MiB   31/05/2021 23:11:13   remote      tar

$ sudo ./clickhouse-backup delete local bkp01 -c config.yml
2021/05/31 23:13:29  info delete 'bkp01'

DROP DATABASE test;

Restore

download the remote backup
restore database

$ sudo ./clickhouse-backup list -c config.yml
bkp01   1.44MiB   31/05/2021 23:11:13   remote      tar

$ sudo ./clickhouse-backup download bkp01 -c config.yml
2021/05/31 23:14:41  info done    backup=bkp01 operation=download table=test.test
 1.47 MiB / 1.47 MiB [=====================] 100.00% 0s
2021/05/31 23:14:43  info done    backup=bkp01 operation=download table=test.test
2021/05/31 23:14:43  info done    backup=bkp01 operation=download

$ sudo ./clickhouse-backup restore bkp01 -c config.yml
2021/05/31 23:16:04  info done    backup=bkp01 operation=restore table=test.test
2021/05/31 23:16:04  info done    backup=bkp01 operation=restore

SELECT count() FROM test.test;
┌─count()─┐
│  400000 │
└─────────┘

Delete backups

$ sudo ./clickhouse-backup delete local bkp01 -c config.yml
2021/05/31 23:17:05  info delete 'bkp01'

$ sudo ./clickhouse-backup delete remote bkp01 -c config.yml

5.38 - Altinity packaging compatibility >21.x and earlier

Altinity packaging compatibility >21.x and earlier

Working with Altinity & Yandex packaging together

Since ClickHouse® version 21.1 Altinity switches to the same packaging as used by Yandex. That is needed for syncing things and introduces several improvements (like adding systemd service file).

Unfortunately, that change leads to compatibility issues - automatic dependencies resolution gets confused by the conflicting package names: both when you update ClickHouse to the new version (the one which uses older packaging) and when you want to install older altinity packages (20.8 and older).

Installing old ClickHouse version (with old packaging schema)

When you try to install versions 20.8 or older from Altinity repo -

version=20.8.12.2-1.el7
yum install clickhouse-client-${version} clickhouse-server-${version}

yum outputs something like

yum install clickhouse-client-${version} clickhouse-server-${version}
Loaded plugins: fastestmirror, ovl
Loading mirror speeds from cached hostfile
 * base: centos.hitme.net.pl
 * extras: centos1.hti.pl
 * updates: centos1.hti.pl
Altinity_clickhouse-altinity-stable/x86_64/signature                                                                                                                                |  833 B  00:00:00
Altinity_clickhouse-altinity-stable/x86_64/signature                                                                                                                                | 1.0 kB  00:00:01 !!!
Altinity_clickhouse-altinity-stable-source/signature                                                                                                                                |  833 B  00:00:00
Altinity_clickhouse-altinity-stable-source/signature                                                                                                                                |  951 B  00:00:00 !!!
Resolving Dependencies
--> Running transaction check
---> Package clickhouse-client.x86_64 0:20.8.12.2-1.el7 will be installed
---> Package clickhouse-server.x86_64 0:20.8.12.2-1.el7 will be installed
--> Processing Dependency: clickhouse-server-common = 20.8.12.2-1.el7 for package: clickhouse-server-20.8.12.2-1.el7.x86_64
Package clickhouse-server-common is obsoleted by clickhouse-server, but obsoleting package does not provide for requirements
--> Processing Dependency: clickhouse-common-static = 20.8.12.2-1.el7 for package: clickhouse-server-20.8.12.2-1.el7.x86_64
--> Running transaction check
---> Package clickhouse-common-static.x86_64 0:20.8.12.2-1.el7 will be installed
---> Package clickhouse-server.x86_64 0:20.8.12.2-1.el7 will be installed
--> Processing Dependency: clickhouse-server-common = 20.8.12.2-1.el7 for package: clickhouse-server-20.8.12.2-1.el7.x86_64
Package clickhouse-server-common is obsoleted by clickhouse-server, but obsoleting package does not provide for requirements
--> Finished Dependency Resolution
Error: Package: clickhouse-server-20.8.12.2-1.el7.x86_64 (Altinity_clickhouse-altinity-stable)
           Requires: clickhouse-server-common = 20.8.12.2-1.el7
           Available: clickhouse-server-common-1.1.54370-2.x86_64 (clickhouse-stable)
               clickhouse-server-common = 1.1.54370-2
           Available: clickhouse-server-common-1.1.54378-2.x86_64 (clickhouse-stable)
               clickhouse-server-common = 1.1.54378-2
...
           Available: clickhouse-server-common-20.8.11.17-1.el7.x86_64 (Altinity_clickhouse-altinity-stable)
               clickhouse-server-common = 20.8.11.17-1.el7
           Available: clickhouse-server-common-20.8.12.2-1.el7.x86_64 (Altinity_clickhouse-altinity-stable)
               clickhouse-server-common = 20.8.12.2-1.el7
 You could try using --skip-broken to work around the problem
 You could try running: rpm -Va --nofiles --nodigest

As you can see yum has an issue with resolving clickhouse-server-common dependency, which marked as obsoleted by newer packages.

Solution with Old Packaging Scheme

add --setopt=obsoletes=0 flag to the yum call.

version=20.8.12.2-1.el7
yum install --setopt=obsoletes=0 clickhouse-client-${version} clickhouse-server-${version}
---
title: "installation succeeded"
linkTitle: "installation succeeded"
description: >
    installation succeeded
---

Alternatively, you can add obsoletes=0 into /etc/yum.conf.

To update to new ClickHouse version (from old packaging schema to new packaging schema)

version=21.1.7.1-2
yum install clickhouse-client-${version} clickhouse-server-${version}

Loaded plugins: fastestmirror, ovl
Loading mirror speeds from cached hostfile
 * base: centos.hitme.net.pl
 * extras: centos1.hti.pl
 * updates: centos1.hti.pl
Altinity_clickhouse-altinity-stable/x86_64/signature                                                                                                                                |  833 B  00:00:00
Altinity_clickhouse-altinity-stable/x86_64/signature                                                                                                                                | 1.0 kB  00:00:01 !!!
Altinity_clickhouse-altinity-stable-source/signature                                                                                                                                |  833 B  00:00:00
Altinity_clickhouse-altinity-stable-source/signature                                                                                                                                |  951 B  00:00:00 !!!
Nothing to do

It is caused by wrong dependencies resolution.

Solution with New Package Scheme

To update to the latest available version - just add clickhouse-server-common:

yum install clickhouse-client clickhouse-server clickhouse-server-common

This way the latest available version will be installed (even if you will request some other version explicitly).

To install some specific version remove old packages first, then install new ones.

yum erase clickhouse-client clickhouse-server clickhouse-server-common clickhouse-common-static
version=21.1.7.1
yum install clickhouse-client-${version} clickhouse-server-${version}

Downgrade from new version to old one

version=20.8.12.2-1.el7
yum downgrade  clickhouse-client-${version} clickhouse-server-${version}

will not work:

Loaded plugins: fastestmirror, ovl
Loading mirror speeds from cached hostfile
 * base: ftp.agh.edu.pl
 * extras: ftp.agh.edu.pl
 * updates: centos.wielun.net
Resolving Dependencies
--> Running transaction check
---> Package clickhouse-client.x86_64 0:20.8.12.2-1.el7 will be a downgrade
---> Package clickhouse-client.noarch 0:21.1.7.1-2 will be erased
---> Package clickhouse-server.x86_64 0:20.8.12.2-1.el7 will be a downgrade
--> Processing Dependency: clickhouse-server-common = 20.8.12.2-1.el7 for package: clickhouse-server-20.8.12.2-1.el7.x86_64
Package clickhouse-server-common-20.8.12.2-1.el7.x86_64 is obsoleted by clickhouse-server-21.1.7.1-2.noarch which is already installed
--> Processing Dependency: clickhouse-common-static = 20.8.12.2-1.el7 for package: clickhouse-server-20.8.12.2-1.el7.x86_64
---> Package clickhouse-server.noarch 0:21.1.7.1-2 will be erased
--> Finished Dependency Resolution
Error: Package: clickhouse-server-20.8.12.2-1.el7.x86_64 (Altinity_clickhouse-altinity-stable)
           Requires: clickhouse-common-static = 20.8.12.2-1.el7
           Installed: clickhouse-common-static-21.1.7.1-2.x86_64 (@clickhouse-stable)
               clickhouse-common-static = 21.1.7.1-2
           Available: clickhouse-common-static-1.1.54378-2.x86_64 (clickhouse-stable)
               clickhouse-common-static = 1.1.54378-2
Error: Package: clickhouse-server-20.8.12.2-1.el7.x86_64 (Altinity_clickhouse-altinity-stable)
...
           Available: clickhouse-server-common-20.8.12.2-1.el7.x86_64 (Altinity_clickhouse-altinity-stable)
               clickhouse-server-common = 20.8.12.2-1.el7
 You could try using --skip-broken to work around the problem
 You could try running: rpm -Va --nofiles --nodigest

Solution With Downgrading

Remove packages first, then install older versions:

yum erase clickhouse-client clickhouse-server clickhouse-server-common clickhouse-common-static
version=20.8.12.2-1.el7
yum install --setopt=obsoletes=0 clickhouse-client-${version} clickhouse-server-${version}

5.39 - AWS EC2 Storage

AWS EBS, EFS, FSx, Lustre

EBS

Most native choose for ClickHouse® as fast storage, because it usually guarantees best throughput, IOPS, latency for reasonable price.

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-optimized.html

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volume-types.html

General Purpose SSD volumes

In usual conditions ClickHouse being limited by throughput of volumes and amount of provided IOPS doesn’t make any big difference for performance starting from a certain number. So the most native choice for ClickHouse is gp3 and gp2 volumes.

‌EC2 instances also have an EBS throughput limit, it depends on the size of the EC2 instance. That means if you would attach multiple volumes which would have high potential throughput, you would be limited by your EC2 instance, so usually there is no reason to have more than 1-3 GP3 volume or 4-5 GP2 volume per node.

It’s pretty straightforward to set up a ClickHouse for using multiple EBS volumes with jbod storage_policies.

general purpose

Volume type	gp3	gp2
Max throughput per volume	1000 MiB/s	250 MiB/s
Price	$0.08/GB-month 3,000 IOPS free and $0.005/provisioned IOPS-month over 3,000; 125 MB/s free and $0.04/provisioned MB/s-month over 125	$0.10/GB-month

Volume type

gp3

gp2

Max throughput per volume

1000 MiB/s

250 MiB/s

Price

$0.08/GB-month

3,000 IOPS free and

$0.005/provisioned IOPS-month over 3,000;

125 MB/s free and

$0.04/provisioned MB/s-month over 125

$0.10/GB-month

GP3

It’s recommended option, as it allow you to have only one volume, for instances which have less than 10 Gbps EBS Bandwidth (nodes =<32 VCPU usually) and still have maximum performance. For bigger instances, it make sense to look into option of having several GP3 volumes.

It’s a new type of volume, which is 20% cheaper than gp2 per GB-month and has lower free throughput: only 125 MiB/s vs 250 MiB/s. But you can buy additional throughput and IOPS for volume. It also works better if most of your queries read only one or several parts, because in that case you are not being limited by performance of a single EBS disk, as parts can be located only on one disk at once.

Because, you need to have less GP3 volumes compared to GP2 option, it’s suggested approach for now.

For best performance, it’s suggested to buy:

7000 IOPS
Throughput up to the limit of your EC2 instance (1000 MiB/s is safe option)

GP2

‌GP2 volumes have a hard limit of 250 MiB/s per volume (for volumes bigger than 334 GB), it usually makes sense to split one big volume in multiple smaller volumes larger than 334GB in order to have maximum possible throughput.

Throughput Optimized HDD volumes

ST1

Looks like a good candidate for cheap cold storage for old data with decent maximum throughput 500 MiB/s. But it achieved only for big volumes >5 TiB.

Throughput credits and burst performance

Provisioned IOPS SSD volumes

IO2 Block Express, IO2, IO1

In 99.99% cases doesn’t give any benefit for ClickHouse compared to GP3 option and perform worse because maximum throughput is limited to 500 MiB/s per volume if you buy less than 32 000 IOPS, which is really expensive (compared to other options) and unneeded for ClickHouse. And if you have spare money, it’s better to spend them on better EC2 instance.

S3

Best option for cold data, it can give considerably good throughput and really good price, but latencies and IOPS much worse than EBS option. Another interesting point is, for EC2 instance throughput limit for EBS and S3 calculated separately, so if you access your data both from EBS and S3, you can get double throughput.

It’s stated in AWS documentation, that S3 can fully utilize network capacity of EC2 instance. (up to 100 Gb/s) Latencies or (first-byte-out) estimated to be 100-200 milliseconds withing single region.

It also recommended to enable gateway endpoint for s3 , it can push throughput even further (up to 800 Gb/s)

S3 best practices

EFS

Works over NFSv4.1 version. We have clients, which run their ClickHouse installations over NFS. It works considerably well as cold storage, so it’s recommended to have EBS disks for hot data. A fast network is required.

ClickHouse doesn’t have any native option to reuse the same data on durable network disk via several replicas. You either need to store the same data twice or build custom tooling around ClickHouse and use it without Replicated*MergeTree tables.

FSx

Lustre

We have several clients, who use Lustre (some of them use AWS FSx Lustre, another is self managed Lustre) without any big issue. Fast network is required. There were known problems with data damage on older versions caused by issues with O_DIRECT or async IO support on Lustre.

https://altinity.com/blog/2019/11/27/amplifying-clickhouse-capacity-with-multi-volume-storage-part-1

https://altinity.com/blog/2019/11/29/amplifying-clickhouse-capacity-with-multi-volume-storage-part-2

https://calculator.aws/#/createCalculator/EBS?nc2=h_ql_pr_calc

5.40 - ClickHouse® in Docker

ClickHouse® in Docker

Do you have documentation on Docker deployments?

Check

Important things:

use concrete version tag (avoid using latest)
if possible use --network=host (due to performance reasons)
you need to mount the folder /var/lib/clickhouse to have persistency.
you MAY also mount the folder /var/log/clickhouse-server to have logs accessible outside of the container.
Also, you may mount in some files or folders in the configuration folder:
- /etc/clickhouse-server/config.d/listen_ports.xml
--ulimit nofile=262144:262144
You can also set on some linux capabilities to enable some of extra features of ClickHouse® (not obligatory): SYS_PTRACE NET_ADMIN IPC_LOCK SYS_NICE
you may also mount in the folder /docker-entrypoint-initdb.d/ - all SQL or bash scripts there will be executed during container startup.
if you use cgroup limits - it may misbehave https://github.com/ClickHouse/ClickHouse/issues/2261 (set up <max_server_memory_usage> manually)
there are several ENV switches, see: https://github.com/ClickHouse/ClickHouse/blob/master/docker/server/entrypoint.sh

TLDR version: use it as a starting point:

docker run -d \
   --name some-clickhouse-server \
   --ulimit nofile=262144:262144 \
   --volume=$(pwd)/data:/var/lib/clickhouse \
   --volume=$(pwd)/logs:/var/log/clickhouse-server \
   --volume=$(pwd)/configs/memory_adjustment.xml:/etc/clickhouse-server/config.d/memory_adjustment.xml \
   --cap-add=SYS_NICE \
   --cap-add=NET_ADMIN \
   --cap-add=IPC_LOCK \
   --cap-add=SYS_PTRACE \
   --network=host \
   clickhouse/clickhouse-server:latest

docker exec -it some-clickhouse-server clickhouse-client
docker exec -it some-clickhouse-server bash

5.41 - ClickHouse® Monitoring

Tracking potential issues in your cluster before they cause a critical error

What to read / watch on the subject:

Altinity webinar “ClickHouse Monitoring 101: What to monitor and how”. Watch the video or download the slides .
The ClickHouse docs

What should be monitored

The following metrics should be collected / monitored

For Host Machine:
- CPU
- Memory
- Network (bytes/packets)
- Storage (iops)
- Disk Space (free / used)
For ClickHouse:
- Connections (Number of queries running)
- DDL queue length
- RWLocks
- Read / Write / Return (bytes/rows)
- Merges (queue length, memory used)
- Mutations
- Query duration (optional)
- Replication queue length and lag
- Read only tables
- ZooKeeper latencies
- Zookeeper operations (count)
- S3 errors (if used)
For Zookeeper:
- See separate article

ClickHouse monitoring tools

Prometheus (embedded exporter) + Grafana

Enable embedded exporter
Grafana dashboards https://grafana.com/grafana/dashboards/14192 or https://grafana.com/grafana/dashboards/13500

Prometheus (embedded http handler with Altinity Kubernetes Operator for ClickHouse style metrics) + Grafana

Enable http handler
Useful, if you want to use the dashboard from the Altinity Kubernetes Operator for ClickHouse, but do not run ClickHouse in k8s.

Prometheus (embedded exporter in the Altinity Kubernetes Operator for ClickHouse) + Grafana

exporter is included in the Altinity Kubernetes Operator for ClickHouse, and enabled automatically
see instructions of Prometheus and Grafana installation (if you don’t have one)
Grafana dashboard https://github.com/Altinity/clickhouse-operator/tree/master/grafana-dashboard
Prometheus alerts https://github.com/Altinity/clickhouse-operator/blob/master/deploy/prometheus/prometheus-alert-rules-clickhouse.yaml

Prometheus (ClickHouse external exporter) + Grafana

clickhouse-exporter
Dashboard: https://grafana.com/grafana/dashboards/882

(unmaintained)

Dashboards querying ClickHouse directly via vertamedia / Altinity plugin

Overview: https://grafana.com/grafana/dashboards/13606
Queries dashboard (analyzing system.query_log) https://grafana.com/grafana/dashboards/2515

Dashboard querying ClickHouse directly via Grafana plugin

Zabbix

Graphite

Use the embedded exporter. See docs and config.xml

InfluxDB

You can use embedded exporter, plus Telegraf. For more information, see Graphite protocol support in InfluxDB .

Nagios/Icinga

https://github.com/exogroup/check_clickhouse/

Commercial solution

“Build your own” ClickHouse monitoring

ClickHouse allows to access lots of internals using system tables. The main tables to access monitoring data are:

system.metrics
system.asynchronous_metrics
system.events

Minimum necessary set of checks

Check Name	`Shell or SQL command`	`Severity`
ClickHouse status	`$ curl 'http://localhost:8123/'` `Ok.`	`Critical`
Too many simultaneous queries. Maximum: 100 (by default)	`select value from system.metrics` `where metric='Query'`	`Critical`
Replication status	`$ curl 'http://localhost:8123/replicas_status'` `Ok.`	`High`
Read only replicas (reflected by `replicas_status` as well)	`select value from system.metrics` `where metric='ReadonlyReplica'`	`High`
Some replication tasks are stuck	`select count()` `from system.replication_queue` `where num_tries > 100 or num_postponed > 1000`	`High`
ZooKeeper is available	`select count() from system.zookeeper` `where path='/'`	`Critical for writes`
ZooKeeper exceptions	`select value from system.events` `where event='ZooKeeperHardwareExceptions'`	`Medium`
Other CH nodes are available	$ for node in `echo "select distinct host_address from system.clusters where host_name !='localhost'" \| curl 'http://localhost:8123/' --silent --data-binary @-`; do curl "http://$node:8123/" --silent ; done \| sort -u `Ok.`	`High`
All CH clusters are available (i.e. every configured cluster has enough replicas to serve queries)	for cluster in `echo "select distinct cluster from system.clusters where host_name !='localhost'" \| curl 'http://localhost:8123/' --silent --data-binary @-` ; do clickhouse-client --query="select '$cluster', 'OK' from cluster('$cluster', system, one)" ; done	`Critical`
There are files in 'detached' folders	`$ find /var/lib/clickhouse/data///detached/* -type d \| wc -l; \ 19.8+` `select count() from system.detached_parts`	`Medium`
Too many parts: \ Number of parts is growing; \ Inserts are being delayed; \ Inserts are being rejected	`select value from system.asynchronous_metrics` `where metric='MaxPartCountForPartition';` `select value from system.events/system.metrics` `where event/metric='DelayedInserts'; \ select value from system.events` `where event='RejectedInserts'`	`Critical`
Dictionaries: exception	`select concat(name,': ',last_exception)` `from system.dictionaries` `where last_exception != ''`	`Medium`
ClickHouse has been restarted	`select uptime();` `select value from system.asynchronous_metrics` `where metric='Uptime'`
DistributedFilesToInsert should not be always increasing	`select value from system.metrics` `where metric='DistributedFilesToInsert'`	`Medium`
A data part was lost	`select value from system.events` `where event='ReplicatedDataLoss'`	`High`
Data parts are not the same on different replicas	`select value from system.events where event='DataAfterMergeDiffersFromReplica'; \ select value from system.events where event='DataAfterMutationDiffersFromReplica'`	`Medium`

The following queries are recommended to be included in monitoring:

SELECT * FROM system.replicas
- For more information, see the ClickHouse guide on System Tables
SELECT * FROM system.merges
- Checks on the speed and progress of currently executed merges.
SELECT * FROM system.mutations
- This is the source of information on the speed and progress of currently executed merges.

Monitoring ClickHouse logs

ClickHouse logs can be another important source of information. There are 2 logs enabled by default

/var/log/clickhouse-server/clickhouse-server.err.log (error & warning, you may want to keep an eye on that or send it to some monitoring system)
/var/log/clickhouse-server/clickhouse-server.log (trace logs, very detailed, useful for debugging, usually too verbose to monitor).

You can additionally enable system.text_log table to have an access to the logs from clickhouse sql queries (ensure that you will not expose some information to the users who should not see it).

$ cat /etc/clickhouse-server/config.d/text_log.xml
<yandex>
    <text_log>
        <database>system</database>
        <table>text_log</table>
        <flush_interval_milliseconds>7500</flush_interval_milliseconds>
        <level>warning</level>
    </text_log>
</yandex>

OpenTelemetry support

See https://clickhouse.com/docs/en/operations/opentelemetry/

Other sources

5.42 - ClickHouse® versions

ClickHouse® versions

ClickHouse® versioning schema

ClickHouse Version Breakdown

Example:

21.3.10.1-lts

21 is the year of release.
3 indicates a Feature Release. This is an increment where features are delivered.
10 is the bugfix / maintenance version. When that version is incremented it means that some bugs was fixed comparing to 21.3.9.
1 - build number, means nothing for end users.
lts - type of release. (long time support).

What is Altinity Stable version?

It is one of general / public version of ClickHouse which has passed some extra testings, the upgrade path and changelog was analyzed, known issues are documented, and at least few big companies use it on production. All those things take some time, so usually that means that Altinity Stable is always a ‘behind’ the main releases.

Altinity version - is an option for conservative users, who prefer bit older but better known things.

Usually there is no reason to use version older than Altinity Stable. If you see that new Altinity Version arrived and you still use some older version - you should for sure consider an upgrade.

Additionally for Altinity client we provide extra support for those version for a longer time (and we also support newer versions).

Which version should I use?

We recommend the following approach:

When you start using ClickHouse and before you go on production - pick the latest stable version.
If you already have ClickHouse running on production:
1. Check all the new queries / schemas on the staging first, especially if some new ClickHouse features are used.
2. Do minor (bugfix) upgrades regularly: monitor new maintenance releases of the feature release you use.
3. When considering upgrade - check Altinity Stable release docs , if you want to use newer release - analyze changelog and known issues.
4. Check latest stable or test versions of ClickHouse on your staging environment regularly and pass the feedback to us or on the official ClickHouse github .
5. Consider blue/green or canary upgrades.

How do I upgrade?

Follow this KB article for ClickHouse version upgrade

Bugs?

ClickHouse development process goes in a very high pace and has already thousands of features. CI system doing tens of thousands of tests (including tests with different sanitizers) against every commit.

All core features are well-tested, and very stable, and code is high-quality. But as with any other software bad things may happen. Usually the most of bugs happens in the new, freshly added functionality, and in some complex combination of several features (of course all possible combinations of features just physically can’t be tested). Usually new features are adopted by the community and stabilize quickly.

What should I do if I found a bug in ClickHouse?

First of all: try to upgrade to the latest bugfix release Example: if you use v21.3.5.42-lts but you know that v21.3.10.1-lts already exists - start with upgrade to that. Upgrades to latest maintenance releases are smooth and safe.
Look for similar issues in github. Maybe the fix is on the way.
If you can reproduce the bug: try to isolate it - remove some pieces of query one-by-one / simplify the scenario until the issue still reproduces. This way you can figure out which part is responsible for that bug, and you can try to create minimal reproducible example
Once you have minimal reproducible example:
1. report it to github (or to Altinity Support)
2. check if it reproduces on newer ClickHouse versions

5.43 - Configure ClickHouse® for low memory environments

Configure ClickHouse® for low memory environments

While Clickhouse® it’s typically deployed on powerful servers with ample memory and CPU, it can be deployed in resource-constrained environments like a Raspberry Pi. Whether you’re working on edge computing, IoT data collection, or simply experimenting with ClickHouse in a small-scale setup, running it efficiently on low-memory hardware can be a rewarding challenge.

TLDR;

<!-- config.xml -->
<!-- These settinsg should allow to run clickhouse in nodes with 4GB/8GB RAM -->
<clickhouse>
  <!-- disable some optional components/tables -->
  <mysql_port remove="1" />
  <postgresql_port remove="1" />  
  <query_thread_log remove="1" />
  <opentelemetry_span_log remove="1" />
  <processors_profile_log remove="1" />   

  <!-- disable mlock, allowing binary pages to be unloaded from RAM, relying on Linux defaults -->
  <mlock_executable>false</mlock_executable> 

  <!-- decrease the cache sizes -->
  <mark_cache_size>268435456</mark_cache_size> <!-- 256 MB -->
  <index_mark_cache_size>67108864</index_mark_cache_size> <!-- 64 MB -->
  <uncompressed_cache_size>16777216</uncompressed_cache_size> <!-- 16 MB -->

  <!-- control the concurrency -->
  <max_thread_pool_size>2000</max_thread_pool_size>
  <max_connections>64</max_connections>
  <max_concurrent_queries>8</max_concurrent_queries>
  <max_server_memory_usage_to_ram_ratio>0.75</max_server_memory_usage_to_ram_ratio> <!-- 75% of the RAM, leave more for the system -->
  <max_server_memory_usage>0</max_server_memory_usage> <!-- We leave the overcommiter to manage available ram for queries-->

  <!-- reconfigure the main pool to limit the merges (those can create problems if the insert pressure is high) -->
  <background_pool_size>2</background_pool_size>
  <background_merges_mutations_concurrency_ratio>2</background_merges_mutations_concurrency_ratio>
  <merge_tree>
    <merge_max_block_size>1024</merge_max_block_size>
    <max_bytes_to_merge_at_max_space_in_pool>1073741824</max_bytes_to_merge_at_max_space_in_pool> <!-- 1 GB max part-->
    <number_of_free_entries_in_pool_to_lower_max_size_of_merge>2</number_of_free_entries_in_pool_to_lower_max_size_of_merge>
    <number_of_free_entries_in_pool_to_execute_mutation>2</number_of_free_entries_in_pool_to_execute_mutation>
    <number_of_free_entries_in_pool_to_execute_optimize_entire_partition>2</number_of_free_entries_in_pool_to_execute_optimize_entire_partition>
    <!-- Reduces memory usage during merges in system.metric_log table (enabled by default) by setting min_bytes_for_wide_part and vertical_merge_algorithm_min_bytes_to_activate to 128MB -->
    <min_bytes_for_wide_part>134217728</min_bytes_for_wide_part>
    <vertical_merge_algorithm_min_bytes_to_activate>134217728</vertical_merge_algorithm_min_bytes_to_activate>
  </merge_tree>

  <!-- shrink all pools to minimum-->
  <background_buffer_flush_schedule_pool_size>1</background_buffer_flush_schedule_pool_size>
  <background_merges_mutations_scheduling_policy>round_robin</background_merges_mutations_scheduling_policy>
  <background_move_pool_size>1</background_move_pool_size>
  <background_fetches_pool_size>1</background_fetches_pool_size>
  <background_common_pool_size>2</background_common_pool_size>
  <background_schedule_pool_size>8</background_schedule_pool_size>
  <background_message_broker_schedule_pool_size>1</background_message_broker_schedule_pool_size>
  <background_distributed_schedule_pool_size>1</background_distributed_schedule_pool_size>
  <tables_loader_foreground_pool_size>0</tables_loader_foreground_pool_size>
  <tables_loader_background_pool_size>0</tables_loader_background_pool_size>   
</clickhouse>

<!-- users.xml -->
<clickhouse>
  <profiles>
    <default>
      <max_threads>2</max_threads>
      <max_block_size>8192</max_block_size>
      <queue_max_wait_ms>1000</queue_max_wait_ms>
      <max_execution_time>600</max_execution_time>
      <input_format_parallel_parsing>0</input_format_parallel_parsing>
      <output_format_parallel_formatting>0</output_format_parallel_formatting>
      <max_bytes_before_external_group_by>3221225472</max_bytes_before_external_group_by> <!-- 3 GB -->
      <max_bytes_before_external_sort>3221225472</max_bytes_before_external_sort> <!-- 3 GB -->
    </default>
  </profiles>
</clickhouse>

Some interesting settings to explain:

Disabling both postgres/mysql interfaces will release some CPU/memory resources.
Disabling some system tables like processor_profile_log, opentelemetry_span_log, or query_thread_log will help reducing write amplification. Those tables write a lot of data very frequently. In a Raspi4 with 4 GB of RAM and a simple USB3.1 storage they can spend some needed resources.
Decrease mark caches. Defaults are 5GB and they are loaded into RAM (in newer versions this behavior of loading them completely in RAM can be tuned with a prewarm setting https://github.com/ClickHouse/ClickHouse/pull/71053 ) so better to reserve a reasonable amount of space in line with the total amount of RAM. For example for 4/8GB 256MB is a good value.
Tune server memory and leave 25% for OS ops (max_server_memory_usage_to_ram_ratio)
Tune the thread pools and queues for merges and mutations:
- merge_max_block_size will reduce the number of rows per block when merging. Default is 8192 and this will reduce the memory usage of merges.
- The number_of_free_entries_in_pool settings are very nice to tune how much concurrent merges are allowed in the queue. When there is less than specified number of free entries in pool , start to lower maximum size of merge to process (or to put in queue) or do not execute part mutations to leave free threads for regular merges . This is to allow small merges to process - not filling the pool with long running merges or multiple mutations. You can check clickhouse documentation to get more insights.
Reduce the background pools and be conservative. In a Raspi4 with 4 cores and 4 GB or ram, background pool should be not bigger than the number of cores and even less if possible.
Tune some profile settings to enable disk spilling (max_bytes_before_external_group_by and max_bytes_before_external_sort) and reduce the number of threads per query plus enable queuing of queries (queue_max_wait_ms) if the max_concurrent_queries limit is exceeded. Also max_block_size is not usually touched but in this case we can lower it ro reduce RAM usage.

5.44 - Converting MergeTree to Replicated

Adding replication to a table

To enable replication in a table that uses the MergeTree engine, you need to convert the engine to ReplicatedMergeTree. Options here are:

UseINSERT INTO foo_replicated SELECT * FROM foo. (suitable for small tables)
Create table aside and attach all partition from the existing table then drop original table (uses hard links don’t require extra disk space). ALTER TABLE foo_replicated ATTACH PARTITION ID 'bar' FROM 'foo' You can easily auto generate those commands using a query like: SELECT DISTINCT 'ALTER TABLE foo_replicated ATTACH PARTITION ID \'' || partition_id || '\' FROM foo;' from system.parts WHERE table = 'foo'; See the example below for details.
Do it ‘in place’ using some file manipulation. see the procedure described here: https://clickhouse.tech/docs/en/engines/table-engines/mergetree-family/replication/#converting-from-mergetree-to-replicatedmergetree
Do a backup of MergeTree and recover as ReplicatedMergeTree. https://github.com/Altinity/clickhouse-backup/blob/master/Examples.md#how-to-convert-mergetree-to-replicatedmegretree
Embedded command for recent Clickhouse versions - https://clickhouse.com/docs/en/sql-reference/statements/attach#attach-mergetree-table-as-replicatedmergetree

Example for option 2 above

Note: ATTACH PARTITION ID 'bar' FROM 'foo' is practically free from a compute and disk space perspective. This feature utilizes filesystem hard-links and the fact that files are immutable in ClickHouse® (it’s the core of the ClickHouse design, filesystem hard-links and such file manipulations are widely used).

create table foo( A Int64, D Date, S String ) 
Engine MergeTree 
partition by toYYYYMM(D) order by A;

insert into foo select number, today(), '' from numbers(1e8);
insert into foo select number, today()-60, '' from numbers(1e8);

select count() from foo;
┌───count()─┐
│ 200000000 │
└───────────┘

create table foo_replicated as foo 
Engine ReplicatedMergeTree('/clickhouse/{cluster}/tables/{database}/{table}/{shard}','{replica}')
partition by toYYYYMM(D) order by A;

SYSTEM STOP MERGES;

SELECT DISTINCT 'ALTER TABLE foo_replicated ATTACH PARTITION ID \'' || partition_id || '\' FROM foo;' from system.parts WHERE table = 'foo' AND active;
┌─concat('ALTER TABLE foo_replicated ATTACH PARTITION ID \'', partition_id, '\' FROM foo;')─┐
│ ALTER TABLE foo_replicated ATTACH PARTITION ID '202111' FROM foo;                         │
│ ALTER TABLE foo_replicated ATTACH PARTITION ID '202201' FROM foo;                         │
└───────────────────────────────────────────────────────────────────────────────────────────┘

clickhouse-client -q "SELECT DISTINCT 'ALTER TABLE foo_replicated ATTACH PARTITION ID \'' || partition_id || '\' FROM foo;' from system.parts WHERE table = 'foo' format TabSeparatedRaw" |clickhouse-client -mn

SYSTEM START MERGES;

SELECT count() FROM foo_replicated;
┌───count()─┐
│ 200000000 │
└───────────┘

rename table foo to foo_old, foo_replicated to foo;

-- you can drop foo_old any time later, it's kinda a cheap backup, 
-- it cost nothing until you insert a lot of additional data into foo_replicated

5.45 - Data Migration

Data Migration

Export & Import into common data formats

Pros:

Data can be inserted into any DBMS.

Cons:

Decoding & encoding of common data formats may be slower / require more CPU
The data size is usually bigger than ClickHouse® formats.
Some of the common data formats have limitations.

Info

The best approach to do that is using clickhouse-client, in that case, encoding/decoding of format happens client-side, while client and server speak clickhouse Native format (columnar & compressed).

In contrast: when you use HTTP protocol, the server do encoding/decoding and more data is passed between client and server.

remote/remoteSecure or cluster/Distributed table

Pros:

Simple to run.
It’s possible to change the schema and distribution of data between shards.
It’s possible to copy only some subset of data.
Needs only access to ClickHouse TCP port.

Cons:

Uses CPU / RAM (mostly on the receiver side)

See details of both approaches in:

remote-table-function.md

distributed-table-cluster.md

Manual parts moving: freeze / rsync / attach

Pros:

Low CPU / RAM usage.

Cons:

Table schema should be the same.
A lot of manual operations/scripting.

Info

With some additional care and scripting, it’s possible to do cheap re-sharding on parts level.

See details in:

rsync.md

clickhouse-backup

Pros:

Low CPU / RAM usage.
Suitable to recover both schema & data for all tables at once.

Cons:

Table schema should be the same.

Just create the backup on server 1, upload it to server 2, and restore the backup.

See https://github.com/Altinity/clickhouse-backup

https://altinity.com/blog/introduction-to-clickhouse-backups-and-clickhouse-backup

Fetch from zookeeper path

Pros:

Low CPU / RAM usage.

Cons:

Table schema should be the same.
Works only when the source and the destination ClickHouse servers share the same zookeeper (without chroot)
Needs to access zookeeper and ClickHouse replication ports: (interserver_http_port or interserver_https_port)

ALTER TABLE table_name FETCH PARTITION partition_expr FROM 'path-in-zookeeper'

alter table fetch detail

Using the replication protocol by adding a new replica

Just make one more replica in another place.

Pros:

Simple to setup
Data is consistent all the time automatically.
Low CPU and network usage should be tuned.

Cons:

Needs to reach both zookeeper client (2181) and ClickHouse replication ports: (interserver_http_port or interserver_https_port)
In case of cluster migration, zookeeper need’s to be migrated too.
Replication works both ways so new replica should be outside the main cluster.

Check the details in:

Add a replica to a Cluster

5.45.1 - MSSQL bcp pipe to clickhouse-client

Export from MSSQL to ClickHouse®

How to pipe data to ClickHouse® from bcp export tool for MSSQL database

Prepare tables

LAPTOP.localdomain :) CREATE TABLE tbl(key UInt32) ENGINE=MergeTree ORDER BY key;

root@LAPTOP:/home/user# sqlcmd -U sa -P Password78
1> WITH t0(i) AS (SELECT 0 UNION ALL SELECT 0), t1(i) AS (SELECT 0 FROM t0 a, t0 b), t2(i) AS (SELECT 0 FROM t1 a, t1 b), t3(i) AS (SELECT 0 FROM t2 a, t2 b), t4(i) AS (SELECT 0 FROM t3 a, t3 b), t5(i) AS (SELECT 0 FROM t4 a, t3 b),n(i) AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) FROM t5) SELECT i INTO tbl FROM n WHERE i BETWEEN 1 AND 16777216
2> GO

(16777216 rows affected)

root@LAPTOP:/home/user# sqlcmd -U sa -P Password78 -Q "SELECT count(*) FROM tbl"

-----------
   16777216

(1 rows affected)

Piping

root@LAPTOP:/home/user# mkfifo import_pipe
root@LAPTOP:/home/user# bcp "SELECT * FROM tbl" queryout import_pipe -t, -c -b 200000 -U sa -P Password78 -S localhost &
[1] 6038
root@LAPTOP:/home/user#
Starting copy...
1000 rows successfully bulk-copied to host-file. Total received: 1000
1000 rows successfully bulk-copied to host-file. Total received: 2000
1000 rows successfully bulk-copied to host-file. Total received: 3000
1000 rows successfully bulk-copied to host-file. Total received: 4000
1000 rows successfully bulk-copied to host-file. Total received: 5000
1000 rows successfully bulk-copied to host-file. Total received: 6000
1000 rows successfully bulk-copied to host-file. Total received: 7000
1000 rows successfully bulk-copied to host-file. Total received: 8000
1000 rows successfully bulk-copied to host-file. Total received: 9000
1000 rows successfully bulk-copied to host-file. Total received: 10000
1000 rows successfully bulk-copied to host-file. Total received: 11000
1000 rows successfully bulk-copied to host-file. Total received: 12000
1000 rows successfully bulk-copied to host-file. Total received: 13000
1000 rows successfully bulk-copied to host-file. Total received: 14000
1000 rows successfully bulk-copied to host-file. Total received: 15000
1000 rows successfully bulk-copied to host-file. Total received: 16000
1000 rows successfully bulk-copied to host-file. Total received: 17000
1000 rows successfully bulk-copied to host-file. Total received: 18000
1000 rows successfully bulk-copied to host-file. Total received: 19000
1000 rows successfully bulk-copied to host-file. Total received: 20000
1000 rows successfully bulk-copied to host-file. Total received: 21000
1000 rows successfully bulk-copied to host-file. Total received: 22000
1000 rows successfully bulk-copied to host-file. Total received: 23000
-- Enter
root@LAPTOP:/home/user# cat import_pipe | clickhouse-client --query "INSERT INTO tbl FORMAT CSV" &
...
1000 rows successfully bulk-copied to host-file. Total received: 16769000
1000 rows successfully bulk-copied to host-file. Total received: 16770000
1000 rows successfully bulk-copied to host-file. Total received: 16771000
1000 rows successfully bulk-copied to host-file. Total received: 16772000
1000 rows successfully bulk-copied to host-file. Total received: 16773000
1000 rows successfully bulk-copied to host-file. Total received: 16774000
1000 rows successfully bulk-copied to host-file. Total received: 16775000
1000 rows successfully bulk-copied to host-file. Total received: 16776000
1000 rows successfully bulk-copied to host-file. Total received: 16777000
16777216 rows copied.
Network packet size (bytes): 4096
Clock Time (ms.) Total     : 11540  Average : (1453831.5 rows per sec.)

[1]-  Done                    bcp "SELECT * FROM tbl" queryout import_pipe -t, -c -b 200000 -U sa -P Password78 -S localhost
[2]+  Done                    cat import_pipe | clickhouse-client --query "INSERT INTO tbl FORMAT CSV"

Another shell

root@LAPTOP:/home/user# for i in `seq 1 600`; do clickhouse-client -q "select count() from tbl";sleep 1;  done
0
0
0
0
0
0
1048545
4194180
6291270
9436905
11533995
13631085
16777216
16777216
16777216
16777216

5.45.2 - Add/Remove a new replica to a ClickHouse® cluster

How to add/remove a new ClickHouse replica manually and using clickhouse-backup

ADD nodes/replicas to a ClickHouse® cluster

To add some ClickHouse® replicas to an existing cluster if -30TB then better to use replication:

don’t add the remote_servers.xml until replication is done.
Add these files and restart to limit bandwidth and avoid saturation (70% total bandwidth):

Core Settings | ClickHouse Docs

💡 Do the Gbps to Bps math correctly. For 10G —> 1250MB/s —> 1250000000 B/s. Change the max_replicated_* settings accordingly and add them to a file in /etc/clickhouse-server/config.d/ (e.g., config.d/replication-limits.xml) and restart ClickHouse:

Nodes replicating from:

<clickhouse>
  <max_replicated_sends_network_bandwidth_for_server>50000</max_replicated_sends_network_bandwidth_for_server>
</clickhouse>

Nodes replicating to:

<clickhouse>
  <max_replicated_fetches_network_bandwidth_for_server>50000</max_replicated_fetches_network_bandwidth_for_server>
</clickhouse>

Manual method (DDL)

Create tables manually and be sure macros in all replicas are aligned with the ZK path. If zk path uses {cluster} then this method won’t work. ZK path should use {shard} and {replica} or {uuid} (if databases are Atomic) only.

-- DDL for Databases
SELECT concat('CREATE DATABASE "', name, '" ENGINE = ', engine_full, ';') 
FROM system.databases WHERE name NOT IN ('system', 'information_schema', 'INFORMATION_SCHEMA')
INTO OUTFILE '/tmp/databases.sql' 
FORMAT TSVRaw;
-- DDL for tables and views
SELECT
    replaceRegexpOne(replaceOne(concat(create_table_query, ';'), '(', 'ON CLUSTER \'{cluster}\' ('), 'CREATE (TABLE|DICTIONARY|VIEW|LIVE VIEW|WINDOW VIEW)', 'CREATE \\1 IF NOT EXISTS')
FROM
    system.tables
WHERE engine != 'MaterializedView' and
    database NOT IN ('system', 'information_schema', 'INFORMATION_SCHEMA') AND
    create_table_query != '' AND
    name NOT LIKE '.inner.%%' AND
    name NOT LIKE '.inner_id.%%'
INTO OUTFILE '/tmp/schema.sql' AND STDOUT
FORMAT TSVRaw
SETTINGS show_table_uuid_in_table_create_query_if_not_nil=1;
--- DDL only for materialized views
SELECT
    replaceRegexpOne(replaceOne(concat(create_table_query, ';'), 'TO', 'ON CLUSTER \'{cluster}\' TO'), '(CREATE MATERIALIZED VIEW)', '\\1 IF NOT EXISTS')
FROM
    system.tables
WHERE engine = 'MaterializedView' and
    database NOT IN ('system', 'information_schema', 'INFORMATION_SCHEMA') AND
    create_table_query != '' AND
    name NOT LIKE '.inner.%%' AND
    name NOT LIKE '.inner_id.%%' AND
		as_select != ''
INTO OUTFILE '/tmp/schema.sql' APPEND AND STDOUT
FORMAT TSVRaw
SETTINGS show_table_uuid_in_table_create_query_if_not_nil=1;

This will generate the UUIDs in the CREATE TABLE definition, something like this:

CREATE TABLE IF NOT EXISTS default.insert_test UUID '51b41170-5192-4947-b13b-d4094c511f06' ON CLUSTER '{cluster}' (`id_order` UInt16, `id_plat` UInt32, `id_warehouse` UInt64, `id_product` UInt16, `order_type` UInt16, `order_status` String, `datetime_order` DateTime, `units` Int16, `total` Float32) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}') PARTITION BY tuple() ORDER BY (id_order, id_plat, id_warehouse) SETTINGS index_granularity = 8192;

Copy both SQL to destination replica and execute

clickhouse-client --host localhost --port 9000 -mn < databases.sql
clickhouse-client --host localhost --port 9000 -mn < schema.sql

Using `clickhouse-backup`

Before proceeding: check if you have restore_schema_on_cluster set; if it is, this procedure will drop tables with ON CLUSTER, which is not its intention! To verify:

$ clickhouse-backup print-config|grep restore_schema_on_cluster
    restore_schema_on_cluster: ""

Using clickhouse-backup to copy the schema of a replica to another is also convenient, and if using Atomic database with {uuid} macros in ReplicatedMergeTree engines .

$ sudo -u clickhouse clickhouse-backup create --schema --rbac --named-collections rbac_and_schema
# From the destination replica do this in 2 steps (for safety, keep --env=RESTORE_SCHEMA_ON_CLUSTER=):
$ sudo -u clickhouse clickhouse-backup restore --env=RESTORE_SCHEMA_ON_CLUSTER= --rbac-only rbac_and_schema
$ sudo -u clickhouse clickhouse-backup restore --env=RESTORE_SCHEMA_ON_CLUSTER= --schema --named-collections rbac_and_schema

Using `altinity operator`

If there is at least one alive replica in the shard, you can remove PVCs and STS for affected nodes and trigger reconciliation. The operator will try to copy the schema from other replicas.

Check that schema migration was successful and node is replicating

To check that the schema migration has been successful query system.replicas:

SELECT DISTINCT database,table,replica_is_active FROM system.replicas FORMAT Vertical

Check how the replication process is performing using https://kb.altinity.com/altinity-kb-setup-and-maintenance/altinity-kb-replication-queue/

If there are many postponed tasks with the message:

Not executing fetch of part 7_22719661_22719661_0 because 16 fetches already executing, max 16.                                                                                                      │ 2023-09-25 17:03:06 │            │

then it is ok, the maximum replication slots are being used. Exceptions are not OK and should be investigated

If migration was successful and replication is working, then wait until the replication is finished. It may take some days, depending on how much data is being replicated. After this edit, the cluster configuration xml file for all replicas (remote_servers.xml), and add the new replica to the cluster.

Possible problems

Exception `REPLICA_ALREADY_EXISTS`

Code: 253. DB::Exception: Received from localhost:9000. 
DB::Exception: There was an error on [dl-ny2-vm-09.internal.io:9000]: 
Code: 253. DB::Exception: Replica /clickhouse/tables/3c3503c3-ed3c-443b-9cb3-ef41b3aed0a8/1/replicas/dl-ny2-vm-09.internal.io 
already exists. (REPLICA_ALREADY_EXISTS) (version 23.5.3.24 (official build)). (REPLICA_ALREADY_EXISTS)
(query: CREATE TABLE IF NOT EXISTS xxxx.yyyy UUID '3c3503c3-ed3c-443b-9cb3-ef41b3aed0a8'

The DDLs have been executed and some tables have been created and after that dropped but some left overs are left in ZK:

If databases can be dropped then use DROP DATABASE xxxxx SYNC
If databases cannot be dropped use SYSTEM DROP REPLICA ‘replica_name’ FROM db.table

Exception `TABLE_ALREADY_EXISTS`

Code: 57. DB::Exception: Received from localhost:9000. 
DB::Exception: There was an error on [dl-ny2-vm-09.internal.io:9000]: 
Code: 57. DB::Exception: Directory for table data store/3c3/3c3503c3-ed3c-443b-9cb3-ef41b3aed0a8/ already exists. 
(TABLE_ALREADY_EXISTS) (version 23.5.3.24 (official build)). (TABLE_ALREADY_EXISTS)
(query: CREATE TABLE IF NOT EXISTS xxxx.yyyy UUID '3c3503c3-ed3c-443b-9cb3-ef41b3aed0a8' ON CLUSTER '{cluster}'

Tables have not been dropped correctly:

If databases can be dropped then use DROP DATABASE xxxxx SYNC
If databases cannot be dropped use:

SELECT concat('DROP TABLE ', database, '.', name, ' SYNC;') 
FROM system.tables 
WHERE database NOT IN ('system', 'information_schema', 'INFORMATION_SCHEMA') 
INTO OUTFILE '/tmp/drop_tables.sql' 
FORMAT TSVRaw;

Tuning

Sometimes replication goes very fast and if you have a tiered storage hot/cold you could run out of space, so for that it is interesting to:
- reduce fetches from 8 to 4
- increase moves from 8 to 16

Add these settings to a file in /etc/clickhouse-server/config.d/ (e.g., config.d/replication-limits.xml) and restart ClickHouse:

<clickhouse>
  <max_replicated_fetches_network_bandwidth_for_server>625000000</max_replicated_fetches_network_bandwidth_for_server>
  <background_fetches_pool_size>4</background_fetches_pool_size>
  <background_move_pool_size>16</background_move_pool_size>
</clickhouse>

Also to monitor this with:

SELECT *
FROM system.metrics
WHERE metric LIKE '%Move%'

Query id: 5050155b-af4a-474f-a07a-f2f7e95fb395

┌─metric─────────────────┬─value─┬─description──────────────────────────────────────────────────┐
│ BackgroundMovePoolTask │     0 │ Number of active tasks in BackgroundProcessingPool for moves │
└────────────────────────┴───────┴──────────────────────────────────────────────────────────────┘

1 row in set. Elapsed: 0.164 sec. 

dnieto-test :) SELECT * FROM system.metrics WHERE metric LIKE '%Fetch%';

SELECT *
FROM system.metrics
WHERE metric LIKE '%Fetch%'

Query id: 992cae2a-fb58-4150-a088-83273805d0c4

┌─metric────────────────────┬─value─┬─description───────────────────────────────────────────────┐
│ ReplicatedFetch           │     0 │ Number of data parts being fetched from replica           │
│ BackgroundFetchesPoolTask │     0 │ Number of active fetches in an associated background pool │
└───────────────────────────┴───────┴───────────────────────────────────────────────────────────┘

2 rows in set. Elapsed: 0.163 sec.

There are new tables in v23 system.replicated_fetches and system.moves check it out for more info.
if needed just stop replication using SYSTEM STOP FETCHES from the replicating nodes

REMOVE nodes/Replicas from a Cluster

It is important to know which replica/node you want to remove to avoid problems. To check it you need to connect to a different replica/node that the one you want to remove. For instance we want to remove arg_t04, so we connected to replica arg_t01:

SELECT DISTINCT arrayJoin(mapKeys(replica_is_active)) AS replica_name
FROM system.replicas

┌─replica_name─┐
│ arg_t01      │
│ arg_t02      │
│ arg_t03      │
│ arg_t04      │
└──────────────┘

After that (make sure you’re connected to a replica different from the one that you want to remove, arg_tg01) and execute:

SYSTEM DROP REPLICA 'arg_t04'

If by any chance you’re connected to the same replica you want to remove then SYSTEM DROP REPLICA will not work.
BTW SYSTEM DROP REPLICA does not drop any tables and does not remove any data or metadata from disk, it will only remove metadata from Zookeeper/Keeper

-- What happens if executing system drop replica in the local replica to remove.
SYSTEM DROP REPLICA 'arg_t04'

Elapsed: 0.017 sec. 

Received exception from server (version 23.8.6):
Code: 305. DB::Exception: Received from dnieto-zenbook.lan:9440. DB::Exception: We can't drop local replica, please use `DROP TABLE` if you want to clean the data and drop this replica. (TABLE_WAS_NOT_DROPPED)

After DROP REPLICA, we need to check that the replica is gone from the list or replicas:

SELECT DISTINCT arrayJoin(mapKeys(replica_is_active)) AS replica_name
FROM system.replicas

┌─replica_name─┐
│ arg_t01      │
│ arg_t02      │
│ arg_t03      │
└──────────────┘

-- We should see there is no replica arg_t04

Delete the replica in the cluster configuration: remote_servers.xml and shutdown the node/replica removed.

5.45.3 - clickhouse-copier

clickhouse-copier

The description of the utility and its parameters, as well as examples of the config files that you need to create for the copier are in the official repo for the ClickHouse® copier utility

The steps to run a task:

Create a config file for clickhouse-copier (zookeeper.xml)
Create a config file for the task (task1.xml)
Create the task in ZooKeeper and start an instance of clickhouse-copier
clickhouse-copier --daemon --base-dir=/opt/clickhouse-copier --config=/opt/clickhouse-copier/zookeeper.xml --task-path=/clickhouse/copier/task1 --task-file=/opt/clickhouse-copier/task1.xml
If the node in ZooKeeper already exists and you want to change it, you need to add the task-upload-force parameter:
clickhouse-copier --daemon --base-dir=/opt/clickhouse-copier --config=/opt/clickhouse-copier/zookeeper.xml --task-path=/clickhouse/copier/task1 --task-file=/opt/clickhouse-copier/task1.xml --task-upload-force=1
If you want to run another instance of clickhouse-copier for the same task, you need to copy the config file (zookeeper.xml) to another server, and run this command:
clickhouse-copier --daemon --base-dir=/opt/clickhouse-copier --config=/opt/clickhouse-copier/zookeeper.xml --task-path=/clickhouse/copier/task1

The number of simultaneously running instances is controlled be the max_workers parameter in your task configuration file. If you run more workers superfluous workers will sleep and log messages like this:

<Debug> ClusterCopier: Too many workers (1, maximum 1). Postpone processing

5.45.3.1 - clickhouse-copier 20.3 and earlier

clickhouse-copier 20.3 and earlier

clickhouse-copier was created to move data between clusters. It runs simple INSERT…SELECT queries and can copy data between tables with different engine parameters and between clusters with different number of shards. In the task configuration file you need to describe the layout of the source and the target cluster, and list the tables that you need to copy. You can copy whole tables or specific partitions. clickhouse-copier uses temporary distributed tables to select from the source cluster and insert into the target cluster.

The process is as follows

Process the configuration files.
Discover the list of partitions if not provided in the config.
Copy partitions one by one.
1. Drop the partition from the target table if it’s not empty
2. Copy data from source shards one by one.
  1. Check if there is data for the partition on a source shard.
  2. Check the status of the task in ZooKeeper.
  3. Create target tables on all shards of the target cluster.
  4. Insert the partition of data into the target table.
3. Mark the partition as completed in ZooKeeper.

If there are several workers running simultaneously, they will assign themselves to different source shards. If a worker was interrupted, another worker can be started to continue the task. The next worker will drop incomplete partitions and resume the copying.

Configuring the engine of the target table

clickhouse-copier uses the engine from the task configuration file for these purposes:

to create target tables if they don’t exist.
PARTITION BY: to SELECT a partition of data from the source table, to DROP existing partitions from target tables.

clickhouse-copier does not support the old MergeTree format. However, you can create the target tables manually and specify the engine in the task configuration file in the new format so that clickhouse-copier can parse it for its SELECT queries.

How to monitor the status of running tasks

clickhouse-copier uses ZooKeeper to keep track of the progress and to communicate between workers. Here is a list of queries that you can use to see what’s happening.

--task-path /clickhouse/copier/task1

-- The task config
select * from system.zookeeper
where path='<task-path>'
name                        | ctime               | mtime           
----------------------------+---------------------+--------------------
description                 | 2019-10-18 15:40:00 | 2020-09-11 16:01:14
task_active_workers_version | 2019-10-18 16:00:09 | 2020-09-11 16:07:08
tables                      | 2019-10-18 16:00:25 | 2019-10-18 16:00:25
task_active_workers         | 2019-10-18 16:00:09 | 2019-10-18 16:00:09


-- Running workers
select * from system.zookeeper
where path='<task-path>/task_active_workers'


-- The list of processed tables
select * from system.zookeeper
where path='<task-path>/tables'


-- The list of processed partitions
select * from system.zookeeper
where path='<task-path>/tables/<table>'
name   | ctime           
-------+--------------------
201909 | 2019-10-18 18:24:18


-- The status of a partition
select * from system.zookeeper
where path='<task-path>/tables/<table>/<partition>'
name                     | ctime           
-------------------------+--------------------
shards                   | 2019-10-18 18:24:18
partition_active_workers | 2019-10-18 18:24:18


-- The status of source shards
select * from system.zookeeper
where path='<task-path>/tables/<table>/<partition>/shards'
name | ctime               | mtime           
-----+---------------------+--------------------
1    | 2019-10-18 22:37:48 | 2019-10-18 22:49:29

5.45.3.2 - clickhouse-copier 20.4 - 21.6

clickhouse-copier 20.4 - 21.6

clickhouse-copier was created to move data between clusters. It runs simple INSERT…SELECT queries and can copy data between tables with different engine parameters and between clusters with different number of shards. In the task configuration file you need to describe the layout of the source and the target cluster, and list the tables that you need to copy. You can copy whole tables or specific partitions. clickhouse-copier uses temporary distributed tables to select from the source cluster and insert into the target cluster.

The behavior of clickhouse-copier was changed in 20.4:

Now clickhouse-copier inserts data into intermediate tables, and after the insert finishes successfully clickhouse-copier attaches the completed partition into the target table. This allows for incremental data copying, because the data in the target table is intact during the process. Important note: ATTACH PARTITION respects the max_partition_size_to_drop limit. Make sure the max_partition_size_to_drop limit is big enough (or set to zero) in the destination cluster. If clickhouse-copier is unable to attach a partition because of the limit, it will proceed to the next partition, and it will drop the intermediate table when the task is finished (if the intermediate table is less than the max_table_size_to_drop limit). Another important note: ATTACH PARTITION is replicated. The attached partition will need to be downloaded by the other replicas. This can create significant network traffic between ClickHouse nodes. If an attach takes a long time, clickhouse-copier will log a timeout and will proceed to the next step.
Now clickhouse-copier splits the source data into chunks and copies them one by one. This is useful for big source tables, when inserting one partition of data can take hours. If there is an error during the insert clickhouse-copier has to drop the whole partition and start again. The number_of_splits parameter lets you split your data into chunks so that in case of an exception clickhouse-copier has to re-insert only one chunk of the data.
Now clickhouse-copier runs OPTIMIZE target_table PARTITION ... DEDUPLICATE for non-Replicated MergeTree tables. Important note: This is a very strange feature that can do more harm than good. We recommend to disable it by configuring the engine of the target table as Replicated in the task configuration file, and create the target tables manually if they are not supposed to be replicated. Intermediate tables are always created as plain MergeTree.

The process is as follows

Process the configuration files.
Discover the list of partitions if not provided in the config.
Copy partitions one by one ** The metadata in ZooKeeper suggests the order described here.**
1. Copy chunks of data one by one.
  1. Copy data from source shards one by one.
    1. Create intermediate tables on all shards of the target cluster.
    2. Check the status of the chunk in ZooKeeper.
    3. Drop the partition from the intermediate table if the previous attempt was interrupted.
    4. Insert the chunk of data into the intermediate tables.
    5. Mark the shard as completed in ZooKeeper
2. Attach the chunks of the completed partition into the target table one by one
  1. Attach a chunk into the target table.
  2. non-Replicated: Run OPTIMIZE target_table DEDUPLICATE for the partition on the target table.
Drop intermediate tables (may not succeed if the tables are bigger than max_table_size_to_drop).

Configuring the engine of the target table

clickhouse-copier uses the engine from the task configuration file for these purposes:

to create target and intermediate tables if they don’t exist.
PARTITION BY: to SELECT a partition of data from the source table, to ATTACH partitions into target tables, to DROP incomplete partitions from intermediate tables, to OPTIMIZE partitions after they are attached to the target.
ORDER BY: to SELECT a chunk of data from the source table.

Here is an example of SELECT that clickhouse-copier runs to get the sixth of ten chunks of data:

WHERE (<the PARTITION BY clause> = (<a value of the PARTITION BY expression> AS partition_key))
  AND (cityHash64(<the ORDER BY clause>) % 10 = 6 )

clickhouse-copier does not support the old MergeTree format. However, you can create the intermediate tables manually with the same engine as the target tables (otherwise ATTACH will not work), and specify the engine in the task configuration file in the new format so that clickhouse-copier can parse it for SELECT, ATTACH PARTITION and DROP PARTITION queries.

Important note: always configure engine as Replicated to disable OPTIMIZE … DEDUPLICATE (unless you know why you need clickhouse-copier to run OPTIMIZE … DEDUPLICATE).

How to configure the number of chunks

The default value for number_of_splits is 10. You can change this parameter in the table section of the task configuration file. We recommend setting it to 1 for smaller tables.

<cluster_push>target_cluster</cluster_push>
<database_push>target_database</database_push>
<table_push>target_table</table_push>
<number_of_splits>1</number_of_splits>
<engine>Engine=Replicated...<engine>

How to monitor the status of running tasks

clickhouse-copier uses ZooKeeper to keep track of the progress and to communicate between workers. Here is a list of queries that you can use to see what’s happening.

--task-path=/clickhouse/copier/task1

-- The task config
select * from system.zookeeper
where path='<task-path>'
name                        | ctime               | mtime           
----------------------------+---------------------+--------------------
description                 | 2021-03-22 13:15:48 | 2021-03-22 13:25:28
status                      | 2021-03-22 13:15:48 | 2021-03-22 13:25:28
task_active_workers_version | 2021-03-22 13:15:48 | 2021-03-22 20:32:09
tables                      | 2021-03-22 13:16:47 | 2021-03-22 13:16:47
task_active_workers         | 2021-03-22 13:15:48 | 2021-03-22 13:15:48


-- Status
select * from system.zookeeper
where path='<task-path>/status'


-- Running workers
select * from system.zookeeper
where path='<task-path>/task_active_workers'


-- The list of processed tables
select * from system.zookeeper
where path='<task-path>/tables'


-- The list of processed partitions
select * from system.zookeeper
where path='<task-path>/tables/<table>'
name   | ctime           
-------+--------------------
202103 | 2021-03-22 13:16:47
202102 | 2021-03-22 13:18:31
202101 | 2021-03-22 13:27:36
202012 | 2021-03-22 14:05:08


-- The status of a partition
select * from system.zookeeper
where path='<task-path>/tables/<table>/<partition>'
name           | ctime           
---------------+--------------------
piece_0        | 2021-03-22 13:18:31
attach_is_done | 2021-03-22 14:05:05


-- The status of a piece
select * from system.zookeeper
where path='<task-path>/tables/<table>/<partition>/piece_N'
name                           | ctime           
-------------------------------+--------------------
shards                         | 2021-03-22 13:18:31
is_dirty                       | 2021-03-22 13:26:51
partition_piece_active_workers | 2021-03-22 13:26:54
clean_start                    | 2021-03-22 13:26:54


-- The status of source shards
select * from system.zookeeper
where path='<task-path>/tables/<table>/<partition>/piece_N/shards'
name | ctime               | mtime           
-----+---------------------+--------------------
1    | 2021-03-22 13:26:54 | 2021-03-22 14:05:05

5.45.3.3 - Kubernetes job for clickhouse-copier

Kubernetes job for clickhouse-copier

`clickhouse-copier` deployment in kubernetes

clickhouse-copier can be deployed in a kubernetes environment to automate some simple backups or copy fresh data between clusters.

Some documentation to read:

Deployment

Use a kubernetes job is recommended but a simple pod can be used if you only want to execute the copy one time.

Just edit/change all the yaml files to your needs.

1) Create the PVC:

First create a namespace in which all the pods and resources are going to be deployed

kubectl create namespace clickhouse-copier

Then create the PVC using a storageClass gp2-encrypted class or use any other storageClass from other providers:

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: copier-logs
  namespace: clickhouse-copier
spec:
  storageClassName: gp2-encrypted
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Mi

and deploy:

kubectl -n clickhouse-copier create -f ./kubernetes/copier-pvc.yaml

2) Create the configmap:

The configmap has both files zookeeper.xml and task01.xml with the zookeeper node listing and the parameters for the task respectively.

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: copier-config
  namespace: clickhouse-copier
data:
    task01.xml: |
        <clickhouse>
            <logger>
                <console>true</console>
                <log remove="remove"/>
                <errorlog remove="remove"/>
                <level>trace</level>
            </logger>
            <remote_servers>
                <all-replicated>
                    <shard>
                        <replica>
                            <host>clickhouse01.svc.cluster.local</host>
                            <port>9000</port>
                            <user>chcopier</user>
                            <password>pass</password>
                        </replica>
                        <replica>
                            <host>clickhouse02.svc.cluster.local</host>
                            <port>9000</port>
                            <user>chcopier</user>
                            <password>pass</password>
                        </replica>
                    </shard>
                </all-replicated>
                <all-sharded>
                    <!-- <secret></secret> -->
                    <shard>
                        <replica>
                            <host>clickhouse03.svc.cluster.local</host>
                            <port>9000</port>
                            <user>chcopier</user>
                            <password>pass</password>
                        </replica>
                    </shard>
                    <shard>
                        <replica>
                            <host>clickhouse03.svc.cluster.local</host>
                            <port>9000</port>
                            <user>chcopier</user>
                            <password>pass</password>
                        </replica>
                    </shard>
                </all-sharded>
            </remote_servers>
            <max_workers>1</max_workers>
            <settings_pull>
                <readonly>1</readonly>
            </settings_pull>
            <settings_push>
                <readonly>0</readonly>
            </settings_push>
            <settings>
                <connect_timeout>3</connect_timeout>
                <insert_distributed_sync>1</insert_distributed_sync>
            </settings>
            <tables>
                <table_sales>
                    <cluster_pull>all-replicated</cluster_pull>
                    <database_pull>default</database_pull>
                    <table_pull>fact_sales_event</table_pull>
                    <cluster_push>all-sharded</cluster_push>
                    <database_push>default</database_push>
                    <table_push>fact_sales_event</table_push>
                    <engine>
                        Engine=ReplicatedMergeTree('/clickhouse/{cluster}/tables/{shard}/fact_sales_event', '{replica}')
                        PARTITION BY toYYYYMM(timestamp)
                        ORDER BY (channel_id, product_id)
                        SETTINGS index_granularity = 8192
                    </engine>
                    <sharding_key>rand()</sharding_key>
                </table_ventas>
            </tables>
        </clickhouse>        
    zookeeper.xml: |
        <clickhouse>
            <logger>
                <level>trace</level>
                <size>100M</size>
                <count>3</count>
            </logger>
            <zookeeper>
                <node>
                    <host>zookeeper1.svc.cluster.local</host>
                    <port>2181</port>
                </node>
                <node>
                    <host>zookeeper2.svc.cluster.local</host>
                    <port>2181</port>
                </node>
                <node>
                    <host>zookeeper3.svc.cluster.local</host>
                    <port>2181</port>
                </node>
            </zookeeper>
        </clickhouse>

and deploy:

kubectl -n clickhouse-copier create -f ./kubernetes/copier-configmap.yaml

The task01.xml file has many parameters to take into account explained in the repo for clickhouse-copier . Important to note that it is needed a FQDN for the Zookeeper nodes and ClickHouse® server that are valid for the cluster. As the deployment creates a new namespace, it is recommended to use a FQDN linked to a service. For example zookeeper01.svc.cluster.local. This file should be adapted to both clusters topologies and to the needs of the user.

The zookeeper.xml file is pretty straightforward with a simple 3 node ensemble configuration.

3) Create the job:

Basically the job will download the official ClickHouse image and will create a pod with 2 containers:

clickhouse-copier: This container will run the clickhouse-copier utility.
sidecar-logging: This container will be used to read the logs of the clickhouse-copier container for different runs (this part can be improved):

---
apiVersion: batch/v1
kind: Job
metadata:
  name: clickhouse-copier-test
  namespace: clickhouse-copier
spec:
  # only for kubernetes 1.23
  # ttlSecondsAfterFinished: 86400
  template:
    spec:
      containers:
        - name: clickhouse-copier
          image: clickhouse/clickhouse-server:21.8
          command:
            - clickhouse-copier
            - --task-upload-force=1
            - --config-file=$(CH_COPIER_CONFIG)
            - --task-path=$(CH_COPIER_TASKPATH)
            - --task-file=$(CH_COPIER_TASKFILE)
            - --base-dir=$(CH_COPIER_BASEDIR)
          env:
            - name: CH_COPIER_CONFIG
              value: "/var/lib/clickhouse/tmp/zookeeper.xml"
            - name: CH_COPIER_TASKPATH
              value: "/clickhouse/copier/tasks/task01"
            - name: CH_COPIER_TASKFILE
              value: "/var/lib/clickhouse/tmp/task01.xml"
            - name: CH_COPIER_BASEDIR
              value: "/var/lib/clickhouse/tmp"
          resources:
            limits:
              cpu: "1"
              memory: 2048Mi
          volumeMounts:
            - name: copier-config
              mountPath: /var/lib/clickhouse/tmp/zookeeper.xml
              subPath: zookeeper.xml
            - name: copier-config
              mountPath: /var/lib/clickhouse/tmp/task01.xml
              subPath: task01.xml
            - name: copier-logs
              mountPath: /var/lib/clickhouse/tmp
        - name: sidecar-logger
          image: busybox:1.35
          command: ['/bin/sh', '-c', 'tail', '-n', '1000', '-f', '/tmp/copier-logs/clickhouse-copier*/*.log']
          resources:
            limits:
              cpu: "1"
              memory: 512Mi
          volumeMounts:
            - name: copier-logs
              mountPath: /tmp/copier-logs
      volumes:
        - name: copier-config
          configMap:
            name: copier-config
            items:
              - key: zookeeper.xml
                path: zookeeper.xml
              - key: task01.xml
                path: task01.xml
        - name: copier-logs
          persistentVolumeClaim:
            claimName: copier-logs
      restartPolicy: Never
  backoffLimit: 3

Deploy and watch progress checking the logs:

kubectl -n clickhouse-copier logs <podname> sidecar-logging

5.45.4 - Distributed table to ClickHouse® Cluster

Shifting INSERTs to a standby cluster

In order to shift INSERTS to a standby cluster (for example increase zone availability or disaster recovery ) some ClickHouse® features can be used.

Basically we need to create a distributed table, a MV, rewrite the remote_servers.xml config file and tune some parameters.

Distributed engine information and parameters: https://clickhouse.com/docs/en/engines/table-engines/special/distributed/

Steps

Create a Distributed table in the source cluster

For example, we should have a ReplicatedMergeTree table in which all inserts are falling. This table is the first step in our pipeline:

CREATE TABLE db.inserts_source ON CLUSTER 'source'
(
    column1 String
    column2 DateTime
    .....
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/inserts_source', '{replica}')
PARTITION BY toYYYYMM(column2)
ORDER BY (column1, column2)

This table lives in the source cluster and all INSERTS go there. In order to shift all INSERTS in the source cluster to destination cluster we can create a Distributed table that points to another ReplicatedMergeTree in the destination cluster:

CREATE TABLE db.inserts_source_dist ON CLUSTER 'source'
(
    column1 String
    column2 DateTime
    .....
)
ENGINE = Distributed('destination', db, inserts_destination)

Create a Materialized View to shift INSERTS to destination cluster:

CREATE MATERIALIZED VIEW shift_inserts ON CLUSTER 'source'
TO db.inserts_source_dist AS
SELECT * FROM db.inserts_source

Create a ReplicatedMergeTree table in the destination cluster:

This is the table in the destination cluster that is pointed by the distributed table in the source cluster

CREATE TABLE db.inserts_destination ON CLUSTER 'destination'
(
    column1 String
    column2 DateTime
    .....
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/inserts_destination', '{replica}')
PARTITION BY toYYYYMM(column2)
ORDER BY (column1, column2)

Rewrite remote_servers.xml:

All the hostnames/FQDN from each replica/node must be accessible from both clusters. Also the remote_servers.xml from the source cluster should read like this:

<clickhouse>
    <remote_servers>
        <source>   
            <shard>
                <replica>
                    <host>host03</host>
                    <port>9000</port>
                </replica>
                <replica>
                    <host>host04</host>
                    <port>9000</port>
                </replica>
            </shard>
        </source>
        <destination>   
            <shard>
                <replica>
                    <host>host01</host>
                    <port>9000</port>
                </replica>
                <replica>
                    <host>host02</host>
                    <port>9000</port>
                </replica>
            </shard>
        </destination>
        <!-- If using a LB to shift inserts you need to use user and password and create MT destination table in an all-replicated cluster config -->
        <destination_with_lb>   
            <shard>
                <replica>
                    <host>load_balancer.xxxx.com</host>
                    <port>9440</port>
                    <secure>1</secure>
                    <username>user</username>
                    <password>pass</password>
                </replica>
            </shard>
        </destination_with_lb>
   </remote_servers>
</clickhouse>

Configuration settings

Depending on your use case you can set the the distributed INSERTs to sync or async mode . This example is for async mode: Put this config settings on the default profile. Check for more info about the possible modes:

https://clickhouse.com/docs/en/operations/settings/settings#insert_distributed_sync

<clickhouse>
    ....
    <profiles>
        <default>
            <!-- StorageDistributed DirectoryMonitors try to batch individual inserts into bigger ones to increase performance -->
            <distributed_directory_monitor_batch_inserts>1</distributed_directory_monitor_batch_inserts>
            <!-- StorageDistributed DirectoryMonitors try to split batch into smaller in case of failures -->
            <distributed_directory_monitor_split_batch_on_failure>1</distributed_directory_monitor_split_batch_on_failure>
        </default>
    .....
    </profiles>
</clickhouse>

5.45.5 - Fetch Alter Table

Fetch Alter Table

FETCH Parts from Zookeeper

This is a detailed explanation on how to move data by fetching partitions or parts between replicas

Get partitions by database and table:

SELECT
    hostName() AS host,
    database,
    table
    partition_id,
    name as part_id
FROM cluster('{cluster}', system.parts)
WHERE database IN ('db1','db2' ... 'dbn') AND active

This query will return all the partitions and parts stored in this node for the databases and their tables.

Fetch the partitions:

Prior starting with the fetching process it is recommended to check the system.detached_parts table of the destination node. There is a chance that detached folders already contain some old parts, and you will have to remove them all before starting moving data. Otherwise you will attach those old parts together with the fetched parts. Also you could run into issues if there are detached folders with the same names as the ones you are fetching (not very probable, put possible). Simply delete the detached parts and continue with the process.

To fetch a partition:

ALTER TABLE <tablename> FETCH PARTITION <partition_id> FROM '/clickhouse/{cluster}/tables/{shard}/{table}'

The FROM path is from the zookeeper node and you have to specify the shard from you’re fetching the partition . Next executing the DDL query:

ALTER TABLE <tablename> ATTACH PARTITION <partition_id>

will attach the partitions to a table. Again and because the process is manual, it is recommended to check that the fetched partitions are attached correctly and that there are no detached parts left. Check both system.parts and system.detached_parts tables.

Detach tables and delete replicas:

If needed, after moving the data and checking that everything is sound, you can detach the tables and delete the replicas.

-- Required for DROP REPLICA
DETACH TABLE <table_name>;  

-- This will remove everything from /table_path_in_z/replicas/replica_name
-- but not the data. You could reattach the table again and
-- restore the replica if needed. Get the zookeeper_path and replica_name from system.replicas

SYSTEM DROP REPLICA 'replica_name' FROM ZKPATH '/table_path_in_zk/';

Query to generate all the DDL:

With this query you can generate the DDL script that will do the fetch and attach operations for each table and partition.

SELECT
    DISTINCT
    'alter table '||database||'.'||table||' FETCH PARTITION '''||partition_id||''' FROM '''||zookeeper_path||'''; '
    ||'alter table '||database||'.'||table||' ATTACH PARTITION '''||partition_id||''';'
FROM system.parts INNER JOIN system.replicas USING (database, table)
WHERE database IN ('db1','db2' ... 'dbn') AND active

You could add an ORDER BY to manually make the list in the order you need, or use ORDER BY rand() to randomize it. You will then need to split the commands between the shards.

5.45.6 - Remote table function

Remote table function

remote(…) table function

Suitable for moving up to hundreds of gigabytes of data.

With bigger tables recommended approach is to slice the original data by some WHERE condition, ideally - apply the condition on partitioning key, to avoid writing data to many partitions at once.

INSERT INTO staging_table SELECT * FROM remote(...) WHERE date='2021-04-13';
INSERT INTO staging_table SELECT * FROM remote(...) WHERE date='2021-04-12';
INSERT INTO staging_table SELECT * FROM remote(...) WHERE date='2021-04-11';
....

OR 

INSERT INTO FUNCTION remote(...) SELECT * FROM staging_table WHERE date='2021-04-11';
....

Q. Can it create a bigger load on the source system?

Yes, it may use disk read & network write bandwidth. But typically write speed is worse than the read speed, so most probably the receiver side will be a bottleneck, and the sender side will not be overloaded.

While of course it should be checked, every case is different.

Q. Can I tune INSERT speed to make it faster?

Yes, by the cost of extra memory usage (on the receiver side).

ClickHouse® tries to form blocks of data in memory and while one of limit: min_insert_block_size_rows or min_insert_block_size_bytes being hit, ClickHouse dump this block on disk. If ClickHouse tries to execute insert in parallel (max_insert_threads > 1), it would form multiple blocks at one time.
So maximum memory usage can be calculated like this: max_insert_threads * first(min_insert_block_size_rows OR min_insert_block_size_bytes)

Default values:

┌─name────────────────────────┬─value─────┐
│ min_insert_block_size_rows  │ 1048545   │
│ min_insert_block_size_bytes │ 268427520 │
│ max_insert_threads          │ 0         │ <- Values 0 or 1 means that INSERT SELECT is not run in parallel.
└─────────────────────────────┴───────────┘

Tune those settings depending on your table average row size and amount of memory which are safe to occupy by INSERT SELECT query.

Q. I’ve got the error “All connection tries failed”

SELECT count()
FROM remote('server.from.remote.dc:9440', 'default.table', 'admin', 'password')
Received exception from server (version 20.8.11):
Code: 519. DB::Exception: Received from localhost:9000. DB::Exception: All attempts to get table structure failed. Log:
Code: 279, e.displayText() = DB::NetException: All connection tries failed. Log:
Code: 209, e.displayText() = DB::NetException: Timeout: connect timed out: 192.0.2.1:9440 (server.from.remote.dc:9440) (version 20.8.11.17 (official build))
Code: 209, e.displayText() = DB::NetException: Timeout: connect timed out: 192.0.2.1:9440 (server.from.remote.dc:9440) (version 20.8.11.17 (official build))
Code: 209, e.displayText() = DB::NetException: Timeout: connect timed out: 192.0.2.1:9440 (server.from.remote.dc:9440) (version 20.8.11.17 (official build))

Using remote(…) table function with secure TCP port (default values is 9440). There is remoteSecure() function for that.
High (>50ms) ping between servers, values for connect_timeout_with_failover_ms, connect_timeout_with_failover_secure_ms need’s to be adjusted accordingly.

Default values:

┌─name────────────────────────────────────┬─value─┐
│ connect_timeout_with_failover_ms        │ 50    │
│ connect_timeout_with_failover_secure_ms │ 100   │
└─────────────────────────────────────────┴───────┘

Example

#!/bin/bash

table='...'
database='bvt'
local='...'
remote='...'
CH="clickhouse-client"   # you may add auth here 
settings="  max_insert_threads=20, 
            max_threads=20, 
            min_insert_block_size_bytes = 536870912, 
            min_insert_block_size_rows = 16777216, 
            max_insert_block_size = 16777216,
            optimize_on_insert=0";

# need it to create temp table with same structure (suitable for attach)
params=$($CH -h $remote -q "select partition_key,sorting_key,primary_key from system.tables where table='$table' and database = '$database' " -f TSV)
IFS=$'\t' read -r partition_key sorting_key primary_key <<< $params

$CH -h $local \  # get list of source partitions
-q "select distinct partition from system.parts where table='$table' and database = '$database' "

while read -r partition; do
# check that the partition is already copied
  if [ `$CH -h $remote -q " select count() from system.parts table='$table' and database = '$database' and partition='$partition'"` -eq 0 ] ; then
      $CH -n -h $remote -q "
        create temporary table temp as $database.$table engine=MergeTree -- 23.3 required for temporary table
           partition by ($partition_key) primary key ($primary_key)  order by ($sorting_key);
        -- SYSTEM STOP MERGES temp; -- maybe....
        set $settings;
        insert into temp select * from remote($local,$database.$table) where _partition='$partition'
        -- order by ($sorting_key) -- maybe....
        ;
        alter table $database.$table attach partition $partition from temp
  "
  fi
done

5.45.7 - Moving ClickHouse to Another Server

Copying Multi-Terabyte Live ClickHouse to Another Server

When migrating a large, live ClickHouse cluster (multi-terabyte scale) to a new server or cluster, the goal is to minimize downtime while ensuring data consistency. A practical method is to use incremental rsync in multiple passes, combined with ClickHouse’s replication features.

Prepare the new cluster
- Ensure the new cluster is set up with its own ZooKeeper (or Keeper).
- Configure ClickHouse but keep it stopped initially.
- For clickhouse-operator instances, you can stop all pods by CHI definition:

spec:
  stop: "true"

and attach volumes (PVC) to a service pod.

Initial data sync
Run a full recursive sync of the data directory from the old server to the new one:
```
rsync -ravlW --delete /var/lib/clickhouse/ user@new_host:/var/lib/clickhouse/
```
Explanation of flags:
- r: recursive, includes all subdirectories.
- a: archive mode (preserves symlinks, permissions, timestamps, ownership, devices).
- v: verbose, shows progress.
- l: copy symlinks as symlinks.
- W: copy whole files instead of using rsync’s delta algorithm (faster for large DB files).
- –delete: remove files from the destination that don’t exist on the source.
If you plan to run several replicas on a new cluster, rsync data to all of them. To save the performance of production servers, you can copy data to 1 new replica and then use it as a source for others. You can start with a single replica and add more after switching, but it will take more time afterward, as additional replicas need to pull all the data.
Add –bwlimit=100000 to preserve the performance of the production cluster while copying a lot of data.
Consider shards as independent clusters.
Incremental re-syncs
- Repeat the rsync step multiple times while the old cluster is live.
- Each subsequent run will copy only changes and reduce the final sync time.
Restore replication metadata
- Start the new ClickHouse node(s).
- Run SYSTEM RESTORE REPLICA table_name to rebuild replication metadata in ZooKeeper.
Test the application
- Point your test environment to the new cluster.
- Validate queries, schema consistency, and application behavior.
Final sync and switchover
- Stop ClickHouse on the old cluster.
- Immediately run a final incremental rsync to catch last-minute changes.
- Reinitialize ZooKeeper/Keeper database (stop/clear snapshots/start).
- Run SYSTEM RESTORE REPLICA table_name to rebuild replication metadata in ZooKeeper again.
- Start ClickHouse on the new cluster and switch production traffic.
- add more replicas as needed

NOTES:

To restore metadata on all cluster nodes by a single command, use ON CLUSTER modifier for the RESTORE REPLICA command.
You can build a script to run restore replica commands over all replicated tables by query:

select 'SYSTEM RESTORE REPLICA ' || database || '.' || table || ' ON CLUSTER {cluster} ;'
from system.tables
where engine ilike 'Replicated%'

If you are using a mount point that differs from /var/lib/clickhouse/data, adjust the rsync command accordingly to point to the correct location. For example, suppose you reconfigure the storage path as follows in /etc/clickhouse-server/config.d/config.xml.

<clickhouse>
    <!-- Path to data directory, with trailing slash. -->
    <path>/data1/clickhouse/</path>
    ...
</clickhouse>

You’ll need to use /data1/clickhouse instead of /var/lib/clickhouse in the rsync paths.

ClickHouse Docker container image does not have rsync installed. Add it using apt-get or run sidecar in k8s or run a service pod with volumes attached.
If you running rsync to multiple replicas or planning to use same (Zoo)Keeper ensemble for source and destination ClickHouse servers, you need to remove server uuid file after syncing data with rsync.

rm /var/lib/clickhouse/uuid

Otherwise, it can lead to hard-to-debug replication issues. Replicas will break each other’s sessions with (Zoo)Keeper.

5.46 - DDLWorker and DDL queue problems

Finding and troubleshooting problems in the distributed_ddl_queue

DDLWorker is a subprocess (thread) of clickhouse-server that executes ON CLUSTER tasks at the node.

When you execute a DDL query with ON CLUSTER mycluster section, the query executor at the current node reads the cluster mycluster definition (remote_servers / system.clusters) and places tasks into Zookeeper znode task_queue/ddl/... for members of the cluster mycluster.

DDLWorker at all ClickHouse® nodes constantly check this task_queue for their tasks, executes them locally, and reports about the results back into task_queue.

The common issue is the different hostnames/IPAddresses in the cluster definition and locally.

So if the initiator node puts tasks for a host named Host1. But the Host1 thinks about own name as localhost or xdgt634678d (internal docker hostname) and never sees tasks for the Host1 because is looking tasks for xdgt634678d. The same with internal VS external IP addresses.

DDLWorker thread crashed

That causes ClickHouse to stop executing ON CLUSTER tasks.

Check that DDLWorker is alive:

ps -eL|grep DDL
18829 18876 ?        00:00:00 DDLWorkerClnr
18829 18879 ?        00:00:00 DDLWorker

ps -ef|grep 18829|grep -v grep
clickho+ 18829 18828  1 Feb09 ?        00:55:00 /usr/bin/clickhouse-server --con...

As you can see there are two threads: DDLWorker and DDLWorkerClnr.

The second thread – DDLWorkerCleaner cleans old tasks from task_queue. You can configure how many recent tasks to store:

config.xml
<yandex>
    <distributed_ddl>
        <path>/clickhouse/task_queue/ddl</path>
        <pool_size>1</pool_size>
        <max_tasks_in_queue>1000</max_tasks_in_queue>
        <task_max_lifetime>604800</task_max_lifetime>
        <cleanup_delay_period>60</cleanup_delay_period>
    </distributed_ddl>
</yandex>

Default values:

cleanup_delay_period = 60 seconds – Sets how often to start cleanup to remove outdated data.

task_max_lifetime = 7 * 24 * 60 * 60 (in seconds = week) – Delete task if its age is greater than that.

max_tasks_in_queue = 1000 – How many tasks could be in the queue.

pool_size = 1 - How many ON CLUSTER queries can be run simultaneously.

Too intensive stream of ON CLUSTER command

Generally, it’s a bad design, but you can increase pool_size setting

Stuck DDL tasks in the distributed_ddl_queue

Sometimes DDL tasks (the ones that use ON CLUSTER) can get stuck in the distributed_ddl_queue because the replicas can overload if multiple DDLs (thousands of CREATE/DROP/ALTER) are executed at the same time. This is very normal in heavy ETL jobs.This can be detected by checking the distributed_ddl_queue table and see if there are tasks that are not moving or are stuck for a long time.

If these DDLs are completed in some replicas but failed in others, the simplest way to solve this is to execute the failed command in the missed replicas without ON CLUSTER. If most of the DDLs failed, then check the number of unfinished records in distributed_ddl_queue on the other nodes, because most probably it will be as high as thousands.

First, backup the distributed_ddl_queue into a table so you will have a snapshot of the table with the states of the tasks. You can do this with the following command:

CREATE TABLE default.system_distributed_ddl_queue AS SELECT * FROM system.distributed_ddl_queue;

After this, we need to check from the backup table which tasks are not finished and execute them manually in the missed replicas, and review the pipeline which do ON CLUSTER command and does not abuse them. There is a new CREATE TEMPORARY TABLE command that can be used to avoid the ON CLUSTER command in some cases, where you need an intermediate table to do some operations and after that you can INSERT INTO the final table or do ALTER TABLE final ATTACH PARTITION FROM TABLE temp and this temp table will be dropped automatically after the session is closed.

5.46.1 - There are N unfinished hosts (0 of them are currently active).

There are N unfinished hosts (0 of them are currently active).

Sometimes your Distributed DDL queries are being stuck, and not executing on all or subset of nodes, there are a lot of possible reasons for that kind of behavior, so it would take some time and effort to investigate.

Possible reasons

ClickHouse® node can’t recognize itself

SELECT * FROM system.clusters; -- check is_local column, it should have 1 for itself

getent hosts clickhouse.local.net # or other name which should be local
hostname --fqdn

cat /etc/hosts
cat /etc/hostname

Debian / Ubuntu

There is an issue in Debian based images, when hostname being mapped to 127.0.1.1 address which doesn’t literally match network interface and ClickHouse fails to detect this address as local.

https://github.com/ClickHouse/ClickHouse/issues/23504

Previous task is being executed and taking some time

It’s usually some heavy operations like merges, mutations, alter columns, so it make sense to check those tables:

SHOW PROCESSLIST;
SELECT * FROM system.merges;
SELECT * FROM system.mutations;

In that case, you can just wait completion of previous task.

Previous task is stuck because of some error

In that case, the first step is to understand which exact task is stuck and why. There are some queries which can help with that.

-- list of all distributed ddl queries, path can be different in your installation
SELECT * FROM system.zookeeper WHERE path = '/clickhouse/task_queue/ddl/';

-- information about specific task.
SELECT * FROM system.zookeeper WHERE path = '/clickhouse/task_queue/ddl/query-0000001000/';
SELECT * FROM system.zookeeper WHERE path = '/clickhouse/task_queue/ddl/' AND name = 'query-0000001000';
-- 22.3
SELECT * FROM system.zookeeper WHERE path like '/clickhouse/task_queue/ddl/query-0000001000/%' 
ORDER BY ctime, path SETTINGS allow_unrestricted_reads_from_keeper='true'
-- 22.6
SELECT path, name, value, ctime, mtime 
FROM system.zookeeper WHERE path like '/clickhouse/task_queue/ddl/query-0000001000/%' 
ORDER BY ctime, path SETTINGS allow_unrestricted_reads_from_keeper='true'

-- How many nodes executed this task
SELECT name, numChildren as finished_nodes FROM system.zookeeper
WHERE path = '/clickhouse/task_queue/ddl/query-0000001000/' AND name = 'finished';

┌─name─────┬─finished_nodes─┐
│ finished │              0 │
└──────────┴────────────────┘

-- The nodes that are running the task
SELECT name, value, ctime, mtime FROM system.zookeeper 
WHERE path = '/clickhouse/task_queue/ddl/query-0000001000/active/';

-- What was the result for the finished nodes 
SELECT name, value, ctime, mtime FROM system.zookeeper 
WHERE path = '/clickhouse/task_queue/ddl/query-0000001000/finished/';

-- Latest successfull executed tasks from query_log.
SELECT query FROM system.query_log WHERE query LIKE '%ddl_entry%' AND type = 2 ORDER BY event_time DESC LIMIT 5;

SELECT
    FQDN(),
    *
FROM clusterAllReplicas('cluster', system.metrics)
WHERE metric LIKE '%MaxDDLEntryID%'

┌─FQDN()───────────────────┬─metric────────┬─value─┬─description───────────────────────────┐
│ chi-ab.svc.cluster.local │ MaxDDLEntryID │  1468 │ Max processed DDL entry of DDLWorker. │
└──────────────────────────┴───────────────┴───────┴───────────────────────────────────────┘
┌─FQDN()───────────────────┬─metric────────┬─value─┬─description───────────────────────────┐
│ chi-ab.svc.cluster.local │ MaxDDLEntryID │  1468 │ Max processed DDL entry of DDLWorker. │
└──────────────────────────┴───────────────┴───────┴───────────────────────────────────────┘
┌─FQDN()───────────────────┬─metric────────┬─value─┬─description───────────────────────────┐
│ chi-ab.svc.cluster.local │ MaxDDLEntryID │  1468 │ Max processed DDL entry of DDLWorker. │
└──────────────────────────┴───────────────┴───────┴───────────────────────────────────────┘


-- Information about task execution from logs.
grep -C 40 "ddl_entry" /var/log/clickhouse-server/clickhouse-server*.log

Issues that can prevent task execution

Obsolete Replicas

Obsolete replicas left in zookeeper.

SELECT database, table, zookeeper_path, replica_path zookeeper FROM system.replicas WHERE total_replicas != active_replicas;

SELECT * FROM system.zookeeper WHERE path = '/clickhouse/cluster/tables/01/database/table/replicas';

SYSTEM DROP REPLICA 'replica_name';

SYSTEM STOP REPLICATION QUEUES;
SYSTEM START REPLICATION QUEUES;

https://clickhouse.tech/docs/en/sql-reference/statements/system/#query_language-system-drop-replica

Tasks manually removed from DDL queue

Task were removed from DDL queue, but left in Replicated*MergeTree table queue.

grep -C 40 "ddl_entry" /var/log/clickhouse-server/clickhouse-server*.log

/var/log/clickhouse-server/clickhouse-server.log:2021.05.04 12:41:28.956888 [ 599 ] {} <Debug> DDLWorker: Processing task query-0000211211 (ALTER TABLE db.table_local ON CLUSTER `all-replicated` DELETE WHERE id = 1)
/var/log/clickhouse-server/clickhouse-server.log:2021.05.04 12:41:29.053555 [ 599 ] {} <Error> DDLWorker: ZooKeeper error: Code: 999, e.displayText() = Coordination::Exception: No node, Stack trace (when copying this message, always include the lines below):
/var/log/clickhouse-server/clickhouse-server.log-
/var/log/clickhouse-server/clickhouse-server.log-0. Coordination::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, Coordination::Error, int) @ 0xfb2f6b3 in /usr/bin/clickhouse
/var/log/clickhouse-server/clickhouse-server.log-1. Coordination::Exception::Exception(Coordination::Error) @ 0xfb2fb56 in /usr/bin/clickhouse
/var/log/clickhouse-server/clickhouse-server.log:2. DB::DDLWorker::createStatusDirs(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::shared_ptr<zkutil::ZooKeeper> const&) @ 0xeb3127a in /usr/bin/clickhouse
/var/log/clickhouse-server/clickhouse-server.log:3. DB::DDLWorker::processTask(DB::DDLTask&) @ 0xeb36c96 in /usr/bin/clickhouse
/var/log/clickhouse-server/clickhouse-server.log:4. DB::DDLWorker::enqueueTask(std::__1::unique_ptr<DB::DDLTask, std::__1::default_delete<DB::DDLTask> >) @ 0xeb35f22 in /usr/bin/clickhouse
/var/log/clickhouse-server/clickhouse-server.log-5. ? @ 0xeb47aed in /usr/bin/clickhouse
/var/log/clickhouse-server/clickhouse-server.log-6. ThreadPoolImpl<ThreadFromGlobalPool>::worker(std::__1::__list_iterator<ThreadFromGlobalPool, void*>) @ 0x8633bcd in /usr/bin/clickhouse
/var/log/clickhouse-server/clickhouse-server.log-7. ThreadFromGlobalPool::ThreadFromGlobalPool<void ThreadPoolImpl<ThreadFromGlobalPool>::scheduleImpl<void>(std::__1::function<void ()>, int, std::__1::optional<unsigned long>)::'lambda1'()>(void&&, void ThreadPoolImpl<ThreadFromGlobalPool>::scheduleImpl<void>(std::__1::function<void ()>, int, std::__1::optional<unsigned long>)::'lambda1'()&&...)::'lambda'()::operator()() @ 0x863612f in /usr/bin/clickhouse
/var/log/clickhouse-server/clickhouse-server.log-8. ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>) @ 0x8630ffd in /usr/bin/clickhouse
/var/log/clickhouse-server/clickhouse-server.log-9. ? @ 0x8634bb3 in /usr/bin/clickhouse
/var/log/clickhouse-server/clickhouse-server.log-10. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
/var/log/clickhouse-server/clickhouse-server.log-11. __clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so
/var/log/clickhouse-server/clickhouse-server.log- (version 21.1.8.30 (official build))
/var/log/clickhouse-server/clickhouse-server.log:2021.05.04 12:41:29.053951 [ 599 ] {} <Debug> DDLWorker: Processing task query-0000211211 (ALTER TABLE db.table_local ON CLUSTER `all-replicated` DELETE WHERE id = 1)

Context of this problem is:

Constant pressure of cheap ON CLUSTER DELETE queries.
One replica was down for a long amount of time (multiple days).
Because of pressure on the DDL queue, it purged old records due to the task_max_lifetime setting.
When a lagging replica comes up, it’s fail’s execute old queries from DDL queue, because at this point they were purged from it.

Solution:

Reload/Restore this replica from scratch.

DDL path was changed in Zookeeper without restarting ClickHouse

Changing the DDL queue path in Zookeeper without restarting ClickHouse will make ClickHouse confused. If you need to do this ensure that you restart ClickHouse before submitting additional distributed DDL commands. Here’s an example.

-- Path before change:
SELECT *
FROM system.zookeeper
WHERE path = '/clickhouse/clickhouse101/task_queue'

┌─name─┬─value─┬─path─────────────────────────────────┐
│ ddl  │       │ /clickhouse/clickhouse101/task_queue │
└──────┴───────┴──────────────────────────────────────┘

-- Path after change
SELECT *
FROM system.zookeeper
WHERE path = '/clickhouse/clickhouse101/task_queue'

┌─name─┬─value─┬─path─────────────────────────────────┐
│ ddl2 │       │ /clickhouse/clickhouse101/task_queue │
└──────┴───────┴──────────────────────────────────────┘

The reason is that ClickHouse will not “see” this change and will continue to look for tasks in the old path. Altering paths in Zookeeper should be avoided if at all possible. If necessary it must be done very carefully.

5.47 - Merge Shards

Marge many Shards to one

(draft, not tested)

ClickHouse migration plan: merge 11 shards into 1 using `clickhouse-backup`

Your migration approach is workable with one important pattern:

restore schema once
restore local-table data shard by shard into detached
run ALTER TABLE ... ATTACH PART to attach restored parts
recreate or adjust Distributed tables for the new 1-shard topology

This plan assumes:

all 11 shards use schema-compatible local tables
all backups are taken from a consistent point in time
the target cluster is already built as a 1-shard environment
Distributed tables are treated as routing/query objects, not as the physical data source

Relevant references:

clickhouse-backup README: https://github.com/Altinity/clickhouse-backup/blob/master/ReadMe.md
clickhouse-backup changelog: https://github.com/Altinity/clickhouse-backup/blob/master/ChangeLog.md
Replication docs: https://clickhouse.com/docs/engines/table-engines/mergetree-family/replication
Distributed engine docs: https://clickhouse.com/docs/engines/table-engines/special/distributed
Detached parts docs: https://clickhouse.com/docs/operations/system-tables/detached_parts

Diagnosis

The safest migration pattern is:

take one backup per shard
build the new 1-shard target cluster
restore schema once from a single shard backup
restore only local-table data from each shard backup using --replicated-copy-to-detached
attach detached parts after each shard restore
recreate or validate Distributed tables for the new cluster layout
validate row counts, parts, and detached leftovers

I would not restore all 11 shard backups first and attach later. It is safer to process one shard backup at a time:

restore to detached
attach parts
validate
continue with the next shard

Migration sequence

1) Take backups on all 11 source shards

Use one backup per shard and keep shard identity in the backup name.

Examples:

shard01_20260319_full
shard02_20260319_full
...
shard11_20260319_full

Example commands:

clickhouse-backup create_remote shard01_20260319_full
clickhouse-backup create_remote shard02_20260319_full
clickhouse-backup create_remote shard03_20260319_full

Notes:

run clickhouse-backup on the same host or pod as ClickHouse, because it needs filesystem access
keep writes stopped or otherwise guarantee a consistent backup window across all shards

2) Prepare the new single-shard target

Before restoring anything:

create the new cluster definition
set correct macros for the new topology
verify Keeper paths for replicated tables
verify storage policies and disk layout

For Replicated*MergeTree, Keeper paths must be correct for the new 1-shard layout.

3) Restore schema once

Restore schema from one shard backup only.

Example:

clickhouse-backup restore_remote --schema shard01_20260319_full

You should restore schema only once because the table definitions are expected to be identical across shards.

Practical recommendation:

restore databases and local tables once
then recreate Distributed tables later so they point to the new 1-shard cluster

4) Restore local-table data shard by shard into `detached`

Use --replicated-copy-to-detached so the restore copies data into detached instead of trying to attach parts automatically.

Example for all local tables in both databases:

clickhouse-backup restore_remote \
  --data \
  --tables="db1.*_local,db2.*_local" \
  --replicated-copy-to-detached \
  shard01_20260319_full

Example for a smaller test subset:

clickhouse-backup restore_remote \
  --data \
  --tables="db1.events_local,db1.sessions_local,db2.fact_local" \
  --replicated-copy-to-detached \
  shard01_20260319_full

Notes:

restore local tables only
do not rely on Distributed tables for the data merge
process one shard backup at a time

5) Attach detached parts

After each shard restore, inspect system.detached_parts and attach the parts into the target local tables.

Attach a known part:

ALTER TABLE `db1`.`events_local` ATTACH PART '202603_12_12_0';

Generate attach statements for all detached parts in the two databases:

SELECT concat(
    'ALTER TABLE `', database, '`.`', table,
    '` ATTACH PART ', quoteString(name), ';'
) AS attach_sql
FROM system.detached_parts
WHERE database IN ('db1', 'db2')
  AND ifNull(reason, '') = ''
ORDER BY database, table, partition_id, min_block_number, max_block_number, name;

Inventory detached parts before and after attach:

SELECT
    database,
    table,
    reason,
    count() AS parts,
    formatReadableSize(sum(bytes_on_disk)) AS total_bytes
FROM system.detached_parts
WHERE database IN ('db1', 'db2')
GROUP BY database, table, reason
ORDER BY database, table, reason;

Validate active data after attach:

SELECT
    database,
    table,
    sum(rows) AS rows,
    formatReadableSize(sum(bytes_on_disk)) AS total_bytes
FROM system.parts
WHERE active
  AND database IN ('db1', 'db2')
GROUP BY database, table
ORDER BY database, table;

6) Recreate `Distributed` tables for the new 1-shard cluster

After all local-table data is loaded, recreate or adjust Distributed tables so they point to the new cluster layout.

Example:

DROP TABLE IF EXISTS `db1`.`events`;

CREATE TABLE `db1`.`events` AS `db1`.`events_local`
ENGINE = Distributed('cluster_1shard', 'db1', 'events_local', cityHash64(user_id));

This step is important because Distributed tables are query-routing objects, not the physical source of merged shard data.

7) Validation checklist

Before opening writes on the new cluster:

compare row counts by table
compare bytes on disk by table
inspect system.detached_parts for leftovers
inspect replication health if tables remain replicated
validate that all Distributed tables point to the new cluster definition
run smoke-test queries against both databases

Recommended operating pattern

For your case with two databases and around 50 tables total:

separate local tables from Distributed tables
restore schema once
restore local data shard by shard
attach parts after each shard
recreate Distributed tables last

That is the most predictable way to merge 11 shards into 1 with clickhouse-backup.

Important caveats

do not restore all shard backups to detached first and postpone all attaches until the end
do not restore schema 11 times
verify Keeper paths and macros carefully when moving from 11 shards to 1
test the full flow on a few representative large tables before running the complete migration
treat any remaining entries in system.detached_parts as something to review explicitly

Minimal command examples

Create backup:

clickhouse-backup create_remote shard01_20260319_full

Restore schema once:

clickhouse-backup restore_remote --schema shard01_20260319_full

Restore local-table data to detached:

clickhouse-backup restore_remote \
  --data \
  --tables="db1.*_local,db2.*_local" \
  --replicated-copy-to-detached \
  shard01_20260319_full

Attach one detached part:

ALTER TABLE `db1`.`events_local` ATTACH PART '202603_12_12_0';

Generate all attach commands:

SELECT concat(
    'ALTER TABLE `', database, '`.`', table,
    '` ATTACH PART ', quoteString(name), ';'
) AS attach_sql
FROM system.detached_parts
WHERE database IN ('db1', 'db2')
  AND ifNull(reason, '') = ''
ORDER BY database, table, partition_id, min_block_number, max_block_number, name;

Bash script template

This is a production-style skeleton you can adapt.

#!/usr/bin/env bash
set -euo pipefail

CH_CLIENT="${CH_CLIENT:-clickhouse-client --multiquery}"
CH_BACKUP="${CH_BACKUP:-clickhouse-backup}"

# Backups from 11 source shards
BACKUPS=(
  shard01_20260319_full
  shard02_20260319_full
  shard03_20260319_full
  shard04_20260319_full
  shard05_20260319_full
  shard06_20260319_full
  shard07_20260319_full
  shard08_20260319_full
  shard09_20260319_full
  shard10_20260319_full
  shard11_20260319_full
)

# Databases to migrate
DATABASES=(
  db1
  db2
)

# Local tables only.
# Keep Distributed tables out of this list.
LOCAL_TABLE_PATTERNS=(
  "db1.*_local"
  "db2.*_local"
)

join_by_comma() {
  local IFS=","
  echo "$*"
}

LOCAL_TABLES_CSV="$(join_by_comma "${LOCAL_TABLE_PATTERNS[@]}")"

echo "== Step 1: restore schema once from first shard backup =="
${CH_BACKUP} restore_remote --schema "${BACKUPS[0]}"

echo "== Step 2: process shard backups one by one =="
for backup in "${BACKUPS[@]}"; do
  echo "---- restoring data to detached from backup: ${backup}"
  ${CH_BACKUP} restore_remote \
    --data \
    --tables="${LOCAL_TABLES_CSV}" \
    --replicated-copy-to-detached \
    "${backup}"

  echo "---- attaching detached parts created by ${backup}"
  ${CH_CLIENT} --query "
    SELECT concat(
      'ALTER TABLE `', database, '`.`', table,
      '` ATTACH PART ', quoteString(name), ';'
    )
    FROM system.detached_parts
    WHERE database IN ('db1', 'db2')
      AND ifNull(reason, '') = ''
    ORDER BY database, table, partition_id, min_block_number, max_block_number, name
    FORMAT TSVRaw
  " | while IFS= read -r stmt; do
      echo "${stmt}"
      ${CH_CLIENT} --query "${stmt}"
    done

  echo "---- post-attach detached inventory"
  ${CH_CLIENT} --query "
    SELECT
      database,
      table,
      reason,
      count() AS parts
    FROM system.detached_parts
    WHERE database IN ('db1', 'db2')
    GROUP BY database, table, reason
    ORDER BY database, table, reason
  "
done

echo "== Step 3: final validation =="
${CH_CLIENT} --query "
  SELECT database, table, sum(rows) AS rows, formatReadableSize(sum(bytes_on_disk)) AS bytes
  FROM system.parts
  WHERE active
    AND database IN ('db1', 'db2')
  GROUP BY database, table
  ORDER BY database, table
"

echo "Migration load phase completed."

5.48 - differential backups using clickhouse-backup

differential backups using clickhouse-backup

Download the latest version of Altinity Backup for ClickHouse®: https://github.com/Altinity/clickhouse-backup/releases

# ubuntu / debian

wget https://github.com/Altinity/clickhouse-backup/releases/download/v2.5.20/clickhouse-backup_2.5.20_amd64.deb 
sudo dpkg -i clickhouse-backup_2.5.20_amd64.deb 

# centos / redhat / fedora 

sudo yum install https://github.com/Altinity/clickhouse-backup/releases/download/v2.5.20/clickhouse-backup-2.5.20-1.x86_64.rpm

# other platforms
wget https://github.com/Altinity/clickhouse-backup/releases/download/v2.5.20/clickhouse-backup.tar.gz
sudo mkdir /etc/clickhouse-backup/
sudo mv clickhouse-backup/config.yml /etc/clickhouse-backup/config.yml.example
sudo mv clickhouse-backup/clickhouse-backup /usr/bin/
rm -rf clickhouse-backup clickhouse-backup.tar.gz

Create a runner script for the crontab

mkdir /opt/clickhouse-backup-diff/

cat << 'END' > /opt/clickhouse-backup-diff/clickhouse-backup-cron.sh

#!/bin/bash
set +x
command_line_argument=$1

backup_name=$(date +%Y-%M-%d-%H-%M-%S)

echo "Creating local backup '${backup_name}' (full, using hardlinks)..."
clickhouse-backup create "${backup_name}"

if [[ "run_diff" == "${command_line_argument}" && "2" -le "$(clickhouse-backup list local | wc -l)" ]]; then
  prev_backup_name="$(clickhouse-backup list local | tail -n 2 | head -n 1 | cut -d " " -f 1)"
  echo "Uploading the backup '${backup_name}' as diff from the previous backup ('${prev_backup_name}')"
  clickhouse-backup upload --diff-from "${prev_backup_name}" "${backup_name}"
elif [[ "" == "${command_line_argument}" ]]; then
  echo "Uploading the backup '${backup_name}, and removing old unneeded backups"
  KEEP_BACKUPS_LOCAL=1 KEEP_BACKUPS_REMOTE=1 clickhouse-backup upload "${backup_name}"
fi
END

chmod +x /opt/clickhouse-backup-diff/clickhouse-backup-cron.sh

Create configuration for clickhouse-backup

# Check the example: /etc/clickhouse-backup/config.yml.example 
vim /etc/clickhouse-backup/config.yml

Edit the crontab

crontab -e

# full backup at 0:00 Monday
0 0 * * 1 clickhouse /opt/clickhouse-backup-diff/clickhouse-backup-cron.sh
# differential backup every hour (except of 00:00) Monday 
0 1-23 * * 1 clickhouse /opt/clickhouse-backup-diff/clickhouse-backup-cron.sh run_diff
# differential backup every hour Sunday, Tuesday-Saturday
0 */1 * * 0,2-6 clickhouse /opt/clickhouse-backup-diff/clickhouse-backup-cron.sh run_diff

Recover the last backup:

last_remote_backup="$(clickhouse-backup list remote | tail -n 1 | cut -d " " -f 1)"
clickhouse-backup download "${last_remote_backup}"
clickhouse-backup restore --rm "${last_remote_backup}"

5.49 - High CPU usage in ClickHouse®

Getting CPU usage under control

In general, it is a NORMAL situation for ClickHouse® that while processing a huge dataset it can use a lot of (or all of) the server resources. It is ‘by design’ - just to make the answers faster.

The main directions to reduce the CPU usage is to review the schema / queries to limit the amount of the data which need to be processed, and to plan the resources in a way when single running query will not impact the others.

Any attempts to reduce the CPU usage will end up with slower queries!

How to slow down queries to reduce the CPU usage

If it is acceptable for you - please check the following options for limiting the CPU usage:

setting max_threads: reducing the number of threads that are allowed to use one request. Fewer threads = more free cores for other requests. By default, it’s allowed to take half of the available CPU cores, adjust only when needed. So if if you have 10 cores then max_threads = 10 will work about twice faster than max_threads=5, but will take 100% or CPU. (max_threads=5 will use half of CPUs so 50%).
setting os_thread_priority: increasing niceness for selected requests. In this case, the operating system, when choosing which of the running processes to allocate processor time, will prefer processes with lower niceness. 0 is the default niceness. The higher the niceness, the lower the priority of the process. The maximum niceness value is 19.

These are custom settings that can be tweaked in several ways:

by specifying them when connecting a client, for example

clickhouse-client --os_thread_priority=19 -q 'SELECT max (number) from numbers (100000000)'

echo 'SELECT max(number) from numbers(100000000)' | curl 'http://localhost:8123/?os_thread_priority=19' --data-binary @-

via dedicated API / connection parameters in client libraries

using the SQL command SET (works only within the session)

SET os_thread_priority = 19;
SELECT max(number) from numbers(100000000)

using different profiles of settings for different users. Something like

<?xml version="1.0"?>
<yandex>
    <profiles>
        <default>
        ...
        </default>

        <lowcpu>
            <os_thread_priority>19</os_thread_priority>
            <max_threads>4</max_threads>
        </lowcpu>
    </profiles>

    <!-- Users and ACL. -->
    <users>
        <!-- If user name was not specified, 'default' user is used. -->
        <limited_user>
            <password>123</password>
            <networks>
                <ip>::/0</ip>
            </networks>
            <profile>lowcpu</profile>

            <!-- Quota for user. -->
            <quota>default</quota>
        </limited_user>
    </users>

</yandex>

There are also plans to introduce a system of more flexible control over the assignment of resources to different requests.

Also, if these are manually created queries, then you can try to discipline users by adding quotas to them (they can be formulated as “you can read no more than 100GB of data per hour” or “no more than 10 queries”, etc.)

If these are automatically generated queries, it may make sense to check if there is no way to write them in a more efficient way.

5.50 - Load balancers

Load balancers

In general - one of the simplest option to do load balancing is to implement it on the client side.

I.e. list several endpoints for ClickHouse® connections and add some logic to pick one of the nodes.

Many client libraries support that.

ClickHouse native protocol (port 9000)

Currently there are no protocol-aware proxies for ClickHouse protocol, so the proxy / load balancer can work only on TCP level.

One of the best option for TCP load balancer is haproxy, also nginx can work in that mode.

Haproxy will pick one upstream when connection is established, and after that it will keep it connected to the same server until the client or server will disconnect (or some timeout will happen).

It can’t send different queries coming via a single connection to different servers, as he knows nothing about ClickHouse protocol and doesn’t know when one query ends and another start, it just sees the binary stream.

So for native protocol, there are only 3 possibilities:

close connection after each query client-side
close connection after each query server-side (currently there is only one setting for that - idle_connection_timeout=0, which is not exact what you need, but similar).
use a ClickHouse server with Distributed table as a proxy.

HTTP protocol (port 8123)

There are many more options and you can use haproxy / nginx / chproxy, etc. chproxy give some extra ClickHouse-specific features, you can find a list of them at https://chproxy.org

5.51 - memory configuration settings

memory configuration settings

max_memory_usage. Single query memory usage

max_memory_usage - the maximum amount of memory allowed for a single query to take. By default, it’s 10Gb. The default value is good, don’t adjust it in advance.

There are scenarios when you need to relax the limit for particular queries (if you hit ‘Memory limit (for query) exceeded’), or use a lower limit if you need to discipline the users or increase the number of simultaneous queries.

Server memory usage

Server memory usage = constant memory footprint (used by different caches, dictionaries, etc) + sum of memory temporary used by running queries (a theoretical limit is a number of simultaneous queries multiplied by max_memory_usage).

Since 20.4 you can set up a global limit using the max_server_memory_usage setting. If something will hit that limit you will see ‘Memory limit (total) exceeded’ in random places.

By default it 90% of the physical RAM of the server. https://clickhouse.tech/docs/en/operations/server-configuration-parameters/settings/#max_server_memory_usage https://github.com/ClickHouse/ClickHouse/blob/e5b96bd93b53d2c1130a249769be1049141ef386/programs/server/config.xml#L239-L250

You can decrease that in some scenarios (like you need to leave more free RAM for page cache or to some other software).

Limits?

select metric, formatReadableSize(value) from system.asynchronous_metrics where metric ilike '%MemoryTotal%'
union all 
select name, formatReadableSize(toUInt64(value)) from system.server_settings where name='max_server_memory_usage'
FORMAT PrettyCompactMonoBlock

How to check what is using my RAM?

altinity-kb-who-ate-my-memory.md

Mark cache

https://github.com/ClickHouse/clickhouse-presentations/blob/master/meetup39/mark-cache.pdf

5.52 - Memory Overcommiter

Enable Memory overcommiter instead of ussing max_memory_usage per query

Memory Overcommiter

From version 22.2+ ClickHouse® was updated with enhanced Memory overcommit capabilities . In the past, queries were constrained by the max_memory_usage setting, imposing a rigid limitation. Users had the option to increase this limit, but it came at the potential expense of impacting other users during a single query. With the introduction of Memory overcommit, more memory-intensive queries can now execute, granted there are ample resources available. When the server reaches its maximum memory limit , ClickHouse identifies the most overcommitted queries and attempts to terminate them. It’s important to note that the terminated query might not be the one causing the condition. If it’s not, the query will undergo a waiting period to allow the termination of the high-memory query before resuming its execution. This setup ensures that low-memory queries always have the opportunity to run, while more resource-intensive queries can execute during server idle times when resources are abundant. Users have the flexibility to fine-tune this behavior at both the server and user levels.

If the memory overcommitter is not being used you’ll get something like this:

Received exception from server (version 22.8.20):
Code: 241. DB::Exception: Received from altinity.cloud:9440. DB::Exception: Received from chi-replica1-2-0:9000. DB::Exception: Memory limit (for query) exceeded: would use 5.00 GiB (attempt to allocate chunk of 4196736 bytes), maximum: 5.00 GiB. OvercommitTracker decision: Memory overcommit isn't used. OvercommitTracker isn't set.: (avg_value_size_hint = 0, avg_chars_size = 1, limit = 8192): while receiving packet from chi-replica1-1-0:9000: While executing Remote. (MEMORY_LIMIT_EXCEEDED)

So to enable Memory Overcommit you need to get rid of the max_memory_usage and max_memory_usage_for_user (set them to 0) and configure overcommit specific settings (usually defaults are ok, so read carefully the documentation)

memory_overcommit_ratio_denominator: It represents soft memory limit on the user level. This value is used to compute query overcommit ratio.
memory_overcommit_ratio_denominator_for_user: It represents soft memory limit on the global level. This value is used to compute query overcommit ratio.
memory_usage_overcommit_max_wait_microseconds: Maximum time thread will wait for memory to be freed in the case of memory overcommit. If timeout is reached and memory is not freed, exception is thrown

Please check https://clickhouse.com/docs/en/operations/settings/memory-overcommit

Also you will check/need to configure global memory server setting. These are by default:

<clickhouse>
   <!-- when max_server_memory_usage is set to non-zero, max_server_memory_usage_to_ram_ratio is ignored-->
    <max_server_memory_usage>0</max_server_memory_usage>
    <max_server_memory_usage_to_ram_ratio>0.8</max_server_memory_usage_to_ram_ratio> 
</clickhouse>

With these set, now if you execute some queries with bigger memory needs than your max_server_memory_usage you’ll get something like this:

Received exception from server (version 22.8.20):
Code: 241. DB::Exception: Received from altinity.cloud:9440. DB::Exception: Received from chi-test1-2-0:9000. DB::Exception: Memory limit (total) exceeded: would use 12.60 GiB (attempt to allocate chunk of 4280448 bytes), maximum: 12.60 GiB. OvercommitTracker decision: Query was selected to stop by OvercommitTracker.: while receiving packet from chi-replica1-2-0:9000: While executing Remote. (MEMORY_LIMIT_EXCEEDED)

This will allow you to know that the Overcommit memory tracker is set and working.

Also to note that maybe you don’t need the Memory Overcommit system because with max_memory_usage per query you’re ok.

The good thing about memory overcommit is that you let ClickHouse handle the memory limitations instead of doing it manually, but there may be some scenarios where you don’t want to use it and using max_memory_usage or max_memory_usage_for_user is a better fit. For example, if your workload has a lot of small/medium queries that are not memory intensive and you need to run few memory intensive queries for some users with a fixed memory limit. This is a common scenario for dbt or other ETL tools that usually run big memory intensive queries.

5.53 - Moving a table to another device

Moving a table to another device.

Suppose we mount a new device at path /mnt/disk_1 and want to move table_4 to it.

Create directory on new device for ClickHouse® data. /in shell mkdir /mnt/disk_1/clickhouse
Change ownership of created directory to ClickHouse user. /in shell chown -R clickhouse:clickhouse /mnt/disk_1/clickhouse
Create a special storage policy which should include both disks: old and new. /in shell

nano /etc/clickhouse-server/config.d/storage.xml
###################/etc/clickhouse-server/config.d/storage.xml###########################
<yandex>
  <storage_configuration>
    <disks>
      <!--
          default disk is special, it always
          exists even if not explicitly
          configured here, but you can't change
          it's path here (you should use <path>
          on top level config instead)
      -->
      <default>
         <!--
             You can reserve some amount of free space
             on any disk (including default) by adding
             keep_free_space_bytes tag
         -->
      </default>
      <disk_1> <!-- disk name -->
          <path>/mnt/disk_1/clickhouse/</path>
      </disk_1>
    </disks>
    <policies>
      <move_from_default_to_disk_1> <!-- name for new storage policy -->
        <volumes>
          <default>
            <disk>default</disk>
            <max_data_part_size_bytes>10000000</max_data_part_size_bytes>
          </default>
          <disk_1_vol> <!-- name of volume -->
            <!--
                we have only one disk in that volume
                and we reference here the name of disk
                as configured above in <disks> section
            -->
            <disk>disk_1</disk>
          </disk_1_vol>
        </volumes>
        <move_factor>0.99</move_factor>
      </move_from_default_to_disk_1>
    </policies>
  </storage_configuration>
</yandex>
#########################################################################################

Update storage_policy setting of tables to new policy.

ALTER TABLE table_4 MODIFY SETTING storage_policy='move_from_default_to_disk_1';

Wait till all parts of tables change their disk_name to new disk.

SELECT name,disk_name, path from system.parts WHERE table='table_4' and active;
SELECT disk_name, path, sum(rows), sum(bytes_on_disk), uniq(partition), count() FROM system.parts WHERE table='table_4' and active GROUP BY disk_name, path ORDER BY disk_name, path;

Remove ‘default’ disk from new storage policy. In server shell:

nano /etc/clickhouse-server/config.d/storage.xml
###################/etc/clickhouse-server/config.d/storage.xml###########################
<yandex>
  <storage_configuration>
    <disks>
      <!--
          default disk is special, it always
          exists even if not explicitly
          configured here, but you can't change
          it's path here (you should use <path>
          on top level config instead)
      -->
      <default>
         <!--
             You can reserve some amount of free space
             on any disk (including default) by adding
             keep_free_space_bytes tag
         -->
      </default>
      <disk_1> <!-- disk name -->
          <path>/mnt/disk_1/clickhouse/</path>
      </disk_1>
    </disks>
    <policies>
      <move_from_default_to_disk_1> <!-- name for new storage policy -->
        <volumes>
          <disk_1_vol> <!-- name of volume -->
            <!--
                we have only one disk in that volume
                and we reference here the name of disk
                as configured above in <disks> section
            -->
            <disk>disk_1</disk>
          </disk_1_vol>
        </volumes>
        <move_factor>0.99</move_factor>
      </move_from_default_to_disk_1>
    </policies>
  </storage_configuration>
</yandex>
#########################################################################################

ClickHouse wouldn’t auto reload config, because we removed some disks from storage policy, so we need to restart it by hand.

Restart ClickHouse server.
Make sure that storage policy uses the right disks.

SELECT * FROM system.storage_policies WHERE policy_name='move_from_default_to_disk_1';

5.54 - MultiDisk (JBOD) Balancing

ClickHouse provides two options to balance an insert across disks in a volume with more than one disk: round_robin and least_used .

Round Robin (Default):

ClickHouse selects the next disk in a round robin manner to write a part.

This is the default setting and is most effective when parts created on insert are roughly the same size.

Drawbacks: may lead to disk skew

Least Used:

ClickHouse selects the disk with the most available space and writes to that disk.

Changing to least_used when even disk space consumption is desirable or when you have a JBOD volume with differing disk sizes. To prevent hot-spots, it is best to set this policy on a fresh volume or on a volume that has already been (re)balanced.

Drawbacks: may lead to hot-spots

Configurations

Configurations that can affect disk selected:

storage policy volume configuration: least_used_ttl_ms. Only applies to least_used policy, 60s default.
disk setting: keep_free_space_bytes , keep_free_space_ratio

Configuration to assist rebalancing:

The MergeTree setting min_bytes_to_rebalance_partition_over_jbod does not control where data is written during inserts. Instead, it governs how parts are redistributed across disks within the same volume during merge operations.

Note: setting min_bytes_to_rebalance_partition_over_jbod does not guarantee balanced partitions and balanced disk usage.

Example of least_used policy:

<clickhouse>
  <storage_configuration>
    <disks>
     <default>
       <path>/var/lib/clickhouse/</path>
        <keep_free_space_bytes>10737418240</keep_free_space_bytes>
      </disk1>
      <disk1>
        <path>/mnt/disk1/</path>
        <keep_free_space_bytes>10737418240</keep_free_space_bytes>
      </disk1>
      <disk2>
        <path>/mnt/disk2/</path>
        <keep_free_space_bytes>10737418240</keep_free_space_bytes>
      </disk2>
    </disks>
    <policies>
      <hot>
        <volumes>
          <default>
            <disk>disk1</disk>
            <disk>disk2</disk>
            <load_balancing>least_used</load_balancing>
            <least_used_ttl_ms>60000</least_used_ttl_ms> <!-- 60s -->
          </default>
        </volumes>
      </hot>
    </policies>
  </storage_configuration>
</clickhouse>

Manual Rebalancing Parts over JBOD Disks

Following query will select large parts in target_tables and target_databases that can be candidates to move to another disk. Disk chosen should comply with the following requirements:

Should only select valid moves for the same storage_policy used by that table
storage_policy must be JBODs type
moves to other disks in the same volume
select a different disk, i.e not the same disk as the one that part is in
select the disk to move the part to by order of largest free_space on that disk

Set target_tables and target_databases based on requirements.

WITH
    '%' AS target_tables,
    '%' AS target_databases
SELECT sub.q FROM 
( 
    SELECT
        'ALTER TABLE ' || parts.database || '.' || parts.`table` || ' MOVE PART \'' || parts.name ||'\' TO DISK \'' || other_disk_candidate || '\';' as q,
        parts.database as db,
        parts.`table` as t,
        parts.name as part_name,
        parts.disk_name as part_disk_name,
        parts.bytes_on_disk AS part_bytes_on_disk,
        sp.storage_policy as part_storage_policy,
        arrayJoin(arrayRemove(v.disks, parts.disk_name)) AS other_disk_candidate,
        candidate_disks.free_space AS candidate_disk_free_space
    FROM system.parts AS parts
    INNER JOIN ( SELECT database, `table`, storage_policy FROM system.tables where (name LIKE target_tables) AND (database LIKE target_databases) group by 1, 2, 3 ) AS sp ON sp.`table` = parts.`table` AND sp.database = parts.database 
    INNER JOIN ( SELECT policy_name, volume_name, disks AS disks FROM system.storage_policies WHERE volume_type = 0 ) AS v ON sp.storage_policy = v.policy_name
    INNER JOIN ( SELECT name, free_space FROM system.disks ORDER BY free_space DESC ) AS candidate_disks ON candidate_disks.name = other_disk_candidate
    WHERE parts.active = 1 
        AND (parts.bytes_on_disk >= 10737418240) --10GB prioritize larger parts
        AND (parts.`table` LIKE target_tables) 
        AND (parts.database LIKE target_databases)
        AND candidate_disks.free_space > parts.bytes_on_disk*2 -- 2x buffer
    ORDER BY parts.bytes_on_disk DESC, candidate_disk_free_space DESC
    LIMIT 1 BY db, t, part_name
) as sub
FORMAT TSVRaw

5.55 - Object consistency in a cluster

Object consistency in a cluster

List of missing tables

WITH (
     SELECT groupArray(FQDN()) FROM clusterAllReplicas('{cluster}',system,one)
     ) AS hosts
SELECT database,
       table,
       arrayFilter( i-> NOT has(groupArray(host),i), hosts) miss_table
FROM (
        SELECT FQDN() host, database, name table
        FROM clusterAllReplicas('{cluster}',system,tables)
        WHERE engine NOT IN ('Log','Memory','TinyLog')
     )
GROUP BY database, table
HAVING miss_table <> []
SETTINGS skip_unavailable_shards=1;

┌─database─┬─table─┬─miss_table────────────────┐
│ default  │ test  │ ['host366.mynetwork.net'] │
└──────────┴───────┴───────────────────────────┘

List of inconsistent tables

SELECT database, name, engine, uniqExact(create_table_query) AS ddl
FROM clusterAllReplicas('{cluster}',system.tables)
GROUP BY database, name, engine HAVING ddl > 1

List of inconsistent columns

WITH (
     SELECT groupArray(FQDN()) FROM clusterAllReplicas('{cluster}',system,one)
     ) AS hosts
SELECT database,
       table,
       column,
       arrayStringConcat(arrayMap( i -> i.2 ||': '|| i.1,
                                 (groupArray( (type,host) ) AS g)),', ') diff
FROM (
        SELECT FQDN() host, database, table, name column, type
        FROM clusterAllReplicas('{cluster}',system,columns)
     )
GROUP BY database, table, column
HAVING length(arrayDistinct(g.1)) > 1 OR length(g.1) <> length(hosts)
SETTINGS skip_unavailable_shards=1;

┌─database─┬─table───┬─column────┬─diff────────────────────────────────┐
│ default  │ z       │ A         │ ch-host22: Int64, ch-host21: String │
└──────────┴─────────┴───────────┴─────────────────────────────────────┘

List of inconsistent dictionaries

WITH (
     SELECT groupArray(FQDN()) FROM clusterAllReplicas('{cluster}',system,one)
     ) AS hosts
SELECT database,
       dictionary,
       arrayFilter( i-> NOT has(groupArray(host),i), hosts) miss_dict,
       arrayReduce('min', (groupArray((element_count, host)) AS ec).1) min,
       arrayReduce('max', (groupArray((element_count, host)) AS ec).1) max
FROM (
        SELECT FQDN() host, database, name dictionary, element_count
        FROM clusterAllReplicas('{cluster}',system,dictionaries)
     )
GROUP BY database, dictionary
HAVING miss_dict <> [] or min <> max
SETTINGS skip_unavailable_shards=1;
;

5.56 - Production Cluster Configuration Guide

Production Cluster Configuration Guide

Moving from a single ClickHouse® server to a clustered format provides several benefits:

Replication guarantees data integrity.
Provides redundancy.
Failover by being able to restart half of the nodes without encountering downtime.

Moving from an unsharded ClickHouse environment to a sharded cluster requires redesign of schema and queries. Starting with a sharded cluster from the beginning makes it easier in the future to scale the cluster up.

Setting up a ClickHouse cluster for a production environment requires the following stages:

Hardware Requirements
Network Configuration
Create Host Names
Monitoring Considerations
Configuration Steps
Setting Up Backups
Staging Plans
Upgrading The Cluster

5.56.1 - Backups

Backups

ClickHouse® is currently at the design stage of creating some universal backup solution. Some custom backup strategies are:

Each shard is backed up separately.
FREEZE the table/partition. For more information, see Alter Freeze Partition .
1. This creates hard links in shadow subdirectory.
rsync that directory to a backup location, then remove that subfolder from shadow.
1. Cloud users are recommended to use Rclone .
Always add the full contents of the metadata subfolder that contains the current DB schema and ClickHouse configs to your backup.
For a second replica, it’s enough to copy metadata and configuration.
Data in ClickHouse is already compressed with lz4, backup can be compressed bit better, but avoid using cpu-heavy compression algorithms like gzip, use something like zstd instead.

The tool automating that process: Altinity Backup for ClickHouse .

5.56.2 - Cluster Configuration FAQ

Cluster Configuration FAQ

ClickHouse® does not start, some other unexpected behavior happening

Check ClickHouse logs, they are your friends:

tail -n 1000 /var/log/clickhouse-server/clickhouse-server.err.log | less tail -n 10000 /var/log/clickhouse-server/clickhouse-server.log | less

How Do I Restrict Memory Usage?

See our knowledge base article and official documentation for more information.

ClickHouse died during big query execution

Misconfigured ClickHouse can try to allocate more RAM than is available on the system.

In that case an OS component called oomkiller can kill the ClickHouse process.

That event leaves traces inside system logs (can be checked by running dmesg command).

How Do I make huge ‘Group By’ queries use less RAM?

Enable on disk GROUP BY (it is slower, so is disabled by default)

Set max_bytes_before_external_group_by to a value about 70-80% of your max_memory_usage value.

Data returned in chunks by clickhouse-client

See altinity-kb-clickhouse-client

I Can’t Connect From Other Hosts. What do I do?

Check the settings in config.xml. Verify that the connection can connect on both IPV4 and IPV6.

5.56.3 - Cluster Configuration Process

Cluster Configuration Process

So you set up 3 nodes with zookeeper (zookeeper1, zookeeper2, zookeeper3 - How to install zookeeper? ), and and 4 nodes with ClickHouse® (clickhouse-sh1r1,clickhouse-sh1r2,clickhouse-sh2r1,clickhouse-sh2r2 - how to install ClickHouse? ). Now we need to make them work together.

Use ansible/puppet/salt or other systems to control the servers’ configurations.

Configure ClickHouse access to Zookeeper by adding the file zookeeper.xml in /etc/clickhouse-server/config.d/ folder. This file must be placed on all ClickHouse servers.

<yandex>
    <zookeeper>
        <node>
            <host>zookeeper1</host>
            <port>2181</port>
        </node>
        <node>
            <host>zookeeper2</host>
            <port>2181</port>
        </node>
        <node>
            <host>zookeeper3</host>
            <port>2181</port>
        </node>
    </zookeeper>
</yandex>

On each server put the file macros.xml in /etc/clickhouse-server/config.d/ folder.

<yandex>
    <!--
        That macros are defined per server,
        and they can be used in DDL, to make the DB schema cluster/server neutral
    -->
    <macros>
        <cluster>prod_cluster</cluster>
        <shard>01</shard>
        <replica>clickhouse-sh1r1</replica> <!-- better - use the same as hostname  -->
    </macros>
</yandex>

On each server place the file cluster.xml in /etc/clickhouse-server/config.d/ folder. Before 20.10 ClickHouse will use default user to connect to other nodes (configurable, other users can be used), since 20.10 we recommend to use passwordless intercluster authentication based on common secret (HMAC auth)

<yandex>
    <remote_servers>
        <prod_cluster> <!-- you need to give a some name for a cluster -->

            <!--
                <secret>some_random_string, same on all cluster nodes, keep it safe</secret>
            -->
            <shard>
                <internal_replication>true</internal_replication>
                <replica>
                    <host>clickhouse-sh1r1</host>
                    <port>9000</port>
                </replica>
                <replica>
                    <host>clickhouse-sh1r2</host>
                    <port>9000</port>
                </replica>
            </shard>
            <shard>
                <internal_replication>true</internal_replication>
                <replica>
                    <host>clickhouse-sh2r1</host>
                    <port>9000</port>
                </replica>
                <replica>
                    <host>clickhouse-sh2r2</host>
                    <port>9000</port>
                </replica>
            </shard>
        </prod_cluster>
    </remote_servers>
</yandex>

A good practice is to create 2 additional cluster configurations similar to prod_cluster above with the following distinction: but listing all nodes of single shard (all are replicas) and as nodes of 6 different shards (no replicas)
1. all-replicated: All nodes are listed as replicas in a single shard.
2. all-sharded: All nodes are listed as separate shards with no replicas.

Once this is complete, other queries that span nodes can be performed. For example:

CREATE TABLE test_table_local ON CLUSTER '{cluster}'
(
  id UInt8
)
Engine=ReplicatedMergeTree('/clickhouse/tables/{database}/{table}/{shard}', '{replica}')
ORDER BY (id);

That will create a table on all servers in the cluster. You can insert data into this table and it will be replicated automatically to the other shards.To store the data or read the data from all shards at the same time, create a Distributed table that links to the replicatedMergeTree table.

CREATE TABLE test_table ON CLUSTER '{cluster}'
Engine=Distributed('{cluster}', 'default', '

Hardening ClickHouse Security

See https://docs.altinity.com/operationsguide/security/

Additional Settings

See altinity-kb-settings-to-adjust

Users

Disable or add password for the default users default and readonly if your server is accessible from non-trusted networks.

If you add password to the default user, you will need to adjust cluster configuration, since the other servers need to know the default user’s should know the default user’s to connect to each other.

If you’re inside a trusted network, you can leave default user set to nothing to allow the ClickHouse nodes to communicate with each other.

Engines & ClickHouse building blocks

For general explanations of roles of different engines - check the post Distributed vs Shard vs Replicated ahhh, help me!!! .

Zookeeper Paths

Use conventions for zookeeper paths. For example, use:

ReplicatedMergeTree(’/clickhouse/{cluster}/tables/{shard}/table_name’, ‘{replica}’)

for:

SELECT * FROM system.zookeeper WHERE path=’/ …';

Configuration Best Practices

Attribution

Modified by a post [on GitHub by Mikhail Filimonov](https://github.com/ClickHouse/ClickHouse/issues/3607#issuecomment-440235298).

The following are recommended Best Practices when it comes to setting up a ClickHouse Cluster with Zookeeper:

Don’t edit/overwrite default configuration files. Sometimes a newer version of ClickHouse introduces some new settings or changes the defaults in config.xml and users.xml.
1. Set configurations via the extra files in conf.d directory. For example, to overwrite the interface save the file config.d/listen.xml, with the following:

<?xml version="1.0"?>
<yandex>
    <listen_host replace="replace">::</listen_host>
</yandex>

The same is true for users. For example, change the default profile by putting the file in users.d/profile_default.xml:

<?xml version="1.0"?>
<yandex>
    <profiles>
        <default replace="replace">
            <max_memory_usage>15000000000</max_memory_usage>
            <max_bytes_before_external_group_by>12000000000</max_bytes_before_external_group_by>
            <max_bytes_before_external_sort>12000000000</max_bytes_before_external_sort>
            <distributed_aggregation_memory_efficient>1</distributed_aggregation_memory_efficient>
            <use_uncompressed_cache>0</use_uncompressed_cache>
            <load_balancing>random</load_balancing>
            <log_queries>1</log_queries>
            <max_execution_time>600</max_execution_time>
        </default>
    </profiles>
</yandex>

Or you can create a user by putting a file users.d/user_xxx.xml (since 20.5 you can also use CREATE USER)

<?xml version="1.0"?>
<yandex>
    <users>
        <xxx>
            <!-- PASSWORD=$(base64 < /dev/urandom | head -c8); echo "$PASSWORD"; echo -n "$PASSWORD" | sha256sum | tr -d '-' -->
            <password_sha256_hex>...</password_sha256_hex>
            <networks incl="networks" />
            <profile>readonly</profile>
            <quota>default</quota>
            <allow_databases incl="allowed_databases" />
        </xxx>
    </users>
</yandex>

Some parts of configuration will contain repeated elements (like allowed ips for all the users). To avoid repeating that - use substitutions file. By default its /etc/metrika.xml, but you can change it for example to /etc/clickhouse-server/substitutions.xml with the <include_from> section of the main config. Put the repeated parts into substitutions file, like this:

<?xml version="1.0"?>
<yandex>
    <networks>
        <ip>::1</ip>
        <ip>127.0.0.1</ip>
        <ip>10.42.0.0/16</ip>
        <ip>192.168.0.0/24</ip>
    </networks>
</yandex>

These files can be common for all the servers inside the cluster or can be individualized per server. If you choose to use one substitutions file per cluster, not per node, you will also need to generate the file with macros, if macros are used.

This way you have full flexibility; you’re not limited to the settings described in the template. You can change any settings per server or data center just by assigning files with some settings to that server or server group. It becomes easy to navigate, edit, and assign files.

Other Configuration Recommendations

Other configurations that should be evaluated:

in config.xml: Determines which IP addresses and ports the ClickHouse servers listen for incoming communications.
<max_memory_..> and <max_bytes_before_external_…> in users.xml. These are part of the profile .
<max_execution_time>
<log_queries>

The following extra debug logs should be considered:

part_log
text_log

Understanding The Configuration

ClickHouse configuration stores most of its information in two files:

config.xml: Stores Server configuration parameters . They are server wide, some are hierarchical , and most of them can’t be changed in runtime. The list of settings to apply without a restart changes from version to version. Some settings can be verified using system tables, for example:
- macros (system.macros)
- remote_servers (system.clusters)
users.xml: Configure users, and user level / session level settings .
- Each user can change these during their session by:
  - Using parameter in http query
  - By using parameter for clickhouse-client
  - Sending query like set allow_experimental_data_skipping_indices=1.
- Those settings and their current values are visible in system.settings. You can make some settings global by editing default profile in users.xml, which does not need restart.
- You can forbid users to change their settings by using readonly=2 for that user, or using setting constraints .
- Changes in users.xml are applied w/o restart.

For both config.xml and users.xml, it’s preferable to put adjustments in the config.d and users.d subfolders instead of editing config.xml and users.xml directly.

You can check if the config file was reread by checking /var/lib/clickhouse/preprocessed_configs/ folder.

5.56.4 - Hardware Requirements

Hardware Requirements

ClickHouse®

ClickHouse will use all available hardware to maximize performance. So the more hardware - the better. As of this publication, the hardware requirements are:

Minimum Hardware: 4-core CPU with support of SSE4.2, 16 Gb RAM, 1Tb HDD.
- Recommended for development and staging environments.
- SSE4.2 is required, and going below 4 Gb of RAM is not recommended.
Recommended Hardware: >=16-cores, >=64Gb RAM, HDD-raid or SSD.
- For processing up to hundreds of millions / billions of rows.

For clouds: disk throughput is the more important factor compared to IOPS. Be aware of burst / baseline disk speed difference.

Zookeeper

Zookeeper requires separate servers from those used for ClickHouse. Zookeeper has poor performance when installed on the same node as ClickHouse.

Hardware Requirements for Zookeeper:

Fast disk speed (ideally NVMe, 128Gb should be enough).
Any modern CPU (one core, better 2)
4Gb of RAM

For clouds - be careful with burstable network disks (like gp2 on aws): you may need up to 1000 IOPs on the disk for on a long run, so gp3 with 3000 IOPs baseline is a better choice.

The number of Zookeeper instances depends on the environment:

Production: 3 is an optimal number of zookeeper instances.
Development and Staging: 1 zookeeper instance is sufficient.

See also:

ClickHouse Hardware Configuration

Configure the servers according to those recommendations on the ClickHouse Usage Recommendations .

Test Your Hardware

Be sure to test the following:

RAM speed.
Network speed.
Storage speed.

It’s better to find any performance issues before installing ClickHouse.

5.56.5 - Network Configuration

Network Configuration

Networking And Server Room Planning

The network used for your ClickHouse® cluster should be a fast network, ideally 10 Gbit or more. ClickHouse nodes generate a lot of traffic to exchange the data between nodes (port 9009 for replication, and 9000 for distributed queries). Zookeeper traffic in normal circumstances is moderate, but in some special cases can also be very significant.

For the zookeeper low latency is more important than bandwidth.

Keep the replicas isolated on the hardware level. This allows for cluster failover from possible outages.

For Physical Environments: Avoid placing 2 ClickHouse replicas on the same server rack. Ideally, they should be on isolated network switches and an isolated power supply.
For Clouds Environments: Use different availability zones between the ClickHouse replicas when possible (but be aware of the interzone traffic costs)

These considerations are the same as the Zookeeper nodes.

For example:

Rack	Server	Server	Server	Server
Rack 1	CH_SHARD1_R1	CH_SHARD2_R1	CH_SHARD3_R1	ZOO_1
Rack 2	CH_SHARD1_R2	CH_SHARD2_R2	CH_SHARD3_R2	ZOO_2
Rack 3	ZOO3

Network Ports And Firewall

ClickHouse listens the following ports:

9000: clickhouse-client, native clients, other clickhouse-servers connect to here.
8123: HTTP clients
9009: Other replicas will connect here to download data.

For more information, see CLICKHOUSE NETWORKING, PART 1 .

Zookeeper listens the following ports:

2181: Client connections.
2888: Inter-ensemble connections.
3888: Leader election.

Outbound traffic from ClickHouse connects to the following ports:

ZooKeeper: On port 2181.
Other CH nodes in the cluster: On port 9000 and 9009.
Dictionary sources: Depending on what was configured such as HTTP, MySQL, Mongo, etc.
Kafka or Hadoop: If those integrations were enabled.

SSL

For non-trusted networks enable SSL/HTTPS. If acceptable, it is better to keep interserver communications unencrypted for performance reasons.

Naming Schema

The best time to start creating a naming schema for the servers is before they’re created and configured.

There are a few features based on good server naming in ClickHouse:

clickhouse-client prompts: Allows a different prompt for clickhouse-client per server hostname.
Nearest hostname load balancing: For more information, see Nearest Hostname .

A good option is to use the following:

{datacenter}-{serverroom}-{rack identifier}-{clickhouse cluster identifier}-{shard number or server number}.

Other examples:

rxv-olap-ch-master-sh01-r01:
- rxv - location (rack#15)
- olap - product name
- ch = clickhouse
- master = stage
- sh01 = shard 1
- r01 = replica 1
hetnzerde1-ch-prod-01.local:
- hetnzerde1 - location (also replica id)
- ch = clickhouse
- prod = stage
- 01 - server number / shard number in that DC
sh01.ch-front.dev.aws-east1a.example.com:
- sh01 - shard 01
- ch-front - cluster name
- dev = stage
- aws = cloud provider
- east1a = region and availability zone

Host Name References

Additional Hostname Tips

Hostnames configured on the server should not change. If you do need to change the host name, one reference to use is How to Change Hostname on Ubuntu 18.04 .
The server should be accessible to other servers in the cluster via it’s hostname. Otherwise you will need to configure interserver_hostname in your config.
Ensure that hostname --fqdn and getent hosts $(hostname --fqdn) return the correct name and ip.

5.57 - System tables ate my disk

When the ClickHouse® SYSTEM database gets out of hand

Note 1: System database stores virtual tables (parts, tables, columns, etc.) and *_log tables.
Virtual tables do not persist on disk. They reflect ClickHouse® memory (c++ structures). They cannot be changed or removed.
Log tables are named with postfix *_log and have the MergeTree engine . ClickHouse does not use information stored in these tables, this data is for you only.
You can drop / rename / truncate *_log tables at any time. ClickHouse will recreate them in about 7 seconds (flush period).

Note 2: Log tables with numeric postfixes (_1 / 2 / 3 …) query_log_1 query_thread_log_3 are results of ClickHouse upgrades (or other changes of schemas of these tables). When a new version of ClickHouse starts and discovers that a system log table’s schema is incompatible with a new schema, then ClickHouse renames the old *_log table to the name with the prefix and creates a table with the new schema. You can drop such tables if you don’t need such historic data.

You can disable all / any of them

Do not create log tables at all (a restart is needed for these changes to take effect).

$ cat /etc/clickhouse-server/config.d/z_log_disable.xml
<?xml version="1.0"?>
<clickhouse>
    <asynchronous_metric_log remove="1"/>
    <asynchronous_insert_log remove="1"/>
    <backup_log remove="1"/>
    <error_log remove="1"/>
    <metric_log remove="1"/>
    <query_metric_log remove="1"/>
    <query_thread_log remove="1" />  
    <query_log remove="1" />
    <query_views_log remove="1" />
    <part_log remove="1"/>
    <session_log remove="1"/>
    <text_log remove="1" />
    <trace_log remove="1"/>
    <crash_log remove="1"/>
    <opentelemetry_span_log remove="1"/>
    <zookeeper_log remove="1"/>
    <processors_profile_log remove="1"/>
    <latency_log remove="1"/>
    <background_schedule_pool_log remove="1"/>
    <aggregated_zookeeper_log remove="1"/>
    <zookeeper_connection_log remove="1"/>
</clickhouse>

Hint: z_log_disable.xml is named with z_ in the beginning, it means this config will be applied the last and will override all other config files with these sections (config are applied in alphabetical order).

We do not recommend removing query_log as it has very useful information for debugging, and logging can be easily turned off without a restart through user profiles:

$ cat /etc/clickhouse-server/users.d/z_log_queries.xml
<clickhouse>
    <profiles>
        <default>
            <log_queries>0</log_queries> <!-- normally it's better to keep it turned on! -->
        </default>
    </profiles>
</clickhouse>

You can also configure these settings to reduce the amount of data in the system.query_log table:

name                              | value       | description                                                                                                                                                       
----------------------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------
log_queries_min_type              | QUERY_START | Minimal type in query_log to log, possible values (from low to high): QUERY_START, QUERY_FINISH, EXCEPTION_BEFORE_START, EXCEPTION_WHILE_PROCESSING.
log_queries_min_query_duration_ms | 0           | Minimal time for the query to run, to get to the query_log/query_thread_log.
log_queries_cut_to_length         | 100000      | If query length is greater than specified threshold (in bytes), then cut query when writing to query log. Also limit length of printed query in ordinary text log.
log_profile_events                | 1           | Log query performance statistics into the query_log and query_thread_log.
log_query_settings                | 1           | Log query settings into the query_log.
log_queries_probability           | 1           | Log queries with the specified probabality.

The other system log tables that can be disabled in profiles are:

query_views_log
query_thread_log
processors_profile_log
query_metric_log
trace_log (https://clickhouse.com/docs/operations/system-tables/trace_log )

$ cat /etc/clickhouse-server/users.d/z_log_tables.xml
<clickhouse>
    <profiles>
        <default>
            <log_query_views>0</log_query_views>
            <log_query_threads>0</log_query_threads>
            <log_processors_profiles>0</log_processors_profiles>
            <query_metric_log_interval>0</query_metric_log_interval>
        </default>
    </profiles>
</clickhouse>

You can configure TTL

Example for query_log. It drops partitions with data older than 14 days:

$ cat /etc/clickhouse-server/config.d/query_log_ttl.xml
<?xml version="1.0"?>
<clickhouse>
    <query_log replace="1">
        <database>system</database>
        <table>query_log</table>
        <engine>ENGINE = MergeTree PARTITION BY (event_date)
                ORDER BY (event_time)
                TTL event_date + INTERVAL 14 DAY DELETE
        </engine>
        <flush_interval_milliseconds>7500</flush_interval_milliseconds>
    </query_log>
</clickhouse>

After that you need to restart ClickHouse and if using old clickhouse versions like 20 or less, drop or rename the existing system.query_log table and then CH creates a new table with these settings. This is automatically done in newer versions 21+.

RENAME TABLE system.query_log TO system.query_log_1;

Important part here is a daily partitioning PARTITION BY (event_date) in this case TTL expression event_date + INTERVAL 14 DAY DELETE expires all rows at the same time. In this case ClickHouse drops whole partitions. Dropping of partitions is very easy operation for CPU / Disk I/O.

Usual TTL processing (when table partitioned by toYYYYMM and TTL by day) is heavy CPU / Disk I/O consuming operation which re-writes data parts without expired rows.

You can add TTL without ClickHouse restart (and table dropping or renaming):

ALTER TABLE system.query_log MODIFY TTL event_date + INTERVAL 14 DAY;

But in this case ClickHouse will drop only whole monthly partitions (will store data older than 14 days).

One more way to configure TTL for system tables

This way just adds TTL to a table and leaves monthly (default) partitioning (will store data older than 14 days).

$ cat /etc/clickhouse-server/config.d/query_log_ttl.xml
<?xml version="1.0"?>
<clickhouse>
    <query_log>
        <database>system</database>
        <table>query_log</table>
        <ttl>event_date + INTERVAL 30 DAY DELETE</ttl>
    </query_log>
</clickhouse>

💡 For the clickhouse-operator , the above method of using only the <engine> tag without <ttl> or <partition> is recommended, because of possible configuration clashes.

5.58 - ClickHouse® Replication problems

Finding and troubleshooting problems in the replication_queue

Common problems & solutions

If the replication queue does not have any Exceptions only postponed reasons without exceptions just leave ClickHouse® do Merges/Mutations and it will eventually catch up and reduce the number of tasks in replication_queue. Number of concurrent merges and fetches can be tuned but if it is done without an analysis of your workload then you may end up in a worse situation. If Delay in queue is going up actions may be needed:
First simplest approach: try to SYSTEM RESTART REPLICA db.table (This will DETACH/ATTACH table internally)

How to check for replication problems

Check system.replicas first, cluster-wide. It allows to check if the problem is local to some replica or global, and allows to see the exception. allows to answer the following questions:
- Are there any ReadOnly replicas?
- Is there the connection to zookeeper active?
- Is there the exception during table init? (Code: 999. Coordination::Exception: Transaction failed (No node): Op #1)
Check system.replication_queue.
- How many tasks there / are they moving / are there some very old tasks there? (check created_time column, if tasks are 24h old, it is a sign of a problem):
- You can use this qkb article query: https://kb.altinity.com/altinity-kb-setup-and-maintenance/altinity-kb-replication-queue/
- Check if there are tasks with a high number of num_tries or num_postponed and postponed_reason this is a sign of stuck tasks.
- Check the problematic parts affecting the stuck tasks. You can use columns new_part_name or parts_to_merge
- Check which type is the task. If it is MUTATE_PART then it is a mutation task. If it is MERGE_PARTS then it is a merge task. These tasks can be deleted from the replication queue but GET_PARTS should not be deleted.
Check system.errors
Check system.mutations:
- You can check that in the replication queue are stuck tasks of type MUTATE_PART, and that those mutations are still executing system.mutations using column is_done
Find the moment when the problem started and collect/analyze / preserve logs from that moment. It is usually during the first steps of a restart/crash
Use part_log and system.parts to gather information of the parts related with the stuck tasks in the replication queue:
- Check if those parts exist and are active from system.parts (use partition_id, name as part and active columns to filter)
- Extract the part history from system.part_log
- Example query from part_log:

SELECT hostName(), * FROM 
cluster('all-sharded',system.part_log)
WHERE
    hostName() IN ('chi-prod-live-2-0-0','chi-prod-live-2-2-0','chi-prod-live-2-1-0')
    AND table = 'sessions_local'
    AND database = 'analytics'
    AND part_name in ('20230411_33631_33654_3')

If there are no errors, just everything get slower - check the load (usual system metrics)

Some stuck replication task for a partition that was already removed or has no data

This can be easily detected because some exceptions will be in the replication queue that reference a part from a partition that do not exist. Here the most probable scenario is that the partition was dropped and some tasks were left in the queue.
drop the partition manually once again (it should remove the task)
If the partition exists but the part is missing (maybe because it is superseded by a newer merged part) then you can try to DETACH/ATTACH the partition.
Below DML generates the ALTER commands to do this:

WITH 
    extract(new_part_name, '^[^_]+')  as partition_id
SELECT
    '/* count: ' || count() || ' */\n' ||
    'ALTER TABLE ' || database || '.' || table || ' DETACH PARTITION ID \''|| partition_id || '\';\n' ||
    'ALTER TABLE ' || database || '.' || table || ' ATTACH PARTITION ID \''|| partition_id || '\';\n'
FROM 
    system.replication_queue as rq
GROUP BY
    database, table, partition_id
HAVING sum(num_tries) > 1000 OR count() > 100
ORDER BY count() DESC, sum(num_tries) DESC
FORMAT TSVRaw;

Problem with mutation stuck in the queue

This can happen if the mutation is finished and, for some reason, the task is not removed from the queue. This can be detected by checking system.mutations table and seeing if the mutation is done, but the task is still in the queue.
kill the mutation (again)

Replica is not starting because local set of files differs too much

First try increase the thresholds or set flag force_restore_data flag and restarting clickhouse/pod https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/replication#recovery-after-complete-data-loss

Replica is in Read-Only MODE

Sometimes, due to crashes, zookeeper unavailability, slowness, or other reasons, some of the tables can be in Read-Only mode. This allows SELECTS but not INSERTS. So we need to do DROP / RESTORE replica procedure.

Just to be clear, this procedure will not delete any data, it will just re-create the metadata in zookeeper with the current state of the ClickHouse replica .

How it works:

ALTER TABLE table_name DROP DETACHED PARTITION ALL  -- clean detached folder before operation. PARTITION ALL works only for the fresh clickhouse versions
DETACH TABLE table_name;  -- Required for DROP REPLICA
-- Use the zookeeper_path and replica_name from system.replicas. 
SYSTEM DROP REPLICA 'replica_name' FROM ZKPATH '/table_path_in_zk'; -- It will remove everything from the /table_path_in_zk/replicas/replica_name
ATTACH TABLE table_name;  -- Table will be in readonly mode, because there is no metadata in ZK and after that execute
SYSTEM RESTORE REPLICA table_name;  -- It will detach all partitions, re-create metadata in ZK (like it's new empty table), and then attach all partitions back
SYSTEM SYNC REPLICA table_name; -- Not mandatory. It will Wait for replicas to synchronize parts. Also it's recommended to check `system.detached_parts` on all replicas after recovery is finished.
SELECT name FROM system.detached_parts WHERE table = 'table_name'; -- check for leftovers. See the potential problems here https://altinity.com/blog/understanding-detached-parts-in-clickhouse

Starting from version 23, it’s possible to use syntax SYSTEM DROP REPLICA 'replica_name' FROM TABLE db.table instead of the ZKPATH variant, but you need to execute the above command from a different replica than the one you want to drop, which is not convenient sometimes. We recommend using the above method because it works with any version and is more reliable.

Procedure to restore multiple tables in Read-Only mode per replica

It is better to make an approach per replica, because restoring a replica using ON CLUSTER could lead to race conditions that would cause errors and a big stress in zookeeper/keeper

SELECT 
    '-- Table ' || toString(row_num) || '\n' ||
    'DETACH TABLE `' || database || '`.`' || table || '`;\n' ||
    'SYSTEM DROP REPLICA ''' || replica_name || ''' FROM ZKPATH ''' || zookeeper_path || ''';\n' ||
    'ATTACH TABLE `' || database || '`.`' || table || '`;\n' ||
    'SYSTEM RESTORE REPLICA `' || database || '`.`' || table || '`;\n'
FROM (
    SELECT 
        *,
        rowNumberInAllBlocks() + 1 as row_num
    FROM (
        SELECT 
            database,
            table,
            any(replica_name) as replica_name,
            any(zookeeper_path) as zookeeper_path
        FROM system.replicas
        WHERE is_readonly
        GROUP BY database, table
        ORDER BY database, table
    )
    ORDER BY database, table
) 
FORMAT TSVRaw;

This will generate the DDL statements to be executed per replica and generate an ouput that can be saved as an SQL file . It is important to execute the commands per replica in the sequence generated by the above DDL:

DETACH the table
DROP REPLICA
ATTACH the table
RESTORE REPLICA

If we do this in parallel a table could still be attaching while another query is dropping/restoring the replica in zookeeper, causing errors.

The following bash script will read the generated SQL file and execute the commands sequentially, asking for user input in case of errors. Simply save the generated SQL to a file (e.g. recovery_commands.sql) and run the script below (that you can name as clickhouse_replica_recovery.sh):

$ clickhouse_replica_recovery.sh recovery_commands.sql

Here the script:

#!/bin/bash

# ClickHouse Replica Recovery Script
# This script executes DETACH, DROP REPLICA, ATTACH, and RESTORE REPLICA commands sequentially

# Configuration
CLICKHOUSE_HOST="${CLICKHOUSE_HOST:-localhost}"
CLICKHOUSE_PORT="${CLICKHOUSE_PORT:-9000}"
CLICKHOUSE_USER="${CLICKHOUSE_USER:-clickhouse_operator}"
CLICKHOUSE_PASSWORD="${CLICKHOUSE_PASSWORD:-xxxxxxxxx}"
COMMANDS_FILE="${1:-recovery_commands.sql}"
LOG_FILE="recovery_$(date +%Y%m%d_%H%M%S).log"

# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
MAGENTA='\033[0;35m'
NC='\033[0m' # No Color

# Function to log messages
log() {
    echo -e "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}

# Function to execute a SQL statement with retry logic
execute_sql() {
    local sql="$1"
    local table_num="$2"
    local step_name="$3"
    
    while true; do
        log "${YELLOW}Executing command for Table $table_num - $step_name:${NC}"
        log "$sql"
        
        # Build clickhouse-client command
        local ch_cmd="clickhouse-client --host=$CLICKHOUSE_HOST --port=$CLICKHOUSE_PORT --user=$CLICKHOUSE_USER"
        
        if [ -n "$CLICKHOUSE_PASSWORD" ]; then
            ch_cmd="$ch_cmd --password=$CLICKHOUSE_PASSWORD"
        fi
        
        # Execute the command and capture output and exit code
        local output
        local exit_code
        output=$(echo "$sql" | $ch_cmd 2>&1)
        exit_code=$?
        
        # Log the output
        echo "$output" | tee -a "$LOG_FILE"
        
        if [ $exit_code -eq 0 ]; then
            log "${GREEN}✓ Successfully executed${NC}"
            return 0
        else
            log "${RED}✗ Failed to execute (Exit code: $exit_code)${NC}"
            log "${RED}Error output: $output${NC}"
            
            # Ask user what to do
            while true; do
                echo ""
                log "${MAGENTA}========================================${NC}"
                log "${MAGENTA}Error occurred! Choose an option:${NC}"
                log "${MAGENTA}========================================${NC}"
                echo -e "${YELLOW}[R]${NC} - Retry this command"
                echo -e "${YELLOW}[I]${NC} - Ignore this error and continue to next command in this table"
                echo -e "${YELLOW}[S]${NC} - Skip this entire table and move to next table"
                echo -e "${YELLOW}[A]${NC} - Abort script execution"
                echo ""
                echo -n "Enter your choice (R/I/S/A): "
                
                # Read from /dev/tty to get user input from terminal
                read -r response < /dev/tty
                
                case "${response^^}" in
                    R|RETRY)
                        log "${BLUE}Retrying command...${NC}"
                        break  # Break inner loop to retry
                        ;;
                    I|IGNORE)
                        log "${YELLOW}Ignoring error and continuing to next command...${NC}"
                        return 1  # Return error but continue
                        ;;
                    S|SKIP)
                        log "${YELLOW}Skipping entire table $table_num...${NC}"
                        return 2  # Return special code to skip table
                        ;;
                    A|ABORT)
                        log "${RED}Aborting script execution...${NC}"
                        exit 1
                        ;;
                    *)
                        echo -e "${RED}Invalid option '$response'. Please enter R, I, S, or A.${NC}"
                        ;;
                esac
            done
        fi
    done
}

# Main execution function
main() {
    log "${BLUE}========================================${NC}"
    log "${BLUE}ClickHouse Replica Recovery Script${NC}"
    log "${BLUE}========================================${NC}"
    log "Host: $CLICKHOUSE_HOST:$CLICKHOUSE_PORT"
    log "User: $CLICKHOUSE_USER"
    log "Commands file: $COMMANDS_FILE"
    log "Log file: $LOG_FILE"
    echo ""
    
    # Check if commands file exists
    if [ ! -f "$COMMANDS_FILE" ]; then
        log "${RED}Error: Commands file '$COMMANDS_FILE' not found!${NC}"
        echo ""
        echo "Usage: $0 [commands_file]"
        echo "  commands_file: Path to SQL commands file (default: recovery_commands.sql)"
        echo ""
        echo "Example: $0 my_commands.sql"
        exit 1
    fi
    
    # Process SQL commands from file
    local current_sql=""
    local table_counter=0
    local step_in_table=0
    local failed_count=0
    local success_count=0
    local ignored_count=0
    local skipped_tables=()
    local skip_current_table=false
    
    while IFS= read -r line || [ -n "$line" ]; do
        # Skip empty lines
        if [[ -z "$line" ]] || [[ "$line" =~ ^[[:space:]]*$ ]]; then
            continue
        fi
        
        # Check if this is a comment line indicating a new table
        if [[ "$line" =~ ^[[:space:]]*--[[:space:]]*Table[[:space:]]+([0-9]+) ]]; then
            table_counter="${BASH_REMATCH[1]}"
            step_in_table=0
            skip_current_table=false
            log ""
            log "${BLUE}========================================${NC}"
            log "${BLUE}Processing Table $table_counter${NC}"
            log "${BLUE}========================================${NC}"
            continue
        elif [[ "$line" =~ ^[[:space:]]*-- ]]; then
            # Skip other comment lines
            continue
        fi
        
        # Skip if we're skipping this table
        if [ "$skip_current_table" = true ]; then
            # Check if line ends with semicolon to count statements
            if [[ "$line" =~ \;[[:space:]]*$ ]]; then
                step_in_table=$((step_in_table + 1))
            fi
            continue
        fi
        
        # Accumulate the SQL statement
        current_sql+="$line "
        
        # Check if we have a complete statement (ends with semicolon)
        if [[ "$line" =~ \;[[:space:]]*$ ]]; then
            step_in_table=$((step_in_table + 1))
            
            # Determine the step name
            local step_name=""
            if [[ "$current_sql" =~ ^[[:space:]]*DETACH ]]; then
                step_name="DETACH"
            elif [[ "$current_sql" =~ ^[[:space:]]*SYSTEM[[:space:]]+DROP[[:space:]]+REPLICA ]]; then
                step_name="DROP REPLICA"
            elif [[ "$current_sql" =~ ^[[:space:]]*ATTACH ]]; then
                step_name="ATTACH"
            elif [[ "$current_sql" =~ ^[[:space:]]*SYSTEM[[:space:]]+RESTORE[[:space:]]+REPLICA ]]; then
                step_name="RESTORE REPLICA"
            fi
            
            log ""
            log "Step $step_in_table/4: $step_name"
            
            # Execute the statement
            local result
            execute_sql "$current_sql" "$table_counter" "$step_name"
            result=$?
            
            if [ $result -eq 0 ]; then
                success_count=$((success_count + 1))
                sleep 1  # Small delay between commands
            elif [ $result -eq 1 ]; then
                # User chose to ignore this error
                failed_count=$((failed_count + 1))
                ignored_count=$((ignored_count + 1))
                sleep 1
            elif [ $result -eq 2 ]; then
                # User chose to skip this table
                skip_current_table=true
                skipped_tables+=("$table_counter")
                log "${YELLOW}Skipping remaining commands for Table $table_counter${NC}"
            fi
            
            # Reset current_sql for next statement
            current_sql=""
        fi
    done < "$COMMANDS_FILE"
    
    # Summary
    log ""
    log "${BLUE}========================================${NC}"
    log "${BLUE}Execution Summary${NC}"
    log "${BLUE}========================================${NC}"
    log "Total successful commands: ${GREEN}$success_count${NC}"
    log "Total failed commands: ${RED}$failed_count${NC}"
    log "Total ignored errors: ${YELLOW}$ignored_count${NC}"
    log "Total tables processed: $table_counter"
    
    if [ ${#skipped_tables[@]} -gt 0 ]; then
        log "Skipped tables: ${YELLOW}${skipped_tables[*]}${NC}"
    fi
    
    log "Log file: $LOG_FILE"
    
    if [ $failed_count -eq 0 ]; then
        log "${GREEN}All commands executed successfully!${NC}"
        exit 0
    else
        log "${YELLOW}Some commands failed or were ignored. Please check the log file.${NC}"
        exit 1
    fi
}

# Run the main function
main

5.59 - Replication queue

Replication queue

SELECT
    database,
    table,
    type,
    max(last_exception),
    max(postpone_reason),
    min(create_time),
    max(last_attempt_time),
    max(last_postpone_time),
    max(num_postponed) AS max_postponed,
    max(num_tries) AS max_tries,
    min(num_tries) AS min_tries,
    countIf(last_exception != '') AS count_err,
    countIf(num_postponed > 0) AS count_postponed,
    countIf(is_currently_executing) AS count_executing,
    count() AS count_all
FROM system.replication_queue
GROUP BY
    database,
    table,
    type
ORDER BY count_all DESC

5.60 - Schema migration tools for ClickHouse®

Schema migration tools for ClickHouse®

atlas
- https://atlasgo.io/guides/clickhouse
golang-migrate tool - see golang-migrate
liquibase
- https://github.com/mediarithmics/liquibase-clickhouse
- https://johntipper.org/how-to-execute-liquibase-changesets-against-clickhouse/
HousePlant
- New CLI migration tool (Dec2024) for ClickHouse developed by June
- Documentation https://houseplant.readthedocs.io/en/latest/index.html
- Github https://github.com/juneHQ/houseplant
ClickSuite
- developed by GameBeast
- A robust CLI tool for managing ClickHouse database migrations with environment-specific configurations and TypeScript support.
- Github https://github.com/GamebeastGG/clicksuite
Flyway
- Official community supported plugin git https://github.com/flyway/flyway-community-db-support
- Old pull requests (latest at the top):
  - https://github.com/flyway/flyway/pull/3333 СlickHouse support
  - https://github.com/flyway/flyway/pull/3134 СlickHouse support
  - https://github.com/flyway/flyway/pull/3133 Add support ClickHouse
  - https://github.com/flyway/flyway/pull/2981 ClickHouse replicated
  - https://github.com/flyway/flyway/pull/2640 Yet another ClickHouse support
  - https://github.com/flyway/flyway/pull/2166 ClickHouse support (#1772)
  - https://github.com/flyway/flyway/pull/1773 Fixed #1772: Add support for ClickHouse (https://clickhouse.yandex/ )
alembic
- see https://clickhouse-sqlalchemy.readthedocs.io/en/latest/migrations.html
bytebase
- https://bytebase.com
custom tool for ClickHouse for python
phpMigrations
- https://github.com/smi2/phpMigrationsClickHouse
- https://habrahabr.ru/company/smi2/blog/317682/
dbmate
- https://github.com/amacneil/dbmate#clickhouse

Know more?

https://clickhouse.com/docs/knowledgebase/schema_migration_tools

Article on migrations in ClickHouse https://posthog.com/blog/async-migrations

5.60.1 - golang-migrate

golang-migrate

`migrate`

migrate is a simple schema migration tool written in golang. No external dependencies are required (like interpreter, jre), only one platform-specific executable. golang-migrate/migrate

migrate supports several databases, including ClickHouse® (support was introduced by @kshvakov ).

To store information about migrations state migrate creates one additional table in target database, by default that table is called schema_migrations.

Install

download the migrate executable for your platform and put it to the folder listed in your %PATH.

#wget https://github.com/golang-migrate/migrate/releases/download/v3.2.0/migrate.linux-amd64.tar.gz
wget https://github.com/golang-migrate/migrate/releases/download/v4.14.1/migrate.linux-amd64.tar.gz
tar -xzf migrate.linux-amd64.tar.gz
mkdir -p ~/bin
mv migrate.linux-amd64 ~/bin/migrate
rm migrate.linux-amd64.tar.gz

Sample usage

mkdir migrations
echo 'create table test(id UInt8) Engine = Memory;' > migrations/000001_my_database_init.up.sql
echo 'DROP TABLE test;' > migrations/000001_my_database_init.down.sql

# you can also auto-create file with new migrations with automatic numbering like that:
migrate create -dir migrations -seq -digits 6 -ext sql my_database_init

edit migrations/000001_my_database_init.up.sql & migrations/000001_my_database_init.down.sql

migrate -database 'clickhouse://localhost:9000' -path ./migrations up
1/u my_database_init (6.502974ms)

migrate -database 'clickhouse://localhost:9000' -path ./migrations down
1/d my_database_init (2.164394ms)

# clears the database (use carefully - will not ask any confirmations)
➜ migrate -database 'clickhouse://localhost:9000' -path ./migrations drop

Connection string format

clickhouse://host:port?username=user&password=qwerty&database=clicks

URL Query	Description
`x-migrations-table`	Name of the migrations table
`x-migrations-table-engine`	Engine to use for the migrations table, defaults to TinyLog
`x-cluster-name`	Name of cluster for creating table cluster wide
`database`	The name of the database to connect to
`username`	The user to sign in as
`password`	The user’s password
`host`	The host to connect to.
`port`	The port to bind to.
`secure`	to use a secure connection (for self-signed also add `skip_verify=1`)

Replicated / Distributed / Cluster environments

golang-migrate supports a clustered ClickHouse environment since v4.15.0.

If you provide x-cluster-name query param, it will create the table to store migration data on the passed cluster.

Known issues

could not load time location: unknown time zone Europe/Moscow in line 0:

It’s happens due of missing tzdata package in migrate/migrate docker image of golang-migrate. There is 2 possible solutions:

You can build your own golang-migrate image from official with tzdata package.
If you using it as part of your CI you can add installing tzdata package as one of step in CI before using golang-migrate.

Using database name in x-migrations-table

Creates table with database.table
When running migrations migrate actually uses database from query settings and encapsulate database.table as table name: ``other_database.`database.table```

5.61 - Settings to adjust

Settings to adjust

query_log and other _log tables - set up TTL, or some other cleanup procedures.

cat /etc/clickhouse-server/config.d/query_log.xml
<clickhouse>
    <query_log replace="1">
        <database>system</database>
        <table>query_log</table>
        <flush_interval_milliseconds>7500</flush_interval_milliseconds>
        <engine>
ENGINE = MergeTree
PARTITION BY event_date
ORDER BY (event_time)
TTL event_date + interval 90 day
SETTINGS ttl_only_drop_parts=1
        </engine>
    </query_log>
</clickhouse>

query_thread_log - typically is not too useful for end users, you can disable it (or set up TTL). We do not recommend removing this table completely as you might need it for debug one day and the threads’ logging can be easily disabled/enabled without a restart through user profiles:
```
 $ cat /etc/clickhouse-server/users.d/z_log_queries.xml
 <clickhouse>
     <profiles>
         <default>
             <log_query_threads>0</log_query_threads>
         </default>
     </profiles>
 </clickhouse>
```

If you have a good monitoring outside ClickHouse® you don’t need to store the history of metrics in ClickHouse

cat /etc/clickhouse-server/config.d/disable_metric_logs.xml
<clickhouse>
    <metric_log remove="1" />
    <asynchronous_metric_log remove="1" />
</clickhouse>

part_log - may be nice, especially at the beginning / during system tuning/analyze.

cat /etc/clickhouse-server/config.d/part_log.xml
<clickhouse>
    <part_log replace="1">
        <database>system</database>
        <table>part_log</table>
        <flush_interval_milliseconds>7500</flush_interval_milliseconds>
        <engine>
ENGINE = MergeTree
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_time)
TTL toStartOfMonth(event_date) + INTERVAL 3 MONTH
SETTINGS ttl_only_drop_parts=1
        </engine>
    </part_log>
</clickhouse>

on older versions log_queries is disabled by default, it’s worth having it enabled always.

$ cat /etc/clickhouse-server/users.d/log_queries.xml
<clickhouse>
    <profiles>
        <default>
            <log_queries>1</log_queries>
        </default>
    </profiles>
</clickhouse>

quite often you want to have on-disk group by / order by enabled (both disabled by default).

cat /etc/clickhouse-server/users.d/enable_on_disk_operations.xml
<clickhouse>
    <profiles>
        <default>
           <max_bytes_before_external_group_by>2000000000</max_bytes_before_external_group_by>
           <max_bytes_before_external_sort>2000000000</max_bytes_before_external_sort>
        </default>
    </profiles>
</clickhouse>

quite often you want to create more users with different limitations. The most typical is <max_execution_time> It’s actually also not a way to plan/share existing resources better, but it at least disciplines users.
Also introducing some restrictions on query complexity can be a good option to discipline users.
You can find the preset example here . Also, force_index_by_date + force_primary_key can be a nice idea to avoid queries that ‘accidentally’ do full scans, max_concurrent_queries_for_user
merge_tree settings: max_bytes_to_merge_at_max_space_in_pool (may be reduced in some scenarios), inactive_parts_to_throw_insert - can be enabled, replicated_deduplication_window - can be extended if single insert create lot of parts , merge_with_ttl_timeout - when you use ttl
insert_distributed_sync - for small clusters you may sometimes want to enable it
when the durability is the main requirement (or server / storage is not stable) - you may want to enable fsync_* setting (impacts the write performance significantly!!), and insert_quorum
If you use FINAL queries - usually you want to enable do_not_merge_across_partitions_select_final
memory usage per server / query / user: memory configuration settings
if you use async_inserts - you often may want to increase max_concurrent_queries

<clickhouse>
    <max_concurrent_queries>500</max_concurrent_queries>
    <max_concurrent_insert_queries>400</max_concurrent_insert_queries>
    <max_concurrent_select_queries>100</max_concurrent_select_queries>
</clickhouse>

materialize_ttl_after_modify=0
access_management=1
secret in <remote_servers>

5.62 - Shutting down a node

Shutting down a node

It’s possible to shutdown server on fly, but that would lead to failure of some queries.

More safer way:

Remove server (which is going to be disabled) from remote_server section of config.xml on all servers.
- avoid removing the last replica of the shard (that can lead to incorrect data placement if you use non-random distribution)
Remove server from load balancer, so new queries wouldn’t hit it.
Detach Kafka / Rabbit / Buffer tables (if used), and Materialized* databases.
Wait until all already running queries would finish execution on it. It’s possible to check it via query:
```
SHOW PROCESSLIST;
```

Ensure there is no pending data in distributed tables

SELECT * FROM system.distribution_queue;
SYSTEM FLUSH DISTRIBUTED <table_name>;

Run sync replica query in related shard replicas (others than the one you remove) via query:
```
SYSTEM SYNC REPLICA db.table;
```
Shutdown server.

SYSTEM SHUTDOWN query by default doesn’t wait until query completion and tries to kill all queries immediately after receiving signal, if you want to change this behavior, you need to enable setting shutdown_wait_unfinished_queries.

https://github.com/ClickHouse/ClickHouse/blob/d705f8ead4bdc837b8305131844f558ec002becc/programs/server/Server.cpp#L1682

5.63 - SSL connection unexpectedly closed

SSL connection unexpectedly closed

ClickHouse doesn’t probe CA path which is default on CentOS and Amazon Linux.

ClickHouse client

cat /etc/clickhouse-client/conf.d/openssl-ca.xml
<config>
    <openSSL>
        <client> <!-- Used for connection to server's secure tcp port -->
            <caConfig>/etc/ssl/certs</caConfig>
        </client>
    </openSSL>
</config>

ClickHouse server

cat /etc/clickhouse-server/conf.d/openssl-ca.xml
<config>
    <openSSL>
        <server>  <!-- Used for https server AND secure tcp port -->
            <caConfig>/etc/ssl/certs</caConfig>
        </server>
        <client>  <!-- Used for connecting to https dictionary source and secured Zookeeper communication -->
            <caConfig>/etc/ssl/certs</caConfig>
        </client>
    </openSSL>
</config>

https://github.com/ClickHouse/ClickHouse/issues/17803

https://github.com/ClickHouse/ClickHouse/issues/18869

5.64 - Suspiciously many broken parts

Debugging a common error message

Symptom:

clickhouse fails to start with a message DB::Exception: Suspiciously many broken parts to remove.

Cause:

That exception is just a safeguard check/circuit breaker, triggered when clickhouse detects a lot of broken parts during server startup.

Parts are considered broken if they have bad checksums or some files are missing or malformed. Usually, that means the data was corrupted on the disk.

Why data could be corrupted?

the most often reason is a hard restart of the system, leading to a loss of the data which was not fully flushed to disk from the system page cache. Please be aware that by default ClickHouse doesn’t do fsync, so data is considered inserted after it was passed to the Linux page cache. See fsync-related settings in ClickHouse.
it can also be caused by disk failures, maybe there are bad blocks on hard disk, or logical problems, or some raid issue. Check system journals, use fsck / mdadm and other standard tools to diagnose the disk problem.
other reasons: manual intervention/bugs etc, for example, the data files or folders are removed by mistake or moved to another folder.

Action:

If you are ok to accept the data loss : set up force_restore_data flag and clickhouse will move the parts to detached. Data loss is possible if the issue is a result of misconfiguration (i.e. someone accidentally has fixed xml configs with incorrect shard/replica macros , data will be moved to detached folder and can be recovered).
```
sudo -u clickhouse touch /var/lib/clickhouse/flags/force_restore_data
```
then restart clickhouse. the table will be attached, and the broken parts will be detached, which means the data from those parts will not be available for the selects. You can see the list of those parts in the system.detached_parts table and drop them if needed using ALTER TABLE ... DROP DETACHED PART ... commands.
If you are ok to tolerate bigger losses automatically you can change that safeguard configuration to be less sensitive by increasing max_suspicious_broken_parts setting:
```
cat /etc/clickhouse-server/config.d/max_suspicious_broken_parts.xml
<?xml version="1.0"?>
<clickhouse>
     <merge_tree>
         <max_suspicious_broken_parts>50</max_suspicious_broken_parts>
     </merge_tree>
</clickhouse>
```
this limit is set to 100 by default in recent releases. We can set a bigger value (250 or more), but the data will be lost because of the corruption.
Check out also a similar setting max_suspicious_broken_parts_bytes.
See https://clickhouse.com/docs/en/operations/settings/merge-tree-settings/
If you can’t accept the data loss - you should recover data from backups / re-insert it once again etc.
If you don’t want to tolerate automatic detaching of broken parts, you can set max_suspicious_broken_parts_bytes and max_suspicious_broken_parts to 0.

Scenario illustrating / testing

Create table

create table t111(A UInt32) Engine=MergeTree order by A settings max_suspicious_broken_parts=1;
insert into t111 select number from numbers(100000);

Detach the table and make Data corruption

detach table t111;

cd /var/lib/clickhouse/data/default/t111/all_*** make data file corruption:

> data.bin

repeat for 2 or more data files.

Attach the table:

attach table t111;
 
Received exception from server (version 21.12.3):
Code: 231. DB::Exception: Received from localhost:9000. DB::Exception: Suspiciously many (2) broken parts to remove.. (TOO_MANY_UNEXPEC

setup force_restore_data flag

sudo -u clickhouse touch /var/lib/clickhouse/flags/force_restore_data
sudo service clickhouse-server restart

then the table t111 will be attached, losing the corrupted data.

5.65 - Threads

Threads

Count threads used by clickhouse-server

cat /proc/$(pidof -s clickhouse-server)/status | grep Threads
Threads: 103

ps hH $(pidof -s clickhouse-server) | wc -l
103

ps hH -AF | grep clickhouse | wc -l
116

Thread counts by type (using ps & clickhouse-local)

ps H -o 'tid comm' $(pidof -s clickhouse-server) |  tail -n +2 | awk '{ printf("%s\t%s\n", $1, $2) }' | clickhouse-local -S "threadid UInt16, name String" -q "SELECT name, count() FROM table GROUP BY name WITH TOTALS ORDER BY count() DESC FORMAT PrettyCompact"

Threads used by running queries:

SELECT query, length(thread_ids) AS threads_count FROM system.processes ORDER BY threads_count;

Thread pools limits & usage

SELECT
    name,
    value
FROM system.settings
WHERE name LIKE '%pool%'

┌─name─────────────────────────────────────────┬─value─┐
│ connection_pool_max_wait_ms                  │ 0     │
│ distributed_connections_pool_size            │ 1024  │
│ background_buffer_flush_schedule_pool_size   │ 16    │
│ background_pool_size                         │ 16    │
│ background_move_pool_size                    │ 8     │
│ background_fetches_pool_size                 │ 8     │
│ background_schedule_pool_size                │ 16    │
│ background_message_broker_schedule_pool_size │ 16    │
│ background_distributed_schedule_pool_size    │ 16    │
│ postgresql_connection_pool_size              │ 16    │
│ postgresql_connection_pool_wait_timeout      │ -1    │
│ odbc_bridge_connection_pool_size             │ 16    │
└──────────────────────────────────────────────┴───────┘

SELECT
    metric,
    value
FROM system.metrics
WHERE metric LIKE 'Background%'

┌─metric──────────────────────────────────┬─value─┐
│ BackgroundPoolTask                      │     0 │
│ BackgroundFetchesPoolTask               │     0 │
│ BackgroundMovePoolTask                  │     0 │
│ BackgroundSchedulePoolTask              │     0 │
│ BackgroundBufferFlushSchedulePoolTask   │     0 │
│ BackgroundDistributedSchedulePoolTask   │     0 │
│ BackgroundMessageBrokerSchedulePoolTask │     0 │
└─────────────────────────────────────────┴───────┘


SELECT *
FROM system.asynchronous_metrics
WHERE lower(metric) LIKE '%thread%'
ORDER BY metric ASC

┌─metric───────────────────────────────────┬─value─┐
│ HTTPThreads                              │     0 │
│ InterserverThreads                       │     0 │
│ MySQLThreads                             │     0 │
│ OSThreadsRunnable                        │     2 │
│ OSThreadsTotal                           │  2910 │
│ PostgreSQLThreads                        │     0 │
│ TCPThreads                               │     1 │
│ jemalloc.background_thread.num_runs      │     0 │
│ jemalloc.background_thread.num_threads   │     0 │
│ jemalloc.background_thread.run_intervals │     0 │
└──────────────────────────────────────────┴───────┘


SELECT *
FROM system.metrics
WHERE lower(metric) LIKE '%thread%'
ORDER BY metric ASC

Query id: 6acbb596-e28f-4f89-94b2-27dccfe88ee9

┌─metric─────────────┬─value─┬─description───────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ GlobalThread       │   151 │ Number of threads in global thread pool.                                                                          │
│ GlobalThreadActive │   144 │ Number of threads in global thread pool running a task.                                                           │
│ LocalThread        │     0 │ Number of threads in local thread pools. The threads in local thread pools are taken from the global thread pool. │
│ LocalThreadActive  │     0 │ Number of threads in local thread pools running a task.                                                           │
│ QueryThread        │     0 │ Number of query processing threads                                                                                │
└────────────────────┴───────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Stack traces of the working threads from the pools

SET allow_introspection_functions = 1;

WITH arrayMap(x -> demangle(addressToSymbol(x)), trace) AS all
SELECT
    thread_id,
    query_id,
    arrayStringConcat(all, '\n') AS res
FROM system.stack_trace
WHERE res ILIKE '%Pool%'
FORMAT Vertical;

5.66 - Who ate my ClickHouse® memory?

“It was here a few minutes ago…”

SYSTEM JEMALLOC PURGE;

SELECT 'OS' as group, metric as name, toInt64(value) as val FROM system.asynchronous_metrics WHERE metric like 'OSMemory%'
    UNION ALL
SELECT 'Caches' as group, metric as name, toInt64(value) FROM system.asynchronous_metrics WHERE metric LIKE '%CacheBytes'
    UNION ALL
SELECT 'Caches' as group, metric as name, toInt64(value) FROM system.metrics WHERE metric LIKE '%CacheBytes'
    UNION ALL
SELECT 'MMaps' as group, metric as name, toInt64(value) FROM system.metrics WHERE metric LIKE 'MMappedFileBytes'
    UNION ALL
SELECT 'Process' as group, metric as name, toInt64(value) FROM system.asynchronous_metrics WHERE metric LIKE 'Memory%'
    UNION ALL
SELECT 'MemoryTable', engine as name, toInt64(sum(total_bytes)) FROM system.tables WHERE engine IN ('Join','Memory','Buffer','Set') GROUP BY engine
    UNION ALL
SELECT 'StorageBuffer' as group, metric as name, toInt64(value) FROM system.metrics WHERE metric='StorageBufferBytes'
    UNION ALL
SELECT 'Queries' as group, left(query,7) as name, toInt64(sum(memory_usage)) FROM system.processes GROUP BY name
    UNION ALL
SELECT 'Dictionaries' as group, type as name, toInt64(sum(bytes_allocated)) FROM system.dictionaries GROUP BY name
    UNION ALL
SELECT 'PrimaryKeys' as group, 'db:'||database as name, toInt64(sum(primary_key_bytes_in_memory_allocated)) FROM system.parts GROUP BY name
    UNION ALL
SELECT 'Merges' as group, 'db:'||database as name, toInt64(sum(memory_usage)) FROM system.merges GROUP BY name
    UNION ALL
SELECT 'InMemoryParts' as group, 'db:'||database as name, toInt64(sum(data_uncompressed_bytes)) FROM system.parts WHERE part_type = 'InMemory' GROUP BY name
    UNION ALL
SELECT 'AsyncInserts' as group, 'db:'||database as name, toInt64(sum(total_bytes)) FROM system.asynchronous_inserts GROUP BY name
    UNION ALL
SELECT 'FileBuffersVirtual' as group, metric as name, toInt64(value * 2*1024*1024) FROM system.metrics WHERE metric like 'OpenFileFor%'
    UNION ALL
SELECT 'ThreadStacksVirual' as group, metric as name, toInt64(value * 8*1024*1024) FROM system.metrics WHERE metric = 'GlobalThread'
    UNION ALL
SELECT 'UserMemoryTracking' as group, user as name, toInt64(memory_usage) FROM system.user_processes
    UNION ALL
select 'QueryCacheBytes' as group, '', toInt64(sum(result_size)) FROM system.query_cache
    UNION ALL
SELECT 'MemoryTracking' as group, 'total' as name, toInt64(value) FROM system.metrics WHERE metric = 'MemoryTracking'

SELECT *, formatReadableSize(value) 
FROM system.metrics 
WHERE (metric ilike '%Cach%' or metric ilike '%Mem%') and value != 0
order by metric format PrettyCompactMonoBlock;

SELECT *, formatReadableSize(value) 
FROM system.asynchronous_metrics 
WHERE metric like '%Cach%' or metric like '%Mem%' 
order by metric format PrettyCompactMonoBlock;

SELECT event_time, metric, value, formatReadableSize(value) 
FROM system.asynchronous_metric_log 
WHERE event_time > now() - 600 and (metric like '%Cach%' or metric like '%Mem%') and value <> 0 
order by metric, event_time format PrettyCompactMonoBlock;

SELECT formatReadableSize(sum(bytes_allocated)) FROM system.dictionaries;

SELECT
    database,
    name,
    formatReadableSize(total_bytes)
FROM system.tables
WHERE engine IN ('Memory','Set','Join');

SELECT
    sumIf(data_uncompressed_bytes, part_type = 'InMemory') as memory_parts,
    formatReadableSize(sum(primary_key_bytes_in_memory)) AS primary_key_bytes_in_memory,
    formatReadableSize(sum(primary_key_bytes_in_memory_allocated)) AS primary_key_bytes_in_memory_allocated
FROM system.parts;

SELECT formatReadableSize(sum(memory_usage)) FROM system.merges;

SELECT formatReadableSize(sum(memory_usage)) FROM system.processes;

select formatReadableSize(sum(result_size)) FROM system.query_cache;

SELECT
    initial_query_id,
    elapsed,
    formatReadableSize(memory_usage),
    formatReadableSize(peak_memory_usage),
    query
FROM system.processes
ORDER BY peak_memory_usage DESC
LIMIT 10;

SELECT
    type,
    event_time,
    initial_query_id,
    formatReadableSize(memory_usage),
    query
FROM system.query_log
WHERE (event_date >= today()) AND (event_time >= (now() - 7200))
ORDER BY memory_usage DESC
LIMIT 10;

for i in `seq 1 600`; do clickhouse-client --empty_result_for_aggregation_by_empty_set=0 -q "select (select 'Merges: \
'||formatReadableSize(sum(memory_usage)) from system.merges), (select \
'Processes: '||formatReadableSize(sum(memory_usage)) from system.processes)";\
sleep 3;  done 

Merges: 96.57 MiB	Processes: 41.98 MiB
Merges: 82.24 MiB	Processes: 41.91 MiB
Merges: 66.33 MiB	Processes: 41.91 MiB
Merges: 66.49 MiB	Processes: 37.13 MiB
Merges: 67.78 MiB	Processes: 37.13 MiB

echo "         Merges      Processes       PrimaryK       TempTabs          Dicts"; \
for i in `seq 1 600`; do clickhouse-client --empty_result_for_aggregation_by_empty_set=0  -q "select \
(select leftPad(formatReadableSize(sum(memory_usage)),15, ' ') from system.merges)||
(select leftPad(formatReadableSize(sum(memory_usage)),15, ' ') from system.processes)||
(select leftPad(formatReadableSize(sum(primary_key_bytes_in_memory_allocated)),15, ' ') from system.parts)|| \
(select leftPad(formatReadableSize(sum(total_bytes)),15, ' ') from system.tables \
 WHERE engine IN ('Memory','Set','Join'))||
(select leftPad(formatReadableSize(sum(bytes_allocated)),15, ' ') FROM system.dictionaries)
"; sleep 3;  done 

         Merges      Processes       PrimaryK       TempTabs          Dicts
         0.00 B         0.00 B      21.36 MiB       1.58 GiB     911.07 MiB
         0.00 B         0.00 B      21.36 MiB       1.58 GiB     911.07 MiB
         0.00 B         0.00 B      21.35 MiB       1.58 GiB     911.07 MiB
         0.00 B         0.00 B      21.36 MiB       1.58 GiB     911.07 MiB

retrospection analysis of the RAM usage based on query_log and part_log (shows peaks)

WITH 
    now() - INTERVAL 24 HOUR AS min_time,  -- you can adjust that
    now() AS max_time,   -- you can adjust that
    INTERVAL 1 HOUR as time_frame_size
SELECT 
    toStartOfInterval(event_timestamp, time_frame_size) as timeframe,
    formatReadableSize(max(mem_overall)) as peak_ram,
    formatReadableSize(maxIf(mem_by_type, event_type='Insert'))     as inserts_ram,
    formatReadableSize(maxIf(mem_by_type, event_type='Select'))     as selects_ram,
    formatReadableSize(maxIf(mem_by_type, event_type='MergeParts')) as merge_ram,
    formatReadableSize(maxIf(mem_by_type, event_type='MutatePart')) as mutate_ram,
    formatReadableSize(maxIf(mem_by_type, event_type='Alter'))      as alter_ram,
    formatReadableSize(maxIf(mem_by_type, event_type='Create'))     as create_ram,
    formatReadableSize(maxIf(mem_by_type, event_type not IN ('Insert', 'Select', 'MergeParts','MutatePart', 'Alter', 'Create') )) as other_types_ram,
    groupUniqArrayIf(event_type, event_type not IN ('Insert', 'Select', 'MergeParts','MutatePart', 'Alter', 'Create') ) as other_types
FROM (
    SELECT 
        toDateTime( toUInt32(ts) ) as event_timestamp,
        t as event_type,
        SUM(mem) OVER (PARTITION BY t ORDER BY ts) as mem_by_type,
        SUM(mem) OVER (ORDER BY ts) as mem_overall
    FROM 
    (
        WITH arrayJoin([(toFloat64(event_time_microseconds) - (duration_ms / 1000), toInt64(peak_memory_usage)), (toFloat64(event_time_microseconds), -peak_memory_usage)]) AS data
        SELECT
        CAST(event_type,'LowCardinality(String)') as t,
        data.1 as ts,
        data.2 as mem
        FROM system.part_log
        WHERE event_time BETWEEN min_time AND max_time AND peak_memory_usage != 0

        UNION ALL 

        WITH arrayJoin([(toFloat64(query_start_time_microseconds), toInt64(memory_usage)), (toFloat64(event_time_microseconds), -memory_usage)]) AS data
        SELECT 
        query_kind,
        data.1 as ts,
        data.2 as mem
        FROM system.query_log
        WHERE event_time BETWEEN min_time AND max_time AND memory_usage != 0

        UNION ALL 

        WITH 
        arrayJoin([(toFloat64(event_time_microseconds) - (view_duration_ms / 1000), toInt64(peak_memory_usage)), (toFloat64(event_time_microseconds), -peak_memory_usage)]) AS data
        SELECT
        CAST(toString(view_type)||'View','LowCardinality(String)') as t,
        data.1 as ts,
        data.2 as mem
        FROM system.query_views_log
        WHERE event_time BETWEEN min_time AND max_time AND peak_memory_usage != 0
)
)
GROUP BY timeframe
ORDER BY timeframe
FORMAT PrettyCompactMonoBlock;

retrospection analysis of trace_log

WITH 
    now() - INTERVAL 24 HOUR AS min_time,  -- you can adjust that
    now() AS max_time   -- you can adjust that
SELECT
    trace_type,
    count(),
    topK(20)(query_id)
FROM system.trace_log
WHERE event_time BETWEEN min_time AND max_time
GROUP BY trace_type;

SELECT
    t,
    count() AS queries,
    formatReadableSize(sum(peak_size)) AS sum_of_peaks,
    formatReadableSize(max(peak_size)) AS biggest_query_peak,
    argMax(query_id, peak_size) AS query
FROM
(
    SELECT
        toStartOfInterval(event_time, toIntervalMinute(5)) AS t,
        query_id,
        max(size) AS peak_size
    FROM system.trace_log
    WHERE (trace_type = 'MemoryPeak') AND (event_time > (now() - toIntervalHour(24)))
    GROUP BY
        t,
        query_id
)
GROUP BY t
ORDER BY t ASC;

-- later on you can check particular query_ids in query_log

analysis of the server text logs

grep MemoryTracker /var/log/clickhouse-server.log
zgrep MemoryTracker /var/log/clickhouse-server.log.*.gz

5.67 - X rows of Y total rows in filesystem are suspicious

X rows of Y total rows in filesystem are suspicious

Warning

The local set of parts of table doesn’t look like the set of parts in ZooKeeper. 100.00 rows of 150.00 total rows in filesystem are suspicious. There are 1 unexpected parts with 100 rows (1 of them is not just-written with 100 rows), 0 missing parts (with 0 blocks).: Cannot attach table.

ClickHouse has a registry of parts in ZooKeeper.

And during the start ClickHouse compares that list of parts on a local disk is consistent with a list in ZooKeeper. If the lists are too different ClickHouse denies to start because it could be an issue with settings, wrong Shard or wrong Replica macros. But this safe-limiter throws an exception if the difference is more 50% (in rows).

In your case the table is very small and the difference >50% ( 100.00 vs 150.00 ) is only a single part mismatch, which can be the result of hard restart.

SELECT * FROM system.merge_tree_settings WHERE name = 'replicated_max_ratio_of_wrong_parts'

┌─name────────────────────────────────┬─value─┬─changed─┬─description──────────────────────────────────────────────────────────────────────────┬─type──┐
│ replicated_max_ratio_of_wrong_parts │ 0.5   │       0 │ If ratio of wrong parts to total number of parts is less than this - allow to start. │ Float │
└─────────────────────────────────────┴───────┴─────────┴──────────────────────────────────────────────────────────────────────────────────────┴───────┘

You can set another value of replicated_max_ratio_of_wrong_parts for all MergeTree tables or per table.

https://clickhouse.tech/docs/en/operations/settings/merge-tree-settings

After manipulation with storage_policies and disks

When storage policy changes (one disk was removed from it), ClickHouse compared parts on disk and this replica state in ZooKeeper and found out that a lot of parts (from removed disk) disappeared. So ClickHouse removed them from the replica state in ZooKeeper and scheduled to fetch them from other replicas.

After we add the removed disk to storage_policy back, ClickHouse finds missing parts, but at this moment they are not registered for that replica. ClickHouse produce error message like this:

Warning

Application: DB::Exception: The local set of parts of table default.tbl doesn’t look like the set of parts in ZooKeeper: 14.96 billion rows of 16.24 billion total rows in filesystem are suspicious. There are 45 unexpected parts with 14960302620 rows (43 of them is not just-written with 14959824636 rows), 0 missing parts (with 0 blocks).: Cannot attach table default.tbl from metadata file /var/lib/clickhouse/metadata/default/tbl.sql from query ATTACH TABLE default.tbl … ENGINE=ReplicatedMergeTree(’/clickhouse/tables/0/default/tbl’, ‘replica-0’)… SETTINGS index_granularity = 1024, storage_policy = ’ebs_hot_and_cold’: while loading database default from path /var/lib/clickhouse/metadata/data

At this point, it’s possible to either tune setting replicated_max_ratio_of_wrong_parts or do force restore, but it will end up downloading all “missing” parts from other replicas, which can take a lot of time for big tables.

ClickHouse 21.7+

Rename table SQL attach script in order to prevent ClickHouse from attaching it at startup.

mv /var/lib/clickhouse/metadata/default/tbl.sql /var/lib/clickhouse/metadata/default/tbl.sql.bak

Start ClickHouse server.
Remove metadata for this replica from ZooKeeper.

SYSTEM DROP REPLICA 'replica-0' FROM ZKPATH '/clickhouse/tables/0/default/tbl';

SELECT * FROM system.zookeeper WHERE path = '/clickhouse/tables/0/default/tbl/replicas';

Rename table SQL attach script back to normal name.

mv /var/lib/clickhouse/metadata/default/tbl.sql.bak /var/lib/clickhouse/metadata/default/tbl.sql

Attach table to ClickHouse server, because there is no metadata in ZooKeeper, ClickHouse will attach it in read only state.

ATTACH TABLE default.tbl;

Run SYSTEM RESTORE REPLICA in order to sync state on disk and in ZooKeeper.

SYSTEM RESTORE REPLICA default.tbl;

Run SYSTEM SYNC REPLICA to download missing parts from other replicas.

SYSTEM SYNC REPLICA default.tbl;

5.68 - ZooKeeper

ZooKeeper

Requirements

TLDR version:

USE DEDICATED FAST DISKS for the transaction log! (crucial for performance due to write-ahead-log, NVMe is preferred for heavy load setup).
use 3 nodes (more nodes = slower quorum, less = no HA).
low network latency between zookeeper nodes is very important (latency, not bandwidth).
have at least 4Gb of RAM, disable swap, tune JVM sizes, and garbage collector settings
ensure that zookeeper will not be CPU-starved by some other processes
monitor zookeeper .

Side note: in many cases, the slowness of the zookeeper is actually a symptom of some issue with ClickHouse® schema/usage pattern (the most typical issues: an enormous number of partitions/tables/databases with real-time inserts, tiny & frequent inserts).

How to install

Random links on best practices

Cite from https://zookeeper.apache.org/doc/r3.5.7/zookeeperAdmin.html#sc_commonProblems :

Things to Avoid
Here are some common problems you can avoid by configuring ZooKeeper correctly:
inconsistent lists of servers : The list of ZooKeeper servers used by the clients must match the list of ZooKeeper servers that each ZooKeeper server has. Things work okay if the client list is a subset of the real list, but things will really act strange if clients have a list of ZooKeeper servers that are in different ZooKeeper clusters. Also, the server lists in each Zookeeper server configuration file should be consistent with one another.
incorrect placement of transaction log : The most performance critical part of ZooKeeper is the transaction log. ZooKeeper syncs transactions to media before it returns a response. A dedicated transaction log device is key to consistent good performance. Putting the log on a busy device will adversely affect performance. If you only have one storage device, increase the snapCount so that snapshot files are generated less often; it does not eliminate the problem, but it makes more resources available for the transaction log.
incorrect Java heap size : You should take special care to set your Java max heap size correctly. In particular, you should not create a situation in which ZooKeeper swaps to disk. The disk is death to ZooKeeper. Everything is ordered, so if processing one request swaps the disk, all other queued requests will probably do the same. the disk. DON’T SWAP. Be conservative in your estimates: if you have 4G of RAM, do not set the Java max heap size to 6G or even 4G. For example, it is more likely you would use a 3G heap for a 4G machine, as the operating system and the cache also need memory. The best and only recommend practice for estimating the heap size your system needs is to run load tests, and then make sure you are well below the usage limit that would cause the system to swap.
Publicly accessible deployment : A ZooKeeper ensemble is expected to operate in a trusted computing environment. It is thus recommended to deploy ZooKeeper behind a firewall.

How to check number of followers:

echo mntr | nc zookeeper 2187 | grep foll
zk_synced_followers    2
zk_synced_non_voting_followers    0
zk_avg_follower_sync_time    0.0
zk_min_follower_sync_time    0
zk_max_follower_sync_time    0
zk_cnt_follower_sync_time    0
zk_sum_follower_sync_time    0

Tools

https://github.com/apache/zookeeper/blob/master/zookeeper-docs/src/main/resources/markdown/zookeeperTools.md

Alternative for zkCli

https://github.com/go-zkcli/zkcli

Web UI

5.68.1 - clickhouse-keeper-initd

clickhouse-keeper-initd

An init.d script for clickhouse-keeper. This example is based on zkServer.sh

#!/bin/bash
### BEGIN INIT INFO
# Provides:          clickhouse-keeper
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# Required-Start:
# Required-Stop:
# Short-Description: Start keeper daemon
# Description: Start keeper daemon
### END INIT INFO

NAME=clickhouse-keeper
ZOOCFGDIR=/etc/$NAME
ZOOCFG="$ZOOCFGDIR/keeper.xml"
ZOO_LOG_DIR=/var/log/$NAME
USER=clickhouse
GROUP=clickhouse
ZOOPIDDIR=/var/run/$NAME
ZOOPIDFILE=$ZOOPIDDIR/$NAME.pid
SCRIPTNAME=/etc/init.d/$NAME

#echo "Using config: $ZOOCFG" >&2
ZOOCMD="clickhouse-keeper -C ${ZOOCFG} start --daemon"

# ensure PIDDIR exists, otw stop will fail
mkdir -p "$(dirname "$ZOOPIDFILE")"

if [ ! -w "$ZOO_LOG_DIR" ] ; then
mkdir -p "$ZOO_LOG_DIR"
fi

case $1 in
start)
    echo -n "Starting keeper ... "
    if [ -f "$ZOOPIDFILE" ]; then
      if kill -0 `cat "$ZOOPIDFILE"` > /dev/null 2>&1; then
         echo already running as process `cat "$ZOOPIDFILE"`.
         exit 0
      fi
    fi
    sudo -u clickhouse `echo "$ZOOCMD"`
    if [ $? -eq 0 ]
    then
      pgrep -f "$ZOOCMD" > "$ZOOPIDFILE"
      echo "PID:" `cat $ZOOPIDFILE`
      if [ $? -eq 0 ];
      then
        sleep 1
        echo STARTED
      else
        echo FAILED TO WRITE PID
        exit 1
      fi
    else
      echo SERVER DID NOT START
      exit 1
    fi
    ;;
start-foreground)
    sudo -u clickhouse clickhouse-keeper -C "$ZOOCFG" start
    ;;
print-cmd)
    echo "sudo -u clickhouse ${ZOOCMD}"
    ;;
stop)
    echo -n "Stopping keeper ... "
    if [ ! -f "$ZOOPIDFILE" ]
    then
      echo "no keeper to stop (could not find file $ZOOPIDFILE)"
    else
      ZOOPID=$(cat "$ZOOPIDFILE")
      echo $ZOOPID
      kill $ZOOPID
      while true; do
         sleep 3
         if kill -0 $ZOOPID > /dev/null 2>&1; then
            echo $ZOOPID is still running
         else
            break
         fi
      done
      rm "$ZOOPIDFILE"
      echo STOPPED
    fi
    exit 0
    ;;
restart)
    shift
    "$0" stop ${@}
    sleep 3
    "$0" start ${@}
    ;;
status)
    clientPortAddress="localhost"
    clientPort=2181
    STAT=`echo srvr | nc $clientPortAddress $clientPort 2> /dev/null | grep Mode`
    if [ "x$STAT" = "x" ]
    then
        echo "Error contacting service. It is probably not running."
        exit 1
    else
        echo $STAT
        exit 0
    fi
    ;;
*)
    echo "Usage: $0 {start|start-foreground|stop|restart|status|print-cmd}" >&2

esac

5.68.2 - clickhouse-keeper-service

clickhouse-keeper-service

installation

Need to install clickhouse-common-static + clickhouse-keeper OR clickhouse-common-static + clickhouse-server. Both OK, use the first if you don’t need ClickHouse® server locally.

dpkg -i clickhouse-common-static_{%version}.deb clickhouse-keeper_{%version}.deb

dpkg -i clickhouse-common-static_{%version}.deb clickhouse-server_{%version}.deb clickhouse-client_{%version}.deb

Create directories

mkdir -p /etc/clickhouse-keeper/config.d
mkdir -p /var/log/clickhouse-keeper
mkdir -p /var/lib/clickhouse-keeper/coordination/log
mkdir -p /var/lib/clickhouse-keeper/coordination/snapshots
mkdir -p /var/lib/clickhouse-keeper/cores

chown -R clickhouse.clickhouse /etc/clickhouse-keeper /var/log/clickhouse-keeper /var/lib/clickhouse-keeper

config

cat /etc/clickhouse-keeper/config.xml

<?xml version="1.0"?>
<clickhouse>
    <logger>
        <!-- Possible levels [1]:

          - none (turns off logging)
          - fatal
          - critical
          - error
          - warning
          - notice
          - information
          - debug
          - trace
          - test (not for production usage)

            [1]: https://github.com/pocoproject/poco/blob/poco-1.9.4-release/Foundation/include/Poco/Logger.h#L105-L114
        -->
        <level>trace</level>
        <log>/var/log/clickhouse-keeper/clickhouse-keeper.log</log>
        <errorlog>/var/log/clickhouse-keeper/clickhouse-keeper.err.log</errorlog>
        <!-- Rotation policy
             See https://github.com/pocoproject/poco/blob/poco-1.9.4-release/Foundation/include/Poco/FileChannel.h#L54-L85
          -->
        <size>1000M</size>
        <count>10</count>
        <!-- <console>1</console> --> <!-- Default behavior is autodetection (log to console if not daemon mode and is tty) -->

        <!-- Per level overrides (legacy):

        For example to suppress logging of the ConfigReloader you can use:
        NOTE: levels.logger is reserved, see below.
        -->
        <!--
        <levels>
          <ConfigReloader>none</ConfigReloader>
        </levels>
        -->

        <!-- Per level overrides:

        For example to suppress logging of the RBAC for default user you can use:
        (But please note that the logger name maybe changed from version to version, even after minor upgrade)
        -->
        <!--
        <levels>
          <logger>
            <name>ContextAccess (default)</name>
            <level>none</level>
          </logger>
          <logger>
            <name>DatabaseOrdinary (test)</name>
            <level>none</level>
          </logger>
        </levels>
        -->
        <!-- Structured log formatting:
        You can specify log format(for now, JSON only). In that case, the console log will be printed
        in specified format like JSON.
        For example, as below:
        {"date_time":"1650918987.180175","thread_name":"#1","thread_id":"254545","level":"Trace","query_id":"","logger_name":"BaseDaemon","message":"Received signal 2","source_file":"../base/daemon/BaseDaemon.cpp; virtual void SignalListener::run()","source_line":"192"}
        To enable JSON logging support, just uncomment <formatting> tag below.
        -->
        <!-- <formatting>json</formatting> -->
    </logger>

    <!-- Listen specified address.
     Use :: (wildcard IPv6 address), if you want to accept connections both with IPv4 and IPv6 from everywhere.
     Notes:
     If you open connections from wildcard address, make sure that at least one of the following measures applied:
     - server is protected by firewall and not accessible from untrusted networks;
     - all users are restricted to subset of network addresses (see users.xml);
     - all users have strong passwords, only secure (TLS) interfaces are accessible, or connections are only made via TLS interfaces.
     - users without password have readonly access.
     See also: https://www.shodan.io/search?query=clickhouse
    -->
    <!-- <listen_host>::</listen_host> -->


    <!-- Same for hosts without support for IPv6: -->
    <!-- <listen_host>0.0.0.0</listen_host> -->

    <!-- Default values - try listen localhost on IPv4 and IPv6. -->
    <!--
    <listen_host>::1</listen_host>
    <listen_host>127.0.0.1</listen_host>
    -->

    <!-- <interserver_listen_host>::</interserver_listen_host> -->
    <!-- Listen host for communication between replicas. Used for data exchange -->
    <!-- Default values - equal to listen_host -->

    <!-- Don't exit if IPv6 or IPv4 networks are unavailable while trying to listen. -->
    <!-- <listen_try>0</listen_try> -->

    <!-- Allow multiple servers to listen on the same address:port. This is not recommended.
    -->
    <!-- <listen_reuse_port>0</listen_reuse_port> -->
    <!-- <listen_backlog>4096</listen_backlog> -->

    <path>/var/lib/clickhouse-keeper/</path>
    <core_path>/var/lib/clickhouse-keeper/cores</core_path>

    <keeper_server>
	    <tcp_port>2181</tcp_port>
	    <server_id>1</server_id>
	    <log_storage_path>/var/lib/clickhouse-keeper/coordination/log</log_storage_path>
	    <snapshot_storage_path>/var/lib/clickhouse-keeper/coordination/snapshots</snapshot_storage_path>

	    <coordination_settings>
        	<operation_timeout_ms>10000</operation_timeout_ms>
	        <session_timeout_ms>30000</session_timeout_ms>
	        <raft_logs_level>trace</raft_logs_level>
	        <rotate_log_storage_interval>10000</rotate_log_storage_interval>
	    </coordination_settings>

            <raft_configuration>
	              <server>
                   <id>1</id>
                   <hostname>localhost</hostname>
                   <port>9444</port>
                </server>
           </raft_configuration>
    </keeper_server>
</clickhouse>

cat /etc/clickhouse-keeper/config.d/keeper.xml
<?xml version="1.0"?>
<clickhouse>
    <listen_host>::</listen_host>
    <keeper_server>
            <tcp_port>2181</tcp_port>
            <server_id>1</server_id>
            <raft_configuration>
                <server>
                   <id>1</id>
       	           <hostname>keeper-host-1</hostname>
                   <port>9444</port>
                </server>
                <server>
                   <id>2</id>
                   <hostname>keeper-host-2</hostname>
                   <port>9444</port>
                </server>
                <server>
                   <id>3</id>
                   <hostname>keeper-host-3</hostname>
                   <port>9444</port>
                </server>                
           </raft_configuration>
    </keeper_server>
</clickhouse>

systemd service

cat /lib/systemd/system/clickhouse-keeper.service
[Unit]
Description=ClickHouse Keeper (analytic DBMS for big data)
Requires=network-online.target
# NOTE: that After/Wants=time-sync.target is not enough, you need to ensure
# that the time was adjusted already, if you use systemd-timesyncd you are
# safe, but if you use ntp or some other daemon, you should configure it
# additionaly.
After=time-sync.target network-online.target
Wants=time-sync.target

[Service]
Type=simple
User=clickhouse
Group=clickhouse
Restart=always
RestartSec=30
RuntimeDirectory=clickhouse-keeper
ExecStart=/usr/bin/clickhouse-keeper --config=/etc/clickhouse-keeper/config.xml --pid-file=/run/clickhouse-keeper/clickhouse-keeper.pid
# Minus means that this file is optional.
EnvironmentFile=-/etc/default/clickhouse
LimitCORE=infinity
LimitNOFILE=500000
CapabilityBoundingSet=CAP_NET_ADMIN CAP_IPC_LOCK CAP_SYS_NICE CAP_NET_BIND_SERVICE

[Install]
# ClickHouse should not start from the rescue shell (rescue.target).
WantedBy=multi-user.target

systemctl daemon-reload

systemctl status clickhouse-keeper

systemctl start clickhouse-keeper

debug start without service (as foreground application)

sudo -u clickhouse /usr/bin/clickhouse-keeper --config=/etc/clickhouse-keeper/config.xml

5.68.3 - Install standalone Zookeeper for ClickHouse® on Ubuntu / Debian

Install standalone Zookeeper for ClickHouse® on Ubuntu / Debian.

Reference script to install standalone Zookeeper for Ubuntu / Debian

Tested on Ubuntu 20.

# install java runtime environment
sudo apt-get update
sudo apt install default-jre

# prepare folders, logs folder should be on the low-latency disk.
sudo mkdir -p /var/lib/zookeeper/data /var/lib/zookeeper/logs /etc/zookeeper /var/log/zookeeper /opt 

# download and install files 
export ZOOKEEPER_VERSION=3.6.3
wget https://dlcdn.apache.org/zookeeper/zookeeper-${ZOOKEEPER_VERSION}/apache-zookeeper-${ZOOKEEPER_VERSION}-bin.tar.gz -O /tmp/apache-zookeeper-${ZOOKEEPER_VERSION}-bin.tar.gz
sudo tar -xvf /tmp/apache-zookeeper-${ZOOKEEPER_VERSION}-bin.tar.gz -C /opt
rm -rf /tmp/apache-zookeeper-${ZOOKEEPER_VERSION}-bin.tar.gz

# create the user 
sudo groupadd -r zookeeper
sudo useradd -r -g zookeeper --home-dir=/var/lib/zookeeper --shell=/bin/false zookeeper

# symlink pointing to the used version of zookeeper distibution
sudo ln -s /opt/apache-zookeeper-${ZOOKEEPER_VERSION}-bin /opt/zookeeper 
sudo chown -R zookeeper:zookeeper /var/lib/zookeeper /var/log/zookeeper /etc/zookeeper /opt/apache-zookeeper-${ZOOKEEPER_VERSION}-bin
sudo chown -h zookeeper:zookeeper /opt/zookeeper

# shortcuts in /usr/local/bin/
echo -e '#!/usr/bin/env bash\n/opt/zookeeper/bin/zkCli.sh "$@"'             | sudo tee /usr/local/bin/zkCli
echo -e '#!/usr/bin/env bash\n/opt/zookeeper/bin/zkServer.sh "$@"'          | sudo tee /usr/local/bin/zkServer
echo -e '#!/usr/bin/env bash\n/opt/zookeeper/bin/zkCleanup.sh "$@"'         | sudo tee /usr/local/bin/zkCleanup
echo -e '#!/usr/bin/env bash\n/opt/zookeeper/bin/zkSnapShotToolkit.sh "$@"' | sudo tee /usr/local/bin/zkSnapShotToolkit
echo -e '#!/usr/bin/env bash\n/opt/zookeeper/bin/zkTxnLogToolkit.sh "$@"'   | sudo tee /usr/local/bin/zkTxnLogToolkit
sudo chmod +x /usr/local/bin/zkCli /usr/local/bin/zkServer /usr/local/bin/zkCleanup /usr/local/bin/zkSnapShotToolkit /usr/local/bin/zkTxnLogToolkit

# put in the config
sudo cp opt/zookeeper/conf/* /etc/zookeeper
cat <<EOF | sudo tee /etc/zookeeper/zoo.cfg
initLimit=20
syncLimit=10
maxSessionTimeout=60000000
maxClientCnxns=2000
preAllocSize=131072
snapCount=3000000
dataDir=/var/lib/zookeeper/data
dataLogDir=/var/lib/zookeeper/logs # use low-latency disk!
clientPort=2181
#clientPortAddress=nthk-zoo1.localdomain
autopurge.snapRetainCount=10
autopurge.purgeInterval=1
4lw.commands.whitelist=*
EOF
sudo chown -R zookeeper:zookeeper /etc/zookeeper

# create systemd service file
cat <<EOF | sudo tee /etc/systemd/system/zookeeper.service
[Unit]
Description=Zookeeper Daemon
Documentation=http://zookeeper.apache.org
Requires=network.target
After=network.target

[Service]
Type=forking
WorkingDirectory=/var/lib/zookeeper
User=zookeeper
Group=zookeeper
Environment=ZK_SERVER_HEAP=1536 # in megabytes, adjust to ~ 80-90% of avaliable RAM (more than 8Gb is rather overkill)
Environment=SERVER_JVMFLAGS="-Xms256m -XX:+AlwaysPreTouch -Djute.maxbuffer=8388608 -XX:MaxGCPauseMillis=50"
Environment=ZOO_LOG_DIR=/var/log/zookeeper
ExecStart=/opt/zookeeper/bin/zkServer.sh start /etc/zookeeper/zoo.cfg
ExecStop=/opt/zookeeper/bin/zkServer.sh stop /etc/zookeeper/zoo.cfg
ExecReload=/opt/zookeeper/bin/zkServer.sh restart /etc/zookeeper/zoo.cfg
TimeoutSec=30
Restart=on-failure

[Install]
WantedBy=default.target
EOF

# start zookeeper
sudo systemctl daemon-reload
sudo systemctl start zookeeper.service 

# check status etc.
echo stat | nc localhost 2181
echo ruok | nc localhost 2181
echo mntr | nc localhost 2181

5.68.4 - How to check the list of watches

How to check the list of watches

Zookeeper use watches to notify a client on znode changes. This article explains how to check watches set by ZooKeeper servers and how it is used.

Solution:

Zookeeper uses the 'wchc' command to list all watches set on the Zookeeper server.

# echo wchc | nc zookeeper 2181

Reference

https://zookeeper.apache.org/doc/r3.4.12/zookeeperAdmin.html

The wchp and wchc commands are not enabled by default because of their known DOS vulnerability. For more information, see ZOOKEEPER-2693 and Zookeeper 3.5.2 - Denial of Service .

By default those commands are disabled, they can be enabled via Java system property:

-Dzookeeper.4lw.commands.whitelist=*

on in zookeeper config: 4lw.commands.whitelist=*\

5.68.5 - JVM sizes and garbage collector settings

JVM sizes and garbage collector settings

TLDR version

use fresh Java version (11 or newer), disable swap and set up (for 4 Gb node):

JAVA_OPTS="-Xms512m -Xmx3G -XX:+AlwaysPreTouch -Djute.maxbuffer=8388608 -XX:MaxGCPauseMillis=50"

If you have a node with more RAM - change it accordingly, for example for 8Gb node:

JAVA_OPTS="-Xms512m -Xmx7G -XX:+AlwaysPreTouch -Djute.maxbuffer=8388608 -XX:MaxGCPauseMillis=50"

Details

ZooKeeper runs as in JVM. Depending on version different garbage collectors are available.
Recent JVM versions (starting from 10) use G1 garbage collector by default (should work fine). On JVM 13-14 using ZGC or Shenandoah garbage collector may reduce pauses. On older JVM version (before 10) you may want to make some tuning to decrease pauses, ParNew + CMS garbage collectors (like in Yandex config) is one of the best options.
One of the most important setting for JVM application is heap size. A heap size of >1 GB is recommended for most use cases and monitoring heap usage to ensure no delays are caused by garbage collection. We recommend to use at least 4Gb of RAM for zookeeper nodes (8Gb is better, that will make difference only when zookeeper is heavily loaded).

Set the Java heap size smaller than available RAM size on the node. This is very important to avoid swapping, which will seriously degrade ZooKeeper performance. Be conservative - use a maximum heap size of 3GB for a 4GB machine.

Add XX:+AlwaysPreTouch flag as well to load the memory pages into memory at the start of the zookeeper.
Set min (Xms) heap size to the values like 512Mb, or even to the same value as max (Xmx) to avoid resizing and returning the RAM to OS. Add XX:+AlwaysPreTouch flag as well to load the memory pages into memory at the start of the zookeeper.
MaxGCPauseMillis=50 (by default 200) - the ’target’ acceptable pause for garbage collection (milliseconds)
jute.maxbuffer limits the maximum size of znode content. By default it’s 1Mb. In some usecases (lot of partitions in table) ClickHouse® may need to create bigger znodes.
(optional) enable GC logs: -Xloggc:/path_to/gc.log

Zookeeper configuration used by Yandex Metrika (from 2017)

The configuration used by Yandex ( https://clickhouse.com/docs/en/operations/tips#zookeeper ) - they use older JVM version (with UseParNewGC garbage collector), and tune GC logs heavily:

JAVA_OPTS="-Xms{{ cluster.get('xms','128M') }} \
    -Xmx{{ cluster.get('xmx','1G') }} \
    -Xloggc:/var/log/$NAME/zookeeper-gc.log \
    -XX:+UseGCLogFileRotation \
    -XX:NumberOfGCLogFiles=16 \
    -XX:GCLogFileSize=16M \
    -verbose:gc \
    -XX:+PrintGCTimeStamps \
    -XX:+PrintGCDateStamps \
    -XX:+PrintGCDetails
    -XX:+PrintTenuringDistribution \
    -XX:+PrintGCApplicationStoppedTime \
    -XX:+PrintGCApplicationConcurrentTime \
    -XX:+PrintSafepointStatistics \
    -XX:+UseParNewGC \
    -XX:+UseConcMarkSweepGC \
    -XX:+CMSParallelRemarkEnabled"

5.68.6 - Proper setup

Proper setup

Main docs article

https://docs.altinity.com/operationsguide/clickhouse-zookeeper/zookeeper-installation/

Hardware requirements

TLDR version:

USE DEDICATED FAST DISKS for the transaction log! (crucial for performance due to write-ahead-log, NVMe is preferred for heavy load setup).
use 3 nodes (more nodes = slower quorum, less = no HA).
low network latency between zookeeper nodes is very important (latency, not bandwidth).
have at least 4Gb of RAM, disable swap, tune JVM sizes, and garbage collector settings.
ensure that zookeeper will not be CPU-starved by some other processes
monitor zookeeper.

Some doc about that subject:

Cite from https://zookeeper.apache.org/doc/r3.5.7/zookeeperAdmin.html#sc_commonProblems :

Things to Avoid
Here are some common problems you can avoid by configuring ZooKeeper correctly:
inconsistent lists of servers : The list of ZooKeeper servers used by the clients must match the list of ZooKeeper servers that each ZooKeeper server has. Things work okay if the client list is a subset of the real list, but things will really act strange if clients have a list of ZooKeeper servers that are in different ZooKeeper clusters. Also, the server lists in each Zookeeper server configuration file should be consistent with one another.
incorrect placement of transaction log : The most performance critical part of ZooKeeper is the transaction log. ZooKeeper syncs transactions to media before it returns a response. A dedicated transaction log device is key to consistent good performance. Putting the log on a busy device will adversely affect performance. If you only have one storage device, increase the snapCount so that snapshot files are generated less often; it does not eliminate the problem, but it makes more resources available for the transaction log.
incorrect Java heap size : You should take special care to set your Java max heap size correctly. In particular, you should not create a situation in which ZooKeeper swaps to disk. The disk is death to ZooKeeper. Everything is ordered, so if processing one request swaps the disk, all other queued requests will probably do the same. the disk. DON’T SWAP. Be conservative in your estimates: if you have 4G of RAM, do not set the Java max heap size to 6G or even 4G. For example, it is more likely you would use a 3G heap for a 4G machine, as the operating system and the cache also need memory. The best and only recommend practice for estimating the heap size your system needs is to run load tests, and then make sure you are well below the usage limit that would cause the system to swap.
Publicly accessible deployment : A ZooKeeper ensemble is expected to operate in a trusted computing environment. It is thus recommended to deploy ZooKeeper behind a firewall.

5.68.7 - Recovering from complete metadata loss in ZooKeeper

Recovering from complete metadata loss in ZooKeeper

Problem

Every ClickHouse® user experienced a loss of ZooKeeper one day. While the data is available and replicas respond to queries, inserts are no longer possible. ClickHouse uses ZooKeeper in order to store the reference version of the table structure and part of data, and when it is not available can not guarantee data consistency anymore. Replicated tables turn to the read-only mode. In this article we describe step-by-step instructions of how to restore ZooKeeper metadata and bring ClickHouse cluster back to normal operation.

In order to restore ZooKeeper we have to solve two tasks. First, we need to restore table metadata in ZooKeeper. Currently, the only way to do it is to recreate the table with the CREATE TABLE DDL statement.

CREATE TABLE table_name ... ENGINE=ReplicatedMergeTree('zookeeper_path','replica_name');

The second and more difficult task is to populate zookeeper with information of ClickHouse data parts. As mentioned above, ClickHouse stores the reference data about all parts of replicated tables in ZooKeeper, so we have to traverse all partitions and re-attach them to the recovered replicated table in order to fix that.

Info

Starting from ClickHouse version 21.7 there is SYSTEM RESTORE REPLICA command

https://altinity.com/blog/a-new-way-to-restore-clickhouse-after-zookeeper-metadata-is-lost

Test case

Let’s say we have replicated table table_repl.

CREATE TABLE table_repl 
(
   `number` UInt32
)
ENGINE = ReplicatedMergeTree('/clickhouse/{cluster}/tables/{shard}/table_repl','{replica}')
PARTITION BY intDiv(number, 1000)
ORDER BY number;

And populate it with some data

SELECT * FROM system.zookeeper WHERE path='/clickhouse/cluster_1/tables/01/';

INSERT INTO table_repl SELECT * FROM numbers(1000,2000);

SELECT partition, sum(rows) AS rows, count() FROM system.parts WHERE table='table_repl' AND active GROUP BY partition;

Now let’s remove metadata in zookeeper using ZkCli.sh at ZooKeeper host:

deleteall  /clickhouse/cluster_1/tables/01/table_repl

And try to resync ClickHouse replica state with zookeeper:

SYSTEM RESTART REPLICA table_repl;

If we try to insert some data in the table, error happens:

INSERT INTO table_repl SELECT number AS number FROM numbers(1000,2000) WHERE number % 2 = 0;

And now we have an exception that we lost all metadata in zookeeper. It is time to recover!

Current Solution

Detach replicated table.
```
DETACH TABLE table_repl;
```

Save the table’s attach script and change engine of replicated table to non-replicated *mergetree analogue. Table definition is located in the ‘metadata’ folder, ‘/var/lib/clickhouse/metadata/default/table_repl.sql’ in our example. Please make a backup copy and modify the file as follows:

ATTACH TABLE table_repl
(
   `number` UInt32
)
ENGINE = ReplicatedMergeTree('/clickhouse/{cluster}/tables/{shard}/table_repl', '{replica}')
PARTITION BY intDiv(number, 1000)
ORDER BY number
SETTINGS index_granularity = 8192

Needs to be replaced with this:

ATTACH TABLE table_repl
(
   `number` UInt32
)
ENGINE = MergeTree()
PARTITION BY intDiv(number, 1000)
ORDER BY number
SETTINGS index_granularity = 8192

Attach non-replicated table.
```
ATTACH TABLE table_repl;
```

Rename non-replicated table.

RENAME TABLE table_repl TO table_repl_old;

Create a new replicated table. Take the saved attach script and replace ATTACH with CREATE, and run it.

CREATE TABLE table_repl
(
   `number` UInt32
)
ENGINE = ReplicatedMergeTree('/clickhouse/{cluster}/tables/{shard}/table_repl', '{replica}')
PARTITION BY intDiv(number, 1000)
ORDER BY number
SETTINGS index_granularity = 8192

Attach parts from old table to new.

ALTER TABLE table_repl ATTACH PARTITION 1 FROM table_repl_old;

ALTER TABLE table_repl ATTACH PARTITION 2 FROM table_repl_old;

If the table has many partitions, it may require some shell script to make it easier.

Automated approach

For a large number of tables, you can use script https://github.com/Altinity/clickhouse-zookeeper-recovery which partially automates the above approach.

5.68.8 - Using clickhouse-keeper

Moving to the ClickHouse® alternative to Zookeeper

Since 2021 the development of built-in ClickHouse® alternative for Zookeeper is happening, whose goal is to address several design pitfalls, and get rid of extra dependency.

See slides: https://presentations.clickhouse.com/meetup54/keeper.pdf and video https://youtu.be/IfgtdU1Mrm0?t=2682

Current status (last updated: March 2026)

ClickHouse Keeper is the recommended choice for new installations. It yields better performance in many cases due to the new features, like async replication or multi read. Some ClickHouse server features cannot be used without Keeper, for example the S3Queue.

Use the latest Keeper version available in your supported upgrade path whenever possible.
The Keeper version doesn’t need to match the ClickHouse server version
Modern Keeper usually performs better than older versions because the codebase has matured significantly, new protocol feature flags have been added, and internal replication has improved.

For existing systems that currently use Apache Zookeeper, you can consider upgrading to clickhouse-keeper especially if you will upgrade ClickHouse also.

Warning

Before upgrading ClickHouse Keeper from version older than 23.9 please check Upgrade caveat for async_replication Upgrade caveat for async_replication

How does clickhouse-keeper differ from Zookeeper?

Keeper is optimized for ClickHouse workloads and written in C++ (and can be used as single-binary), so it don’t need any external dependencies. It uses the same client protocol but both are implementing different consensus protocol: Zookeeper is using ZAB, while ClickHouse Keeper implements eBay NuRAFT GitHub - eBay/NuRaft: C++ implementation of Raft core logic as a replication library which improves stability and performance of base RAFT protocol.

ClickHouse Keeper can also run in embedded mode, operating as a separate thread within the ClickHouse server process, which may be suitable for testing purposes or smaller instances where some performance can be sacrificed for simplicity

Migration and upgrade guide

A mixed ZooKeeper / ClickHouse Keeper quorum is not supported. Those are different consensus protocols.
ZooKeeper snapshots and transaction logs are not format-compatible with Keeper. For data migration use clickhouse-keeper-converter.
If the above is too complex you can switch to new, empty Keeper ensemble and recreate the Keeper metadata using SYSTEM RESTORE REPLICA calls. This method takes longer time but it is suitable for smaller clusters. Check procedure to restore multiple tables in RO mode article
Keep in mind that some metadata is available in ZooKeeper only and will be lost if you don’t migrate with clickhouse-keeper-converter using above guide. For example: Distributed DDL queue, RBAC data (if configured), etc. Check Keeper depended features for more information.

Upgrade caveat for `async_replication`

async_replication is an internal Keeper optimization for RAFT replication and it’s turned on by default starting from 25.10 . It does not change ClickHouse replicated table semantics, but it can improve Keeper performance.

If you upgrade directly from a version older than 23.9 to 25.10+:

either upgrade Keeper to 23.9+ first, and then continue to 25.10+
or temporarily set keeper_server.coordination_settings.async_replication=0 during the upgrade and enable it after the upgrade is finished

Keeper in kubernetes

If you run ClickHouse on Kubernetes with Altinity operator, Keeper can be managed as a dedicated ClickHouseKeeperInstallation resource (often abbreviated as CHK). That is usually the cleanest way to run and upgrade a separate Keeper ensemble on Kubernetes. Please check examples here .

systemd service file

See https://kb.altinity.com/altinity-kb-setup-and-maintenance/altinity-kb-zookeeper/clickhouse-keeper-service/

init.d script

See https://kb.altinity.com/altinity-kb-setup-and-maintenance/altinity-kb-zookeeper/clickhouse-keeper-initd/

More than 3 Keeper nodes

The main issue with a larger Keeper ensemble is that it takes more time to re-elect a leader, and commits take longer, which can slow down insertions and DDL queries.

It should be fine, but we don’t recommend running more than three Keeper nodes (excluding observers).

Increasing the number of nodes offers no significant advantages (unless you need to tolerate the simultaneous failure of two Keeper nodes). In terms of performance, it doesn’t perform better—and may even perform worse—and it consumes additional resources (ZooKeeper requires fast, dedicated disks to perform well, as well as some RAM and CPU).

clickhouse-keeper-client

In clickhouse-keeper-client, paths are now parsed more strictly and must be passed as string literals. In practice, this means using single quotes around paths—for example, ls '/' instead of ls /, and get '/clickhouse/path' instead of get /clickhouse/path.

Embedded Keeper

To use the embedded ClickHouse Keeper, add the <keeper_server> section to the ClickHouse server configuration. In this setup, a separate client-side <keeper> section is not required. If your ClickHouse servers use an external ClickHouse Keeper or ZooKeeper ensemble instead, see the section below.

Example of a simple cluster

The Keeper ensemble size must be odd because it requires a majority (50% + 1 nodes) to form a quorum. A 2-node Keeper setup will lose quorum after a single node failure, so the recommended number of Keeper replicas is 3.

hostname1

$ cat /etc/clickhouse-server/config.d/keeper.xml

<?xml version="1.0" ?>
<clickhouse>
    <keeper_server>
        <tcp_port>2181</tcp_port>
        <server_id>1</server_id>
        <log_storage_path>/var/lib/clickhouse/coordination/log</log_storage_path>
        <snapshot_storage_path>/var/lib/clickhouse/coordination/snapshots</snapshot_storage_path>

        <coordination_settings>
            <operation_timeout_ms>10000</operation_timeout_ms>
            <session_timeout_ms>30000</session_timeout_ms>
            <raft_logs_level>trace</raft_logs_level>
            <rotate_log_storage_interval>10000</rotate_log_storage_interval>
        </coordination_settings>

        <raft_configuration>
            <server>
                <id>1</id>
                <hostname>hostname1</hostname>
                <port>9444</port>
            </server>
            <server>
                <id>2</id>
                <hostname>hostname2</hostname>
                <port>9444</port>
            </server>
            <server>
                <id>3</id>
                <hostname>hostname3</hostname>
                <port>9444</port>
            </server>
        </raft_configuration>
    </keeper_server>

    <distributed_ddl>
        <path>/clickhouse/testcluster/task_queue/ddl</path>
    </distributed_ddl>
</clickhouse>

$ cat /etc/clickhouse-server/config.d/macros.xml

<?xml version="1.0" ?>
<clickhouse>
    <macros>
        <cluster>testcluster</cluster>
        <replica>replica1</replica>
        <shard>1</shard>
    </macros>
</clickhouse>

hostname2

$ cat /etc/clickhouse-server/config.d/keeper.xml

<?xml version="1.0" ?>
<clickhouse>
    <keeper_server>
        <tcp_port>2181</tcp_port>
        <server_id>2</server_id>
        <log_storage_path>/var/lib/clickhouse/coordination/log</log_storage_path>
        <snapshot_storage_path>/var/lib/clickhouse/coordination/snapshots</snapshot_storage_path>

        <coordination_settings>
            <operation_timeout_ms>10000</operation_timeout_ms>
            <session_timeout_ms>30000</session_timeout_ms>
            <raft_logs_level>trace</raft_logs_level>
            <rotate_log_storage_interval>10000</rotate_log_storage_interval>
        </coordination_settings>

        <raft_configuration>
            <server>
                <id>1</id>
                <hostname>hostname1</hostname>
                <port>9444</port>
            </server>
            <server>
                <id>2</id>
                <hostname>hostname2</hostname>
                <port>9444</port>
            </server>
            <server>
                <id>3</id>
                <hostname>hostname3</hostname>
                <port>9444</port>
            </server>
        </raft_configuration>
    </keeper_server>

    <distributed_ddl>
        <path>/clickhouse/testcluster/task_queue/ddl</path>
    </distributed_ddl>
</clickhouse>

$ cat /etc/clickhouse-server/config.d/macros.xml

<?xml version="1.0" ?>
<clickhouse>
    <macros>
        <cluster>testcluster</cluster>
        <replica>replica2</replica>
        <shard>1</shard>
    </macros>
</clickhouse>

hostname3

$ cat /etc/clickhouse-keeper/keeper_config.xml

<?xml version="1.0" ?>
<clickhouse>
    <keeper_server>
        <tcp_port>2181</tcp_port>
        <server_id>3</server_id>
        <log_storage_path>/var/lib/clickhouse/coordination/log</log_storage_path>
        <snapshot_storage_path>/var/lib/clickhouse/coordination/snapshots</snapshot_storage_path>

        <coordination_settings>
            <operation_timeout_ms>10000</operation_timeout_ms>
            <session_timeout_ms>30000</session_timeout_ms>
            <raft_logs_level>trace</raft_logs_level>
            <rotate_log_storage_interval>10000</rotate_log_storage_interval>
        </coordination_settings>

        <raft_configuration>
            <server>
                <id>1</id>
                <hostname>hostname1</hostname>
                <port>9444</port>
            </server>
            <server>
                <id>2</id>
                <hostname>hostname2</hostname>
                <port>9444</port>
            </server>
            <server>
                <id>3</id>
                <hostname>hostname3</hostname>
                <port>9444</port>
            </server>
        </raft_configuration>
    </keeper_server>
</clickhouse>

$ clickhouse-keeper --config /etc/clickhouse-keeper/keeper_config.xml

on both ClickHouse nodes

$ cat /etc/clickhouse-server/config.d/clusters.xml

<?xml version="1.0" ?>
<clickhouse>
    <remote_servers>
        <testcluster>
            <shard>
                <replica>
                    <host>hostname1</host>
                    <port>9000</port>
                </replica>
                <replica>
                    <host>hostname2</host>
                    <port>9000</port>
                </replica>
            </shard>
        </testcluster>
    </remote_servers>
</clickhouse>

Then create a table

create table test on cluster '{cluster}'   ( A Int64, S String)
Engine = ReplicatedMergeTree('/clickhouse/{cluster}/tables/{database}/{table}','{replica}')
Order by A;

insert into test select number, '' from numbers(100000000);

-- on both nodes:
select count() from test;

Useful references

Official Keeper guide: https://clickhouse.com/docs/en/guides/sre/keeper/clickhouse-keeper/
clickhouse-keeper-client: https://clickhouse.com/docs/en/operations/utilities/clickhouse-keeper-client
Keeper HTTP API and dashboard (26.1+): https://clickhouse.com/docs/operations/utilities/clickhouse-keeper-http-api
system.zookeeper: https://clickhouse.com/docs/operations/system-tables/zookeeper
system.zookeeper_connection: https://clickhouse.com/docs/operations/system-tables/zookeeper_connection
system.zookeeper_connection_log: https://clickhouse.com/docs/operations/system-tables/zookeeper_connection_log
system.zookeeper_info (26.1+): https://clickhouse.com/docs/operations/system-tables/zookeeper_info
system.zookeeper_log: https://clickhouse.com/docs/operations/system-tables/zookeeper_log
aggregated_zookeeper_log upstream PR: resubmit https://github.com/ClickHouse/ClickHouse/pull/87208
Altinity operator CHK examples: https://github.com/Altinity/clickhouse-operator/tree/master/docs/chk-examples
Altinity operator Keeper dashboard JSON: https://github.com/Altinity/clickhouse-operator/blob/master/grafana-dashboard/ClickHouseKeeper_dashboard.json
Altinity operator Keeper alert rules: https://github.com/Altinity/clickhouse-operator/blob/master/deploy/prometheus/prometheus-alert-rules-chkeeper.yaml

5.68.9 - ZooKeeper backup

ZooKeeper backup

Question: Do I need to backup Zookeeper Database, because it’s pretty important for ClickHouse®?

TLDR answer: NO, just backup ClickHouse data itself, and do SYSTEM RESTORE REPLICA during recovery to recreate zookeeper data

Details:

Zookeeper does not store any data, it stores the STATE of the distributed system (“that replica have those parts”, “still need 2 merges to do”, “alter is being applied” etc). That state always changes, and you can not capture / backup / and recover that state in a safe manner. So even backup from few seconds ago is representing some ‘old state from the past’ which is INCONSISTENT with actual state of the data.

In other words - if ClickHouse is working - then the state of distributed system always changes, and it’s almost impossible to collect the current state of zookeeper (while you collecting it it will change many times). The only exception is ‘stop-the-world’ scenario - i.e. shutdown all ClickHouse nodes, with all other zookeeper clients, then shutdown all the zookeeper, and only then take the backups, in that scenario and backups of zookeeper & ClickHouse will be consistent. In that case restoring the backup is as simple (and is equal to) as starting all the nodes which was stopped before. But usually that scenario is very non-practical because it requires huge downtime.

So what to do instead? It’s enough if you will backup ClickHouse data itself, and to recover the state of zookeeper you can just run the command SYSTEM RESTORE REPLICA command AFTER restoring the ClickHouse data itself. That will recreate the state of the replica in the zookeeper as it exists on the filesystem after backup recovery.

Normally Zookeeper ensemble consists of 3 nodes, which is enough to survive hardware failures.

On older version (which don’t have SYSTEM RESTORE REPLICA command - it can be done manually, using instruction https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/replication/#converting-from-mergetree-to-replicatedmergetree) , on scale you can try https://github.com/Altinity/clickhouse-zookeeper-recovery

5.68.10 - ZooKeeper cluster migration

ZooKeeper cluster migration

Here is a plan for ZK 3.4.9 (no dynamic reconfiguration):

Add the 3 new ZK nodes to the old cluster. No changes needed for the 3 old ZK nodes at this time.
1. Configure one of the new ZK nodes as a cluster of 4 nodes (3 old + 1 new), start it.
2. Configure the other two new ZK nodes as a cluster of 6 nodes (3 old + 3 new), start them.
Make sure the 3 new ZK nodes connected to the old ZK cluster as followers (run echo stat | nc localhost 2181 on the 3 new ZK nodes)
Confirm that the leader has 5 synced followers (run echo mntr | nc localhost 2181 on the leader, look for zk_synced_followers)
Stop data ingestion in CH (this is to minimize errors when CH loses ZK).
Change the zookeeper section in the configs on the CH nodes (remove the 3 old ZK servers, add the 3 new ZK servers)
Make sure that there are no connections from CH to the 3 old ZK nodes (run echo stat | nc localhost 2181 on the 3 old nodes, check their Clients section). Restart all CH nodes if necessary (In some cases CH can reconnect to different ZK servers without a restart).
Remove the 3 old ZK nodes from zoo.cfg on the 3 new ZK nodes.
Restart the 3 new ZK nodes. They should form a cluster of 3 nodes.
When CH reconnects to ZK, start data loading.
Turn off the 3 old ZK nodes.

This plan works, but it is not the only way to do this, it can be changed if needed.

5.68.11 - ZooKeeper cluster migration when using K8s node local storage

ZooKeeper cluster migration when using K8s node local storage

Describes how to migrate a ZooKeeper cluster when using K8s node-local storage such as static PV, local-path, TopoLVM.

Requires HA setup (3+ pods).

This solution is more risky than migration by adding followers because it reduces the number of active consensus members but is operationally simpler. When running with clickhouse-keeper, it can be performed gracefully so that quorum is maintained during the whole operation.

Find the leader pod and note its name
1. To detect leader run echo stat | nc 127.0.0.1 2181 | grep leader inside pods
Make sure the ZK cluster is healthy and all nodes are in sync
1. (run on leader) echo mntr | nc 127.0.0.1 2181 | grep zk_synced_followers should be N-1 for N member cluster
Pick the first non-leader pod and delete its PVC,
1. kubectl delete --wait=false pvc clickhouse-keeper-data-0 -> status should be Terminating
2. Also delete PV if your StorageClass reclaim policy is set to Retain
If you are using dynamic volume provisioning make adjustments based on your k8s infrastructure (such as moving labels and taints or cordoning node) so that after pod delete the new one will be scheduled on the planned node
1. kubectl label node planned-node dedicated=zookeeper
2. kubectl label node this-pod-node dedicated-
3. kubectl taint node planned-node dedicated=zookeeper:NoSchedule
4. kubectl taint node this-pod-node dedicated=zookeeper:NoSchedule-
For manual volume provisioning wait till a new PVC is created and then provision volume on the planned node
Delete the first non-leader pod and wait for its PV to be deleted
1. kubectl delete pod clickhouse-keeper-0
2. kubectl wait --for=delete pv/pvc-0a823311-616f-4b7e-9b96-0c059c62ab3b --timeout=120s
Wait for the new pod to be scheduled and volume provisioned (or provision manual volume per instructions above)
Ensure new member joined and synced
1. (run on leader) echo mntr | nc 127.0.0.1 2181 | grep zk_synced_followers should be N-1 for N member cluster
Repeat for all other non-leader pods
(ClickHouse® Keeper only), for Zookeeper you will need to force an election by stopping the leader
1. Ask the current leader to yield leadership
2. echo ydld | nc 127.0.0.1 2181 -> should print something like Sent yield leadership request to ...
3. - Make sure a different leader was elected by finding your new leader
Finally repeat for the leader pod

5.68.12 - ZooKeeper Monitoring

ZooKeeper Monitoring

ZooKeeper

scrape metrics

embedded exporter since version 3.6.0
- https://zookeeper.apache.org/doc/r3.6.2/zookeeperMonitor.html
standalone exporter
- https://github.com/dabealu/zookeeper-exporter

Install dashboards

embedded exporter https://grafana.com/grafana/dashboards/10465
dabealu exporter https://grafana.com/grafana/dashboards/11442

setup alert rules

embedded exporter link

5.68.13 - ZooKeeper schema

ZooKeeper schema

/metadata

Table schema.

date column -> legacy MergeTree partition expression.
sampling expression -> SAMPLE BY
index granularity -> index_granularity
mode -> type of MergeTree table
sign column -> sign - CollapsingMergeTree / VersionedCollapsingMergeTree
primary key -> ORDER BY key if PRIMARY KEY not defined.
sorting key -> ORDER BY key if PRIMARY KEY defined.
data format version -> 1
partition key -> PARTITION BY
granularity bytes -> index_granularity_bytes

types of MergeTree tables:
Ordinary            = 0
Collapsing          = 1
Summing             = 2
Aggregating         = 3
Replacing           = 5
Graphite            = 6
VersionedCollapsing = 7

/mutations

Log of latest mutations

/columns

List of columns for latest (reference) table version. Replicas would try to reach this state.

/log

Log of latest actions with table.

Related settings:

┌─name────────────────────────┬─value─┬─changed─┬─description────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─type───┐
│ max_replicated_logs_to_keep │ 1000  │       0 │ How many records may be in log, if there is inactive replica. Inactive replica becomes lost when when this number exceed.                                                  │ UInt64 │
│ min_replicated_logs_to_keep │ 10    │       0 │ Keep about this number of last records in ZooKeeper log, even if they are obsolete. It doesn't affect work of tables: used only to diagnose ZooKeeper log before cleaning. │ UInt64 │
└─────────────────────────────┴───────┴─────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────┘

/replicas

List of table replicas.

/replicas/replica_name/

/replicas/replica_name/mutation_pointer

Pointer to the latest mutation executed by replica

/replicas/replica_name/log_pointer

Pointer to the latest task from replication_queue executed by replica

/replicas/replica_name/max_processed_insert_time

/replica/replica_name/metadata

Table schema of specific replica

/replica/replica_name/columns

Columns list of specific replica.

/quorum

Used for quorum inserts.

6 - Useful queries

Access useful ClickHouse® queries, from finding database size, missing blocks, checking table metadata in Zookeeper, and more.

6.1 - Check table metadata in zookeeper

Check table metadata in zookeeper.

Compare table metadata of different replicas in zookeeper

Check if a table is consistent across all zookeeper replicas. From each replica, returns metdadata, columns, and is_active nodes. Checks whether each replica’s value matches the previous replica’s value, and flags any mismatches (looks_good = 0).

SELECT
    *,
    if(
        prev_name = name AND name != 'is_active',
        prev_value = value,
        1
    ) AS looks_good
FROM (
    SELECT
        name,
        path,
        ctime,
        mtime,
        value,
        lagInFrame(name)  OVER w AS prev_name,
        lagInFrame(value) OVER w AS prev_value
    FROM system.zookeeper
    WHERE (path IN (
        SELECT arrayJoin(groupUniqArray(if(path LIKE '%/replicas', concat(path, '/', name), path)))
        FROM system.zookeeper
        WHERE path IN (
            SELECT arrayJoin([zookeeper_path, concat(zookeeper_path, '/replicas')])
            FROM system.replicas
            WHERE table = 'test_repl'
        )
    )) AND (name IN ('metadata', 'columns', 'is_active'))
    WINDOW w AS (ORDER BY name = 'is_active', name ASC, path ASC
                 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
)

Returns a table’s create_table_query, and the last time the table’s metadata was modified

SELECT metadata_modification_time, create_table_query
FROM system.tables
WHERE name = 'test_repl'

6.2 - Compare query_log for 2 intervals

Compare query performance across different time periods

Looks at unique query shapes (by normalized_query_hash) which occurred within two different time intervals (“before” and “after”), and returns performance metrics for each query pattern which performed worse in the “after” interval.

WITH 
    toStartOfInterval(event_time, INTERVAL 5 MINUTE) = '2023-06-30 13:00:00' as before,
    toStartOfInterval(event_time, INTERVAL 5 MINUTE) = '2023-06-30 15:00:00' as after
SELECT
    normalized_query_hash,
    anyIf(query, before) AS QueryBefore,
    anyIf(query, after) AS QueryAfter,
    countIf(before) as CountBefore,
    sumIf(query_duration_ms, before) / 1000 AS QueriesDurationBefore,
    sumIf(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'RealTimeMicroseconds')], before) / 1000000 AS RealTimeBefore,
    sumIf(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'UserTimeMicroseconds')], before) / 1000000 AS UserTimeBefore,
    sumIf(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'SystemTimeMicroseconds')], before) / 1000000 AS SystemTimeBefore,
    sumIf(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'DiskReadElapsedMicroseconds')], before) / 1000000 AS DiskReadTimeBefore,
    sumIf(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'DiskWriteElapsedMicroseconds')], before) / 1000000 AS DiskWriteTimeBefore,
    sumIf(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'NetworkSendElapsedMicroseconds')], before) / 1000000 AS NetworkSendTimeBefore,
    sumIf(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'NetworkReceiveElapsedMicroseconds')], before) / 1000000 AS NetworkReceiveTimeBefore,
    sumIf(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'ZooKeeperWaitMicroseconds')], before) / 1000000 AS ZooKeeperWaitTimeBefore,
    sumIf(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'OSIOWaitMicroseconds')], before) / 1000000 AS OSIOWaitTimeBefore,
    sumIf(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'OSCPUWaitMicroseconds')], before) / 1000000 AS OSCPUWaitTimeBefore,
    sumIf(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'OSCPUVirtualTimeMicroseconds')], before) / 1000000 AS OSCPUVirtualTimeBefore,
    sumIf(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'SelectedBytes')], before)  AS SelectedBytesBefore, 
    sumIf(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'SelectedRanges')], before)  AS SelectedRangesBefore,
    sumIf(read_rows, before) AS ReadRowsBefore,
    formatReadableSize(sumIf(read_bytes, before) AS ReadBytesBefore),
    sumIf(written_rows, before) AS WrittenTowsBefore,
    formatReadableSize(sumIf(written_bytes, before)) AS WrittenBytesBefore,
    sumIf(result_rows, before) AS ResultRowsBefore,
    formatReadableSize(sumIf(result_bytes, before)) AS ResultBytesBefore,

    countIf(after) as CountAfter,
    sumIf(query_duration_ms, after) / 1000 AS QueriesDurationAfter,
   sumIf(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'RealTimeMicroseconds')], after) / 1000000 AS RealTimeAfter,
    sumIf(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'UserTimeMicroseconds')], after) / 1000000 AS UserTimeAfter,
    sumIf(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'SystemTimeMicroseconds')], after) / 1000000 AS SystemTimeAfter,
    sumIf(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'DiskReadElapsedMicroseconds')], after) / 1000000 AS DiskReadTimeAfter,
    sumIf(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'DiskWriteElapsedMicroseconds')], after) / 1000000 AS DiskWriteTimeAfter,
    sumIf(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'NetworkSendElapsedMicroseconds')], after) / 1000000 AS NetworkSendTimeAfter,
    sumIf(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'NetworkReceiveElapsedMicroseconds')], after) / 1000000 AS NetworkReceiveTimeAfter,
    sumIf(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'ZooKeeperWaitMicroseconds')], after) / 1000000 AS ZooKeeperWaitTimeAfter,
    sumIf(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'OSIOWaitMicroseconds')], after) / 1000000 AS OSIOWaitTimeAfter,
    sumIf(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'OSCPUWaitMicroseconds')], after) / 1000000 AS OSCPUWaitTimeAfter,
    sumIf(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'OSCPUVirtualTimeMicroseconds')], after) / 1000000 AS OSCPUVirtualTimeAfter,
    sumIf(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'SelectedBytes')], after)  AS SelectedBytesAfter,
    sumIf(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'SelectedRanges')], after)  AS SelectedRangesAfter,

    sumIf(read_rows, after) AS ReadRowsAfter,
    formatReadableSize(sumIf(read_bytes, after) AS ReadBytesAfter),
    sumIf(written_rows, after) AS WrittenTowsAfter,
    formatReadableSize(sumIf(written_bytes, after)) AS WrittenBytesAfter,
    sumIf(result_rows, after) AS ResultRowsAfter,
    formatReadableSize(sumIf(result_bytes, after)) AS ResultBytesAfter

FROM system.query_log
WHERE (before OR after) AND type in (2,4) -- QueryFinish, ExceptionWhileProcessing
GROUP BY normalized_query_hash
    WITH TOTALS
ORDER BY SelectedRangesAfter- SelectedRangesBefore DESC
LIMIT 10
FORMAT Vertical

Looks at the system.query_log in a window (in this case, 3 days) prior to and following a specified timestamp of interest. Returns performance metrics for each query pattern which performed worse after that timestamp.

WITH 
    toDateTime('2024-02-09 00:00:00') as timestamp_of_issue,
    event_time < timestamp_of_issue as before,
    event_time >= timestamp_of_issue as after
select
    normalized_query_hash as h,
    any(query) as query_sample,
    round(quantileIf(0.9)(query_duration_ms, before)) as duration_q90_before,
    round(quantileIf(0.9)(query_duration_ms, after))  as duration_q90_after,
    countIf(before) as cnt_before,
    countIf(after) as cnt_after,
    sumIf(query_duration_ms,before) as duration_sum_before,
    sumIf(query_duration_ms,after) as duration_sum_after,
    sumIf(ProfileEvents['UserTimeMicroseconds'], before) as usertime_sum_before,
    sumIf(ProfileEvents['UserTimeMicroseconds'], after) as usertime_sum_after,
    sumIf(read_bytes,before) as sum_read_bytes_before,
    sumIf(read_bytes,after) as sum_read_bytes_after
from system.query_log
where event_time between timestamp_of_issue - INTERVAL 3 DAY and timestamp_of_issue + INTERVAL 3 DAY
group by h
HAVING cnt_after > 1.1 * cnt_before OR sum_read_bytes_after > 1.2 * sum_read_bytes_before OR usertime_sum_after > 1.2 * usertime_sum_before
ORDER BY sum_read_bytes_after - sum_read_bytes_before 
FORMAT Vertical

6.3 - Debug hanging thing

Debug hanging / freezing things

If ClickHouse® is busy with something and you don’t know what’s happening, you can easily check the stacktraces of all the thread which are working

SELECT
 arrayStringConcat(arrayMap(x -> concat('0x', lower(hex(x)), '\t', demangle(addressToSymbol(x))), trace), '\n') as trace_functions,
 count()
FROM system.stack_trace
GROUP BY trace_functions
ORDER BY count()
DESC
SETTINGS allow_introspection_functions=1
FORMAT Vertical;

If you can’t start any queries, but you have access to the node, you can sent a signal

# older versions
for i in $(ls -1 /proc/$(pidof clickhouse-server)/task/); do kill -TSTP $i; done
# even older versions
for i in $(ls -1 /proc/$(pidof clickhouse-server)/task/); do kill -SIGPROF $i; done

6.4 - Handy queries for system.query_log

Useful queries for analyzing query performance, resource usage, and overall query statistics

Most resource-intensive queries

For each query (cluster-wide, grouped by query hash and ordered by time), reports:

Latency-related metrics: CPU time categories, disk read and write time, network send and receive time, Zookeeper wait time
Data size-related metrics: counts of bytes and rows read/written, parts/ranges/marks read, files opened, and memory used
Cache hit performance

SELECT 
    hostName() as host,
    normalized_query_hash,
    min(event_time),
    max(event_time),
    replace(substr(argMax(query, utime), 1, 80), '\n', ' ') AS query,
    argMax(query_id, utime) AS sample_query_id,
    count(),
    sum(query_duration_ms) / 1000 AS QueriesDuration, /* wall clock */
    sum(ProfileEvents['RealTimeMicroseconds']) / 1000000 AS RealTime,  /* same as above but x number of thread */
    sum(ProfileEvents['UserTimeMicroseconds'] as utime) / 1000000 AS UserTime,  /* time when our query was doin some cpu-insense work, creating cpu load */
    sum(ProfileEvents['SystemTimeMicroseconds']) / 1000000 AS SystemTime, /* time spend on waiting for some system operations */
    sum(ProfileEvents['DiskReadElapsedMicroseconds']) / 1000000 AS DiskReadTime,
    sum(ProfileEvents['DiskWriteElapsedMicroseconds']) / 1000000 AS DiskWriteTime,
    sum(ProfileEvents['NetworkSendElapsedMicroseconds']) / 1000000 AS NetworkSendTime, /* check the other side of the network! */
    sum(ProfileEvents['NetworkReceiveElapsedMicroseconds']) / 1000000 AS NetworkReceiveTime, /* check the other side of the network! */
    sum(ProfileEvents['ZooKeeperWaitMicroseconds']) / 1000000 AS ZooKeeperWaitTime,
    sum(ProfileEvents['OSIOWaitMicroseconds']) / 1000000 AS OSIOWaitTime, /* IO waits, usually disks - that metric is 'orthogonal' to other */ 
    sum(ProfileEvents['OSCPUWaitMicroseconds']) / 1000000 AS OSCPUWaitTime, /* waiting for a 'free' CPU - usually high when the other load on the server creates a lot of contention for cpu */ 
    sum(ProfileEvents['OSCPUVirtualTimeMicroseconds']) / 1000000 AS OSCPUVirtualTime, /* similar to usertime + system time */
    formatReadableSize(sum(ProfileEvents['NetworkReceiveBytes']) as network_receive_bytes) AS NetworkReceiveBytes,
    formatReadableSize(sum(ProfileEvents['NetworkSendBytes']) as network_send_bytes) AS NetworkSendBytes,
    sum(ProfileEvents['SelectedParts']) as SelectedParts,
    sum(ProfileEvents['SelectedRanges']) as SelectedRanges,
    sum(ProfileEvents['SelectedMarks']) as SelectedMarks,
    sum(ProfileEvents['SelectedRows']) as SelectedRows,  /* those may different from read_rows - here the number or rows potentially matching the where conditions, not neccessary all will be read */
    sum(ProfileEvents['SelectedBytes']) as SelectedBytes,
    sum(ProfileEvents['FileOpen']) as FileOpen,
    sum(ProfileEvents['ZooKeeperTransactions']) as ZooKeeperTransactions,
    formatReadableSize(sum(ProfileEvents['OSReadBytes'] ) as os_read_bytes ) as OSReadBytesExcludePageCache,
    formatReadableSize(sum(ProfileEvents['OSWriteBytes'] ) as os_write_bytes ) as OSWriteBytesExcludePageCache,
    formatReadableSize(sum(ProfileEvents['OSReadChars'] ) as os_read_chars ) as OSReadCharsIncludePageCache,
    formatReadableSize(sum(ProfileEvents['OSWriteChars'] ) as os_write_chars ) as OSWriteCharsIncludePageCache,
    formatReadableSize(quantile(0.97)(memory_usage) as memory_usage_q97) as MemoryUsageQ97 ,
    sum(read_rows) AS ReadRows,
    formatReadableSize(sum(read_bytes) as read_bytes_sum) AS ReadBytes,
    sum(written_rows) AS WrittenRows,
    formatReadableSize(sum(written_bytes) as written_bytes_sum) AS WrittenBytes, /* */
    sum(result_rows) AS ResultRows,
    formatReadableSize(sum(result_bytes) as result_bytes_sum) AS ResultBytes
FROM clusterAllReplicas('{cluster}', system.query_log)
WHERE event_date >= today() AND type in (2,4)-- QueryFinish, ExceptionWhileProcessing
GROUP BY
    GROUPING SETS (
        (normalized_query_hash, host),
        (host),
        ())
ORDER BY OSCPUVirtualTime DESC
LIMIT 30
FORMAT Vertical;

Similar to above, for older ClickHouse versions (pre-22.4). Returns the slowest queries from a single host along with elements of latency.

SELECT
    normalized_query_hash,
    any(query),
    count(),
    sum(query_duration_ms) / 1000 AS QueriesDuration,
    sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'RealTimeMicroseconds')]) / 1000000 AS RealTime,
    sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'UserTimeMicroseconds')]) / 1000000 AS UserTime,
    sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'SystemTimeMicroseconds')]) / 1000000 AS SystemTime,
    sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'DiskReadElapsedMicroseconds')]) / 1000000 AS DiskReadTime,
    sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'DiskWriteElapsedMicroseconds')]) / 1000000 AS DiskWriteTime,
    sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'NetworkSendElapsedMicroseconds')]) / 1000000 AS NetworkSendTime,
    sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'NetworkReceiveElapsedMicroseconds')]) / 1000000 AS NetworkReceiveTime,
    sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'ZooKeeperWaitMicroseconds')]) / 1000000 AS ZooKeeperWaitTime,
    sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'OSIOWaitMicroseconds')]) / 1000000 AS OSIOWaitTime,
    sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'OSCPUWaitMicroseconds')]) / 1000000 AS OSCPUWaitTime,
    sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'OSCPUVirtualTimeMicroseconds')]) / 1000000 AS OSCPUVirtualTime,
    sum(read_rows) AS ReadRows,
    formatReadableSize(sum(read_bytes)) AS ReadBytes,
    sum(written_rows) AS WrittenTows,
    formatReadableSize(sum(written_bytes)) AS WrittenBytes,
    sum(result_rows) AS ResultRows,
    formatReadableSize(sum(result_bytes)) AS ResultBytes
FROM system.query_log
WHERE (event_date >= today()) AND (event_time > (now() - 3600)) AND type in (2,4) -- QueryFinish, ExceptionWhileProcessing
GROUP BY normalized_query_hash
    WITH TOTALS
ORDER BY UserTime DESC
LIMIT 30
FORMAT Vertical

A/B tests of the same query

Runs cluster-wide, returns a side-by-side comparison of performance metrics, ordered by relative difference

WITH
	query_id='8c050082-428e-4523-847a-caf29511d6ba' AS first,
	query_id='618e0c55-e21d-4630-97e7-5f82e2475c32' AS second,
	arrayConcat(mapKeys(ProfileEvents), ['query_duration_ms', 'read_rows', 'read_bytes', 'written_rows', 'written_bytes', 'result_rows', 'result_bytes', 'memory_usage', 'normalized_query_hash', 'peak_threads_usage', 'query_cache_usage']) AS metrics,
	arrayConcat(mapValues(ProfileEvents), [query_duration_ms, read_rows, read_bytes, written_rows, written_bytes, result_rows, result_bytes, memory_usage, normalized_query_hash, peak_threads_usage, toUInt64(query_cache_usage)]) AS metrics_values
SELECT
	metrics[i] AS metric,
	anyIf(metrics_values[i], first) AS v1,
	anyIf(metrics_values[i], second) AS v2,
	formatReadableQuantity(v1 - v2)
FROM clusterAllReplicas(default, system.query_log)
ARRAY JOIN arrayEnumerate(metrics) AS i
WHERE (first OR second) AND (type = 2)
GROUP BY metric
HAVING v1 != v2
ORDER BY
	(v2 - v1) / (v1 + v2) DESC,
	v2 DESC,
	metric ASC

Compares two queries run on the same host in the past day, returning the metrics highlighting the most significant performance differences between the faster and slower query

WITH
    'd18fb820-4075-49bf-8fa3-cd7e53b9d523' AS fast_query_id,
    '22ffbcc0-c62a-4895-8105-ee9d7447a643' AS slow_query_id,
    faster AS
    (
        SELECT pe.1 AS event_name, pe.2 AS event_value
        FROM
        (
            SELECT ProfileEvents.Names, ProfileEvents.Values
            FROM system.query_log
            WHERE (query_id = fast_query_id ) AND (type = 'QueryFinish') AND (event_date = today())
        )
        ARRAY JOIN arrayZip(ProfileEvents.Names, ProfileEvents.Values) AS pe
    ),
    slower AS
    (
        SELECT pe.1 AS event_name, pe.2 AS event_value
        FROM
        (
            SELECT ProfileEvents.Names, ProfileEvents.Values
            FROM system.query_log
            WHERE (query_id = slow_query_id) AND (type = 'QueryFinish') AND (event_date = today())
        )
        ARRAY JOIN arrayZip(ProfileEvents.Names, ProfileEvents.Values) AS pe
    )
SELECT
    event_name,
    formatReadableQuantity(slower.event_value) AS slower_value,
    formatReadableQuantity(faster.event_value) AS faster_value,
    round((slower.event_value - faster.event_value) / slower.event_value, 2) AS diff_q
FROM faster
LEFT JOIN slower USING (event_name)
WHERE diff_q > 0.05
ORDER BY event_name ASC
SETTINGS join_use_nulls = 1

Queries which did not complete within specified timeframe

For a given time range, returns queries which either did not complete, or did not complete within a configurable timeframe (100 seconds)

SELECT
  query_id,
  min(event_time) t,
  any(query)
FROM system.query_log
WHERE event_date = today() AND event_time > '2021-11-25 02:29:12'
GROUP BY query_id
HAVING countIf(type='QueryFinish') = 0 OR sum(query_duration_ms) > 100000
ORDER BY t;

Returns queries which started within a specified timeframe but did not complete successfully (still running, crashed, threw exception)

SELECT
     query_id,
     any(query)
FROM system.query_log
WHERE event_time BETWEEN '2021-09-24 07:00:00' AND '2021-09-24 09:00:00'
GROUP BY query_id HAVING countIf(type=1) <> countIf(type!=1)

Columns used in WHERE clauses

Returns a list of columns which are used as filters against a table. Replace %target_table% with the actual table name (or pattern) you want to inspect.

WITH
    any(query) AS q,
    any(tables) AS _tables,
    arrayJoin(extractAll(query, '\\b(?:PRE)?WHERE\\s+(.*?)\\s+(?:GROUP BY|ORDER BY|UNION|SETTINGS|FORMAT$)')) AS w,
    any(columns) AS cols,
    arrayFilter(x -> (position(w, extract(x, '\\.(`[^`]+`|[^\\.]+)$')) > 0), columns) AS c,
    arrayJoin(c) AS c2
SELECT
    c2,
    count()
FROM system.query_log
WHERE (event_time >= (now() - toIntervalDay(1)))
  AND arrayExists(x -> (x LIKE '%target_table%'), tables)
  AND (query ILIKE 'SELECT%')
GROUP BY c2
ORDER BY count() ASC;

Most‑selected columns

Over the past week, which columns have been accessed the most frequently in SELECT queries

SELECT
    col AS column,
    count() AS hits
FROM system.query_log
ARRAY JOIN columns AS col          -- expand the column list first
WHERE type = 'QueryFinish'
  AND query_kind = 'Select'
  AND event_time >= now() - INTERVAL 7 DAY
  AND notEmpty(columns)
GROUP BY col
ORDER BY hits DESC
LIMIT 50;

Most‑used functions

Over the past week, which functions have been used the most

SELECT
    f AS function,
    count() AS hits
FROM system.query_log
ARRAY JOIN used_functions AS f  -- used_aggregate_functions, used_aggregate_function_combinators
WHERE type = 'QueryFinish'
  AND event_time >= now() - INTERVAL 7 DAY
  AND notEmpty(used_functions)
GROUP BY f
ORDER BY hits DESC
LIMIT 50;

“Worst offender” query ranks

Over a specified time range, returns the query shapes which appear to be the worst performing based on a range of ranked criteria

SELECT *
FROM 
(
SELECT 
    *,
    DENSE_RANK() OVER (PARTITION BY host ORDER BY cnt DESC) as rank_by_cnt,
    DENSE_RANK() OVER (PARTITION BY host ORDER BY QueriesDuration DESC) as rank_by_duration,
    DENSE_RANK() OVER (PARTITION BY host ORDER BY RealTime DESC) as rank_by_real_time,
    DENSE_RANK() OVER (PARTITION BY host ORDER BY UserTime DESC) as rank_by_user_time,
    DENSE_RANK() OVER (PARTITION BY host ORDER BY SystemTime DESC) as rank_by_system_time,
    DENSE_RANK() OVER (PARTITION BY host ORDER BY DiskReadTime DESC) as rank_by_disk_read_time,
    DENSE_RANK() OVER (PARTITION BY host ORDER BY DiskWriteTime DESC) as rank_by_disk_write_time,
    DENSE_RANK() OVER (PARTITION BY host ORDER BY NetworkSendTime DESC) as rank_by_network_send_time,
    DENSE_RANK() OVER (PARTITION BY host ORDER BY NetworkReceiveTime DESC) as rank_by_network_receive_time,
    DENSE_RANK() OVER (PARTITION BY host ORDER BY OSIOWaitTime DESC) as rank_by_os_io_wait_time,
    DENSE_RANK() OVER (PARTITION BY host ORDER BY OSCPUWaitTime DESC) as rank_by_os_cpu_wait_time,
    DENSE_RANK() OVER (PARTITION BY host ORDER BY OSCPUVirtualTime DESC) as rank_by_os_cpu_virtual_time,
    DENSE_RANK() OVER (PARTITION BY host ORDER BY NetworkReceiveBytes DESC) as rank_by_network_receive_bytes,
    DENSE_RANK() OVER (PARTITION BY host ORDER BY NetworkSendBytes DESC) as rank_by_network_send_bytes,
    DENSE_RANK() OVER (PARTITION BY host ORDER BY SelectedParts DESC) as rank_by_selected_parts,
    DENSE_RANK() OVER (PARTITION BY host ORDER BY SelectedRanges DESC) as rank_by_selected_ranges,
    DENSE_RANK() OVER (PARTITION BY host ORDER BY SelectedMarks DESC) as rank_by_selected_marks,
    DENSE_RANK() OVER (PARTITION BY host ORDER BY SelectedRows DESC) as rank_by_selected_rows,
    DENSE_RANK() OVER (PARTITION BY host ORDER BY SelectedBytes DESC) as rank_by_selected_bytes,
    DENSE_RANK() OVER (PARTITION BY host ORDER BY FileOpen DESC) as rank_by_file_open,
    DENSE_RANK() OVER (PARTITION BY host ORDER BY ZooKeeperTransactions DESC) as rank_by_zookeeper_transactions,
    DENSE_RANK() OVER (PARTITION BY host ORDER BY OSReadBytesExcludePageCache DESC) as rank_by_os_read_bytes_exclude_page_cache,
    DENSE_RANK() OVER (PARTITION BY host ORDER BY OSWriteBytesExcludePageCache DESC) as rank_by_os_write_bytes_exclude_page_cache,
    DENSE_RANK() OVER (PARTITION BY host ORDER BY OSReadBytesIncludePageCache DESC) as rank_by_os_read_bytes_include_page_cache,
    DENSE_RANK() OVER (PARTITION BY host ORDER BY OSWriteCharsIncludePageCache DESC) as rank_by_os_write_chars_include_page_cache,
    DENSE_RANK() OVER (PARTITION BY host ORDER BY MemoryUsageQ97 DESC) as rank_by_memory_usage_q97,
    DENSE_RANK() OVER (PARTITION BY host ORDER BY ReadRows DESC) as rank_by_read_rows,
    DENSE_RANK() OVER (PARTITION BY host ORDER BY ReadBytes DESC) as rank_by_read_bytes,
    DENSE_RANK() OVER (PARTITION BY host ORDER BY WrittenRows DESC) as rank_by_written_rows,
    DENSE_RANK() OVER (PARTITION BY host ORDER BY WrittenBytes DESC) as rank_by_written_bytes,
    DENSE_RANK() OVER (PARTITION BY host ORDER BY ResultRows DESC) as rank_by_result_rows,
    DENSE_RANK() OVER (PARTITION BY host ORDER BY ResultBytes DESC) as rank_by_result_bytes
FROM
(
SELECT 
    hostName() as host,
    normalized_query_hash,
    min(event_time) as min_event_time,
    max(event_time) as max_event_time,
    replace(substr(argMax(query, utime), 1, 80), '\n', ' ') AS query,
    argMax(query_id, utime) AS sample_query_id,
    count() as cnt,
    sum(query_duration_ms) / 1000 AS QueriesDuration, /* wall clock */
    sum(ProfileEvents['RealTimeMicroseconds']) / 1000000 AS RealTime,  /* same as above but x number of thread */
    sum(ProfileEvents['UserTimeMicroseconds'] as utime) / 1000000 AS UserTime,  /* time when our query was doin some cpu-insense work, creating cpu load */
    sum(ProfileEvents['SystemTimeMicroseconds']) / 1000000 AS SystemTime, /* time spend on waiting for some system operations */
    sum(ProfileEvents['DiskReadElapsedMicroseconds']) / 1000000 AS DiskReadTime,
    sum(ProfileEvents['DiskWriteElapsedMicroseconds']) / 1000000 AS DiskWriteTime,
    sum(ProfileEvents['NetworkSendElapsedMicroseconds']) / 1000000 AS NetworkSendTime, /* check the other side of the network! */
    sum(ProfileEvents['NetworkReceiveElapsedMicroseconds']) / 1000000 AS NetworkReceiveTime, /* check the other side of the network! */
    sum(ProfileEvents['OSIOWaitMicroseconds']) / 1000000 AS OSIOWaitTime, /* IO waits, usually disks - that metric is 'orthogonal' to other */ 
    sum(ProfileEvents['OSCPUWaitMicroseconds']) / 1000000 AS OSCPUWaitTime, /* waiting for a 'free' CPU - usually high when the other load on the server creates a lot of contention for cpu */ 
    sum(ProfileEvents['OSCPUVirtualTimeMicroseconds']) / 1000000 AS OSCPUVirtualTime, /* similar to usertime + system time */
    sum(ProfileEvents['NetworkReceiveBytes']) AS NetworkReceiveBytes,
    sum(ProfileEvents['NetworkSendBytes']) AS NetworkSendBytes,
    sum(ProfileEvents['SelectedParts']) as SelectedParts,
    sum(ProfileEvents['SelectedRanges']) as SelectedRanges,
    sum(ProfileEvents['SelectedMarks']) as SelectedMarks,
    sum(ProfileEvents['SelectedRows']) as SelectedRows,  /* those may different from read_rows - here the number or rows potentially matching the where conditions, not neccessary all will be read */
    sum(ProfileEvents['SelectedBytes']) as SelectedBytes,
    sum(ProfileEvents['FileOpen']) as FileOpen,
    sum(ProfileEvents['ZooKeeperTransactions']) as ZooKeeperTransactions,
    sum(ProfileEvents['OSReadBytes'] )  as OSReadBytesExcludePageCache,
    sum(ProfileEvents['OSWriteBytes'] )  as OSWriteBytesExcludePageCache,
    sum(ProfileEvents['OSReadChars'] )  as OSReadBytesIncludePageCache,
    sum(ProfileEvents['OSWriteChars'] )  as OSWriteCharsIncludePageCache,
    quantile(0.97)(memory_usage)  as MemoryUsageQ97 ,
    sum(read_rows) AS ReadRows,
    sum(read_bytes) AS ReadBytes,
    sum(written_rows) AS WrittenRows,
    sum(written_bytes) AS WrittenBytes, /* */
    sum(result_rows) AS ResultRows,
    sum(result_bytes) AS ResultBytes
FROM clusterAllReplicas('{cluster}', system.query_log)
WHERE event_time BETWEEN '2024-04-04 11:31:10' and '2024-04-04 12:36:50' AND type in (2,4)-- QueryFinish, ExceptionWhileProcessing
GROUP BY normalized_query_hash, host
)
)
WHERE 
(rank_by_cnt <= 20 and cnt > 10)
OR (rank_by_duration <= 20 and QueriesDuration > 60)
OR (rank_by_real_time <= 20 and RealTime > 60)
OR (rank_by_user_time <= 20 and UserTime > 60)
OR (rank_by_system_time <= 20 and SystemTime > 60)
OR (rank_by_disk_read_time <= 20 and DiskReadTime > 60)
OR (rank_by_disk_write_time <= 20 and DiskWriteTime > 60)
OR (rank_by_network_send_time <= 20 and NetworkSendTime > 60)
OR (rank_by_network_receive_time <= 20 and NetworkReceiveTime > 60)
OR (rank_by_os_io_wait_time <= 20 and OSIOWaitTime > 60)
OR (rank_by_os_cpu_wait_time <= 20 and OSCPUWaitTime > 60)
OR (rank_by_os_cpu_virtual_time <= 20 and OSCPUVirtualTime > 60)
OR (rank_by_network_receive_bytes <= 20 and NetworkReceiveBytes > 500000000)
OR (rank_by_network_send_bytes <= 20 and NetworkSendBytes > 500000000)
OR (rank_by_selected_parts <= 20 and SelectedParts > 1000)
OR (rank_by_selected_ranges <= 20 and SelectedRanges > 1000)
OR (rank_by_selected_marks <= 20 and SelectedMarks > 1000)
OR (rank_by_selected_rows <= 20 and SelectedRows > 1000000)
OR (rank_by_selected_bytes <= 20 and SelectedBytes > 500000000)
OR (rank_by_file_open <= 20 and FileOpen > 1000)
OR (rank_by_zookeeper_transactions <= 20 and ZooKeeperTransactions > 10)
OR (rank_by_os_read_bytes_exclude_page_cache <= 20 and OSReadBytesExcludePageCache > 500000000)
OR (rank_by_os_write_bytes_exclude_page_cache <= 20 and OSWriteBytesExcludePageCache > 500000000)
OR (rank_by_os_read_bytes_include_page_cache <= 20 and OSReadBytesIncludePageCache > 500000000)
OR (rank_by_os_write_chars_include_page_cache <= 20 and OSWriteCharsIncludePageCache > 500000000)
OR (rank_by_memory_usage_q97 <= 20 and MemoryUsageQ97 > 500000000)
OR (rank_by_read_rows <= 20 and ReadRows > 100000)
OR (rank_by_read_bytes <= 20 and ReadBytes > 500000000)
OR (rank_by_written_rows <= 20 and WrittenRows > 100000)
OR (rank_by_written_bytes <= 20 and WrittenBytes > 500000000)
OR (rank_by_result_rows <= 20 and ResultRows > 100000)
OR (rank_by_result_bytes <= 20 and ResultBytes > 100000000)
ORDER BY rank_by_cnt*10 + rank_by_duration*10 + rank_by_real_time*10 + rank_by_user_time*10 + rank_by_system_time*10 + rank_by_disk_read_time*10 + rank_by_disk_write_time*5 + rank_by_network_send_time + rank_by_network_receive_time + rank_by_os_io_wait_time + rank_by_os_cpu_wait_time + rank_by_os_cpu_virtual_time*10 + rank_by_network_receive_bytes*8 + rank_by_network_send_bytes*8 + rank_by_selected_parts*5 + rank_by_selected_ranges*5 + rank_by_selected_marks*5 + rank_by_selected_rows*5 + rank_by_selected_bytes*5 + rank_by_file_open*5 + rank_by_zookeeper_transactions*5 + rank_by_os_read_bytes_exclude_page_cache*5 + rank_by_os_write_bytes_exclude_page_cache*5 + rank_by_os_read_bytes_include_page_cache*5 + rank_by_os_write_chars_include_page_cache*5 + rank_by_memory_usage_q97*10 + rank_by_read_rows*10 + rank_by_read_bytes*10 + rank_by_written_rows*8 + rank_by_written_bytes*8 + rank_by_result_rows*8 + rank_by_result_bytes*8 DESC

Other resources

6.5 - Ingestion metrics from system.part_log

Query to gather information about ingestion rate from system.part_log.

Insert rate

Returns aggregated insert metrics, per table, for the current day (by default), including parts per insert, rows/bytes per insert, and rows/bytes per part.

select database, table, time_bucket,
       max(number_of_parts_per_insert) max_parts_pi,
       median(number_of_parts_per_insert) median_parts_pi,
       min(min_rows_per_part) min_rows_pp, 
       max(max_rows_per_part) max_rows_pp, 
       median(median_rows_per_part) median_rows_pp,
       min(rows_per_insert) min_rows_pi, 
       median(rows_per_insert) median_rows_pi, 
       max(rows_per_insert) max_rows_pi, 
       sum(rows_per_insert) rows_inserted,
       sum(seconds_per_insert) parts_creation_seconds,
       count() inserts,
       sum(number_of_parts_per_insert) new_parts,
       max(last_part_pi) - min(first_part_pi) as insert_period,
       inserts*60/insert_period as inserts_per_minute
from
(SELECT 
	database, 
	table,
	toStartOfDay(event_time) AS time_bucket, 
	count() AS number_of_parts_per_insert,
	min(rows) AS min_rows_per_part,
	max(rows) AS max_rows_per_part, 
	median(rows) AS median_rows_per_part,
	sum(rows) AS rows_per_insert, 
	min(size_in_bytes) AS min_bytes_per_part, 
	max(size_in_bytes) AS max_bytes_per_part,
	median(size_in_bytes) AS median_bytes_per_part, 
	sum(size_in_bytes) AS bytes_per_insert, 
	median_bytes_per_part / median_rows_per_part AS avg_row_size,
	sum(duration_ms)/1000 as seconds_per_insert,
	max(event_time) as last_part_pi,  min(event_time) as first_part_pi
FROM 
	system.part_log
WHERE 
  -- Enum8('NewPart' = 1, 'MergeParts' = 2, 'DownloadPart' = 3, 'RemovePart' = 4, 'MutatePart' = 5, 'MovePart' = 6)
	event_type = 1 
	AND 
  -- change if another time period is desired
	event_date >= today()
GROUP BY query_id, database, table, time_bucket
)
GROUP BY database, table, time_bucket
ORDER BY time_bucket, database, table ASC

New parts per partition

Returns new part counts and average rows per table for the current day (by default)

select database, table, event_type, partition_id, count() c, round(avg(rows)) 
from system.part_log where event_date >= today() and event_type = 'NewPart'
group by database, table, event_type, partition_id
order by c desc

Too fast inserts

Returns new part counts and average rows by minute by table

Should not be more often than 1 new part per table per second (60 inserts per minute) One insert can create several parts because of partitioning and materialized views attached.

select toStartOfMinute(event_time) t, database, table, count() c, round(avg(rows)) 
from system.part_log
where event_date >= today()
  and event_type = 'NewPart'
  --and event_time > now() - 3600
group by database, table, t
order by t

6.6 - Remove block numbers from zookeeper for removed partitions

Remove block numbers from zookeeper for removed partitions

SELECT distinct concat('delete ', zk.block_numbers_path, zk.partition_id) FROM
(
    SELECT r.database, r.table, zk.block_numbers_path, zk.partition_id, p.partition_id
    FROM 
    (
        SELECT path as block_numbers_path, name as partition_id
        FROM system.zookeeper
        WHERE path IN (
            SELECT concat(zookeeper_path, '/block_numbers/') as block_numbers_path
            FROM clusterAllReplicas('{cluster}',system.replicas)
        )
    ) as zk 
    LEFT JOIN 
    (
            SELECT database, table, concat(zookeeper_path, '/block_numbers/') as block_numbers_path
            FROM clusterAllReplicas('{cluster}',system.replicas)
    )
    as r ON (r.block_numbers_path = zk.block_numbers_path) 
    LEFT JOIN 
    (
        SELECT DISTINCT partition_id, database, table
        FROM clusterAllReplicas('{cluster}',system.parts)
    )
    as p ON (p.partition_id = zk.partition_id AND p.database = r.database AND p.table = r.table)
    WHERE p.partition_id = ''  AND zk.partition_id <> 'all'
    ORDER BY r.database, r.table, zk.block_numbers_path, zk.partition_id, p.partition_id
) t
FORMAT TSVRaw;

After 24.3

WITH 
  now() - INTERVAL 120 DAY as retain_old_partitions,
  replicas AS (SELECT DISTINCT database, table, zookeeper_path || '/block_numbers' AS block_numbers_path FROM system.replicas),
  zk_data AS (SELECT DISTINCT name as partition_id, path as block_numbers_path FROM system.zookeeper WHERE path IN (SELECT block_numbers_path FROM replicas) AND mtime < retain_old_partitions AND partition_id <> 'all'),
  zk_partitions AS (SELECT DISTINCT database, table, partition_id FROM replicas JOIN zk_data USING block_numbers_path),
  partitions AS (SELECT DISTINCT database, table, partition_id FROM system.parts)
SELECT 
  format('ALTER TABLE `{}`.`{}` {};',database, table, arrayStringConcat( arraySort(groupArray('FORGET PARTITION ID \'' || partition_id || '\'')), ', ')) AS query
FROM zk_partitions
WHERE (database, table, partition_id) NOT IN (SELECT * FROM partitions) 
GROUP BY database, table
ORDER BY database, table
FORMAT TSVRaw;

After fixing https://github.com/ClickHouse/ClickHouse/issues/72807

WITH 
  now() - INTERVAL 120 DAY as retain_old_partitions,
  replicas AS (SELECT DISTINCT database, table, zookeeper_path || '/block_numbers' AS block_numbers_path FROM clusterAllReplicas('{cluster}',system.replicas)),
  zk_data AS (SELECT DISTINCT name as partition_id, path as block_numbers_path FROM system.zookeeper WHERE path IN (SELECT block_numbers_path FROM replicas) AND mtime < retain_old_partitions AND partition_id <> 'all'),
  zk_partitions AS (SELECT DISTINCT database, table, partition_id FROM replicas JOIN zk_data USING block_numbers_path),
  partitions AS (SELECT DISTINCT database, table, partition_id FROM clusterAllReplicas('{cluster}',system.parts))
SELECT 
  format('ALTER TABLE `{}`.`{}` ON CLUSTER \'{{cluster}}\' {};',database, table, arrayStringConcat( arraySort(groupArray('FORGET PARTITION ID \'' || partition_id || '\'')), ', ')) AS query
FROM zk_partitions
WHERE (database, table, partition_id) NOT IN (SELECT * FROM partitions) 
GROUP BY database, table
ORDER BY database, table
FORMAT TSVRaw;

6.7 - Removing tasks in the replication queue related to empty partitions

Removing tasks in the replication queue related to empty partitions

SELECT 'ALTER TABLE ' || database || '.' || table || ' DROP PARTITION ID \''|| partition_id || '\';'  FROM 
(SELECT DISTINCT database, table, extract(new_part_name, '^[^_]+')  as partition_id FROM clusterAllReplicas('{cluster}', system.replication_queue) ) as rq
LEFT JOIN 
(SELECT database, table, partition_id, sum(rows) as rows_count, count() as part_count 
FROM clusterAllReplicas('{cluster}', system.parts)
WHERE active GROUP BY database, table, partition_id
)  as p
USING (database, table, partition_id)
WHERE p.rows_count = 0 AND p.part_count = 0
FORMAT TSVRaw;

6.8 - Can detached parts in ClickHouse® be dropped?

Cleaning up detached parts without data loss

Overview

This article explains detached parts in ClickHouse®: why they appear, what detached reasons mean, and how to clean up safely.

Use it when investigating:

You can perform two main operations with detached parts:

Recovery: If you’re missing data due to misconfiguration or an error (such as connecting to the wrong ZooKeeper), check the detached parts. The missing data might be recoverable through manual intervention.
Cleanup: Otherwise, clean up the detached parts periodically to free disk space.

Version Scope

Primary scope: ClickHouse 23.10+.

Compatibility note:

In 22.6-23.9, there was optional timeout-based auto-removal for some detached reasons.
In 23.10+, this option was removed; detached-part cleanup is intentionally manual.

Important distinction for ReplicatedMergeTree: ClickHouse® tracks expected parts from ZooKeeper and unexpected parts found locally:

Broken expected parts increment the max_suspicious_broken_parts counter (can block startup).
Broken unexpected parts use a separate counter and do not block startup.

Detailed actions based on the `status` of detached parts:

Safe to delete (after validation):
- ignored
- clone.
Temporary, do not delete while in progress:
- attaching
- deleting
- tmp-fetch.
Investigate before deleting:
- broken
- broken-on-start
- broken-from-backup
- covered-by-broken
- noquorum
- merge-not-byte-identical
- mutate-not-byte-identical

Monitoring of detached parts

You can find information in clickhouse-server.log, for what happened when the parts were detached during startup. If clickhouse-server.log is lost it might be impossible to figure out what happened and why the parts were detached.

Another good source of information is system.part_log table, which can be used to investigate the history/timeline of specific parts involved in the detaching process:

SELECT
    event_time,
    event_type,
    database,
    `table`,
    part_name,
    partition_id,
    rows,
    size_in_bytes,
    merged_from,
    error,
    exception
FROM system.part_log
WHERE part_name IN ('all_1_5_0', 'all_6_10_1') -- example part names, replace with actual part names from detached_parts or clickhouse-server.log
ORDER BY
    part_name ASC,
    event_time ASC

Also system.detached_parts table contains useful information:

SELECT
    database,
    table,
    reason,
    count() AS parts
FROM system.detached_parts
GROUP BY database, table, reason
ORDER BY database ASC, table ASC, reason ASC

It is important to monitor for detached parts and act quickly when they appear. You can use system.asynchronous_metric/metric_log to track some metrics.

Use system.asynchronous_metrics for current values:

SELECT metric, value
FROM system.asynchronous_metrics
WHERE metric IN ('NumberOfDetachedParts', 'NumberOfDetachedByUserParts')
ORDER BY metric;

Use system.asynchronous_metric_log for history/trends:

SELECT
    event_time,
    metric,
    value
FROM system.asynchronous_metric_log
WHERE metric IN ('NumberOfDetachedParts', 'NumberOfDetachedByUserParts')
  AND event_time > now() - INTERVAL 24 HOUR
ORDER BY event_time DESC, metric;

DROP DETACHED command

The DROP DETACHED command in ClickHouse® is used to remove parts or partitions that have previously been detached (i.e., moved to the detached directory and forgotten by the server). The syntax is:

Warning

Be careful before dropping any detached part or partition. Validate that data is no longer needed and keep a backup before running destructive commands.

ALTER TABLE table_name [ON CLUSTER cluster] DROP DETACHED PARTITION|PART ALL|partition_expr

This command removes the specified part or all parts of the specified partition from the detached directory. For more details on how to specify the partition expression, see the documentation on how to set the partition expression DROP DETACHED PARTITION|PART.

Note: You must have the allow_drop_detached setting enabled to use this command.

DROP DML

Warning

Review generated DROP DETACHED commands carefully before executing them. They can cause data loss if used incorrectly. Ensure you have a valid backup before destructive operations. Treat generated commands as candidates for manual review, not as commands to run blindly.

Here is a query that can help with investigations. It looks for active parts containing the same data blocks as the detached parts and generates commands to drop the detached parts.

SELECT a.*,
    concat('ALTER TABLE ',a.database,'.',a.table,' DROP DETACHED PART ''',a.name,''' SETTINGS allow_drop_detached=1;',
           ' -- db=',a.database,' table=',a.table,' reason=',a.reason,' partition_id=',a.partition_id,
           ' min_block=',toString(a.min_block_number),' max_block=',toString(a.max_block_number)) AS drop_command
FROM system.detached_parts AS a
LEFT JOIN (
    SELECT database, table, partition_id, name, active, min_block_number, max_block_number
    FROM system.parts
    WHERE active
) b
ON a.database = b.database AND a.table = b.table AND a.partition_id = b.partition_id
WHERE a.min_block_number IS NOT NULL
  AND a.max_block_number IS NOT NULL
  AND a.min_block_number >= b.min_block_number
  AND a.max_block_number <= b.max_block_number
ORDER BY a.table, a.min_block_number, a.max_block_number
SETTINGS join_use_nulls=1

The list of DETACH_REASONS: MergeTreePartInfo.h#L163

Rare but Important Edge Cases

Invalid detached part names with _tryN suffixes can produce NULL parsing metadata in system.detached_parts; treat these as a separate cleanup track.
Older versions had DROP DETACHED issues on ReplicatedMergeTree over S3 (without zero-copy); this was fixed in 2023.
Startup handling of unexpected parts was improved to restore closer ancestors instead of random covered parts.
Downgrade workflows may fail to ATTACH broken-on-start_* directly in some versions. Workaround is manual rename then attach:

SELECT
    concat('mv ', path, ' ', replace(path, 'broken-on-start_', '')) AS mv_cmd
FROM system.detached_parts
WHERE startsWith(name, 'broken-on-start_')

Detached part type	Source code reference
`broken`	StorageReplicatedMergeTree.cpp
`unexpected`	MergeTreeData.cpp
`ignored`	MergeTreeSettings.cpp
`noquorum`	ReplicatedMergeTreeRestartingThread.cpp
`broken-on-start`	MergeTreeData.cpp
`clone`	StorageReplicatedMergeTree.cpp
`attaching`	MergeTreeData.cpp
`deleting`	MergeTreeData.cpp
`tmp-fetch`	DataPartsExchange.cpp
`covered-by-broken`	StorageReplicatedMergeTree.cpp
`merge-not-byte-identical`	MergeFromLogEntryTask.cpp
`mutate-not-byte-identical`	MutateFromLogEntryTask.cpp
`broken-from-backup`	MergeTreeData.cpp

6.9 - Database Size - Table - Column size

Queries to analyze size and compression rates across tables and columns

Tables

Table size

Returns table size, compression rates, and row and part counts, by table

SELECT
    database,
    table,
    formatReadableSize(sum(data_compressed_bytes) AS size) AS compressed,
    formatReadableSize(sum(data_uncompressed_bytes) AS usize) AS uncompressed,
    round(usize / size, 2) AS compr_rate,
    sum(rows) AS rows,
    count() AS part_count
FROM system.parts
WHERE (active = 1) AND (database LIKE '%') AND (table LIKE '%')
GROUP BY
    database,
    table
ORDER BY size DESC;

Table size + inner MatView (Atomic)

As above, but resolves Materialized View inner table names (for Materialized Views created using implicit inner table)

SELECT
      p.database,
      if(t.name = '', p.table, p.table||' ('||t.name||')') tbl,
      formatReadableSize(sum(p.data_compressed_bytes) AS size) AS compressed,
      formatReadableSize(sum(p.data_uncompressed_bytes) AS usize) AS uncompressed,
      round(usize / size, 2) AS compr_rate,
      sum(p.rows) AS rows,
      count() AS part_count
FROM system.parts p left join system.tables t on p.database = t.database and p.table = '.inner_id.'||toString(t.uuid)
WHERE (active = 1) AND (tbl LIKE '%') AND (database LIKE '%')
GROUP BY
    p.database,
    tbl
ORDER BY size DESC;

Column size

Returns size, compression rate, row counts, and average row size for each column (by db and table)

SELECT
    database,
    table,
    column,
    formatReadableSize(sum(column_data_compressed_bytes) AS size) AS compressed,
    formatReadableSize(sum(column_data_uncompressed_bytes) AS usize) AS uncompressed,
    round(usize / size, 2) AS compr_ratio,
    sum(rows) rows_cnt,
    round(usize / rows_cnt, 2) avg_row_size
FROM system.parts_columns
WHERE (active = 1) AND (database LIKE '%') AND (table LIKE '%')
GROUP BY
    database,
    table,
    column
ORDER BY size DESC;

Projections

Projection size

Returns size, compression rate, row counts, and average row size for each projection (“name”), by db and table

SELECT
    database,
    table,
    name,
    formatReadableSize(sum(data_compressed_bytes) AS size) AS compressed,
    formatReadableSize(sum(data_uncompressed_bytes) AS usize) AS uncompressed,
    round(usize / size, 2) AS compr_rate,
    sum(rows) AS rows,
    count() AS part_count
FROM system.projection_parts
WHERE (table = 'ptest') AND active
GROUP BY
    database,
    table,
    name
ORDER BY size DESC;

Projection column size

Returns size, compression rate, row counts, and average row size for each projection (“name”), by db and table, and column

SELECT
    database,
    table,
    column,
    formatReadableSize(sum(column_data_compressed_bytes) AS size) AS compressed,
    formatReadableSize(sum(column_data_uncompressed_bytes) AS usize) AS uncompressed,
    round(usize / size, 2) AS compr_rate
FROM system.projection_parts_columns
WHERE (active = 1) AND (table LIKE 'ptest')
GROUP BY
    database,
    table,
    column
ORDER BY size DESC;

Understanding the columns data properties:

For each column in a table, unique value counts, min/max, and top 5 most frequent values

SELECT
   count(),
   * APPLY (uniq),
   * APPLY (max),
   * APPLY (min),
   * APPLY(topK(5))
FROM table_name 
FORMAT Vertical;

-- also you can add * APPLY (entropy) to show entropy (i.e. 'randomness' of the column).
-- if the table is huge add some WHERE condition to slice some 'representative' data range, for example single month / week / day of data.

Understanding the ingest pattern:

For parts which are recently created and are unmerged, returns row, size, and count information by db and table.

High count, low rows: lots of small parts
High countif(NOT active) relative to count(): merges are keeping up
Low countIf(NOT active) relative to count(): merges may be falling behind
uniqExact(partition): how many partitions are being written to

SELECT
    database,
    table,
    median(rows),
    median(bytes_on_disk),
    sum(rows),
    max(bytes_on_disk),
    min(bytes_on_disk),
    round(quantile(0.95)(bytes_on_disk), 0),
    sum(bytes_on_disk),
    count(),
    countIf(NOT active),
    uniqExact(partition)
FROM system.parts
WHERE (modification_time > (now() - 480)) AND (level = 0)
GROUP BY
    database,
    table
ORDER BY count() DESC

part_log

For the past day, returns per-second part lifecycle metrics over 30 minute buckets

WITH 30 * 60 AS frame_size
SELECT
    toStartOfInterval(event_time, toIntervalSecond(frame_size)) AS m,
    database,
    table,
    ROUND(countIf(event_type = 'NewPart') / frame_size, 2) AS new,
    ROUND(countIf(event_type = 'MergeParts') / frame_size, 2) AS merge,
    ROUND(countIf(event_type = 'DownloadPart') / frame_size, 2) AS dl,
    ROUND(countIf(event_type = 'RemovePart') / frame_size, 2) AS rm,
    ROUND(countIf(event_type = 'MutatePart') / frame_size, 2) AS mut,
    ROUND(countIf(event_type = 'MovePart') / frame_size, 2) AS mv
FROM system.part_log
WHERE event_time > (now() - toIntervalHour(24))
GROUP BY
    m,
    database,
    table
ORDER BY
    database ASC,
    table ASC,
    m ASC

For the past day, returns per-second insert throughput metrics, by db and table, over 30 minute buckets

WITH 30 * 60 AS frame_size
SELECT
    toStartOfInterval(event_time, toIntervalSecond(frame_size)) AS m,
    database,
    table,
    ROUND(countIf(event_type = 'NewPart') / frame_size, 2) AS inserts_per_sec,
    ROUND(sumIf(rows, event_type = 'NewPart') / frame_size, 2) AS rows_per_sec,
    ROUND(sumIf(size_in_bytes, event_type = 'NewPart') / frame_size, 2) AS bytes_per_sec
FROM system.part_log
WHERE event_time > (now() - toIntervalHour(24))
GROUP BY
    m,
    database,
    table
ORDER BY
    database ASC,
    table ASC,
    m ASC

Understanding partitioning

Partition distribution analysis, aggregating system.parts metrics by partition. The quantiles results can indicate whether there is skewed distribution of data between partitions.

SELECT
    database,
    table,
    count(),
    topK(5)(partition),
    COLUMNS('metric.*') APPLY(quantiles(0.005, 0.05, 0.10, 0.25, 0.5, 0.75, 0.9, 0.95, 0.995))
FROM
(
    SELECT
        database,
        table,
        partition,
        sum(bytes_on_disk) AS metric_bytes,
        sum(data_uncompressed_bytes) AS metric_uncompressed_bytes,
        sum(rows) AS metric_rows,
        sum(primary_key_bytes_in_memory) AS metric_pk_size,
        count() AS metric_count,
        countIf(part_type = 'Wide') AS metric_wide_count,
        countIf(part_type = 'Compact') AS metric_compact_count,
        countIf(part_type = 'Memory') AS metric_memory_count
    FROM system.parts
    GROUP BY
        database,
        table,
        partition
)
GROUP BY
    database,
    table
FORMAT Vertical

Subcolumns sizes

Returns column-level storage metrics, including subcolumns (JSON, tuples, maps, etc - if present)

WITH 
     if(
          length(subcolumns.names) > 0, 
          arrayMap( (sc_n,sc_t,sc_s, sc_bod, sc_dcb, sc_dub) -> tuple(sc_n,sc_t,sc_s, sc_bod, sc_dcb, sc_dub), subcolumns.names, subcolumns.types, subcolumns.serializations, subcolumns.bytes_on_disk, subcolumns.data_compressed_bytes, subcolumns.data_uncompressed_bytes),
          [tuple('',type,serialization_kind,column_bytes_on_disk,column_data_compressed_bytes,column_data_uncompressed_bytes)]) as _subcolumns_data,
     arrayJoin(_subcolumns_data) as _subcolumn,
     _subcolumn.1 as _sc_name, 
     _subcolumn.2 as _sc_type,
     _subcolumn.3 as _sc_serialization,
     _subcolumn.4 as _sc_bytes_on_disk,
     _subcolumn.5 as _sc_data_compressed_bytes,
     _subcolumn.6 as _sc_uncompressed_bytes
SELECT
    database || '.' || table as table_,
    column as colunm_, 
    _sc_name as subcolumn_,
    any(_sc_type),
    formatReadableSize(sum(_sc_data_compressed_bytes) AS size) AS compressed,
    formatReadableSize(sum(_sc_uncompressed_bytes) AS usize) AS uncompressed,
    round(usize / size, 2) AS compr_ratio,
    sum(rows) AS rows_cnt,
    round(usize / rows_cnt, 2) AS avg_row_size
FROM system.parts_columns
WHERE (active = 1) AND (database LIKE '%') AND (`table` LIKE '%')
GROUP BY  
     table_,
     colunm_,
     subcolumn_
ORDER BY size DESC ;

6.10 - Notes on Various Errors with respect to replication and distributed connections

Notes on errors related to replication and distributed connections

`ClickHouseDistributedConnectionExceptions`

This alert usually indicates that one of the nodes isn’t responding or that there’s an interconnectivity issue. Debug steps:

1. Check Cluster Connectivity

Verify connectivity inside the cluster by running:

SELECT count() FROM clusterAllReplicas('{cluster}', cluster('{cluster}', system.one))

2. Check for Errors

Run the following queries to see if any nodes report errors:

SELECT hostName(), * FROM clusterAllReplicas('{cluster}', system.clusters) WHERE errors_count > 0;
SELECT hostName(), * FROM clusterAllReplicas('{cluster}', system.errors) WHERE last_error_time > now() - 3600 ORDER BY value;

Depending on the results, ensure that the affected node is up and responding to queries. Also, verify that connectivity (DNS, routes, delays) is functioning correctly.

`ClickHouseReplicatedPartChecksFailed` & `ClickHouseReplicatedPartFailedFetches`

Unless you’re seeing huge numbers, these alerts can generally be ignored. They’re often a sign of temporary replication issues that ClickHouse resolves on its own. However, if the issue persists or increases rapidly, follow the steps to debug replication issues:

Check the replication status using tables such as system.replicas and system.replication_queue.
Examine server logs, system.errors, and system load for any clues.
Try to restart the replica (SYSTEM RESTART REPLICA db_name.table_name command) and, if necessary, contact Altinity support.

6.11 - Number of active parts in a partition

Number of active parts in a partition

Q: Why do I have several active parts in a partition? Why ClickHouse® does not merge them immediately?

A: CH does not merge parts by time

Merge scheduler selects parts by own algorithm based on the current node workload / number of parts / size of parts.

CH merge scheduler balances between a big number of parts and a wasting resources on merges.

Merges are CPU/DISK IO expensive. If CH will merge every new part then all resources will be spend on merges and will no resources remain on queries (selects ).

CH will not merge parts with a combined size greater than 150 GB max_bytes_to_merge_at_max_space_in_pool .

SELECT
    database,
    table,
    partition,
    sum(rows) AS rows,
    count() AS part_count
FROM system.parts
WHERE (active = 1) AND (table LIKE '%') AND (database LIKE '%')
GROUP BY
    database,
    table,
    partition
ORDER BY part_count DESC
limit 20

6.12 - Parts consistency

Check if there are blocks missing

SELECT
    database,
    table,
    partition_id,
    ranges.1 AS previous_part,
    ranges.2 AS next_part,
    ranges.3 AS previous_block_number,
    ranges.4 AS next_block_number,
    range(toUInt64(previous_block_number + 1), toUInt64(next_block_number)) AS missing_block_numbers
FROM
(
    WITH
        arrayPopFront(groupArray(min_block_number) AS min) AS min_adj,
        arrayPopBack(groupArray(max_block_number) AS max) AS max_adj,
        arrayFilter((x, y, z) -> (y != (z + 1)), arrayZip(arrayPopBack(groupArray(name) AS name_arr), arrayPopFront(name_arr), max_adj, min_adj), min_adj, max_adj) AS missing_ranges
    SELECT
        database,
        table,
        partition_id,
        missing_ranges
    FROM
    (
        SELECT *
        FROM system.parts
        WHERE active AND (table = 'query_thread_log') AND (partition_id = '202108') AND active
        ORDER BY min_block_number ASC
    )
    GROUP BY
        database,
        table,
        partition_id
)
ARRAY JOIN missing_ranges AS ranges

┌─database─┬─table────────────┬─partition_id─┬─previous_part───────┬─next_part──────────┬─previous_block_number─┬─next_block_number─┬─missing_block_numbers─┐
│ system   │ query_thread_log │ 202108       │ 202108_864_1637_556 │ 202108_1639_1639_0 │                  1637 │              1639 │ [1638]                │
└──────────┴──────────────────┴──────────────┴─────────────────────┴────────────────────┴───────────────────────┴───────────────────┴───────────────────────┘

Find the number of blocks in a table

SELECT
    database,
    table,
    partition_id,
    sum(max_block_number - min_block_number) AS blocks_count
FROM system.parts
WHERE active AND (table = 'query_thread_log') AND (partition_id = '202108') AND active
GROUP BY
    database,
    table,
    partition_id

┌─database─┬─table────────────┬─partition_id─┬─blocks_count─┐
│ system   │ query_thread_log │ 202108       │         1635 │
└──────────┴──────────────────┴──────────────┴──────────────┘

Compare the list of parts in ZooKeeper with the list of parts on disk

select zoo.p_path as part_zoo, zoo.ctime, zoo.mtime, disk.p_path as part_disk
from
(
  select concat(path,'/',name) as p_path, ctime, mtime
  from system.zookeeper where path in (select concat(replica_path,'/parts') from system.replicas)
) zoo
left join 
(
  select concat(replica_path,'/parts/',name) as p_path
  from system.parts inner join system.replicas using (database, table)
) disk on zoo.p_path = disk.p_path
where part_disk='' and zoo.mtime <= now() - interval 1 hour
order by part_zoo;

You can clean that orphan zk records (need to execute using delete in zkCli, rm in zk-shell):

select 'delete '||part_zoo
from (
select zoo.p_path as part_zoo, zoo.ctime, zoo.mtime, disk.p_path as part_disk
from
(
  select concat(path,'/',name) as p_path, ctime, mtime
  from system.zookeeper where path in (select concat(replica_path,'/parts') from system.replicas)
) zoo
left join 
(
  select concat(replica_path,'/parts/',name) as p_path
  from system.parts inner join system.replicas using (database, table)
) disk on zoo.p_path = disk.p_path
where part_disk='' and zoo.mtime <= now() - interval 1 day
order by part_zoo) format TSVRaw;

7 - Schema design

All you need to know about ClickHouse® schema design, including materialized view, limitations, lowcardinality, codecs.

7.1 - ClickHouse® limitations

How much is too much?

In most of the cases ClickHouse® doesn’t have any hard limits. But obviously there there are some practical limitation / barriers for different things - often they are caused by some system / network / filesystem limitation.

So after reaching some limits you can get different kind of problems, usually it never a failures / errors, but different kinds of degradations (slower queries / high cpu/memory usage, extra load on the network / zookeeper etc).

While those numbers can vary a lot depending on your hardware & settings there is some safe ‘Goldilocks’ zone where ClickHouse work the best with default settings & usual hardware.

Number of tables (system-wide, across all databases)

non-replicated MergeTree-family tables = few thousands is still acceptable, if you don’t do realtime inserts in more that few dozens of them. See #32259
ReplicatedXXXMergeTree = few hundreds is still acceptable, if you don’t do realtime inserts in more that few dozens of them. Every Replicated table comes with it’s own cost (need to do housekeeping operations, monitoring replication queues etc). See #31919
Log family table = even dozens of thousands is still ok, especially if database engine = Lazy is used.

Number of databases

Fewer than number of tables (above). Dozens / hundreds is usually still acceptable.

Number of inserts per seconds

For usual (non async) inserts - dozen is enough. Every insert creates a part, if you will create parts too often, ClickHouse will not be able to merge them and you will be getting ’too many parts’.

Number of columns in the table

Up to a few hundreds. With thousands of columns the inserts / background merges may become slower / require more of RAM. See for example https://github.com/ClickHouse/ClickHouse/issues/6943 https://github.com/ClickHouse/ClickHouse/issues/27502

ClickHouse instances on a single node / VM

One is enough. Single ClickHouse can use resources of the node very efficiently, and it may require some complicated tuning to run several instances on a single node.

Number of parts / partitions (system-wide, across all databases)

More than several dozens thousands may lead to performance degradation: slow starts (see https://github.com/ClickHouse/ClickHouse/issues/10087 ).

Number of tables & partitions touched by a single insert

If you have realtime / frequent inserts no more than few.

For the inserts are rare - up to couple of dozens.

Number of parts in the single table

More than ~ 5 thousands may lead to issues with alters in Replicated tables (caused by jute.maxbuffer overrun, see details ), and query speed degradation.

Disk size per shard

Less than 10TB of compressed data per server. Bigger disk are harder to replace / resync.

Number of shards

Dozens is still ok. More may require having more complex (non-flat) routing.

Number of replicas in a single shard

2 is minimum for HA. 3 is a ‘golden standard’. Up to 6-8 is still ok. If you have more with realtime inserts - it can impact the zookeeper traffic.

Number of Zookeeper nodes in the ensemble

3 (Three) for most of the cases is enough (you can loose one node). Using more nodes allows to scale up read throughput for zookeeper, but doesn’t improve writes at all.

Number of materialized views attached to a single table.

Up to few. The less the better if the table is getting realtime inserts. (no matter if MV are chained or all are fed from the same source table).

The more you have the more costly your inserts are, and the bigger risks to get some inconsistencies between some MV (inserts to MV and main table are not atomic).

If the table doesn’t have realtime inserts you can have more MV.

Number of projections inside a single table.

Up to few. Similar to MV above.

Number of secondary indexes a single table.

One to about a dozen. Different types of indexes has different penalty, bloom_filter is 100 times heavier than min_max index At some point your inserts will slow down. Try to create possible minimum of indexes. You can combine many columns into a single index and this index will work for any predicate but create less impact.

Number of Kafka tables / consumers inside

High number of Kafka tables maybe quite expensive (every consumer = very expensive librdkafka object with several threads inside). Usually alternative approaches are preferable (mixing several datastreams in one topic, denormalizing, consuming several topics of identical structure with a single Kafka table, etc).

If you really need a lot of Kafka tables you may need more ram / CPU on the node and increase background_message_broker_schedule_pool_size (default is 16) to the number of Kafka tables.

7.2 - ClickHouse® row-level deduplication

ClickHouse® row-level deduplication.

(This article is about row-level deduplication of already ingested data. For insert/block-level deduplication and insert idempotency, see Insert Deduplication / Insert Idempotency . For materialized-view retry semantics, see Idempotent inserts into a materialized view .)

There is quite common requirement to do deduplication on a record level in ClickHouse.

Sometimes duplicates are appear naturally on collector side.
Sometime they appear due the the fact that message queue system (Kafka/Rabbit/etc) offers at-least-once guarantees.
Sometimes you just expect insert idempotency on row level.

For the general case, ClickHouse does not provide a cheap built-in way to enforce arbitrary row-level uniqueness across an already large table. That is a different problem from retry-safe insert deduplication, which ClickHouse supports separately for MergeTree family inserts.

The reason is simple: to check if the row already exists you need a lookup that is closer to a key-value access pattern (which is not what ClickHouse is optimized for), in general case - across the whole huge table (which can be terabyte/petabyte size).

But there many usecases when you can achieve something like row-level deduplication in ClickHouse:

Approach 0. Make deduplication before ingesting data to ClickHouse

Pros:

you have full control
clean and simple schema and selects in ClickHouse

Cons:

extra coding and ‘moving parts’, storing some ids somewhere
check if row exists in ClickHouse before insert can give non-satisfying results if you use ClickHouse cluster (i.e. Replicated / Distributed tables) - due to eventual consistency.

Approach 1. Allow duplicates during ingestion.

Remove them on SELECT level (by things like GROUP BY)

Pros:

simple inserts

Cons:

complicates selects
all selects will be significantly slower

Approach 2. Eventual deduplication using ReplacingMergeTree

Pros:

simple

Cons:

can force you to use suboptimal ORDER BY (which will guarantee record uniqueness)
deduplication is eventual - you never know when it will happen, and you will get some duplicates if you don’t use FINAL
selects with FINAL (select * from table_name FINAL) add overhead and should be benchmarked
- older versions often needed manual optimization https://github.com/ClickHouse/ClickHouse/issues/31411
- performance has improved significantly in recent releases, so FINAL is often acceptable in production workloads https://clickhouse.com/blog/common-getting-started-issues-with-clickhouse
- additional tuning notes: https://kb.altinity.com/altinity-kb-queries-and-syntax/altinity-kb-final-clause-speed/

Approach 3. Eventual deduplication using CollapsingMergeTree

Pros:

you can make the proper aggregations of last state w/o FINAL (bookkeeping-alike sums, counts etc)

Cons:

complicated
can force you to use suboptimal ORDER BY (which will guarantee record uniqueness)
you need to store previous state of the record somewhere, or extract it before ingestion from ClickHouse
deduplication is eventual (same as with Replacing)

Approach 4. Eventual deduplication using Summing/Aggregating/CoalescingMergeTree

use SimpleAggregateFunction( anyLast, …) or AggregateFunction with argMax for Summing/AggregatingMT. CoalescingMergeTree implies anyLast by default

Pros:

you can finish deduplication with GROUP BY instead of FINAL (it’s faster)

Cons:

quite complicated
can force you to use suboptimal ORDER BY (which will guarantee record uniqueness)
deduplication is eventual (same as with ReplacingMergeTree)

Example: keep the latest version of each row in an AggregatingMergeTree table and read the finalized state with GROUP BY:

create table Example4Raw
(
    id UInt64,
    version UInt64,
    metric UInt64
)
engine = MergeTree
order by (id, version);

create table Example4Agg
(
    id UInt64,
    metric_state AggregateFunction(argMax, UInt64, UInt64)
)
engine = AggregatingMergeTree
order by id;

create materialized view Example4AggMV to Example4Agg as
select id, argMaxState(metric, version) as metric_state
from Example4Raw
group by id;

In that example the result contains id = 1, metric = 20 and id = 2, metric = 30.

create table Example4Raw
(
    id UInt64,
    version UInt64,
    metric UInt64
)
engine = MergeTree
order by (id, version);

create table Example4Agg
(
    id UInt64,
    metric_state Nullable(UInt64)
)
engine = CoalescingTree
order by id;

create materialized view Example4AggMV to Example4Agg as
select id, metric as metric_state
from Example4Raw;

Approach 5. Keep data fragments where duplicates are possible to isolate.

Usually you can expect the duplicates only in some time window (like 5 minutes, or one hour, or something like that).

You can put that ‘dirty’ data in separate place, and put it to final MergeTree table after deduplication window timeout. For example - you insert data in some tiny tables (Engine=StripeLog) with minute suffix, and move data from tinytable older that X minutes to target MergeTree (with some external queries). In the meanwhile you can see realtime data using Engine=Merge / VIEWs etc.

Pros:

good control
no duplicated in target table
perfect ingestion speed

Cons:

quite complicated

Approach 6. Deduplication using MV pipeline.

You insert into some temporary table (even with Engine=Null) and MV do join or subselect (which will check the existence of arrived rows in some time frame of target table) and copy new only rows to destination table.

Pros:

don’t impact the select speed

Cons:

complicated
for clusters can be inaccurate due to eventual consistency
slows down inserts significantly (every insert will need to do lookup in target table first)

create table Example1 (id Int64, metric UInt64) 
engine = MergeTree order by id;

create table Example1Null engine = Null as Example1;

create materialized view __Example1 to Example1 as
select * from Example1Null 
where id not in (
   select id from Example1 where id in (
      select id from Example1Null
   )
)

In all case: due to eventual consistency of ClickHouse replication you can still get duplicates if you insert into different replicas/shards.

7.3 - Column backfilling with alter/update using a dictionary

Column backfilling with alter/update using a dictionary

Column backfilling

Sometimes you need to add a column into a huge table and backfill it with a data from another source, without reingesting all data.

Replicated setup

In case of a replicated / sharded setup you need to have the dictionary and source table (dict_table / item_dict) on all nodes and they have to all have EXACTLY the same data. The easiest way to do this is to make dict_table replicated.

In this case, you will need to set the setting allow_nondeterministic_mutations=1 on the user that runs the ALTER TABLE. See the ClickHouse® docs for more information about this setting.

Here is an example.

create database test;
use test;

-- table with an existing data, we need to backfill / update S column

create table fact ( key1 UInt64, key2 String, key3 String, D Date, S String)
Engine MergeTree partition by D order by (key1, key2, key3);

-- example data
insert into fact select number, toString(number%103), toString(number%13), today(), toString(number) from numbers(1e9);
0 rows in set. Elapsed: 155.066 sec. Processed 1.00 billion rows, 8.00 GB (6.45 million rows/s., 51.61 MB/s.)

insert into fact select number, toString(number%103), toString(number%13), today() - 30, toString(number)　from numbers(1e9);
0 rows in set. Elapsed: 141.594 sec. Processed 1.00 billion rows, 8.00 GB (7.06 million rows/s., 56.52 MB/s.)

insert into fact select number, toString(number%103), toString(number%13), today() - 60, toString(number)　from numbers(1e10);
0 rows in set. Elapsed: 1585.549 sec. Processed 10.00 billion rows, 80.01 GB (6.31 million rows/s., 50.46 MB/s.)

select count() from fact;
12000000000                          -- 12 billions rows.


-- table - source of the info to update
create table dict_table ( key1 UInt64, key2 String, key3 String, S String)
Engine MergeTree order by (key1, key2, key3);

-- example data
insert into dict_table select number, toString(number%103), toString(number%13), 
toString(number)||'xxx'　from numbers(1e10);
0 rows in set. Elapsed: 1390.121 sec. Processed 10.00 billion rows, 80.01 GB (7.19 million rows/s., 57.55 MB/s.)

-- DICTIONARY witch will be the source for update / we cannot query dict_table directly
CREATE DICTIONARY item_dict ( key1 UInt64, key2 String, key3 String, S String ) 
PRIMARY KEY key1,key2,key3 SOURCE(CLICKHOUSE(TABLE dict_table DB 'test' USER 'default')) 
LAYOUT(complex_key_cache(size_in_cells 50000000))
Lifetime(60000);



-- let's test that the dictionary is working

select dictGetString('item_dict', 'S', tuple(toUInt64(1),'1','1'));
┌─dictGetString('item_dict', 'S', tuple(toUInt64(1), '1', '1'))─┐
│ 1xxx                                                          │
└───────────────────────────────────────────────────────────────┘
1 rows in set. Elapsed: 0.080 sec.

SELECT dictGetString('item_dict', 'S', (toUInt64(1111111), '50', '1'))
┌─dictGetString('item_dict', 'S', tuple(toUInt64(1111111), '50', '1'))─┐
│ 1111111xxx                                                           │
└──────────────────────────────────────────────────────────────────────┘
1 rows in set. Elapsed: 0.004 sec.


-- Now let's lower number of simultaneous updates/mutations

select value from system.settings where name like '%background_pool_size%';
┌─value─┐
│ 16    │
└───────┘

alter table fact modify setting number_of_free_entries_in_pool_to_execute_mutation=15; -- only one mutation is possible per time / 16 - 15 = 1


-- the mutation itself
alter table test.fact update S = dictGetString('test.item_dict', 'S', tuple(key1,key2,key3)) where 1;

-- mutation took 26 hours and item_dict used bytes_allocated: 8187277280


select * from system.mutations where not is_done \G

Row 1:
──────
database:                   test
table:                      fact
mutation_id:                mutation_11452.txt
command:                    UPDATE S = dictGetString('test.item_dict', 'S', (key1, key2, key3)) WHERE 1
create_time:                2022-01-29 20:21:00
block_numbers.partition_id: ['']
block_numbers.number:       [11452]
parts_to_do_names:          ['20220128_1_954_4','20211230_955_1148_3','20211230_1149_1320_3','20211230_1321_1525_3','20211230_1526_1718_3','20211230_1719_1823_3','20211230_1824_1859_2','20211230_1860_1895_2','20211230_1896_1900_1','20211230_1901_1906_1','20211230_1907_1907_0','20211230_1908_1908_0','20211130_2998_9023_5','20211130_9024_10177_4','20211130_10178_11416_4','20211130_11417_11445_2','20211130_11446_11446_0']
parts_to_do:                17
is_done:                    0
latest_failed_part:
latest_fail_time:           1970-01-01 00:00:00
latest_fail_reason:


SELECT
    table,
    (elapsed * (1 / progress)) - elapsed,
    elapsed,
    progress,
    is_mutation,
    formatReadableSize(total_size_bytes_compressed) AS size,
    formatReadableSize(memory_usage) AS mem
FROM system.merges
ORDER BY progress DESC

┌─table────────────────────────┬─minus(multiply(elapsed, divide(1, progress)), elapsed)─┬─────────elapsed─┬────────────progress─┬─is_mutation─┬─size───────┬─mem───────┐
│ fact                         │                                      7259.920140111059 │  8631.476589565 │  0.5431540560211632 │           1 │ 1.89 GiB   │ 0.00 B    │
│ fact                         │                                      60929.22808705666 │ 23985.610558929 │ 0.28246665649246827 │           1 │ 9.86 GiB   │ 4.25 MiB  │
└──────────────────────────────┴────────────────────────────────────────────────────────┴─────────────────┴─────────────────────┴─────────────┴────────────┴───────────┘


SELECT *　FROM system.dictionaries　WHERE name = 'item_dict'　\G

Row 1:
──────
database:                    test
name:                        item_dict
uuid:                        28fda092-260f-430f-a8fd-a092260f330f
status:                      LOADED
origin:                      28fda092-260f-430f-a8fd-a092260f330f
type:                        ComplexKeyCache
key.names:                   ['key1','key2','key3']
key.types:                   ['UInt64','String','String']
attribute.names:             ['S']
attribute.types:             ['String']
bytes_allocated:             8187277280
query_count:                 12000000000
hit_rate:                    1.6666666666666666e-10
found_rate:                  1
element_count:               67108864
load_factor:                 1
source:                      ClickHouse: test.dict_table
lifetime_min:                0
lifetime_max:                60000
loading_start_time:          2022-01-29 20:20:50
last_successful_update_time: 2022-01-29 20:20:51
loading_duration:            0.829
last_exception:


-- Check that data is updated 

SELECT *
FROM test.fact
WHERE key1 = 11111

┌──key1─┬─key2─┬─key3─┬──────────D─┬─S────────┐
│ 11111 │ 90   │ 9    │ 2021-12-30 │ 11111xxx │
│ 11111 │ 90   │ 9    │ 2022-01-28 │ 11111xxx │
│ 11111 │ 90   │ 9    │ 2021-11-30 │ 11111xxx │
└───────┴──────┴──────┴────────────┴──────────┘

7.4 - Functions to count uniqs

Functions to count uniqs.

Functions to count uniqs

Function	Function(State)	StateSize	Result	QPS
uniqExact	uniqExactState	1600003	100000	59.23
uniq	uniqState	200804	100315	85.55
uniqCombined	uniqCombinedState	98505	100314	108.09
uniqCombined(12)	uniqCombinedState(12)	3291	98160	151.64
uniqCombined(15)	uniqCombinedState(15)	24783	100768	110.18
uniqCombined(18)	uniqCombinedState(18)	196805	100332	101.56
uniqCombined(20)	uniqCombinedState(20)	786621	100088	65.05
uniqCombined64(12)	uniqCombined64State(12)	3291	98160	164.96
uniqCombined64(15)	uniqCombined64State(15)	24783	100768	133.96
uniqCombined64(18)	uniqCombined64State(18)	196805	100332	110.85
uniqCombined64(20)	uniqCombined64State(20)	786621	100088	66.48
uniqHLL12	uniqHLL12State	2651	101344	177.91
uniqTheta	uniqThetaState	32795	98045	144.05
uniqUpTo(100)	uniqUpToState(100)	1	101	222.93

Stats collected via script below on 22.2

funcname=( "uniqExact" "uniq" "uniqCombined" "uniqCombined(12)" "uniqCombined(15)" "uniqCombined(18)" "uniqCombined(20)" "uniqCombined64(12)" "uniqCombined64(15)" "uniqCombined64(18)" "uniqCombined64(20)" "uniqHLL12" "uniqTheta" "uniqUpTo(100)")
funcname2=( "uniqExactState" "uniqState" "uniqCombinedState" "uniqCombinedState(12)" "uniqCombinedState(15)" "uniqCombinedState(18)" "uniqCombinedState(20)" "uniqCombined64State(12)" "uniqCombined64State(15)" "uniqCombined64State(18)" "uniqCombined64State(20)" "uniqHLL12State" "uniqThetaState" "uniqUpToState(100)")

length=${#funcname[@]}
 

for (( j=0; j<length; j++ ));
do
  f1="${funcname[$j]}"
  f2="${funcname2[$j]}"
  size=$( clickhouse-client -q "select ${f2}(toString(number)) from numbers_mt(100000) FORMAT RowBinary" | wc -c )
  result="$( clickhouse-client -q "select ${f1}(toString(number)) from numbers_mt(100000)" )"
  time=$( rm /tmp/clickhouse-benchmark.json; echo "select ${f1}(toString(number)) from numbers_mt(100000)" | clickhouse-benchmark -i200 --json=/tmp/clickhouse-benchmark.json &>/dev/null; cat /tmp/clickhouse-benchmark.json | grep QPS  )

  printf "|%s|%s,%s,%s,%s\n" "$f1" "$f2" "$size" "$result" "$time"
done

groupBitmap

Use Roaring Bitmaps underneath. Return amount of uniq values.

Can be used with Int* types Works really great when your values quite similar. (Low memory usage / state size)

Example with blockchain data, block_number is monotonically increasing number.

SELECT groupBitmap(block_number) FROM blockchain;

┌─groupBitmap(block_number)─┐
│                  48478157 │
└───────────────────────────┘

MemoryTracker: Peak memory usage (for query): 64.44 MiB.
1 row in set. Elapsed: 32.044 sec. Processed 4.77 billion rows, 38.14 GB (148.77 million rows/s., 1.19 GB/s.)

SELECT uniqExact(block_number) FROM blockchain;

┌─uniqExact(block_number)─┐
│                48478157 │
└─────────────────────────┘

MemoryTracker: Peak memory usage (for query): 4.27 GiB.
1 row in set. Elapsed: 70.058 sec. Processed 4.77 billion rows, 38.14 GB (68.05 million rows/s., 544.38 MB/s.)

7.5 - How to change ORDER BY

How to change ORDER BY.

Create a new table and copy data through an intermediate table. Step by step procedure.

We want to add column3 to the ORDER BY in this table:

CREATE TABLE example_table
(
  date Date,
  column1 String,
  column2 String,
  column3 String,
  column4 String
)
ENGINE = ReplicatedMergeTree('/clickhouse/{cluster}/tables/{shard}/default/example_table', '{replica}')
PARTITION BY toYYYYMM(date)
ORDER BY (column1, column2)

Stop publishing/INSERT into example_table.
Rename table example_table to example_table_old
Create the new table with the old name. This will preserve all dependencies like materialized views.

CREATE TABLE example_table as example_table_old 
ENGINE = ReplicatedMergeTree('/clickhouse/{cluster}/tables/{shard}/default/example_table_new', '{replica}')
PARTITION BY toYYYYMM(date)
ORDER BY (column1, column2, column3)

Copy data from example_table_old into example_table_temp

a. Use this query to generate a list of INSERT statements

-- old Clickhouse versions before a support of `where _partition_id`
select concat('insert into example_table_temp select * from example_table_old where toYYYYMM(date)=',partition) as cmd, 
database, table, partition, sum(rows), sum(bytes_on_disk), count()
from system.parts
where database='default' and table='example_table_old'
group by database, table, partition
order by partition

-- newer Clickhouse versions with a support of `where _partition_id`
select concat('insert into example_table_temp select * from ', table,' where _partition_id = \'',partition_id, '\';') as cmd, 
database, table, partition, sum(rows), sum(bytes_on_disk), count()
from system.parts
where database='default' and table='example_table_old'
group by database, table, partition_id, partition
order by partition_id

b. Create an intermediate table

CREATE TABLE example_table_temp as example_table_old 
ENGINE = MergeTree
PARTITION BY toYYYYMM(date)
ORDER BY (column1, column2, column3)

c. Run the queries one by one

After each query compare the number of rows in both tables. If the INSERT statement was interrupted and failed to copy data, drop the partition in example_table and repeat the INSERT. If a partition was copied successfully, proceed to the next partition.

Here is a query to compare the tables:

select database, table, partition, sum(rows), sum(bytes_on_disk), count()
from system.parts
where database='default' and table like 'example_table%'
group by database, table, partition
order by partition

Attach data from the intermediate table to example_table

a. Use this query to generate a list of ATTACH statements

select concat('alter table example_table attach partition id ''',partition,''' from example_table_temp') as cmd, 
database, table, partition, sum(rows), sum(bytes_on_disk), count()
from system.parts
where database='default' and table='example_table_temp'
group by database, table, partition
order by partition

b. Run the queries one by one

Here is a query to compare the tables:

select hostName(), database, table, partition, sum(rows), sum(bytes_on_disk), count()
from clusterAllReplicas('my-cluster',system.parts)
where database='default' and table like 'example_table%'
group by hostName(), database, table, partition
order by partition

Drop example_table_old and example_table_temp

7.6 - Ingestion of AggregateFunction

ClickHouse® - How to insert AggregateFunction data

How to insert AggregateFunction data

Ephemeral column

CREATE TABLE users (
  uid Int16, 
  updated SimpleAggregateFunction(max, DateTime),
  name_stub String Ephemeral,
  name AggregateFunction(argMax, String, DateTime) 
       default arrayReduce('argMaxState', [name_stub], [updated])
) ENGINE=AggregatingMergeTree order by uid;

INSERT INTO users(uid, updated, name_stub) VALUES (1231, '2020-01-02 00:00:00', 'Jane');

INSERT INTO users(uid, updated, name_stub) VALUES (1231, '2020-01-01 00:00:00', 'John');

SELECT
    uid,
    max(updated) AS updated,
    argMaxMerge(name)
FROM users
GROUP BY uid
┌──uid─┬─────────────updated─┬─argMaxMerge(name)─┐
│ 1231 │ 2020-01-02 00:00:00 │ Jane              │
└──────┴─────────────────────┴───────────────────┘

Input function

CREATE TABLE users (
  uid Int16, 
  updated SimpleAggregateFunction(max, DateTime),
  name AggregateFunction(argMax, String, DateTime)
) ENGINE=AggregatingMergeTree order by uid;

INSERT INTO users
SELECT uid, updated, arrayReduce('argMaxState', [name], [updated])
FROM input('uid Int16, updated DateTime, name String') FORMAT Values (1231, '2020-01-02 00:00:00', 'Jane');

INSERT INTO users
SELECT uid, updated, arrayReduce('argMaxState', [name], [updated])
FROM input('uid Int16, updated DateTime, name String') FORMAT Values (1231, '2020-01-01 00:00:00', 'John');

SELECT
    uid,
    max(updated) AS updated,
    argMaxMerge(name)
FROM users
GROUP BY uid;
┌──uid─┬─────────────updated─┬─argMaxMerge(name)─┐
│ 1231 │ 2020-01-02 00:00:00 │ Jane              │
└──────┴─────────────────────┴───────────────────┘

Materialized View And Null Engine

CREATE TABLE users (
  uid Int16, 
  updated SimpleAggregateFunction(max, DateTime),
  name AggregateFunction(argMax, String, DateTime)
) ENGINE=AggregatingMergeTree order by uid;

CREATE TABLE users_null (
  uid Int16, 
  updated DateTime,
  name String
) ENGINE=Null;

CREATE MATERIALIZED VIEW users_mv TO users AS
SELECT uid, updated, arrayReduce('argMaxState', [name], [updated]) name
FROM users_null;

INSERT INTO users_null Values (1231, '2020-01-02 00:00:00', 'Jane');

INSERT INTO users_null Values (1231, '2020-01-01 00:00:00', 'John');

SELECT
    uid,
    max(updated) AS updated,
    argMaxMerge(name) 
FROM users
GROUP BY uid;
┌──uid─┬─────────────updated─┬─argMaxMerge(name)─┐
│ 1231 │ 2020-01-02 00:00:00 │ Jane              │
└──────┴─────────────────────┴───────────────────┘

7.7 - Insert Deduplication / Insert Idempotency

Using ClickHouse® features to avoid duplicate data

Replicated tables have a special feature insert deduplication (enabled by default).

Documentation: Data blocks are deduplicated. For multiple writes of the same data block (data blocks of the same size containing the same rows in the same order), the block is only written once. The reason for this is in case of network failures when the client application does not know if the data was written to the DB, so the INSERT query can simply be repeated. It does not matter which replica INSERTs were sent to with identical data. INSERTs are idempotent. Deduplication parameters are controlled by merge_tree server settings.

Example

create table test_insert ( A Int64 ) 
Engine=ReplicatedMergeTree('/clickhouse/cluster_test/tables/{table}','{replica}') 
order by A;
 
insert into test_insert values(1);
insert into test_insert values(1);
insert into test_insert values(1);
insert into test_insert values(1);

select * from test_insert;
┌─A─┐
│ 1 │                                       -- only one row has been inserted, the other rows were deduplicated
└───┘

alter table test_insert delete where 1;    -- that single row was removed

insert into test_insert values(1);

select * from test_insert;
0 rows in set. Elapsed: 0.001 sec.         -- the last insert was deduplicated again, 
                                           -- because `alter ... delete` does not clear deduplication checksums
                                           -- only `alter table drop partition` and `truncate` clear checksums

In clickhouse-server.log you may see trace messages Block with ID ... already exists locally as part ... ignoring it

# cat /var/log/clickhouse-server/clickhouse-server.log|grep test_insert|grep Block
..17:52:45.064974.. Block with ID all_7615936253566048997_747463735222236827 already exists locally as part all_0_0_0; ignoring it.
..17:52:45.068979.. Block with ID all_7615936253566048997_747463735222236827 already exists locally as part all_0_0_0; ignoring it.
..17:52:45.072883.. Block with ID all_7615936253566048997_747463735222236827 already exists locally as part all_0_0_0; ignoring it.
..17:52:45.076738.. Block with ID all_7615936253566048997_747463735222236827 already exists locally as part all_0_0_0; ignoring it.

Deduplication checksums are stored in Zookeeper in /blocks table’s znode for each partition separately, so when you drop partition, they could be identified and removed for this partition. (during alter table delete it’s impossible to match checksums, that’s why checksums stay in Zookeeper).

SELECT name, value
FROM system.zookeeper
WHERE path = '/clickhouse/cluster_test/tables/test_insert/blocks'
┌─name───────────────────────────────────────┬─value─────┐
│ all_7615936253566048997_747463735222236827 │ all_0_0_0 │
└────────────────────────────────────────────┴───────────┘

insert_deduplicate setting

Insert deduplication is controlled by the insert_deduplicate setting

Let’s disable it:

set insert_deduplicate = 0;              -- insert_deduplicate is now disabled in this session

insert into test_insert values(1);
insert into test_insert values(1);
insert into test_insert values(1);

select * from test_insert format PrettyCompactMonoBlock;
┌─A─┐
│ 1 │
│ 1 │
│ 1 │                                   -- all 3 insterted rows are in the table
└───┘

alter table test_insert delete where 1;

insert into test_insert values(1);
insert into test_insert values(1);

select * from test_insert format PrettyCompactMonoBlock;
┌─A─┐
│ 1 │
│ 1 │
└───┘

Insert deduplication is a user-level setting, it can be disabled in a session or in a user’s profile (insert_deduplicate=0).

clickhouse-client --insert_deduplicate=0 ....

How to disable insert_deduplicate by default for all queries:

# cat /etc/clickhouse-server/users.d/insert_deduplicate.xml
<?xml version="1.0"?>
<yandex>
    <profiles>
        <default>
            <insert_deduplicate>0</insert_deduplicate>
        </default>
    </profile
</yandex>

More info: https://github.com/ClickHouse/ClickHouse/issues/16037 https://github.com/ClickHouse/ClickHouse/issues/3322

Non-replicated MergeTree tables

By default insert deduplication is disabled for non-replicated tables (for backward compatibility).

It can be enabled by the merge_tree setting non_replicated_deduplication_window .

Example:

create table test_insert ( A Int64 ) 
Engine=MergeTree 
order by A
settings non_replicated_deduplication_window = 100;          -- 100 - how many latest checksums to store
 
insert into test_insert values(1);
insert into test_insert values(1);
insert into test_insert values(1);

insert into test_insert values(2);
insert into test_insert values(2);

select * from test_insert format PrettyCompactMonoBlock;
┌─A─┐
│ 2 │
│ 1 │
└───┘

In case of non-replicated tables deduplication checksums are stored in files in the table’s folder:

cat /var/lib/clickhouse/data/default/test_insert/deduplication_logs/deduplication_log_1.txt
1	all_1_1_0	all_7615936253566048997_747463735222236827
1	all_4_4_0	all_636943575226146954_4277555262323907666

Checksums calculation

Checksums are calculated not from the inserted data but from formed parts.

Insert data is separated to parts by table’s partitioning.

Parts contain rows sorted by the table’s order by and all values of functions (i.e. now()) or Default/Materialized columns are expanded.

Example with partial insertion because of partitioning:

create table test_insert ( A Int64, B Int64 ) 
Engine=MergeTree 
partition by B 
order by A
settings non_replicated_deduplication_window = 100;  


insert into test_insert values (1,1);
insert into test_insert values (1,1)(1,2);

select * from test_insert format PrettyCompactMonoBlock;
┌─A─┬─B─┐
│ 1 │ 1 │
│ 1 │ 2 │                                   -- the second insert was skipped for only one partition!!!
└───┴───┘

Example with deduplication despite the rows order:

drop table test_insert;

create table test_insert ( A Int64, B Int64 ) 
Engine=MergeTree 
order by (A, B)
settings non_replicated_deduplication_window = 100;  

insert into test_insert values (1,1)(1,2);
insert into test_insert values (1,2)(1,1);  -- the order of rows is not equal with the first insert

select * from test_insert format PrettyCompactMonoBlock;
┌─A─┬─B─┐
│ 1 │ 1 │
│ 1 │ 2 │
└───┴───┘
2 rows in set. Elapsed: 0.001 sec.          -- the second insert was skipped despite the rows order

Example to demonstrate how Default/Materialize columns are expanded:

drop table test_insert;

create table test_insert ( A Int64, B Int64 Default rand() ) 
Engine=MergeTree 
order by A
settings non_replicated_deduplication_window = 100;

insert into test_insert(A) values (1);                 -- B calculated as  rand()
insert into test_insert(A) values (1);                 -- B calculated as  rand()

select * from test_insert format PrettyCompactMonoBlock;
┌─A─┬──────────B─┐
│ 1 │ 3467561058 │
│ 1 │ 3981927391 │
└───┴────────────┘

insert into test_insert(A, B) values (1, 3467561058);  -- B is not calculated / will be deduplicated

select * from test_insert format PrettyCompactMonoBlock;
┌─A─┬──────────B─┐
│ 1 │ 3981927391 │
│ 1 │ 3467561058 │
└───┴────────────┘

Example to demonstrate how functions are expanded:

drop table test_insert;
create table test_insert ( A Int64, B DateTime64 ) 
Engine=MergeTree 
order by A
settings non_replicated_deduplication_window = 100;

insert into test_insert values (1, now64());
....
insert into test_insert values (1, now64());

select * from test_insert format PrettyCompactMonoBlock;
┌─A─┬───────────────────────B─┐
│ 1 │ 2022-01-31 15:43:45.364 │
│ 1 │ 2022-01-31 15:43:41.944 │
└───┴─────────────────────────┘

insert_deduplication_token

Since ClickHouse® 22.2 there is a new setting insert_deduplication_token . This setting allows you to define an explicit token that will be used for deduplication instead of calculating a checksum from the inserted data.

CREATE TABLE test_table
( A Int64 )
ENGINE = MergeTree
ORDER BY A
SETTINGS non_replicated_deduplication_window = 100;

INSERT INTO test_table SETTINGS insert_deduplication_token = 'test' VALUES (1);

-- the next insert won't be deduplicated because insert_deduplication_token is different
INSERT INTO test_table SETTINGS insert_deduplication_token = 'test1' VALUES (1);

-- the next insert will be deduplicated because insert_deduplication_token 
-- is the same as one of the previous
INSERT INTO test_table SETTINGS insert_deduplication_token = 'test' VALUES (2);
SELECT * FROM test_table
┌─A─┐
│ 1 │
└───┘
┌─A─┐
│ 1 │
└───┘

7.8 - JSONEachRow, Tuples, Maps and Materialized Views

How to use Tuple() and Map() with nested JSON messages in MVs

Using JSONEachRow with Tuple() in Materialized views

Sometimes we can have a nested json message with a fixed size structure like this:

{"s": "val1", "t": {"i": 42, "d": "2023-09-01 12:23:34.231"}}

Values can be NULL but the structure should be fixed. In this case we can use Tuple() to parse the JSON message:

CREATE TABLE tests.nest_tuple_source
(
    `s` String,
    `t` Tuple(`i` UInt8, `d` DateTime64(3))
)
ENGINE = Null

We can use the above table as a source for a materialized view, like it was a Kafka table and in case our message has unexpected keys we make the Kafka table ignore them with the setting (23.3+):

input_format_json_ignore_unknown_keys_in_named_tuple = 1

CREATE MATERIALIZED VIEW tests.mv_nest_tuple TO tests.nest_tuple_destination
AS
SELECT
    s AS s,
    t.1 AS i,
    t.2 AS d
FROM tests.nest_tuple_source

Also, we need a destination table with an adapted structure as the source table:

CREATE TABLE tests.nest_tuple_destination
(
    `s` String,
    `i` UInt8, 
    `d` DateTime64(3)
)
ENGINE = MergeTree
ORDER BY tuple()

INSERT INTO tests.nest_tuple_source FORMAT JSONEachRow {"s": "val1", "t": {"i": 42, "d": "2023-09-01 12:23:34.231"}}


SELECT *
FROM nest_tuple_destination

┌─s────┬──i─┬───────────────────────d─┐
│ val1 │ 42 │ 2023-09-01 12:23:34.231 │
└──────┴────┴─────────────────────────┘

Some hints:

💡 Beware of column names in ClickHouse® they are Case sensitive. If a JSON message has the key names in Capitals, the Kafka/Source table should have the same column names in Capitals.
💡 Also this Tuple() approach is not for Dynamic json schemas as explained above. In the case of having a dynamic schema, use the classic approach using JSONExtract set of functions. If the schema is fixed, you can use Tuple() for JSONEachRow format but you need to use classic tuple notation (using index reference) inside the MV, because using named tuples inside the MV won’t work:
💡 tuple.1 AS column1, tuple.2 AS column2 CORRECT!
💡 tuple.column1 AS column1, tuple.column2 AS column2 WRONG!
💡 use AS (alias) for aggregated columns or columns affected by functions because MV do not work by positional arguments like SELECTs,they work by names**

Example:

parseDateTime32BestEffort(t_date) WRONG!
parseDateTime32BestEffort(t_date) AS t_date CORRECT!

Using JSONEachRow with Map() in Materialized views

Sometimes we can have a nested json message with a dynamic size like these and all elements inside the nested json must be of the same type:

{"k": "val1", "st": {"a": 42, "b": 1.877363}}

{"k": "val2", "st": {"a": 43, "b": 2.3343, "c": 34.4434}}

{"k": "val3", "st": {"a": 66743}}

In this case we can use Map() to parse the JSON message:


CREATE TABLE tests.nest_map_source
(
    `k` String,
    `st` Map(String, Float64)
)
Engine = Null 

CREATE MATERIALIZED VIEW tests.mv_nest_map TO tests.nest_map_destination
AS
SELECT
    k AS k,
    st['a'] AS st_a,
    st['b'] AS st_b,
    st['c'] AS st_c
FROM tests.nest_map_source 


CREATE TABLE tests.nest_map_destination
(
    `k` String,
    `st_a` Float64,
    `st_b` Float64,
    `st_c` Float64
)
ENGINE = MergeTree
ORDER BY tuple()

By default, ClickHouse will ignore unknown keys in the Map() but if you want to fail the insert if there are unknown keys then use the setting:

input_format_skip_unknown_fields = 0

INSERT INTO tests.nest_map_source FORMAT JSONEachRow {"k": "val1", "st": {"a": 42, "b": 1.877363}}
INSERT INTO tests.nest_map_source FORMAT JSONEachRow {"k": "val2", "st": {"a": 43, "b": 2.3343, "c": 34.4434}}
INSERT INTO tests.nest_map_source FORMAT JSONEachRow {"k": "val3", "st": {"a": 66743}}


SELECT *
FROM tests.nest_map_destination

┌─k────┬─st_a─┬─────st_b─┬─st_c─┐
│ val1 │   42 │ 1.877363 │    0 │
└──────┴──────┴──────────┴──────┘
┌─k────┬──st_a─┬─st_b─┬─st_c─┐
│ val3 │ 66743 │    0 │    0 │
└──────┴───────┴──────┴──────┘
┌─k────┬─st_a─┬───st_b─┬────st_c─┐
│ val2 │   43 │ 2.3343 │ 34.4434 │
└──────┴──────┴────────┴─────────┘

See also:

7.9 - Pre-Aggregation approaches

ETL vs Materialized Views vs Projections in ClickHouse®

Pre-Aggregation approaches: ETL vs Materialized Views vs Projections

	ETL	MV	Projections
Realtime	no	yes	yes
How complex queries can be used to build the preaggregaton	any	complex	very simple
Impacts the insert speed	no	yes	yes
Are inconsistancies possible	Depends on ETL. If it process the errors properly - no.	yes (no transactions / atomicity)	no
Lifetime of aggregation	any	any	Same as the raw data
Requirements	need external tools/scripting	is a part of database schema	is a part of table schema
How complex to use in queries	Depends on aggregation, usually simple, quering a separate table	Depends on aggregation, sometimes quite complex, quering a separate table	Very simple, quering the main table
Can work correctly with ReplacingMergeTree as a source	Yes	No	No
Can work correctly with CollapsingMergeTree as a source	Yes	For simple aggregations	For simple aggregations
Can be chained	Yes (Usually with DAGs / special scripts)	Yes (but may be not straightforward, and often is a bad idea)	No
Resources needed to calculate the increment	May be significant	Usually tiny	Usually tiny

7.10 - SnowflakeID for Efficient Primary Keys

SnowflakeID for Efficient Primary Keys

In data warehousing (DWH) environments, the choice of primary key (PK) can significantly impact performance, particularly in terms of RAM usage and query speed. This is where SnowflakeID comes into play, providing a robust solution for PK management. Here’s a deep dive into why and how Snowflake IDs are beneficial and practical implementation examples.

Why Snowflake ID?

Natural IDs Suck: Natural keys derived from business data can lead to various issues like complexity and instability. Surrogate keys, on the other hand, are system-generated and stable.
Surrogate keys simplify joins and indexing, which is crucial for performance in large-scale data warehousing.
Monotonic or sequential IDs help maintain the order of entries, which is essential for performance tuning and efficient range queries.
Having both a timestamp and a unique ID in the same column allows for fast filtering of rows during SELECT operations. This is particularly useful for time-series data.

Building Snowflake IDs

There are two primary methods to construct the lower bits of a Snowflake ID:

Hash of Important Columns:
Using a hash function on significant columns ensures uniqueness and distribution.
Row Number in insert batch
Utilizing the row number within data blocks provides a straightforward approach to generating unique identifiers.

Implementation as UDF

Here’s how to implement Snowflake IDs using standard SQL functions while utilizing second and millisecond timestamps.

Pack hash to lower 22 bits for DateTime64 and 32bits for DateTime

create function toSnowflake64 as (dt,ch) ->
  bitOr(dateTime64ToSnowflakeID(dt),
   bitAnd(bitAnd(ch,0x3FFFFF)+
      bitAnd(bitShiftRight(ch, 20),0x3FFFFF)+
      bitAnd(bitShiftRight(ch, 40),0x3FFFFF),
      0x3FFFFF) 
  );

create function toSnowflake as (dt,ch) ->
  bitOr(dateTimeToSnowflakeID(dt),
   bitAnd(bitAnd(ch,0xFFFFFFFF)+
      bitAnd(bitShiftRight(ch, 32),0xFFFFFFFF),
      0xFFFFFFFF) 
  );
    
with cityHash64('asdfsdnfs;n') as ch,
  now64() as dt
select dt,
  hex(toSnowflake64(dt,ch) as sn) ,
  snowflakeIDToDateTime64(sn);

with cityHash64('asdfsdnfs;n') as ch,
  now() as dt
select dt,
  hex(toSnowflake(dt,ch) as sn) ,
  snowflakeIDToDateTime(sn);

Creating Tables with Snowflake ID

Using Materialized Columns and hash

create table XX 
(
  id Int64 materialized toSnowflake(now(),cityHash64(oldID)),
  oldID  String,
  data String
) engine=MergeTree order by id;

Note: Using User-Defined Functions (UDFs) in CREATE TABLE statements is not always useful, as they expand to create table DDL, and changing them is inconvenient.

Using a Null Table, Materialized View, and rowNumberInAllBlocks

A more efficient approach involves using a Null table and materialized views.

create table XX 
(
  id Int64,
  data String
) engine=MergeTree order by id;

create table Null (data String) engine=Null;
create materialized view _XX to XX as
select toSnowflake(now(),rowNumberInAllBlocks()) is id, data
from Null;

Converting from UUID to SnowFlakeID for subsequent events

Consider that your event stream only has a UUID column identifying a particular user. Registration time that can be used as a base for SnowFlakeID is presented only in the first ‘register’ event, but not in subsequent events. It’s easy to generate SnowFlakeID for the register event, but next, we need to get it from some other table without disturbing the ingestion process too much. Using Hash JOINs in Materialized Views is not recommended, so we need some “nested loop join” to get data fast. In Clickhouse, the “nested loop join” is still not supported, but Direct Dictionary can work around it.

CREATE TABLE UUID2ID_store (user_id UUID, id UInt64) 
ENGINE = MergeTree() -- EmbeddedRocksDB can be used instead
ORDER BY user_id
settings index_granularity=256;

CREATE DICTIONARY UUID2ID_dict (user_id UUID, id UInt64) 
PRIMARY KEY user_id
LAYOUT ( DIRECT ())
SOURCE(CLICKHOUSE(TABLE 'UUID2ID_store'));

CREATE OR REPLACE FUNCTION UUID2ID AS (uuid) -> dictGet('UUID2ID_dict',id,uuid);

CREATE MATERIALIZED VIEW _toUUID_store TO UUID2ID_store AS
select user_id, toSnowflake64(event_time, cityHash64(user_id)) as id
from Actions;

Conclusion

Snowflake IDs provide an efficient mechanism for generating unique, monotonic primary keys, which are essential for optimizing query performance in data warehousing environments. By combining timestamps and unique identifiers, snowflake IDs facilitate faster row filtering and ensure stable, surrogate key generation. Implementing these IDs using SQL functions and materialized views ensures that your data warehouse remains performant and scalable.

7.11 - Two columns indexing

How to create ORDER BY suitable for filtering over two different columns in two different queries

Suppose we have telecom CDR data in which A party calls B party. Each data row consists of A party details: event_timestamp, A MSISDN , A IMEI, A IMSI , A start location, A end location , B MSISDN, B IMEI, B IMSI , B start location, B end location, and some other metadata.

Searches will use one of the A or B fields, for example, A IMSI, within the start and end time window.

A msisdn, A imsi, A imei values are tightly coupled as users rarely change their phones.

The queries will be:

select * from X where A = '0123456789' and ts between ...;
select * from X where B = '0123456789' and ts between ...;

and both A & B are high-cardinality values

ClickHouse® primary skip index (ORDER BY/PRIMARY KEY) works great when you always include leading ORDER BY columns in the WHERE filter. There are exceptions for low-cardinality columns and high-correlated values, but here is another case. A & B both have high cardinality, and it seems that their correlation is at a medium level.

Various solutions exist, and their effectiveness largely depends on the correlation of different column data. Testing all solutions on actual data is necessary to select the best one.

ORDER BY + additional Skip Index

create table X (
    A UInt32,
    B UInt32,
    ts DateTime,
    ....
    INDEX ix_B (B) type minmax GRANULARITY 3
) engine = MergeTree
partition by toYYYYMM(ts)
order by (toStartOfDay(ts),A,B);

bloom_filter index type instead of min_max could work fine in some situations.

Inverted index as a projection

create table X (
    A UInt32,
    B UInt32,
    ts DateTime,
    ....
    PROJECTION ix_B  (
        select A, B,ts ORDER BY B, ts
    )
) engine = MergeTree
partition by toYYYYMM(ts)
order by (toStartOfDay(ts),A,B);

select * from X 
where A in (select A from X where B='....' and ts between ...)
  and B='...' and ts between ... ;

The number of rows the subquery returns should not be very high. 1M rows seems to be a suitable limit.
A separate table with a Materialized View can also be used similarly.
accessing pattern for the main table will “point”, so better to lower index_granularity to 256. That will increase RAM usage by Primary Key

mortonEncode

(available from 23.10)

Do not prioritize either A or B, but distribute indexing efficiency between them.

create table X (
    A UInt32,
    B UInt32,
    ts DateTime,
    ....
) engine = MergeTree
partition by toYYYYMM(ts)
order by (toStartOfDay(ts),mortonEncode(A,B));
select * from X where A = '0123456789' and ts between ...;
select * from X where B = '0123456789' and ts between ...;

mortonEncode with non-UInt columns

mortonEncode function requires UInt columns, but sometimes different column types are needed (like String or ipv6). In such a case, the cityHash64() function can be used both for inserting and querying:

create table X (
    A IPv6,
    B IPv6,
    AA alias cityHash64(A),
    BB alias cityHash64(B),
    ts DateTime materialized now()
) engine = MergeTree
partition by toYYYYMM(ts)
order by 
(toStartOfDay(ts),mortonEncode(cityHash64(A),cityHash64(B)))
;

insert into X values ('fd7a:115c:a1e0:ab12:4843:cd96:624c:9a17','fd7a:115c:a1e0:ab12:4843:cd96:624c:9a17')

select * from X where cityHash64(toIPv6('fd7a:115c:a1e0:ab12:4843:cd96:624c:9a17')) =  AA;

hilbertEncode as alternative

(available from 24.6)

hilbertEncode can be used instead of mortonEncode. On some data it allows better results than mortonEncode.

7.12 - Best schema for storing many metrics registered from the single source

Best schema for storing many metrics registered from the single source

Picking the best schema for storing many metrics registered from single source is quite a common problem.

1 One row per metric

i.e.: timestamp, sourceid, metric_name, metric_value

Pros and cons:

Pros:
- simple
- well normalized schema
- easy to extend
- that is quite typical pattern for timeseries databases
Cons
- different metrics values stored in same columns (worse compression)
- to use values of different datatype you need to cast everything to string or introduce few columns for values of different types.
- not always nice as you need to repeat all ‘common’ fields for each row
- if you need to select all data for one time point you need to scan several ranges of data.

2 Each measurement (with lot of metrics) in it’s own row

In that way you need to put all the metrics in one row (i.e.: timestamp, sourceid, ….) That approach is usually a source of debates about how to put all the metrics in one row.

2a Every metric in it’s own column

i.e.: timestamp, sourceid, metric1_value, … , metricN_value

Pros and cons:

Pros
- simple
- really easy to access / scan for rows with particular metric
- specialized and well adjusted datatypes for every metric.
- good for dense recording (each time point can have almost 100% of all the possible metrics)
Cons
- adding new metric = changing the schema (adding new column). not suitable when set of metric changes dynamically
- not applicable when there are too many metrics (when you have more than 100-200)
- when each timepoint have only small subset of metrics recorded - if will create a lot of sparse filled columns.
- you need to store ’lack of value’ somehow (NULLs or default values)
- to read full row - you need to read a lot of column files.

2b Using arrays / Nested / Map

i.e.: timestamp, sourceid, array_of_metric_names, array_of_metric_values

Pros and cons:

Pros
- easy to extend, you can have very dynamic / huge number of metrics.
- you can use Array(LowCardinality(String)) for storing metric names efficiently
- good for sparse recording (each time point can have only 1% of all the possible metrics)
Cons
- you need to extract all metrics for row to reach a single metric
- not very handy / complicated non-standard syntax
- different metrics values stored in single array (bad compression)
- to use values of different datatype you need to cast everything to string or introduce few arrays for values of different types.

2c Using JSON

i.e.: timestamp, sourceid, metrics_data_json

Pros and cons:

Pros
- easy to extend, you can have very dynamic / huge number of metrics.
- the only option to store hierarchical / complicated data structures, also with arrays etc. inside.
- good for sparse recording (each time point can have only 1% of all the possible metrics)
- ClickHouse® has efficient API to work with JSON
- nice if your data originally came in JSON (don’t need to reformat)
Cons
- uses storage non efficiently
- different metrics values stored in single array (bad compression)
- you need to extract whole JSON field to reach single metric
- slower than arrays

2d Using querystring-format URLs

i.e.: timestamp, sourceid, metrics_querystring Same pros/cons as raw JSON, but usually bit more compact than JSON

Pros and cons:

Pros
- ClickHouse has efficient API to work with URLs (extractURLParameter etc)
- can have sense if you data came in such format (i.e. you can store GET / POST request data directly w/o reprocessing)
Cons
- slower than arrays

2e Several ‘baskets’ of arrays

i.e.: timestamp, sourceid, metric_names_basket1, metric_values_basket1, …, metric_names_basketN, metric_values_basketN The same as 2b, but there are several key-value arrays (‘basket’), and metric go to one particular basket depending on metric name (and optionally by metric type)

Pros and cons:

Pros
- address some disadvantages of 2b (you need to read only single, smaller basket for reaching a value, better compression - less unrelated metrics are mixed together)
Cons
- complex

2f Combined approach

In real life Pareto principle is correct for many fields.

For that particular case: usually you need only about 20% of metrics 80% of the time. So you can pick the metrics which are used intensively, and which have a high density, and extract them into separate columns (like in option 2a), leaving the rest in a common ’trash bin’ (like in variants 2b-2e).

With that approach you can have as many metrics as you need and they can be very dynamic. At the same time the most used metrics are stored in special, fine-tuned columns.

At any time you can decide to move one more metric to a separate column ALTER TABLE ... ADD COLUMN metricX Float64 MATERIALIZED metrics.value[indexOf(metrics.names,'metricX')];

3 json type

https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse

7.13 - Codecs

Codecs

Codec Name	Recommended Data Types	Performance Notes
LZ4	Any	Used by default. Extremely fast; good compression; balanced speed and efficiency
ZSTD(level)	Any	Good compression; pretty fast; best for high compression needs. Don’t use levels higher than 3.
LZ4HC(level)	Any	LZ4 High Compression algorithm with configurable level; slower but better compression than LZ4, but decompression is still fast.
Delta	Integer Types, Time Series Data, Timestamps	Preprocessor (should be followed by some compression codec). Stores difference between neighboring values; good for monotonically increasing data.
DoubleDelta	Integer Types, Time Series Data	Stores difference between neighboring delta values; suitable for time series data
Gorilla	Floating Point Types	Calculates XOR between current and previous value; suitable for slowly changing numbers
T64	Integer, Time Series Data, Timestamps	Preprocessor (should be followed by some compression codec). Crops unused high bits; puts them into a 64x64 bit matrix; optimized for 64-bit data types
GCD	Integer Numbers	Preprocessor (should be followed by some compression codec). Greatest common divisor compression; divides values by a common divisor; effective for divisible integer sequences
FPC	Floating Point Numbers	Designed for Float64; Algorithm detailed in FPC paper , ClickHouse® PR #37553
ZSTD_QAT	Any	Requires hardware support for QuickAssist Technology (QAT) hardware; provides accelerated compression tasks
DEFLATE_QPL	Any	Requires hardware support for Intel’s QuickAssist Technology for DEFLATE compression; enhanced performance for specific hardware
LowCardinality	String	It’s not a codec, but a datatype modifier. Reduces representation size; effective for columns with low cardinality
NONE	Non-compressable data with very high entropy, like some random string, or some AggregateFunction states	No compression at all. Can be used on the columns that can not be compressed anyway.

See

How to test different compression codecs

https://altinity.com/blog/2019/7/new-encodings-to-improve-clickhouse

https://www.percona.com/sites/default/files/ple19-slides/day1-pm/clickhouse-for-timeseries.pdf

7.13.1 - Codecs on array columns

Codecs on array columns

Info

Supported since 20.10 (PR #15089 ). On older versions you will get exception: DB::Exception: Codec Delta is not applicable for Array(UInt64) because the data type is not of fixed size.

DROP TABLE IF EXISTS array_codec_test SYNC

create table array_codec_test( number UInt64, arr Array(UInt64) ) Engine=MergeTree ORDER BY number;
INSERT INTO array_codec_test SELECT number,  arrayMap(i -> number + i, range(100)) from numbers(10000000);


/****  Default LZ4  *****/

OPTIMIZE TABLE array_codec_test FINAL;
--- Elapsed: 3.386 sec.


SELECT * FROM system.columns WHERE (table = 'array_codec_test') AND (name = 'arr')
/*
Row 1:
──────
database:                default
table:                   array_codec_test
name:                    arr
type:                    Array(UInt64)
position:                2
default_kind:         
default_expression:   
data_compressed_bytes:   173866750
data_uncompressed_bytes: 8080000000
marks_bytes:             58656
comment:              
is_in_partition_key:     0
is_in_sorting_key:       0
is_in_primary_key:       0
is_in_sampling_key:      0
compression_codec:    
*/



/****** Delta, LZ4 ******/

ALTER TABLE array_codec_test MODIFY COLUMN arr Array(UInt64) CODEC (Delta, LZ4);

OPTIMIZE TABLE array_codec_test FINAL
--0 rows in set. Elapsed: 4.577 sec.

SELECT * FROM system.columns WHERE (table = 'array_codec_test') AND (name = 'arr')

/*
Row 1:
──────
database:                default
table:                   array_codec_test
name:                    arr
type:                    Array(UInt64)
position:                2
default_kind:         
default_expression:   
data_compressed_bytes:   32458310
data_uncompressed_bytes: 8080000000
marks_bytes:             58656
comment:              
is_in_partition_key:     0
is_in_sorting_key:       0
is_in_primary_key:       0
is_in_sampling_key:      0
compression_codec:       CODEC(Delta(8), LZ4)
*/

7.13.2 - Codecs speed

Codecs speed

create table test_codec_speed engine=MergeTree
ORDER BY tuple()
as select cast(now() + rand()%2000 + number, 'DateTime') as x from numbers(1000000000);

option 1: CODEC(LZ4) (same as default)
option 2: CODEC(DoubleDelta) (`alter table test_codec_speed modify column x DateTime CODEC(DoubleDelta)`);
option 3: CODEC(T64, LZ4) (`alter table test_codec_speed modify column x DateTime CODEC(T64, LZ4)`)
option 4: CODEC(Delta, LZ4) (`alter table test_codec_speed modify column x DateTime CODEC(Delta, LZ4)`)
option 5: CODEC(ZSTD(1)) (`alter table test_codec_speed modify column x DateTime CODEC(ZSTD(1))`)
option 6: CODEC(T64, ZSTD(1)) (`alter table test_codec_speed modify column x DateTime CODEC(T64, ZSTD(1))`)
option 7: CODEC(Delta, ZSTD(1)) (`alter table test_codec_speed modify column x DateTime CODEC(Delta, ZSTD(1))`)
option 8: CODEC(T64, LZ4HC(1)) (`alter table test_codec_speed modify column x DateTime CODEC(T64, LZ4HC(1))`)
option 9: CODEC(Gorilla) (`alter table test_codec_speed modify column x DateTime CODEC(Gorilla)`)

Result may be not 100% reliable (checked on my laptop, need to be repeated in lab environment)


OPTIMIZE TABLE test_codec_speed FINAL (second run - i.e. read + write the same data)
1) 17 sec.
2) 30 sec.
3) 16 sec
4) 17 sec
5) 29 sec
6) 24 sec
7) 31 sec
8) 35 sec
9) 19 sec

compressed size
1) 3181376881
2) 2333793699
3) 1862660307
4) 3408502757
5) 2393078266
6) 1765556173
7) 2176080497
8) 1810471247
9) 2109640716

select max(x) from test_codec_speed
1) 0.597
2) 2.756 :(
3) 1.168
4) 0.752
5) 1.362
6) 1.364
7) 1.752
8) 1.270
9) 1.607

7.13.3 - How to test different compression codecs

How to test different compression codecs

Example

Create test_table based on the source table.

CREATE TABLE test_table AS source_table ENGINE=MergeTree() PARTITION BY ...;

If the source table has Replicated*MergeTree engine, you would need to change it to non-replicated.

Attach one partition with data from the source table to test_table.

ALTER TABLE test_table ATTACH PARTITION ID '20210120' FROM source_table;

You can modify the column or create a new one based on the old column value.

ALTER TABLE test_table MODIFY COLUMN column_a CODEC(ZSTD(2));
ALTER TABLE test_table ADD COLUMN column_new UInt32
                         DEFAULT toUInt32OrZero(column_old) CODEC(T64,LZ4);

After that, you would need to populate changed columns with data.

ALTER TABLE test_table UPDATE column_a=column_a, column_new=column_new WHERE 1;

You can look status of mutation via the system.mutations table

SELECT * FROM system.mutations;

And it’s also possible to kill mutation if there are some problems with it.

KILL MUTATION WHERE ...

Useful queries

SELECT
    database,
    table,
    count() AS parts,
    uniqExact(partition_id) AS partition_cnt,
    sum(rows),
    formatReadableSize(sum(data_compressed_bytes) AS comp_bytes) AS comp,
    formatReadableSize(sum(data_uncompressed_bytes) AS uncomp_bytes) AS uncomp,
    uncomp_bytes / comp_bytes AS ratio
FROM system.parts
WHERE active
GROUP BY
    database,
    table
ORDER BY comp_bytes DESC

SELECT
  database,
  table,
  column,
  type,
  sum(rows) AS rows,
  sum(column_data_compressed_bytes) AS compressed_bytes,
  formatReadableSize(compressed_bytes) AS compressed,
  formatReadableSize(sum(column_data_uncompressed_bytes)) AS uncompressed,
  sum(column_data_uncompressed_bytes) / compressed_bytes AS ratio,
  any(compression_codec) AS codec
FROM system.parts_columns AS pc
LEFT JOIN system.columns AS c
ON (pc.database = c.database) AND (c.table = pc.table) AND (c.name = pc.column)
WHERE (database LIKE '%') AND (table LIKE '%') AND active
GROUP BY
  database,
  table,
  column,
  type
ORDER BY database, table, sum(column_data_compressed_bytes) DESC

7.14 - Dictionaries vs LowCardinality

Dictionaries vs LowCardinality

Q. I think I’m still trying to understand how de-normalized is okay - with my relational mindset, I want to move repeated string fields into their own table, but I’m not sure to what extent this is necessary

I will look at LowCardinality in more detail - I think it may work well here

A. If it’s a simple repetition, which you don’t need to manipulate/change in future - LowCardinality works great, and you usually don’t need to increase the system complexity by introducing dicts.

For example: name of team ‘Manchester United’ will rather not be changed, and even if it will you can keep the historical records with historical name. So normalization here (with some dicts) is very optional, and de-normalized approach with LowCardinality is good & simpler alternative.

From the other hand: if data can be changed in future, and that change should impact the reports, then normalization may be a big advantage.

For example if you need to change the used currency rare every day- it would be quite stupid to update all historical records to apply the newest exchange rate. And putting it to dict will allow to do calculations with latest exchange rate at select time.

For dictionary it’s possible to mark some of the attributes as injective. An attribute is called injective if different attribute values correspond to different keys. It would allow ClickHouse® to replace dictGet call in GROUP BY with cheap dict key.

7.15 - Flattened table

Flattened table

It’s possible to use dictionaries for populating columns of fact table.

CREATE TABLE customer
(
    `customer_id` UInt32,
    `first_name` String,
    `birth_date` Date,
    `sex` Enum('M' = 1, 'F' = 2)
)
ENGINE = MergeTree
ORDER BY customer_id

CREATE TABLE order
(
    `order_id` UInt32,
    `order_date` DateTime DEFAULT now(),
    `cust_id` UInt32,
    `amount` Decimal(12, 2)
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(order_date)
ORDER BY (order_date, cust_id, order_id)

INSERT INTO customer VALUES(1, 'Mike', now() - INTERVAL 30 YEAR, 'M');
INSERT INTO customer VALUES(2, 'Boris', now() - INTERVAL 40 YEAR, 'M');
INSERT INTO customer VALUES(3, 'Sofie', now() - INTERVAL 24 YEAR, 'F');

INSERT INTO order (order_id, cust_id, amount) VALUES(50, 1, 15);
INSERT INTO order (order_id, cust_id, amount) VALUES(30, 1, 10);

SELECT * EXCEPT 'order_date'
FROM order

┌─order_id─┬─cust_id─┬─amount─┐
│       30 │       1 │  10.00 │
│       50 │       1 │  15.00 │
└──────────┴─────────┴────────┘

CREATE DICTIONARY customer_dict
(
    `customer_id` UInt32,
    `first_name` String,
    `birth_date` Date,
    `sex` UInt8
)
PRIMARY KEY customer_id
SOURCE(CLICKHOUSE(TABLE 'customer'))
LIFETIME(MIN 0 MAX 300)
LAYOUT(FLAT)

ALTER TABLE order
  ADD COLUMN `cust_first_name` String DEFAULT dictGetString('default.customer_dict', 'first_name', toUInt64(cust_id)),
  ADD COLUMN `cust_sex` Enum('M' = 1, 'F' = 2) DEFAULT dictGetUInt8('default.customer_dict', 'sex', toUInt64(cust_id)),
    ADD COLUMN `cust_birth_date` Date DEFAULT dictGetDate('default.customer_dict', 'birth_date', toUInt64(cust_id));

INSERT INTO order (order_id, cust_id, amount) VALUES(10, 3, 30);
INSERT INTO order (order_id, cust_id, amount) VALUES(20, 3, 60);
INSERT INTO order (order_id, cust_id, amount) VALUES(40, 2, 20);

SELECT * EXCEPT 'order_date'
FROM order
FORMAT PrettyCompactMonoBlock

┌─order_id─┬─cust_id─┬─amount─┬─cust_first_name─┬─cust_sex─┬─cust_birth_date─┐
│       30 │       1 │  10.00 │ Mike            │ M        │      1991-08-05 │
│       50 │       1 │  15.00 │ Mike            │ M        │      1991-08-05 │
│       10 │       3 │  30.00 │ Sofie           │ F        │      1997-08-05 │
│       40 │       2 │  20.00 │ Boris           │ M        │      1981-08-05 │
│       20 │       3 │  60.00 │ Sofie           │ F        │      1997-08-05 │
└──────────┴─────────┴────────┴─────────────────┴──────────┴─────────────────┘

ALTER TABLE customer UPDATE birth_date = now() - INTERVAL 35 YEAR WHERE customer_id=2;

SYSTEM RELOAD DICTIONARY customer_dict;

ALTER TABLE order
    UPDATE cust_birth_date = dictGetDate('default.customer_dict', 'birth_date', toUInt64(cust_id)) WHERE 1
-- or if you do have track of changes it's possible to lower amount of dict calls
--  UPDATE cust_birth_date = dictGetDate('default.customer_dict', 'birth_date', toUInt64(cust_id)) WHERE customer_id = 2


SELECT * EXCEPT 'order_date'
FROM order
FORMAT PrettyCompactMonoBlock

┌─order_id─┬─cust_id─┬─amount─┬─cust_first_name─┬─cust_sex─┬─cust_birth_date─┐
│       30 │       1 │  10.00 │ Mike            │ M        │      1991-08-05 │
│       50 │       1 │  15.00 │ Mike            │ M        │      1991-08-05 │
│       10 │       3 │  30.00 │ Sofie           │ F        │      1997-08-05 │
│       40 │       2 │  20.00 │ Boris           │ M        │      1986-08-05 │
│       20 │       3 │  60.00 │ Sofie           │ F        │      1997-08-05 │
└──────────┴─────────┴────────┴─────────────────┴──────────┴─────────────────┘

ALTER TABLE order UPDATE would completely overwrite this column in table, so it’s not recommended to run it often.

7.16 - Floats vs Decimals

Floats vs Decimals

Float arithmetics is not accurate: https://floating-point-gui.de/

In case you need accurate calculations you should use Decimal datatypes.

Operations on floats are not associative

SELECT (toFloat64(100000000000000000.) + toFloat64(7.5)) - toFloat64(100000000000000000.) AS res

┌─res─┐
│   0 │
└─────┘


SELECT (toFloat64(100000000000000000.) - toFloat64(100000000000000000.)) + toFloat64(7.5) AS res

┌─res─┐
│ 7.5 │
└─────┘

No problem with Decimals:

SELECT (toDecimal64(100000000000000000., 1) + toDecimal64(7.5, 1)) - toDecimal64(100000000000000000., 1) AS res

┌─res─┐
│ 7.5 │
└─────┘


SELECT (toDecimal64(100000000000000000., 1) - toDecimal64(100000000000000000., 1)) + toDecimal64(7.5, 1) AS res

┌─res─┐
│ 7.5 │
└─────┘

Warning

Because ClickHouse® uses MPP order of execution of a single query can vary on each run, and you can get slightly different results from the float column every time you run the query.

Usually, this deviation is small, but it can be significant when some kind of arithmetic operation is performed on very large and very small numbers at the same time.

Some decimal numbers has no accurate float representation

SELECT sum(toFloat64(0.45)) AS res
FROM numbers(10000)

┌───────────────res─┐
│ 4499.999999999948 │
└───────────────────┘


SELECT sumKahan(toFloat64(0.45)) AS res
FROM numbers(10000)

┌──res─┐
│ 4500 │
└──────┘


SELECT toFloat32(0.6) * 6 AS res

┌────────────────res─┐
│ 3.6000001430511475 │
└────────────────────┘

No problem with Decimal:

SELECT sum(toDecimal64(0.45, 2)) AS res
FROM numbers(10000)

┌──res─┐
│ 4500 │
└──────┘


SELECT toDecimal32(0.6, 1) * 6 AS res

┌─res─┐
│ 3.6 │
└─────┘

Direct comparisons of floats may be impossible

The same number can have several floating-point representations and because of that you should not compare Floats directly

SELECT (toFloat32(0.1) * 10) = (toFloat32(0.01) * 100) AS res

┌─res─┐
│   0 │
└─────┘


SELECT
    sumIf(0.1, number < 10) AS a,
    sumIf(0.01, number < 100) AS b,
    a = b AS a_eq_b
FROM numbers(100)

┌──────────────────a─┬──────────────────b─┬─a_eq_b─┐
│ 0.9999999999999999 │ 1.0000000000000004 │      0 │
└────────────────────┴────────────────────┴────────┘

7.17 - Ingestion performance and formats

clickhouse-client -q 'select toString(number) s, number n, number/1000 f from numbers(100000000) format TSV' > speed.tsv
clickhouse-client -q 'select toString(number) s, number n, number/1000 f from numbers(100000000) format RowBinary' > speed.RowBinary
clickhouse-client -q 'select toString(number) s, number n, number/1000 f from numbers(100000000) format Native' > speed.Native
clickhouse-client -q 'select toString(number) s, number n, number/1000 f from numbers(100000000) format CSV' > speed.csv
clickhouse-client -q 'select toString(number) s, number n, number/1000 f from numbers(100000000) format JSONEachRow' > speed.JSONEachRow
clickhouse-client -q 'select toString(number) s, number n, number/1000 f from numbers(100000000) format Parquet' > speed.parquet
clickhouse-client -q 'select toString(number) s, number n, number/1000 f from numbers(100000000) format Avro' > speed.avro

-- Engine=Null does not have I/O / sorting overhead
-- we test only formats parsing performance.

create table n (s String, n UInt64, f Float64) Engine=Null


-- clickhouse-client parses formats itself
-- it allows to see user CPU time -- time is used in a multithreaded application
-- another option is to disable parallelism `--input_format_parallel_parsing=0`
-- real -- wall / clock time.

time clickhouse-client -t -q 'insert into n format TSV' < speed.tsv
2.693  real  0m2.728s   user  0m14.066s

time clickhouse-client -t -q 'insert into n format RowBinary' < speed.RowBinary
3.744  real  0m3.773s   user  0m4.245s

time clickhouse-client -t -q 'insert into n format Native' < speed.Native
2.359  real  0m2.382s   user  0m1.945s

time clickhouse-client -t -q 'insert into n format CSV' < speed.csv
3.296  real  0m3.328s  user  0m18.145s

time clickhouse-client -t -q 'insert into n format JSONEachRow' < speed.JSONEachRow
8.872  real  0m8.899s  user  0m30.235s

time clickhouse-client -t -q 'insert into n format Parquet' < speed.parquet
4.905  real  0m4.929s   user  0m5.478s

time clickhouse-client -t -q 'insert into n format Avro' < speed.avro
11.491  real  0m11.519s  user  0m12.166s

As you can see the JSONEachRow is the worst format (user 0m30.235s) for this synthetic dataset. Native is the best (user 0m1.945s). TSV / CSV are good in wall time but spend a lot of CPU (user time).

7.18 - IPs/masks

IPs/masks

How do I Store IPv4 and IPv6 Address In One Field?

There is a clean and simple solution for that. Any IPv4 has its unique IPv6 mapping:

IPv4 IP address: 191.239.213.197
IPv4-mapped IPv6 address: ::ffff:191.239.213.197

Find IPs matching CIDR/network mask (IPv4)

WITH IPv4CIDRToRange( toIPv4('10.0.0.1'), 8 ) as range
SELECT
  *
FROM values('ip IPv4',
               toIPv4('10.2.3.4'),
               toIPv4('192.0.2.1'),
               toIPv4('8.8.8.8'))
WHERE
   ip BETWEEN range.1 AND range.2;

Find IPs matching CIDR/network mask (IPv6)

WITH IPv6CIDRToRange
     (
       toIPv6('2001:0db8:0000:85a3:0000:0000:ac1f:8001'),
       32
      ) as range
SELECT
  *
FROM values('ip IPv6',
               toIPv6('2001:db8::8a2e:370:7334'),
               toIPv6('::ffff:192.0.2.1'),
               toIPv6('::'))
WHERE
   ip BETWEEN range.1 AND range.2;

7.19 - JSONAsString and Mat. View as JSON parser

JSONAsString and Mat. View as JSON parser

Tables with engine Null don’t store data but can be used as a source for materialized views.

JSONAsString a special input format which allows to ingest JSONs into a String column. If the input has several JSON objects (comma separated) they will be interpreted as separate rows. JSON can be multiline.

create table entrypoint(J String) Engine=Null;
create table datastore(a String, i Int64, f Float64) Engine=MergeTree order by a;

create materialized view jsonConverter to datastore
as select (JSONExtract(J, 'Tuple(String,Tuple(Int64,Float64))') as x),
        x.1 as a,
        x.2.1 as i,
        x.2.2 as f
from entrypoint;

$ echo '{"s": "val1", "b2": {"i": 42, "f": 0.1}}' | \
    clickhouse-client -q "insert into entrypoint format JSONAsString"

$ echo '{"s": "val1","b2": {"i": 33, "f": 0.2}},{"s": "val1","b2": {"i": 34, "f": 0.2}}' | \
   clickhouse-client -q "insert into entrypoint format JSONAsString"

SELECT * FROM datastore;
┌─a────┬──i─┬───f─┐
│ val1 │ 42 │ 0.1 │
└──────┴────┴─────┘
┌─a────┬──i─┬───f─┐
│ val1 │ 33 │ 0.2 │
│ val1 │ 34 │ 0.2 │
└──────┴────┴─────┘

7.20 - LowCardinality

LowCardinality

Settings

allow_suspicious_low_cardinality_types

In CREATE TABLE statement allows specifying LowCardinality modifier for types of small fixed size (8 or less). Enabling this may increase merge times and memory consumption.

low_cardinality_max_dictionary_size

default - 8192

Maximum size (in rows) of shared global dictionary for LowCardinality type.

low_cardinality_use_single_dictionary_for_part

LowCardinality type serialization setting. If is true, than will use additional keys when global dictionary overflows. Otherwise, will create several shared dictionaries.

low_cardinality_allow_in_native_format

Use LowCardinality type in Native format. Otherwise, convert LowCardinality columns to ordinary for select query, and convert ordinary columns to required LowCardinality for insert query.

output_format_arrow_low_cardinality_as_dictionary

Enable output LowCardinality type as Dictionary Arrow type

7.21 - ClickHouse® Materialized Views

Making the most of this powerful ClickHouse® feature

ClickHouse® MATERIALIZED VIEWs behave like AFTER INSERT TRIGGER to the left-most table listed in their SELECT statement and never read data from disk. Only rows that are placed to the RAM buffer by INSERT are read.

Useful links

ClickHouse Materialized Views Illuminated, Part 1:
- Blog post
- Webinar recording
ClickHouse Materialized Views Illuminated, Part 2:
Everything you should know about Materialized Views:
- Video
- Annotated slides

Best practices

Use MATERIALIZED VIEW with TO syntax (explicit storage table)
First you create the table which will store the data calculated by MV explicitly, and after that create materialized view itself with TO syntax.
```
CREATE TABLE target ( ... ) Engine=[Replicated][Replacing/Summing/...]MergeTree ...;

CREATE MATERIALIZED VIEW mv_source2target TO target
AS SELECT ... FROM source;
```
That way it’s bit simpler to do schema migrations or build more complicated pipelines when one table is filled by several MV.
With engine=Atomic it hard to map underlying table with the MV.
Avoid using POPULATE when creating MATERIALIZED VIEW on big tables.
Use manual backfilling (with the same query) instead.
- With POPULATE the data ingested to the source table during MV populating will not appear in MV.
- POPULATE doesn’t work with TO syntax.
- With manual backfilling, you have much better control on the process - you can do it in parts, adjust settings etc.
- In case of some failure ‘in the middle (for example due to timeouts), it’s hard to understand the state of the MV.
```
CREATE MATERIALIZED VIEW mv_source2target TO target
AS SELECT ... FROM source WHERE cond > ...

INSERT INTO target SELECT ... FROM source WHERE cond < ...
```
This way you have full control backfilling process (you can backfill in smaller parts to avoid timeouts, do some cross-checks / integrity-checks, change some settings, etc.)

FAQ

Q. Can I attach MATERIALIZED VIEW to the VIEW, or engine=Merge, or engine=MySQL, etc.?

Since MATERIALIZED VIEWs are updated on every INSERT to the underlying table and you can not insert anything to the usual VIEW, the materialized view update will never be triggered.

Normally, you should build MATERIALIZED VIEWs on the top of the table with the MergeTree engine family.

Q. I’ve created a materialized error with some error, and since it’s reading from Kafka, I don’t understand where the error is

Look into system.query_views_log table or server logs, or system.text_log table. Also, see the next question.

Q. How to debug misbehaving MATERIALIZED VIEW?

You can also attach the same MV to a dummy table with engine=Null and do manual inserts to debug the behavior. In a similar way (as the Materialized view often contains some pieces of the application’s business logic), you can create tests for your schema.

Warning

Always test MATERIALIZED VIEWs first on staging or testing environments

Possible test scenario:

create a copy of the original table CREATE TABLE src_copy ... AS src
create MV on that copy CREATE MATERIALIZED VIEW ... AS SELECT ... FROM src_copy
check if inserts to src_copy work properly, and mv is properly filled. INSERT INTO src_copy SELECT * FROM src LIMIT 100
cleanup the temp stuff and recreate MV on real table.

Q. Can I use subqueries / joins in MV?

It is possible but it is a very bad idea for most of the use cases**.**

So it will most probably work not as you expect and will hit insert performance significantly.

The MV will be attached (as AFTER INSERT TRIGGER) to the left-most table in the MV SELECT statement, and it will ‘see’ only freshly inserted rows there. It will ‘see’ the whole set of rows of other tables, and the query will be executed EVERY TIME you do the insert to the left-most table. That will impact the performance speed there significantly. If you really need to update the MV with the left-most table, not impacting the performance so much you can consider using dictionary / engine=Join / engine=Set for right-hand table / subqueries (that way it will be always in memory, ready to use).

Q. How are MVs executed sequentially or in parallel?

By default, the execution is sequential and alphabetical. It can be switched by parallel_view_processing .

Parallel processing could be helpful if you have a lot of spare CPU power (cores) and want to utilize it. Add the setting to the insert statement or to the user profile. New blocks created by MVs will also follow the squashing logic similar to the one used in the insert, but they will use the min_insert_block_size_rows_for_materialized_views and min_insert_block_size_bytes_for_materialized_views settings.

Q. How to alter MV implicit storage (w/o TO syntax)

take the existing MV definition
```
SHOW CREATE TABLE dbname.mvname;
```
Adjust the query in the following manner:
- replace ‘CREATE MATERIALIZED VIEW’ to ‘ATTACH MATERIALIZED VIEW’
- add needed columns;

Detach materialized view with the command:

DETACH TABLE dbname.mvname ON CLUSTER cluster_name;

Add the needed column to the underlying ReplicatedAggregatingMergeTree table

-- if the Materialized view was created without TO keyword
ALTER TABLE dbname.`.inner.mvname` ON CLUSTER cluster_name add column tokens AggregateFunction(uniq, UInt64);
-- othewise just alter the target table used in `CREATE MATERIALIZED VIEW ...`  `TO ...` clause

attach MV back using the query you create at p. 1.

ATTACH MATERIALIZED VIEW dbname.mvname ON CLUSTER cluster_name
(
    /* ... */
    `tokens` AggregateFunction(uniq, UInt64)
)
ENGINE = ReplicatedAggregatingMergeTree(...)
ORDER BY ...
AS
SELECT
    /* ... */
    uniqState(rand64()) as tokens
FROM /* ... */
GROUP BY /* ... */

As you can see that operation is NOT atomic, so the safe way is to stop data ingestion during that procedure.

If you have version 19.16.13 or newer you can change the order of step 2 and 3 making the period when MV is detached and not working shorter (related issue https://github.com/ClickHouse/ClickHouse/issues/7878 ).

See also:

7.21.1 - Idempotent inserts into a materialized view

How to make idempotent inserts into a materialized view".

Why inserts into materialized views are not idempotent?

ClickHouse® still does not have transactions. They were to be implemented around 2022Q2 but still not in the roadmap.

Because of ClickHouse materialized view is a trigger. And an insert into a table and an insert into a subordinate materialized view it’s two different inserts so they are not atomic altogether.

And insert into a materialized view may fail after the successful insert into the table. In case of any failure a client gets the error about failed insertion. You may enable insert_deduplication (it’s enabled by default for Replicated engines) and repeat the insert with an idea to archive idempotate insertion, and insertion will be skipped into the source table because of deduplication but it will be skipped for materialized view as well because by default materialized view inherits deduplication from the source table. It’s controlled by a parameter deduplicate_blocks_in_dependent_materialized_views https://clickhouse.com/docs/en/operations/settings/settings/#settings-deduplicate-blocks-in-dependent-materialized-views

If your materialized view is wide enough and always has enough data for consistent deduplication then you can enable deduplicate_blocks_in_dependent_materialized_views. Or you may add information for deduplication (some unique information / insert identifier).

Example 1. Inconsistency with deduplicate_blocks_in_dependent_materialized_views 0

create table test (A Int64, D Date) 
Engine =  ReplicatedMergeTree('/clickhouse/{cluster}/tables/{table}','{replica}') 
partition by toYYYYMM(D) order by A;

create materialized view test_mv 
Engine = ReplicatedSummingMergeTree('/clickhouse/{cluster}/tables/{table}','{replica}') 
partition by D order by D as 
select D, count() CNT from test group by D;

set max_partitions_per_insert_block=1; -- trick to fail insert into MV.

insert into test select number, today()+number%3 from numbers(100);
   DB::Exception: Received from localhost:9000. DB::Exception: Too many partitions 

select count() from test;
┌─count()─┐
│     100 │  -- Insert was successful into the test table
└─────────┘

select sum(CNT) from test_mv;
0 rows in set. Elapsed: 0.001 sec.   -- Insert was unsuccessful into the test_mv table (DB::Exception)

-- Let's try to retry insertion
set max_partitions_per_insert_block=100; -- disable trick
insert into test select number, today()+number%3 from numbers(100); -- insert retry / No error
 
select count() from test;
┌─count()─┐
│     100 │  -- insert was deduplicated
└─────────┘
 
select sum(CNT) from test_mv;
0 rows in set. Elapsed: 0.001 sec.    -- Inconsistency! Unfortunatly insert into MV was deduplicated as well

That is another example - https://github.com/ClickHouse/ClickHouse/issues/56642

Example 2. Inconsistency with deduplicate_blocks_in_dependent_materialized_views 1

create table test (A Int64, D Date) 
Engine =  ReplicatedMergeTree('/clickhouse/{cluster}/tables/{table}','{replica}') 
partition by toYYYYMM(D) order by A;

create materialized view test_mv 
Engine = ReplicatedSummingMergeTree('/clickhouse/{cluster}/tables/{table}','{replica}') 
partition by D order by D as 
select D, count() CNT from test group by D;

set deduplicate_blocks_in_dependent_materialized_views=1;

insert into test select number, today() from numbers(100);      -- insert 100 rows
insert into test select number, today() from numbers(100,100);  -- insert another 100 rows


select count() from test;
┌─count()─┐
│     200 │  -- 200 rows in the source test table
└─────────┘

select sum(CNT) from test_mv;
┌─sum(CNT)─┐
│      100 │ -- Inconsistency! The second insert was falsely deduplicated because count() was = 100 both times 
└──────────┘

Example 3. Solution: no inconsistency with deduplicate_blocks_in_dependent_materialized_views 1

Let’s add some artificial insert_id generated by the source of inserts:

create table test (A Int64, D Date, insert_id Int64) 
Engine =  ReplicatedMergeTree('/clickhouse/{cluster}/tables/{table}','{replica}') 
partition by toYYYYMM(D) order by A;

create materialized view test_mv 
Engine = ReplicatedSummingMergeTree('/clickhouse/{cluster}/tables/{table}','{replica}') 
partition by D order by D as 
select D, count() CNT, any(insert_id) insert_id from test group by D;

set deduplicate_blocks_in_dependent_materialized_views=1;

insert into test select number, today(), 333 from numbers(100);
insert into test select number, today(), 444 from numbers(100,100);


select count() from test;
┌─count()─┐
│     200 │
└─────────┘

select sum(CNT) from test_mv;
┌─sum(CNT)─┐
│      200 │ -- no inconsistency, the second (100) was not deduplicated because 333<>444
└──────────┘

set max_partitions_per_insert_block=1;   -- trick to fail insert into MV.
insert into test select number, today()+number%3, 555 from numbers(100);  
    DB::Exception: Too many partitions for single INSERT block (more than 1)


select count() from test;
┌─count()─┐
│     300 │  -- insert is successful into the test table
└─────────┘

select sum(CNT) from test_mv;
┌─sum(CNT)─┐
│      200 │  -- insert was unsuccessful into the test_mv table
└──────────┘

set max_partitions_per_insert_block=100;
insert into test select number, today()+number%3, 555 from numbers(100);   -- insert retry


select count() from test;
┌─count()─┐
│     300 │  -- insert was deduplicated
└─────────┘

select sum(CNT) from test_mv;
┌─sum(CNT)─┐
│      300 │  -- No inconsistency! Insert was not deduplicated.
└──────────┘

Idea how to fix it in ClickHouse source code https://github.com/ClickHouse/ClickHouse/issues/30240

Fake (unused) metric to add uniqueness.

create materialized view test_mv
   Engine = ReplicatedSummingMergeTree('/clickhouse/{cluster}/tables/{table}','{replica}') 
   partition by D
   order by D
  as
   select
     D,
     count() CNT,
     sum( cityHash(*) ) insert_id
  from test group by D;

7.21.2 - Backfill/populate MV in a controlled manner

Backfill/populate MV in a controlled manner

Q. How to populate MV create with TO syntax? INSERT INTO mv SELECT * FROM huge_table? Will it work if the source table has billions of rows?

A. single huge insert ... select ... actually will work, but it will take A LOT of time, and during that time lot of bad things can happen (lack of disk space, hard restart etc). Because of that, it’s better to do such backfill in a more controlled manner and in smaller pieces.

One of the best options is to fill one partition at a time, and if it breaks you can drop the partition and refill it.

If you need to construct a single partition from several sources - then the following approach may be the best.

CREATE TABLE mv_import AS mv;
INSERT INTO mv_import SELECT * FROM huge_table WHERE toYYYYMM(ts) = 202105;
/* or other partition expression*/

/* that insert select may take a lot of time, if something bad will happen
  during that - just truncate mv_import and restart the process */

/* after successful loading of mv_import do*/
ALTER TABLE mv ATTACH PARTITION ID '202105' FROM  mv_import;

Q. I still do not have enough RAM to GROUP BY the whole partition.

A. Push aggregating to the background during MERGES

There is a modified version of MergeTree Engine, called AggregatingMergeTree . That engine has additional logic that is applied to rows with the same set of values in columns that are specified in the table’s ORDER BY expression. All such rows are aggregated to only one rows using the aggregating functions defined in the column definitions. There are two “special” column types, designed specifically for that purpose:

INSERT … SELECT operating over the very large partition will create data parts by 1M rows (min_insert_block_size_rows), those parts will be aggregated during the merge process the same way as GROUP BY do it, but the number of rows will be much less than the total rows in the partition and RAM usage too. Merge combined with GROUP BY will create a new part with a much less number of rows. That data part possibly will be merged again with other data, but the number of rows will be not too big.

CREATE TABLE mv_import (
  id UInt64,
  ts SimpleAggregatingFunction(max,DateTime),         -- most fresh
  v1 SimpleAggregatingFunction(sum,UInt64),           -- just sum
  v2 SimpleAggregatingFunction(max,String),           -- some not empty string
  v3 AggregatingFunction(argMax,String,ts)            -- last value
) ENGINE = AggregatingMergeTree()
ORDER BY id;

INSERT INTO mv_import
SELECT id,                              -- ORDER BY column
   ts,v1,v2,                            -- state for SimpleAggregatingFunction the same as value
   initializeAggregation('argMaxState',v3,ts)  -- we need to convert from values to States for columns with AggregatingFunction type
FROM huge_table
WHERE toYYYYMM(ts) = 202105;

Actually, the first GROUP BY run will happen just before 1M rows will be stored on disk as a data part. You may disable that behavior by switching off optimize_on_insert setting if you have heavy calculations during aggregation.

You may attach such a table (with AggregatingFunction columns) to the main table as in the example above, but if you don’t like having States in the Materialized Table, data should be finalized and converted back to normal values. In that case, you have to move data by INSERT … SELECT again:

INSERT INTO MV
SELECT id,ts,v1,v2,  -- nothing special for SimpleAggregatingFunction columns
  finalizeAggregation(v3)
from mv_import FINAL

The last run of GROUP BY will happen during FINAL execution and AggregatingFunction types converted back to normal values. To simplify retries after failures an additional temporary table and the same trick with ATTACH could be applied.

8 - Using the Altinity Kubernetes Operator for ClickHouse®

Run ClickHouse® in Kubernetes without any issues.

Useful links

The Altinity Kubernetes Operator for ClickHouse® repo has very useful documentation:

ClickHouse Operator ip filter

In the current version of operator default user is limited to IP addresses of the cluster pods. We plan to have a password option for 0.20.0 and use a ‘secret’ authentication for distributed queries

Start/Stop cluster

Don’t delete the operator using:

kubectl delete -f https://raw.githubusercontent.com/Altinity/clickhouse-operator/master/deploy/operator/clickhouse-operator-install-bundle.yaml

kubectl delete chi cluster-name # chi is the name of the CRD clickhouseInstallation

DELETE PVCs

https://altinity.com/blog/preventing-clickhouse-storage-deletion-with-the-altinity-kubernetes-operator-reclaimpolicy

Scaling

Best way is to scale down the deployments to 0 replicas, after that reboot the node and scale up again:

first check that all your PVCs have the retain policy:

kubectl get pv -o=custom-columns=PV:.metadata.name,NAME:.spec.claimRef.name,POLICY:.spec.persistentVolumeReclaimPolicy
# Patch it if you need
kubectl patch pv <pv_id> -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'

spec:
 templates:
    volumeClaimTemplates:
       - name: XXX
         reclaimPolicy: Retain

After that just create a stop.yaml and kubectl apply -f stop.yaml

kind: ClickHouseInstallation
spec:
 stop: yes

Reboot kubernetes node
Scale up deployment changing the stop property to no and do an kubectl apply -f stop.yml

kind: ClickHouseInstallation
spec:
 stop: no

Check where pods are executing

kubectl get pod -o=custom-columns=NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName -n zk
# Check which hosts in which AZs
kubectl get node -o=custom-columns=NODE:.metadata.name,ZONE:.metadata.labels.'failure-domain\.beta\.kubernetes\.io/zone'

Check node instance types:

kubectl get nodes -o json|jq -Cjr '.items[] | .metadata.name," ",.metadata.labels."beta.kubernetes.io/instance-type"," ",.metadata.labels."beta.kubernetes.io/arch", "\n"'|sort -k3 -r

ip-10-3-9-2.eu-central-1.compute.internal t4g.large arm64
ip-10-3-9-236.eu-central-1.compute.internal t4g.large arm64
ip-10-3-9-190.eu-central-1.compute.internal t4g.large arm64
ip-10-3-9-138.eu-central-1.compute.internal t4g.large arm64
ip-10-3-9-110.eu-central-1.compute.internal t4g.large arm64
ip-10-3-8-39.eu-central-1.compute.internal t4g.large arm64
ip-10-3-8-219.eu-central-1.compute.internal t4g.large arm64
ip-10-3-8-189.eu-central-1.compute.internal t4g.large arm64
ip-10-3-13-40.eu-central-1.compute.internal t4g.large arm64
ip-10-3-12-248.eu-central-1.compute.internal t4g.large arm64
ip-10-3-12-216.eu-central-1.compute.internal t4g.large arm64
ip-10-3-12-170.eu-central-1.compute.internal t4g.large arm64
ip-10-3-11-229.eu-central-1.compute.internal t4g.large arm64
ip-10-3-11-188.eu-central-1.compute.internal t4g.large arm64
ip-10-3-11-175.eu-central-1.compute.internal t4g.large arm64
ip-10-3-10-218.eu-central-1.compute.internal t4g.large arm64
ip-10-3-10-160.eu-central-1.compute.internal t4g.large arm64
ip-10-3-10-145.eu-central-1.compute.internal t4g.large arm64
ip-10-3-9-57.eu-central-1.compute.internal m5.large amd64
ip-10-3-8-146.eu-central-1.compute.internal m5.large amd64
ip-10-3-13-1.eu-central-1.compute.internal m5.xlarge amd64
ip-10-3-11-52.eu-central-1.compute.internal m5.xlarge amd64
ip-10-3-11-187.eu-central-1.compute.internal m5.xlarge amd64
ip-10-3-10-217.eu-central-1.compute.internal m5.xlarge amd64

Search for missing affinity rules:

kubectl get pods -o json -n zk |\
jq -r "[.items[] | {name: .metadata.name,\
 affinity: .spec.affinity}]"
[
  {
    "name": "zookeeper-0",
    "affinity": null
  },
  . . .
]

Storage classes

kubectl get pvc -o=custom-columns=NAME:.metadata.name,SIZE:.spec.resources.requests.storage,CLASS:.spec.storageClassName,VOLUME:.spec.volumeName
...
NAME                         SIZE   CLASS   VOLUME
datadir-volume-zookeeper-0   25Gi   gp2     pvc-9a3...9ee

kubectl get storageclass/gp2 
...
NAME            PROVISIONER       RECLAIMPOLICY...   
gp2 (default)   ebs.csi.aws.com   Delete

Using CSI driver to protect storage:

allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp2-protected
parameters:
  encrypted: "true"
  type: gp2
provisioner: ebs.csi.aws.com
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer

Enable Resize of Volumes

Operator does not delete volumes, so those were probably deleted by Kubernetes. In some new versions there is a feature flag that deletes PVCs attached to STS when STS is deleted.

Please try do the following: Use operator 0.20.3. Add the following to the defaults: ``

 defaults:
    storageManagement:
      provisioner: Operator

That enables storage management by operator, instead of STS. It allows to extend volumes without re-creating STS, and us increase Volume size without restart of clickhouse statefulset pods for CSI drivers which support allowVolumeExpansion in storage classes because statefulset template don’t change and we don’t need delete/create statefulset

Change server settings:

https://github.com/Altinity/clickhouse-operator/issues/828

kind: ClickHouseInstallation
spec:
  configuration:
    settings:
        max_concurrent_queries: 150

Or edit ClickHouseInstallation:

kubectl -n <namespace> get chi

NAME              CLUSTERS   HOSTS   STATUS      HOSTS-COMPLETED   AGE
dnieto-test       1          4       Completed                     211d
mbak-test         1          1       Completed                     44d
rory-backupmar8   1          4       Completed                     42h

kubectl -n <namespace> edit ClickHouseInstallation dnieto-test

Clickhouse-backup for CHOP

Examples for use clickhouse-backup + clickhouse-operator for EKS cluster which not managed by altinity.cloud

Main idea: second container in clickhouse pod + CronJob which will insert and poll system.backup_actions commands to execute clickhouse-backup commands

https://github.com/AlexAkulov/clickhouse-backup/blob/master/Examples.md#how-to-use-clickhouse-backup-in-kubernetes

Configurations:

How to modify yaml configs:

https://github.com/Altinity/clickhouse-operator/blob/dc6cdc6f2f61fc333248bb78a8f8efe792d14ca2/tests/e2e/manifests/chi/test-016-settings-04.yaml#L26

clickhouse-operator install Example:

use latest release if possible https://github.com/Altinity/clickhouse-operator/releases

No. Nodes/replicas: 2 to 3 nodes with 500GB per node minimum
Zookeeper: 3 node ensemble
Type of instances: m6i.x4large to start with and you can go up to m6i.16xlarge
Persistent Storage/volumes: EBS gp2 for data and logs and gp3 for zookeeper

Install operator in namespace

#!/bin/bash

# Namespace to install operator into
OPERATOR_NAMESPACE="${OPERATOR_NAMESPACE:-dnieto-test-chop}"
# Namespace to install metrics-exporter into
METRICS_EXPORTER_NAMESPACE="${OPERATOR_NAMESPACE}"
# Operator's docker image
OPERATOR_IMAGE="${OPERATOR_IMAGE:-altinity/clickhouse-operator:latest}"
# Metrics exporter's docker image
METRICS_EXPORTER_IMAGE="${METRICS_EXPORTER_IMAGE:-altinity/metrics-exporter:latest}"

# Setup clickhouse-operator into specified namespace
kubectl apply --namespace="${OPERATOR_NAMESPACE}" -f <( \
    curl -s https://raw.githubusercontent.com/Altinity/clickhouse-operator/master/deploy/operator/clickhouse-operator-install-template.yaml | \
        OPERATOR_IMAGE="${OPERATOR_IMAGE}" \
        OPERATOR_NAMESPACE="${OPERATOR_NAMESPACE}" \
        METRICS_EXPORTER_IMAGE="${METRICS_EXPORTER_IMAGE}" \
        METRICS_EXPORTER_NAMESPACE="${METRICS_EXPORTER_NAMESPACE}" \
        envsubst \
)

Install zookeeper ensemble

zookeepers will be named like zookeeper-0.zoons

> kubectl create ns zoo3ns
> kubectl -n zoo3ns apply -f https://raw.githubusercontent.com/Altinity/clickhouse-operator/master/deploy/zookeeper/quick-start-persistent-volume/zookeeper-3-nodes-1GB-for-tests-only.yaml

# check names they should be like:
# zookeeper.zoo3ns if using a new namespace
# If using the same namespace zookeeper.<localnamespace>
# zookeeper must be accessed using the service like service_name.namespace

Deploy test cluster

> kubectl -n dnieto-test-chop apply -f dnieto-test-chop.yaml

# dnieto-test-chop.yaml
apiVersion: "clickhouse.altinity.com/v1"
kind: "ClickHouseInstallation"
metadata:
  name: "dnieto-dev"
spec:
  configuration:
    settings:
	    max_concurrent_queries: "200"
			merge_tree/ttl_only_drop_parts: "1"
		profiles:
	    default/queue_max_wait_ms: "10000"
			readonly/readonly: "1"
		users:
      admin/networks/ip:
        - 0.0.0.0/0
        - '::/0'
			admin/password_sha256_hex: ""
      admin/profile: default
      admin/access_management: 1
	  zookeeper:  
			nodes:
        - host: zookeeper.dnieto-test-chop
          port: 2181
		clusters:
      - name: dnieto-dev
        templates:
          podTemplate: pod-template-with-volumes
          serviceTemplate: chi-service-template
        layout:
          shardsCount: 1
          # put the number of desired nodes 3 by default
          replicasCount: 2
  templates:
    podTemplates:
      - name: pod-template-with-volumes
        spec:
          containers:
            - name: clickhouse
              image: clickhouse/clickhouse-server:22.3
              # separate data from logs 
              volumeMounts:
                - name: data-storage-vc-template
                  mountPath: /var/lib/clickhouse
                - name: log-storage-vc-template
                  mountPath: /var/log/clickhouse-server
    serviceTemplates:
      - name: chi-service-template
        generateName: "service-{chi}"
        # type ObjectMeta struct from k8s.io/meta/v1
        metadata:
          annotations:
						# https://kubernetes.io/docs/concepts/services-networking/service/#internal-load-balancer
						# this tags for elb load balancer
            #service.beta.kubernetes.io/aws-load-balancer-backend-protocol: tcp
				    #service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
						#https://kubernetes.io/docs/concepts/services-networking/service/#aws-nlb-support
				    service.beta.kubernetes.io/aws-load-balancer-internal: "true"
					  service.beta.kubernetes.io/aws-load-balancer-type: nlb
				spec:
          ports:
            - name: http
              port: 8123
            - name: client
              port: 9000
          type: LoadBalancer
    volumeClaimTemplates:
      - name: data-storage-vc-template
        spec:
        # no storageClassName - means use default storageClassName
        # storageClassName: default
        # here if you have a storageClassName defined for gp2 you can use it.
        # kubectl get storageclass
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 50Gi
        reclaimPolicy: Retain
      - name: log-storage-vc-template
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 2Gi

Install monitoring:

In order to setup prometheus as a backend for all the asynchronous_metric_log / metric_log tables and also set up grafana dashboards:

Extra configs

There is an admin user by default in the deployment that is used to admin stuff

KUBECTL chi basic comands:

*> kubectl get crd*

NAME                                                       CREATED AT
clickhouseinstallations.clickhouse.altinity.com            2021-10-11T13:46:43Z
clickhouseinstallationtemplates.clickhouse.altinity.com    2021-10-11T13:46:44Z
clickhouseoperatorconfigurations.clickhouse.altinity.com   2021-10-11T13:46:44Z
eniconfigs.crd.k8s.amazonaws.com                           2021-10-11T13:41:23Z
grafanadashboards.integreatly.org                          2021-10-11T13:54:37Z
grafanadatasources.integreatly.org                         2021-10-11T13:54:38Z
grafananotificationchannels.integreatly.org                2022-05-17T14:27:48Z
grafanas.integreatly.org                                   2021-10-11T13:54:37Z
provisioners.karpenter.sh                                  2022-05-17T14:27:49Z
securitygrouppolicies.vpcresources.k8s.aws                 2021-10-11T13:41:27Z
volumesnapshotclasses.snapshot.storage.k8s.io              2022-04-22T13:34:20Z
volumesnapshotcontents.snapshot.storage.k8s.io             2022-04-22T13:34:20Z
volumesnapshots.snapshot.storage.k8s.io                    2022-04-22T13:34:20Z

> *kubectl -n test-clickhouse-operator-dnieto2 get chi*
NAME        CLUSTERS   HOSTS   STATUS   HOSTS-COMPLETED   AGE
simple-01                                                 70m

> *kubectl -n test-clickhouse-operator-dnieto2 describe chi simple-01*
Name:         simple-01
Namespace:    test-clickhouse-operator-dnieto2
Labels:       <none>
Annotations:  <none>
API Version:  clickhouse.altinity.com/v1
Kind:         ClickHouseInstallation
Metadata:
  Creation Timestamp:  2023-01-09T20:38:06Z
  Generation:          1
  Managed Fields:
    API Version:  clickhouse.altinity.com/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:configuration:
          .:
          f:clusters:
    Manager:         kubectl-client-side-apply
    Operation:       Update
    Time:            2023-01-09T20:38:06Z
  Resource Version:  267483138
  UID:               d7018efa-2b13-42fd-b1c5-b798fc6d0098
Spec:
  Configuration:
    Clusters:
      Name:  simple
Events:      <none>

> *kubectl get chi --all-namespaces*

NAMESPACE                          NAME                             CLUSTERS   HOSTS   STATUS      HOSTS-COMPLETED   AGE
andrey-dev                         source                           1          1       Completed                     38d
eu                                 chi-dnieto-test-common-configd   1          1       Completed                     161d
eu                                 dnieto-test                      1          4       Completed                     151d
laszlo-dev                         node-rescale-2                   1          4       Completed                     5d13h
laszlo-dev                         single                           1          1       Completed                     5d13h
laszlo-dev2                        zk2                              1          1       Completed                     52d
test-clickhouse-operator-dnieto2   simple-01

> *kubectl -n test-clickhouse-operator-dnieto2 edit clickhouseinstallations.clickhouse.altinity.com simple-01

# Troubleshoot operator stuff
> kubectl -n test-clickhouse-operator-ns edit chi
> kubectl -n test-clickhouse-operator describe chi
> kubectl -n test-clickhouse-operator get chi -o yaml

# Check operator logs usually located in kube-system or specific namespace
> kubectl -n test-ns logs chi-operator-pod -f

# Check output to yaml
> kubectl -n test-ns get services -o yaml*

Problem with DELETE finalizers:

https://github.com/Altinity/clickhouse-operator/issues/830

There’s a problem with stuck finalizers that can cause old CHI installations to hang. The sequence of operations looks like this.

You delete the existing ClickHouse operator using kubectl delete -f operator-installation.yaml with running CHI clusters.
You then drop the namespace where the CHI clusters are running, e.g., kubectl delete ns my-namespace
This hangs. You run kubectl get ns my-namespace -o yaml and you’ll see a message like the following: “message: ‘Some content in the namespace has finalizers remaining: finalizer.clickhouseinstallation.altinity.com ”

That means the CHI can’t delete because its finalizer was deleted out from under it.

The fix is to figure out the chi name which should still be visible and edit it to remove the finalizer reference.

kubectl -n my-namespace get chi
kubectl -n my-namespace edit [clickhouseinstallations.clickhouse.altinity.com](http://clickhouseinstallations.clickhouse.altinity.com/) my-clickhouse-cluster

Remove the finalizer from the spec, save it, and everything will delete properly.

TIP: if you delete the ns too and there is no ns just create it and apply the above method

Karpenter scaler

> kubectl -n karpenter get all
NAME                             READY   STATUS    RESTARTS   AGE
pod/karpenter-75c8b7667b-vbmj4   1/1     Running   0          16d
pod/karpenter-75c8b7667b-wszxt   1/1     Running   0          16d

NAME                TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)            AGE
service/karpenter   ClusterIP   172.20.129.188   <none>        8080/TCP,443/TCP   16d

NAME                        READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/karpenter   2/2     2            2           16d

NAME                                   DESIRED   CURRENT   READY   AGE
replicaset.apps/karpenter-75c8b7667b   2         2         2       16d

> kubectl -n karpenter logs pod/karpenter-75c8b7667b-vbmj4

2023-02-06T06:33:44.269Z	DEBUG	Successfully created the logger.
2023-02-06T06:33:44.269Z	DEBUG	Logging level set to: debug
{"level":"info","ts":1675665224.2755454,"logger":"fallback","caller":"injection/injection.go:63","msg":"Starting informers..."}
2023-02-06T06:33:44.376Z	DEBUG	controller	waiting for configmaps	{"commit": "f60dacd", "configmaps": ["karpenter-global-settings"]}
2023-02-06T06:33:44.881Z	DEBUG	controller	karpenter-global-settings config "karpenter-global-settings" config was added or updated: settings.Settings{BatchMaxDuration:v1.Duration{Duration:10000000000}, BatchIdleDuration:v1.Duration{Duration:1000000000}}	{"commit": "f60dacd"}
2023-02-06T06:33:44.881Z	DEBUG	controller	karpenter-global-settings config "karpenter-global-settings" config was added or updated: settings.Settings{ClusterName:"eu", ClusterEndpoint:"https://79974769E264251E43B18AF4CA31CE8C.gr7.eu-central-1.eks.amazonaws.com", DefaultInstanceProfile:"KarpenterNodeInstanceProfile-eu", EnablePodENI:false, EnableENILimitedPodDensity:true, IsolatedVPC:false, NodeNameConvention:"ip-name", VMMemoryOverheadPercent:0.075, InterruptionQueueName:"Karpenter-eu", Tags:map[string]string{}}	{"commit": "f60dacd"}
2023-02-06T06:33:45.001Z	DEBUG	controller.aws	discovered region	{"commit": "f60dacd", "region": "eu-central-1"}
2023-02-06T06:33:45.003Z	DEBUG	controller.aws	unable to detect the IP of the kube-dns service, services "kube-dns" is forbidden: User "system:serviceaccount:karpenter:karpenter" cannot get resource "services" in API group "" in the namespace "kube-system"	{"commit": "f60dacd"}
2023/02/06 06:33:45 Registering 2 clients
2023/02/06 06:33:45 Registering 2 informer factories
2023/02/06 06:33:45 Registering 3 informers
2023/02/06 06:33:45 Registering 6 controllers
2023-02-06T06:33:45.080Z	DEBUG	controller.aws	discovered version	{"commit": "f60dacd", "version": "v0.20.0"}
2023-02-06T06:33:45.082Z	INFO	controller	Starting server	{"commit": "f60dacd", "path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
2023-02-06T06:33:45.082Z	INFO	controller	Starting server	{"commit": "f60dacd", "kind": "health probe", "addr": "[::]:8081"}
I0206 06:33:45.182600       1 leaderelection.go:248] attempting to acquire leader lease karpenter/karpenter-leader-election...
2023-02-06T06:33:45.226Z	INFO	controller	Starting informers...	{"commit": "f60dacd"}
2023-02-06T06:33:45.417Z	INFO	controller.aws.pricing	updated spot pricing with instance types and offerings	{"commit": "f60dacd", "instance-type-count": 607, "offering-count": 1400}
2023-02-06T06:33:47.670Z	INFO	controller.aws.pricing	updated on-demand pricing	{"commit": "f60dacd", "instance-type-count": 505}

Operator Affinities:

Screenshot from 2023-02-21 11-26-36.png

Deploy operator with clickhouse-keeper

https://github.com/Altinity/clickhouse-operator/issues/959 setup-example.yaml

Possible issues with running ClickHouse in K8s

The biggest problem with running ClickHouse® in K8s, happens when clickhouse-server can’t start for some reason and pod is falling in CrashloopBackOff, so you can’t easily get in the pod and check/fix/restart ClickHouse.

There is multiple possible reasons for this, some of them can be fixed without manual intervention in pod:

Wrong configuration files Fix: Check templates which are being used for config file generation and fix them.
While upgrade some backward incompatible changes prevents ClickHouse from start. Fix: Downgrade and check backward incompatible changes for all versions in between.

Next reasons would require to have manual intervention in pod/volume. There is two ways, how you can get access to data:

Change entry point of ClickHouse pod to something else, so pod wouldn’t be terminated due ClickHouse error.
Attach ClickHouse data volume to some generic pod (like Ubuntu).
Unclear restart which produced broken files and/or state on disk is differs too much from state in zookeeper for replicated tables. Fix: Create force_restore_data flag.
Wrong file permission for ClickHouse files in pod. Fix: Use chown to set right ownership for files and directories.
Errors in ClickHouse table schema prevents ClickHouse from start. Fix: Rename problematic table.sql scripts to table.sql.bak
Occasional failure of distributed queries because of wrong user/password. Due nature of k8s with dynamic ip allocations, it’s possible that ClickHouse would cache wrong ip-> hostname combination and disallow connections because of mismatched hostname. Fix: run SYSTEM DROP DNS CACHE; <disable_internal_dns_cache>1</disable_internal_dns_cache> in config.xml.

Caveats:

Not all configuration/state folders are being covered by persistent volumes. (geobases )
Page cache belongs to k8s node and pv are being mounted to pod, in case of fast shutdown there is possibility to loss some data(needs to be clarified)
Some cloud providers (GKE) can have slow unlink command, which is important for ClickHouse because it’s needed for parts management. (max_part_removal_threads setting)

Useful commands:

kubectl logs chi-chcluster-2-1-0 -c clickhouse-pod -n chcluster --previous
kubectl describe pod chi-chcluster-2-1-0 -n chcluster

Q. ClickHouse is caching the Kafka pod’s IP and trying to connect to the same ip even when there is a new Kafka pod running and the old one is deprecated. Is there some setting where we could refresh the connection

<disable_internal_dns_cache>1</disable_internal_dns_cache> in config.xml

ClickHouse init process failed

It’s due to low value for env CLICKHOUSE_INIT_TIMEOUT value. Consider increasing it up to 1 min. https://github.com/ClickHouse/ClickHouse/blob/9f5cd35a6963cc556a51218b46b0754dcac7306a/docker/server/entrypoint.sh#L120

8.1 - Istio Issues

Working with the popular service mesh

What is Istio?

Per documentation on Istio Project's website , Istio is “an open source service mesh that layers transparently onto existing distributed applications. Istio’s powerful features provide a uniform and more efficient way to secure, connect, and monitor services. Istio is the path to load balancing, service-to-service authentication, and monitoring – with few or no service code changes.”

Istio works quite well at providing this functionality, and does so through controlling service-to-service communication in a Cluster, find-grained control of traffic behavior, routing rules, load-balancing, a policy layer and configuration API supporting access controls, rate limiting, etc.

It also provides metrics about all traffic in a cluster. One can get an amazing amount of metrics from it. Datadog even has a provider that when turned on is a bit like a firehose of information.

Istio essentially uses a proxy to intercapt all network traffic and provides the ability to configured for providing a appliction-aware features.

ClickHouse and Istio

The implications for ClickHouse need to be taken into consideration however, and this page will attempt to address this from real-life scenarios that Altinity devops, infrastructural, and support engineers have had to solve.

Operator High Level Description

The Altinity ClickHouse Operator, when installed using a deployment, also creates four custom resources:

clickhouseinstallations.clickhouse.altinity.com (chi)
clickhousekeeperinstallations.clickhouse-keeper.altinity.com (chk)
clickhouseinstallationtemplates.clickhouse.altinity.com (chit)
clickhouseoperatorconfigurations.clickhouse.altinity.com (chopconf)

For the first two, it uses StatefullSets to run both Keeper and and ClickHouse clusters. For Keeper, it manages how many replicas specified, and for ClickHouse, it manages both how many replicas and shards are specified.

In managing ClickHouseInstallations, it requires that the operator can interact with the database running on clusters it creates using a specific clickhouse_operator user and needs network access rules that allow connection to the ClickHouse pods.

Many of the issues with Istio can pertain to issues where this can be a problem, particularly in the case where the IP address of the Operator pod changes and no longer is allowed to connect to it’s ClickHouse clusters that it manages.

Issue: Authentication error of clickhouse-operator

This was a ClickHouse cluster running in a Kubernetes setup with Istio.

The clickhouse operator was unable to query the clickhouse pods because of authentication errors. After a period of time, the operator gave up yet the ClickHouse cluster (ClickHouseInstallation) worked normally.
Errors showed AUTHENTICATION_FAILED and connections from :ffff:127.0.0.6 are not allowed as well as IP_ADDRESS_NOT_ALLOWED
Also, the clickhouse_operator user correctly configured
There was a recent issue that on the surface looked similar to a recent issue with https://altinity.com/blog/deepseek-clickhouse-and-the-altinity-kubernetes-operator (disabled network access for default user due to issue with DeepSeek) and one idea seemed as if upgrading the operator (which would fix the issue if it were default user).
However, the key to this issue is that the problem was with the clickhouse_operator user, not default user, hence not due to the aforementioned issue.
More consiration was given in light of how Istio effects what services can connect which made it more obvious that it was an issue with using Istio in the operator vs. operator version
The suggestion was given to remove istio from the clickhouse operator ClickHouseInstallation and references this issue https://github.com/Altinity/clickhouse-operator/issues/1261#issuecomment-1797895080
The change required would be something of the sort:

---

apiVersion: apps/v1
kind: Deployment
metadata:
  name: clickhouse-operator
spec:
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: "false"

---

apiVersion: [clickouse.altinity.com/v1](http://clickouse.altinity.com/v1)
kind: ClickHouseInstallation
metadata:
  name: your-chi
  annotations:
    sidecar.istio.io/inject: "false"

9 - Interfaces

Frequent questions users have about clickhouse-client

9.1 - clickhouse-client

ClickHouse® client

Q. How can I input multi-line SQL code? can you guys give me an example?

A. Just run clickhouse-client with -m switch, and it starts executing only after you finish the line with a semicolon.

Q. How can i use pager with clickhouse-client

A. Here is an example: clickhouse-client --pager 'less -RS'

Q. Data is returned in chunks / several tables.

A. Data get streamed from the server in blocks, every block is formatted individually when the default PrettyCompact format is used. You can use PrettyCompactMonoBlock format instead, using one of the options:

start clickhouse-client with an extra flag: clickhouse-client --format=PrettyCompactMonoBlock
add FORMAT PrettyCompactMonoBlock at the end of your query.
change clickhouse-client default format in the config. See https://github.com/ClickHouse/ClickHouse/blob/976dbe8077f9076387528e2f40b6174f6d8a8b90/programs/client/clickhouse-client.xml#L42

Q. Сustomize client config

A. you can change it globally (for all users of the workstation)

nano /etc/clickhouse-client/conf.d/user.xml

<config>
    <user>default1</user>
    <password>default1</password>
    <host></host>
    <multiline>true</multiline>
    <multiquery>true</multiquery>
</config>
See also https://github.com/ClickHouse/ClickHouse/blob/976dbe8077f9076387528e2f40b6174f6d8a8b90/programs/client/clickhouse-client.xml#L42

or for particular users - by adjusting one of.

./clickhouse-client.xml
~/.clickhouse-client/config.xml

Also, it’s possible to have several client config files and pass one of them to clickhouse-client command explicitly

References:

https://clickhouse.com/docs/en/interfaces/cli

10 - Upgrade

Upgrade notes.

ClickHouse® Version Upgrade Procedure

Step-by-Step Guide:

Normally the upgrade procedure looks like that:

Pick the release to upgrade
- If you upgrade the existing installation with a lot of legacy queries, please pick mature versions with extended lifetime for upgrade (use Altinity Stable Builds or LTS releases from the upstream).
Review Release Notes/Changelog
- Compare the release notes/changelog between your current release and the target release.
- For Altinity Stable Builds: check the release notes of the release you do upgrade to (if you going from some older release - you may need to read several of them for every release in between (for example to upgrade from 22.3 to 23.8 you will need to check 22.8 , 23.3 , 23.8 etc.)
- For upstream releases check the changelog
- Also ensure that no configuration changes are needed.
  - Sometimes, you may need to adjust configuration settings for better compatibility.
  - or to opt-out some new features you don’t need (maybe needed to to make the downgrade path possible, or to make it possible for 2 versions to work together)
Prepare Upgrade Checklist
- Upgrade the package (note that this does not trigger an automatic restart of the clickhouse-server).
- Restart the clickhouse-server service.
- Check health checks and logs.
- Repeat the process on other nodes.
Prepare “Canary” Update Checklist
- Mixing several versions in the same cluster can lead to different degradations. It is usually not recommended to have a significant delay between upgrading different nodes in the same cluster.
- (If needed / depends on use case) stop ingestion into odd replicas / remove them for load-balancer etc.
- Perform the upgrade on the odd replicas first. Once they are back online, repeat same on the even replicas.
- Test and verify that everything works properly. Check for any errors in the log files.
Upgrade Dev/Staging Environment
- Follow 3rd and 4th checklist and perform Upgrade the Dev/Staging environment.
- Ensure your schema/queries work properly in the Dev/staging environment.
- Perform testing before plan for production upgrade.
- Also worth to test the downgrade (to have plan B on upgrade failure)
Upgrade Production
- Once the Dev/Staging environment is verified, proceed with the production upgrade.

Note: Prepare and test downgrade procedures on staging so the server can be returned to the previous version if necessary.

In some upgrade scenarios (depending on which version you are upgrading from and to), when different replicas use different ClickHouse versions, you may encounter the following issues:

Replication doesn’t work at all, and delays grow.
Errors about ‘checksum mismatch’ occur, and traffic between replicas increases as they need to resync merge results. Both problems will be resolved once all replicas are upgraded.

To know more you can Download our free upgrade guide here : https://altinity.com/clickhouse-upgrade-overview/

10.1 - Vulnerabilities

Vulnerabilities

2022-03-15: 7 vulnerabilities in ClickHouse® were published.

See the details https://jfrog.com/blog/7-rce-and-dos-vulnerabilities-found-in-clickhouse-dbms/

Those vulnerabilities were fixed by 2 PRs:

All releases starting from v21.10.2.15 have that problem fixed.

Also, the fix was backported to 21.3 and 21.8 branches - versions v21.8.11.4-lts and v21.3.19.1-lts accordingly have the problem fixed (and all newer releases in those branches).

The latest Altinity stable releases also contain the bugfix.

If you use some older version we recommend upgrading.

Before the upgrade - please ensure that ports 9000 and 8123 are not exposed to the internet, so external clients who can try to exploit those vulnerabilities can not access your clickhouse node.

10.2 - ClickHouse® Function/Engines/Settings Report

Report on ClickHouse® functions, table functions, table engines, system and MergeTree settings, with availability information.

Follow this link for a complete report on ClickHouse® features with their availability: https://github.com/anselmodadams/ChMisc/blob/main/report/report.md . It is frequently updated (at least once a month).

10.3 - Removing empty parts

Removing empty parts

Removing of empty parts is a new feature introduced in ClickHouse® 20.12. Earlier versions leave empty parts (with 0 rows) if TTL removes all rows from a part (https://github.com/ClickHouse/ClickHouse/issues/5491 ). If you set up TTL for your data it is likely that there are quite many empty parts in your system.

The new version notices empty parts and tries to remove all of them immediately. This is a one-time operation which runs right after an upgrade. After that TTL will remove empty parts on its own.

There is a problem when different replicas of the same table start to remove empty parts at the same time. Because of the bug they can block each other (https://github.com/ClickHouse/ClickHouse/issues/23292 ).

What we can do to avoid this problem during an upgrade:

Drop empty partitions before upgrading to decrease the number of empty parts in the system.

SELECT concat('alter table ',database, '.', table, ' drop partition id ''', partition_id, ''';')
FROM system.parts WHERE active
GROUP BY database, table, partition_id
HAVING count() = countIf(rows=0)

Upgrade/restart one replica (in a shard) at a time. If only one replica is cleaning empty parts there will be no deadlock because of replicas waiting for one another. Restart one replica, wait for replication queue to process, then restart the next one.

Removing of empty parts can be disabled by adding remove_empty_parts=0 to the default profile.

$ cat /etc/clickhouse-server/users.d/remove_empty_parts.xml
<clickhouse>
    <profiles>
        <default>
            <remove_empty_parts>0</remove_empty_parts>
        </default>
    </profiles>
</clickhouse>

10.4 - Removing lost parts

Removing lost parts

There might be parts left in ZooKeeper that don’t exist on disk

The explanation is here https://github.com/ClickHouse/ClickHouse/pull/26716

The problem is introduced in ClickHouse® 20.1.

The problem is fixed in 21.8 and backported to 21.3.16, 21.6.9, 21.7.6.

Regarding the procedure to reproduce the issue:

The procedure was not confirmed, but I think it should work.

Wait for a merge on a particular partition (or run an OPTIMIZE to trigger one) At this point you can collect the names of parts participating in the merge from the system.merges table, or the system.parts table.
When the merge finishes, stop one of the replicas before the inactive parts are dropped (or detach the table).
Bring the replica back up (or attach the table). Check that there are no inactive parts in system.parts, but they stayed in ZooKeeper. Also check that the inactive parts got removed from ZooKeeper for another replica. Here is the query to check ZooKeeper:

select name, ctime from system.zookeeper
where path='<table_zpath>/replicas/<replica_name>/parts/'
  and name like '<put an expression for the parts that were merged>'

Drop the partition on the replica that DOES NOT have those extra parts in ZooKeeper. Check the list of parts in ZooKeeper. We hope that after this the parts on disk will be removed on all replicas, but one of the replicas will still have some parts left in ZooKeeper. If this happens, then we think that after a restart of the replica with extra parts in ZooKeeper it will try to download them from another replica.

A query to find ‘forgotten’ parts

https://kb.altinity.com/altinity-kb-useful-queries/parts-consistency/#compare-the-list-of-parts-in-zookeeper-with-the-list-of-parts-on-disk

A query to drop empty partitions with failing replication tasks

select 'alter table '||database||'.'||table||' drop partition id '''||partition_id||''';' 
from (
select database, table, splitByChar('_',new_part_name)[1] partition_id
from system.replication_queue
where type='GET_PART' and not is_currently_executing and create_time < toStartOfDay(yesterday())
group by database, table, partition_id) q
left join 
(select database, table, partition_id, countIf(active) cnt_active, count() cnt_total
from system.parts group by  database, table, partition_id
) p using database, table, partition_id
where cnt_active=0

11 - Dictionaries

All you need to know about creating and using ClickHouse® dictionaries.

For more information on ClickHouse® Dictionaries, see

the presentation https://github.com/ClickHouse/clickhouse-presentations/blob/master/meetup34/clickhouse_integration.pdf , slides 82-95, video https://youtu.be/728Yywcd5ys?t=10642

We have also couple of articles about dictionaries in our blog: https://altinity.com/blog/dictionaries-explained https://altinity.com/blog/2020/5/19/clickhouse-dictionaries-reloaded

And some videos: https://www.youtube.com/watch?v=FsVrFbcyb84

11.1 - Dictionaries & arrays

Dictionaries & arrays

Dictionary with ClickHouse® table as a source

Test data

DROP TABLE IF EXISTS arr_src;
CREATE TABLE arr_src
(
    key UInt64,
    array_int Array(Int64),
    array_str Array(String)
) ENGINE = MergeTree order by key;

INSERT INTO arr_src SELECT
    number,
    arrayMap(i -> (number * i), range(5)),
    arrayMap(i -> concat('str', toString(number * i)), range(5))
FROM numbers(1000);

Dictionary

DROP DICTIONARY IF EXISTS arr_dict;
CREATE DICTIONARY arr_dict
(
    key UInt64,
    array_int Array(Int64) DEFAULT [1,2,3],
    array_str Array(String) DEFAULT ['1','2','3']
)
PRIMARY KEY key
SOURCE(CLICKHOUSE(DATABASE 'default' TABLE 'arr_src'))
LIFETIME(120)
LAYOUT(HASHED());

SELECT
    dictGet('arr_dict', 'array_int', toUInt64(42)) AS res_int,
    dictGetOrDefault('arr_dict', 'array_str', toUInt64(424242), ['none']) AS res_str

┌─res_int───────────┬─res_str──┐
│ [0,42,84,126,168] │ ['none'] │
└───────────────────┴──────────┘

Dictionary with PostgreSQL as a source

Test data in PG

create user ch;
create database ch;
GRANT ALL PRIVILEGES ON DATABASE ch TO ch;
ALTER USER ch WITH PASSWORD 'chch';

CREATE TABLE arr_src (
    key int,
    array_int   integer[],
    array_str   text[]
);

INSERT INTO arr_src VALUES
  (42, '{0,42,84,126,168}','{"str0","str42","str84","str126","str168"}'),
  (66, '{0,66,132,198,264}','{"str0","str66","str132","str198","str264"}');

Dictionary Example

CREATE DICTIONARY pg_arr_dict
(
    key UInt64,
    array_int Array(Int64) DEFAULT [1,2,3],
    array_str Array(String) DEFAULT ['1','2','3']
)
PRIMARY KEY key
SOURCE(POSTGRESQL(PORT 5432 HOST 'pg-host'
         user 'ch' password 'chch'  DATABASE  'ch' TABLE 'arr_src'))
LIFETIME(120)
LAYOUT(HASHED());

select * from pg_arr_dict;
┌─key─┬─array_int──────────┬─array_str───────────────────────────────────┐
│  66 │ [0,66,132,198,264] │ ['str0','str66','str132','str198','str264'] │
│  42 │ [0,42,84,126,168]  │ ['str0','str42','str84','str126','str168']  │
└─────┴────────────────────┴─────────────────────────────────────────────┘

SELECT
    dictGet('pg_arr_dict', 'array_int', toUInt64(42)) AS res_int,
    dictGetOrDefault('pg_arr_dict', 'array_str', toUInt64(424242), ['none']) AS res_str

┌─res_int───────────┬─res_str──┐
│ [0,42,84,126,168] │ ['none'] │
└───────────────────┴──────────┘

Dictionary with MySQL as a source

Test data in MySQL

-- casted into CH Arrays

create table arr_src(
   _key bigint(20) NOT NULL, 
   _array_int text, 
   _array_str text, 
   PRIMARY KEY(_key)
);

INSERT INTO arr_src VALUES 
  (42, '[0,42,84,126,168]','[''str0'',''str42'',''str84'',''str126'',''str168'']'),
  (66, '[0,66,132,198,264]','[''str0'',''str66'',''str132'',''str198'',''str264'']');

Dictionary in MySQL

-- supporting table to cast data
CREATE TABLE arr_src
(
    `_key` UInt8,
    `_array_int` String,
    `array_int` Array(Int32) ALIAS cast(_array_int, 'Array(Int32)'),
    `_array_str` String,
    `array_str` Array(String) ALIAS cast(_array_str, 'Array(String)')
)
ENGINE = MySQL('mysql_host', 'ch', 'arr_src', 'ch', 'pass');

-- dictionary fetches data from the supporting table
CREATE DICTIONARY mysql_arr_dict
(
    _key UInt64,
    array_int Array(Int64) DEFAULT [1,2,3],
    array_str Array(String) DEFAULT ['1','2','3']
)
PRIMARY KEY _key
SOURCE(CLICKHOUSE(DATABASE 'default' TABLE 'arr_src'))
LIFETIME(120)
LAYOUT(HASHED());


select * from mysql_arr_dict;
┌─_key─┬─array_int──────────┬─array_str───────────────────────────────────┐
│   66 │ [0,66,132,198,264] │ ['str0','str66','str132','str198','str264'] │
│   42 │ [0,42,84,126,168]  │ ['str0','str42','str84','str126','str168']  │
└──────┴────────────────────┴─────────────────────────────────────────────┘


SELECT
    dictGet('mysql_arr_dict', 'array_int', toUInt64(42)) AS res_int,
    dictGetOrDefault('mysql_arr_dict', 'array_str', toUInt64(424242), ['none']) AS res_str

┌─res_int───────────┬─res_str──┐
│ [0,42,84,126,168] │ ['none'] │
└───────────────────┴──────────┘

SELECT
    dictGet('mysql_arr_dict', 'array_int', toUInt64(66)) AS res_int,
    dictGetOrDefault('mysql_arr_dict', 'array_str', toUInt64(66), ['none']) AS res_str

┌─res_int────────────┬─res_str─────────────────────────────────────┐
│ [0,66,132,198,264] │ ['str0','str66','str132','str198','str264'] │
└────────────────────┴─────────────────────────────────────────────┘

11.2 - Dictionary on the top of several tables using VIEW

Dictionary on the top of several tables using VIEW


DROP TABLE IF EXISTS dictionary_source_en;
DROP TABLE IF EXISTS dictionary_source_ru;
DROP TABLE IF EXISTS dictionary_source_view;
DROP DICTIONARY IF EXISTS flat_dictionary;

CREATE TABLE dictionary_source_en
(
    id UInt64,
    value String
) ENGINE = TinyLog;

INSERT INTO dictionary_source_en VALUES (1, 'One'), (2,'Two'), (3, 'Three');

CREATE TABLE dictionary_source_ru
(
    id UInt64,
    value String
) ENGINE = TinyLog;

INSERT INTO dictionary_source_ru VALUES (1, 'Один'), (2,'Два'), (3, 'Три');

CREATE VIEW dictionary_source_view AS  SELECT id, dictionary_source_en.value as value_en, dictionary_source_ru.value as value_ru  FROM  dictionary_source_en LEFT JOIN dictionary_source_ru USING (id);

select * from dictionary_source_view;

CREATE DICTIONARY flat_dictionary
(
    id UInt64,
    value_en String,
    value_ru String
)
PRIMARY KEY id
SOURCE(CLICKHOUSE(HOST 'localhost' PORT 9000 USER 'default' PASSWORD '' TABLE 'dictionary_source_view'))
LIFETIME(MIN 1 MAX 1000)
LAYOUT(FLAT());

SELECT
    dictGet(concat(currentDatabase(), '.flat_dictionary'), 'value_en', number + 1),
    dictGet(concat(currentDatabase(), '.flat_dictionary'), 'value_ru', number + 1)
FROM numbers(3);

11.3 - Dimension table design

Dimension table design

Dimension table design considerations

Choosing storage Engine

To optimize the performance of reporting queries, dimensional tables should be loaded into RAM as ClickHouse Dictionaries whenever feasible. It’s becoming increasingly common to allocate 100-200GB of RAM per server specifically for these Dictionaries. Implementing sharding by tenant can further reduce the size of these dimension tables, enabling a greater portion of them to be stored in RAM and thus enhancing query speed.

Different Dictionary Layouts can take more or less RAM (in trade for speed).

The cached dictionary layout is ideal for minimizing the amount of RAM required to store dimensional data when the hit ratio is high. This layout allows frequently accessed data to be kept in RAM while less frequently accessed data is stored on disk, thereby optimizing memory usage without sacrificing performance.
HASHED_ARRAY or SPARSE_HASHED dictionary layouts take less RAM than HASHED. See tests here .
Normalization techniques can be used to lower RAM usage (see below)

If the amount of data is so high that it does not fit in the RAM even after suitable sharding, a disk-based table with an appropriate engine and its parameters can be used for accessing dimensional data in report queries.

MergeTree engines (including Replacing or Aggregating) are not tuned by default for point queries due to the high index granularity (8192) and the necessity of using FINAL (or GROUP BY) when accessing mutated data.

When using the MergeTree engine for Dimensions, the table’s index granularity should be lowered to 256. More RAM will be used for PK, but it’s a reasonable price for reading less data from the disk and making report queries faster, and that amount can be lowered by lightweight PK design (see below).

The EmbeddedRocksDB engine could be used as an alternative. It performs much better than ReplacingMergeTree for highly mutated data, as it is tuned by design for random point queries and high-frequency updates. However, EmbeddedRocksDB does not support Replication, so INSERTing data to such tables should be done over a Distributed table with internal_replication set to false, which is vulnerable to different desync problems. Some “sync” procedures should be designed, developed, and applied after serious data ingesting incidents (like ETL crashes).

When the Dimension table is built on several incoming event streams, AggregatingMergeTree is preferable to ReplacingMergeTree, as it allows putting data from different event streams without external ETL processes:

CREATE TABLE table_C (
    id      UInt64,
    colA    SimpleAggregatingFunction(any,Nullable(UInt32)),
    colB    SimpleAggregatingFunction(max, String)
) ENGINE = AggregatingMergeTree()
PARTITION BY intDiv(id, 0x800000000000000) /* 32 bucket*/
ORDER BY id;

CREATE MATERIALIZED VIEW mv_A TO table_C AS SELECT id,colA FROM Kafka_A;
CREATE MATERIALIZED VIEW mv_B TO table_C AS SELECT id,colB FROM Kafka_B;

EmbeddedRocksDB natively supports UPDATEs without any complications with AggregatingFunctions.

For dimensions where some “start date” column is used in filtering, the Range_Hashed dictionary layout can be used if it is acceptable for RAM usage. For MergeTree variants, ASOF JOIN in queries is needed. Such types of dimensions are the first candidates for placement into RAM.

EmbeddedRocksDB is not suitable here.

Primary Key

To increase query performance, I recommend using a single UInt64 (not String) column for PK, where the upper 32 bits are reserved for tenant_id (shop_id) and the lower 32 bits for actual object_id (like customer_id, product_id, etc.)

That benefits both EmbeddedRocksDB Engine (it can have only one Primary Key column) and ReplacingMergeTree, as FINAL processing will work much faster with a light ORDER BY column of a single UInt64 value.

Direct Dictionary and UDFs

To make the SQL code of report queries more readable and manageable, I recommend always using Dictionaries to access dimensions. A direct dictionary layout should be used for disk-stored dimensions (EmbeddedRocksDB or *MergeTree).

When Clickhouse builds a query to Direct Dictionary, it automatically creates a filter with a list of all needed ID values. There is no need to write code to filter necessary dimension rows to reduce the hash table for the right join table.

Another trick for code manageability is creating an interface function for every dimension to place here all the complexity of managing IDs by packing several values into a single PK value:

create or replace function getCustomer as (shop, id, attr) ->
    dictGetOrNull('dict_Customers', attr, bitOr((bitShiftLeft(toUInt64(shop),32)),id));

It also allows the flexibility of changing dictionary names when testing different types of Engines or can be used to spread dimensional data to several dictionaries. F.e. most active tenants can be served by expensive in-RAM dictionary, while others (not active) tenants will be served from disk.

create or replace function getCustomer as (shop, id, attr) ->
    dictGetOrDefault('dict_Customers_RAM', attr, bitOr((bitShiftLeft(toUInt64(shop),32)),id) as key,
    dictGetOrNull('dict_Customers_MT', attr, key));

We always recommended DENORMALIZATION for Fact tables. However, NORMALIZATION is still a usable approach for taking less RAM for Dimension data stored as dictionaries.

Example of storing a long company name (String) in a separate dictionary:

create or replace function getCustomer as (shop, id, attr) ->
    if(attr='company_name', 
        dictGetOrDefault('dict_Company_name', 'name',
         dictGetOrNull('dict_Customers', 'company_id', 
            bitOr((bitShiftLeft(toUInt64(shop),32)),id)) as key),
        dictGetOrNull('dict_Customers', attr, key)
    );

Example of combining Hash and Direct Dictionaries. Allows to increase lifetime without losing consistency.

CREATE OR REPLACE FUNCTION getProduct AS (product_id, attr) ->
    dictGetOrDefault('hashed_dictionary', attr,(shop_id, product_id),
        dictGet('direct_dictionary',attr,(shop_id, product_id) )
    );

Tests/Examples

EmbeddedRocksDB

CREATE TABLE Dim_Customers (
    id UInt64,
    name String,
    new_or_returning bool
) ENGINE = EmbeddedRocksDB()
PRIMARY KEY (id);

INSERT INTO Dim_Customers
SELECT bitShiftLeft(3648061509::UInt64,32)+number,
  ['Customer A', 'Customer B', 'Customer C', 'Customer D', 'Customer E'][number % 5 + 1],
  number % 2 = 0
FROM numbers(100);

CREATE DICTIONARY dict_Customers
(
    id UInt64,
    name String,
    new_or_returning bool
)
PRIMARY KEY id
LAYOUT(DIRECT())
SOURCE(CLICKHOUSE(TABLE 'Dim_Customers'));

select dictGetOrNull('dict_Customers', 'name', 
  bitOr((bitShiftLeft(toUInt64(shop_id),32)),customer_id));

ReplacingMergeTree

CREATE TABLE Dim_Customers (
    id UInt64, 
    name String,
    new_or_returning bool
) ENGINE = ReplacingMergeTree()
ORDER BY id
PARTITION BY intDiv(id, 0x800000000000000) /* 32 buckets by shop_id */
settings index_granularity=256;

CREATE DICTIONARY dict_Customers
(
    id UInt64,
    name String,
    new_or_returning bool
)
PRIMARY KEY id
LAYOUT(DIRECT())
SOURCE(CLICKHOUSE(query 'select * from Dim_Customers FINAL'));

set do_not_merge_across_partitions_select_final=1; -- or place it to profile
select dictGet('dict_Customers','name',bitShiftLeft(3648061509::UInt64,32)+1);

Tests 1M random reads over 10M entries per shop_id in the Dimension table

EmbeddedRocksDB - 0.003s
ReplacingMergeTree - 0.003s

There is no difference in SELECT on that synthetic test with all MergeTree optimizations applied. The test must be rerun on actual data with the expected update volume. The difference could be seen on a table with high-volume real-time updates.

11.4 - Example of PostgreSQL dictionary

Example of PostgreSQL dictionary

CREATE DICTIONARY postgres_dict
(
    id UInt32,
    value String
)
PRIMARY KEY id
SOURCE(
    POSTGRESQL(
        port 5432
        host 'postgres1'
        user  'postgres'
        password 'mysecretpassword'
        db 'clickhouse'
        table 'test_schema.test_table'
    )
)
LIFETIME(MIN 300 MAX 600)
LAYOUT(HASHED());

and later do

SELECT dictGetString(postgres_dict, 'value', toUInt64(1))

11.5 - MySQL8 source for dictionaries

MySQL8 source for dictionaries

Authorization

MySQL8 used default authorization plugin caching_sha2_password. Unfortunately, libmysql which currently used (21.4-) in ClickHouse® is not.

You can fix it during create custom user with mysql_native_password authentication plugin.

CREATE USER IF NOT EXISTS 'clickhouse'@'%'
IDENTIFIED WITH mysql_native_password BY 'clickhouse_user_password';

CREATE DATABASE IF NOT EXISTS test;

GRANT ALL PRIVILEGES ON test.* TO 'clickhouse'@'%';

Table schema changes

As an example, in ClickHouse, run SHOW TABLE STATUS LIKE 'table_name' and try to figure out was table schema changed or not from MySQL response field Update_time.

By default, to properly data loading from MySQL8 source to dictionaries, please turn off the information_schema cache.

You can change default behavior with create /etc/mysql/conf.d/information_schema_cache.cnfwith following content:

[mysqld]
information_schema_stats_expiry=0

Or setup it via SQL query:

SET GLOBAL information_schema_stats_expiry=0;

11.6 - Partial updates

Partial updates

ClickHouse® is able to fetch from a source only updated rows. You need to define update_field section.

As an example, We have a table in an external source MySQL, PG, HTTP, … defined with the following code sample:

CREATE TABLE cities
(
    `polygon` Array(Tuple(Float64, Float64)),
    `city` String,
    `updated_at` DateTime DEFAULT now()
)
ENGINE = MergeTree ORDER BY city

When you add new row and update some rows in this table you should update updated_at with the new timestamp.

-- fetch updated rows every 30 seconds

CREATE DICTIONARY cities_dict (
    polygon Array(Tuple(Float64, Float64)),
    city String
)
PRIMARY KEY polygon
SOURCE(CLICKHOUSE( TABLE cities DB 'default'
                    update_field 'updated_at'))
LAYOUT(POLYGON())
LIFETIME(MIN 30 MAX 30)

A dictionary with update_field updated_at will fetch only updated rows. A dictionary saves the current time (now) time of the last successful update and queries the source where updated_at >= previous_update - 1 (shift = 1 sec.).

In case of HTTP source ClickHouse will send get requests with update_field as an URL parameter &updated_at=2020-01-01%2000:01:01

11.7 - range_hashed example - open intervals

range_hashed example - open intervals

The following example shows a range_hashed example at open intervals.

DROP TABLE IF EXISTS rates;

DROP DICTIONARY IF EXISTS rates_dict;

CREATE TABLE rates (
  id UInt64,
  date_start Nullable(Date),
  date_end Nullable(Date),
  rate Decimal64(4)
) engine=Log;

INSERT INTO rates VALUES (1, Null, '2021-03-13',99), (1, '2021-03-14','2021-03-16',100), (1, '2021-03-17', Null, 101), (2, '2021-03-14', Null, 200), (3, Null, '2021-03-14', 300), (4, '2021-03-14', '2021-03-14', 400);

CREATE DICTIONARY rates_dict
(
  id UInt64,
  date_start Date,
  date_end Date,
  rate Decimal64(4)
)
PRIMARY KEY id
SOURCE(CLICKHOUSE(HOST 'localhost' PORT 9000 USER 'default' TABLE 'rates'))
LIFETIME(MIN 1 MAX 1000)
LAYOUT(RANGE_HASHED())
RANGE(MIN date_start MAX date_end);

SELECT * FROM rates_dict order by id, date_start;

┌─id─┬─date_start─┬───date_end─┬─────rate─┐
│  1 │ 1970-01-01 │ 2021-03-13 │  99.0000 │
│  1 │ 2021-03-14 │ 2021-03-16 │ 100.0000 │
│  1 │ 2021-03-17 │ 1970-01-01 │ 101.0000 │
│  2 │ 2021-03-14 │ 1970-01-01 │ 200.0000 │
│  3 │ 1970-01-01 │ 2021-03-14 │ 300.0000 │
│  4 │ 2021-03-14 │ 2021-03-14 │ 400.0000 │
└────┴────────────┴────────────┴──────────┘

WITH
  toDate('2021-03-10') + INTERVAL number DAY as date
select
  date,
  dictGet(currentDatabase() || '.rates_dict', 'rate', toUInt64(1), date) as rate1,
  dictGet(currentDatabase() || '.rates_dict', 'rate', toUInt64(2), date) as rate2,
  dictGet(currentDatabase() || '.rates_dict', 'rate', toUInt64(3), date) as rate3,
  dictGet(currentDatabase() || '.rates_dict', 'rate', toUInt64(4), date) as rate4
FROM numbers(10);

┌───────date─┬────rate1─┬────rate2─┬────rate3─┬────rate4─┐
│ 2021-03-10 │  99.0000 │   0.0000 │ 300.0000 │   0.0000 │
│ 2021-03-11 │  99.0000 │   0.0000 │ 300.0000 │   0.0000 │
│ 2021-03-12 │  99.0000 │   0.0000 │ 300.0000 │   0.0000 │
│ 2021-03-13 │  99.0000 │   0.0000 │ 300.0000 │   0.0000 │
│ 2021-03-14 │ 100.0000 │ 200.0000 │ 300.0000 │ 400.0000 │
│ 2021-03-15 │ 100.0000 │ 200.0000 │   0.0000 │   0.0000 │
│ 2021-03-16 │ 100.0000 │ 200.0000 │   0.0000 │   0.0000 │
│ 2021-03-17 │ 101.0000 │ 200.0000 │   0.0000 │   0.0000 │
│ 2021-03-18 │ 101.0000 │ 200.0000 │   0.0000 │   0.0000 │
│ 2021-03-19 │ 101.0000 │ 200.0000 │   0.0000 │   0.0000 │
└────────────┴──────────┴──────────┴──────────┴──────────┘

11.8 - Security named collections

Security named collections

Dictionary with ClickHouse® table as a source with named collections

Data for connecting to external sources can be stored in named collections

<clickhouse>
    <named_collections>
        <local_host>
            <host>localhost</host>
            <port>9000</port>
            <database>default</database>
            <user>ch_dict</user>
            <password>mypass</password>
        </local_host>
    </named_collections>
</clickhouse>

Dictionary

DROP DICTIONARY IF EXISTS named_coll_dict;
CREATE DICTIONARY named_coll_dict
(
    key UInt64,
    val String
)
PRIMARY KEY key
SOURCE(CLICKHOUSE(NAME local_host TABLE my_table DB default))
LIFETIME(MIN 1 MAX 2)
LAYOUT(HASHED());

INSERT INTO my_table(key, val) VALUES(1, 'first row');

SELECT dictGet('named_coll_dict', 'b', 1);
┌─dictGet('named_coll_dict', 'b', 1)─┐
│ first row                          │
└────────────────────────────────────┘

11.9 - SPARSE_HASHED VS HASHED vs HASHED_ARRAY

SPARSE_HASHED VS HASHED VS HASHED_ARRAY

Sparse_hashed and hashed_array layouts are supposed to save memory but has some downsides. We can test it with the following:

create table orders(id UInt64, price Float64)
Engine = MergeTree() order by id;

insert into orders select number, 0 from numbers(5000000);

CREATE DICTIONARY orders_hashed (id UInt64, price Float64)
PRIMARY KEY id SOURCE(CLICKHOUSE(HOST 'localhost' PORT 9000
TABLE orders DB 'default' USER 'default'))
LIFETIME(MIN 0 MAX 0) LAYOUT(HASHED());

CREATE DICTIONARY orders_sparse (id UInt64, price Float64)
PRIMARY KEY id SOURCE(CLICKHOUSE(HOST 'localhost' PORT 9000
TABLE orders DB 'default' USER 'default'))
LIFETIME(MIN 0 MAX 0) LAYOUT(SPARSE_HASHED());

CREATE DICTIONARY orders_hashed_array (id UInt64, price Float64)
PRIMARY KEY id SOURCE(CLICKHOUSE(HOST 'localhost' PORT 9000
TABLE orders DB 'default' USER 'default'))
LIFETIME(MIN 0 MAX 0) LAYOUT(HASHED_ARRAY());

SELECT
    name,
    type,
    status,
    element_count,
    formatReadableSize(bytes_allocated) AS RAM
FROM system.dictionaries
WHERE name LIKE 'orders%'
┌─name────────────────┬─type─────────┬─status─┬─element_count─┬─RAM────────┐
│ orders_hashed_array │ HashedArray  │ LOADED │       5000000 │ 68.77 MiB  │
│ orders_sparse       │ SparseHashed │ LOADED │       5000000 │ 76.30 MiB  │
│ orders_hashed       │ Hashed       │ LOADED │       5000000 │ 256.00 MiB │
└─────────────────────┴──────────────┴────────┴───────────────┴────────────┘

SELECT sum(dictGet('default.orders_hashed', 'price', toUInt64(number))) AS res
FROM numbers(10000000)
┌─res─┐
│   0 │
└─────┘
1 rows in set. Elapsed: 0.546 sec. Processed 10.01 million rows ...

SELECT sum(dictGet('default.orders_sparse', 'price', toUInt64(number))) AS res
FROM numbers(10000000)
┌─res─┐
│   0 │
└─────┘
1 rows in set. Elapsed: 1.422 sec. Processed 10.01 million rows ...

SELECT sum(dictGet('default.orders_hashed_array', 'price', toUInt64(number))) AS res
FROM numbers(10000000)
┌─res─┐
│   0 │
└─────┘
1 rows in set. Elapsed: 0.558 sec. Processed 10.01 million rows ...

As you can see SPARSE_HASHED is memory efficient and use about 3 times less memory (!!!) but is almost 3 times slower as well. On the other side HASHED_ARRAY is even more efficient in terms of memory usage and maintains almost the same performance as HASHED layout.

12 - Using This Knowledge Base

Add pages, make updates, and contribute to this ClickHouse® knowledge base.

The Altinity Knowledge Base is built on GitHub Pages, using Hugo and Docsy. This guide provides a brief description on how to make updates and add to this knowledge base.

Page and Section Basics

The knowledge base is structured in a simple directory format, with the content of the Knowledge Base stored under the directory content/en.

Each section is a directory, with the file _index.md that provides that sections information. For example, the Upgrade section has the following layout:

├── upgrade
│   ├── _index.md
│   └── removing-empty-parts.md

Each Markdown file provides the section’s information and the title that is displayed on the left navigation panel, with the file _index.md providing the top level information on the section.

Each page is set in the following format that sets the page attributes:

---
title: "Using This Knowledge Base"
linkTitle: "Using This Knowledge Base"
description: >
    How to add pages, make updates, and expand this knowledge base.    
weight: 11
---
The content of the page in Markdown format.

The attributes are as follows:

title: The title of the page displayed at the top of the page.
linkTitle: The title used in the left navigation panel.
description: A short description of the page listed under the title.
weight: The placement of the page in the hierarchy in the left navigation panel. The higher the weight, the higher in the display order it will be. For example, the file engines/_index.md has a weight of 1, pushing its display to the top of the list.

Create Pages and Sections

Create or Edit A Page

To create a new page or edit an existing one in the knowledge base:

From the page to start from:
1. To create a new page, select Create child page.
2. To edit an existing page, select Edit this page.
This will open the page’s location in the GitHub repository. Update the page using Markdown. See the Docsy Formatting Options section below for tips and details.
1. View how the page will look Preview. The GitHub Preview is not 100% the same as the page will be displayed on the knowledge base, but it is a close enough approximation.
Saving the file will depend on your role.
1. For those who have been granted Knowledgebase Contributor status, select Commit New File. The changes will be automatically applied to the GitHub repository, and the additions will be displayed to the knowledge base within 1-5 minutes.
2. For those who have not been granted Knowledgebase Contributor status, they will have to fork the changes and then create a new pull request through the following process:
  1. When editing is complete, select Propose New File. This will being you to the GitHub Pull Request page.
  2. Verify the new file is accurate, then select Create Pull Request.
  3. Name the Pull Request, then select Create pull request.
  4. First time contributors will be required to review and sign the Contributor License Agreement(CLA) . To signify they agree with the CLA, the following comment must be left as part of the pull request:
```
I have read the CLA Document and I hereby sign the CLA
```
  This signature will be stored as part of the GitHub repository indicating the GitHub username, the date of the agreement, and the pull request where the signer indicated their consent with the CLA.
  1. The Pull Request will be reviewed and if approved, the changes will be applied to the Knowledge Base.

Create a New Section

To create a new section in the knowledge base, add a new directory under content/en from either the GitHub Repository or through some other GitHub related method., and add the file index.md. The same submission process will be followed as outlined in Create or Edit A Page .

Docsy Formatting Options

Docsy uses Markdown , providing a simple method of formatting documents. Refer to the Markdown documentation for how to edit pages and achieve the display results.

The following guide recommendations should be followed:

Code should should be code segments, which uses three back tics to start and end a code section, with the type of code used. For example, if the code segment is regarding SQL then the section would start with ```sql` .
Display text should be in bold. For example, when requesting someone click Create New Page on a page, Create New Page is in bold.

Adding Images

New images and other static files are stored in the directory static, with the following categories:

Images are stored under static/assets.
Pdf files are stored under static/assets

12.1 - Mermaid Example

A short example of using the Mermaid library to add charts.

This Knowledge Base now supports Mermaid , a handy way to create charts from text. The following example shows a very simple chart, and the code to use.

To add a Mermaid chart, encase the Mermaid code between {{< mermaid >}}, as follows:

{{<mermaid>}}
graph TD;
  A-->B;
  A-->C;
  B-->D;
  C-->D;
{{</mermaid>}}

Get in touch with ClickHouse experts.

Altinity® Knowledge Base for ClickHouse®

Welcome to the Altinity® Knowledge Base (KB) for ClickHouse®

1 - Engines

Short summary

More

1.1 - ClickHouse® Atomic Database Engine

FAQ

Q. Data is not removed immediately

Q. I cannot reuse zookeeper path after dropping the table.

Q. Should I use Atomic or Ordinary for new setups?

Using Ordinary by default instead of Atomic

Other sources

1.1.1 - How to Convert Ordinary to Atomic

New, official way

1.1.2 - How to Convert Atomic to Ordinary

Warning

Schemas with Materialized VIEW

1.2 - EmbeddedRocksDB & dictionary

Direct queries to tables to request 10000 rows by a random key

dictGet – 100.00 thousand random rows

dictGet – 1million random rows

dictGet – 1million random rows from Hashed

dictGet – 1million random rows from SparseHashed

1.3 - MergeTree table engine family

1.3.1 - CollapsingMergeTree vs ReplacingMergeTree

CollapsingMergeTree vs ReplacingMergeTree

1.3.2 - Part names & MVCC

Part names & multiversion concurrency control

1.3.3 - How to pick an ORDER BY / PRIMARY KEY / PARTITION BY for the MergeTree family table

For Summing / Aggregating

For Replacing / Collapsing

ORDER BY example

PARTITION BY

See also

1.3.4 - ClickHouse® AggregatingMergeTree

Last non-null value for each column

Merge two data streams

1.3.5 - index & column files

1.3.6 - Merge performance and OPTIMIZE FINAL

Merge Performance

OPTIMIZE TABLE example FINAL DEDUPLICATE BY expr

1.3.7 - Nulls in order by

1.3.8 - ReplacingMergeTree

General Operations

Engine Parameters

DML operations

FINAL

Deleting the data

Use cases

Last state

Single key

Multiple keys

Full table

1.3.8.1 - ReplacingMergeTree does not collapse duplicates

1.3.9 - Skip index

Warning

1.3.10 - SummingMergeTree

Nested structures

1.3.11 - UPSERT by VersionedCollapsingMergeTree

Challenges with mutated data

Collapsing

Replace data in another partition

Row deduplication

Getting old row

UPSERT by Collapsing

Speed Test

Adding projections

DELETEs inaccuracy

Combine old and new

Complex Primary Key

Sharding

2 - Queries & Syntax

2.1 - GROUP BY

Internal implementation

Two-Level

GROUP BY in external memory

optimize_aggregation_in_order GROUP BY

Last item cache

StringHashMap