This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Queries & Syntax

Learn about ClickHouse queries & syntax, including Joins & Window Functions.

1 - GROUP BY

Learn about GROUP BY clause in ClickHouse.

Internal implementation

Code

ClickHouse uses non-blocking? hash tables, so each thread has at least one hash table.

It makes easier to not care about sync between multiple threads, but has such disadvantages as:

  1. Bigger memory usage.
  2. Needs to merge those per-thread hash tables afterwards.

Because second step can be a bottleneck in case of a really big GROUP BY with a lot of distinct keys, another solution has been made.

Two-Level

https://youtu.be/SrucFOs8Y6c?t=2132

┌─name───────────────────────────────┬─value────┬─changed─┬─description────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─min──┬─max──┬─readonly─┬─type───┐
│ group_by_two_level_threshold       │ 100000   │       0 │ From what number of keys, a two-level aggregation starts. 0 - the threshold is not set.                                                                                                                    │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │        0 │ UInt64 │
│ group_by_two_level_threshold_bytes │ 50000000 │       0 │ From what size of the aggregation state in bytes, a two-level aggregation begins to be used. 0 - the threshold is not set. Two-level aggregation is used when at least one of the thresholds is triggered. │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │        0 │ UInt64 │
└────────────────────────────────────┴──────────┴─────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──────┴──────┴──────────┴────────┘

In order to parallelize merging of hash tables, ie execute such merge via multiple threads, ClickHouse use two-level approach:

On the first step ClickHouse creates 256 buckets for each thread. (determined by one byte of hash function) On the second step ClickHouse can merge those 256 buckets independently by multiple threads.

https://github.com/ClickHouse/ClickHouse/blob/1ea637d996715d2a047f8cd209b478e946bdbfb0/src/Common/HashTable/TwoLevelHashTable.h#L6

GROUP BY in external memory

It utilizes a two-level group by and dumps those buckets on disk. And at the last stage ClickHouse will read those buckets from disk one by one and merge them. So you should have enough RAM to hold one bucket (1/256 of whole GROUP BY size).

https://clickhouse.com/docs/en/sql-reference/statements/select/group-by/#select-group-by-in-external-memory

optimize_aggregation_in_order GROUP BY

Usually it works slower than regular GROUP BY, because ClickHouse need’s to read and process data in specific ORDER, which makes it much more complicated to parallelize reading and aggregating.

But it use much less memory, because ClickHouse can stream resultset and there is no need to keep it in memory.

Last item cache

ClickHouse saves value of previous hash calculation, just in case next value will be the same.

https://github.com/ClickHouse/ClickHouse/pull/5417 https://github.com/ClickHouse/ClickHouse/blob/808d9afd0f8110faba5ae027051bf0a64e506da3/src/Common/ColumnsHashingImpl.h#L40

StringHashMap

Actually uses 5 different hash tables

  1. For empty strings
  2. For strings < 8 bytes
  3. For strings < 16 bytes
  4. For strings < 24 bytes
  5. For strings > 24 bytes
SELECT count()
FROM
(
    SELECT materialize('1234567890123456') AS key           -- length(key) = 16
    FROM numbers(1000000000)
)
GROUP BY key

Aggregator: Aggregation method: key_string

Elapsed: 8.888 sec. Processed 1.00 billion rows, 8.00 GB (112.51 million rows/s., 900.11 MB/s.)

SELECT count()
FROM
(
    SELECT materialize('12345678901234567') AS key          -- length(key) = 17
    FROM numbers(1000000000)
)
GROUP BY key

Aggregator: Aggregation method: key_string

Elapsed: 9.089 sec. Processed 1.00 billion rows, 8.00 GB (110.03 million rows/s., 880.22 MB/s.)

SELECT count()
FROM
(
    SELECT materialize('123456789012345678901234') AS key   -- length(key) = 24
    FROM numbers(1000000000)
)
GROUP BY key

Aggregator: Aggregation method: key_string

Elapsed: 9.134 sec. Processed 1.00 billion rows, 8.00 GB (109.49 million rows/s., 875.94 MB/s.)

SELECT count()
FROM
(
    SELECT materialize('1234567890123456789012345') AS key  -- length(key) = 25
    FROM numbers(1000000000)
)
GROUP BY key

Aggregator: Aggregation method: key_string

Elapsed: 12.566 sec. Processed 1.00 billion rows, 8.00 GB (79.58 million rows/s., 636.67 MB/s.)

length

16 8.89 17 9.09 24 9.13 25 12.57

For what GROUP BY statement use memory

  1. Hash tables

It will grow with:

Amount of unique combinations of keys participated in GROUP BY

Size of keys participated in GROUP BY

  1. States of aggregation functions:

Be careful with function, which state can use unrestricted amount of memory and grow indefenetely:

  • groupArray (groupArray(1000)())
  • uniqExact (uniq,uniqCombined)
  • quantileExact (medianExact) (quantile,quantileTDigest)
  • windowFunnel
  • groupBitmap
  • sequenceCount (sequenceMatch)
  • *Map

Why my GROUP BY eat all the RAM

  1. run your query with set send_logs_level='trace'

  2. Remove all aggregation functions from the query, try to understand how many memory simple GROUP BY will take.

  3. One by one remove aggregation functions from query in order to understand which one is taking most of memory

1.1 - GROUP BY tricks

Tricks for GROUP BY memory usage optimization

Tricks

Testing dataset

CREATE TABLE sessions
(
    `app` LowCardinality(String),
    `user_id` String,
    `created_at` DateTime,
    `platform` LowCardinality(String),
    `clicks` UInt32,
    `session_id` UUID
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(created_at)
ORDER BY (app, user_id, session_id, created_at)

INSERT INTO sessions WITH
    CAST(number % 4, 'Enum8(\'Orange\' = 0, \'Melon\' = 1, \'Red\' = 2, \'Blue\' = 3)') AS app,
    concat('UID: ', leftPad(toString(number % 20000000), 8, '0')) AS user_id,
    toDateTime('2021-10-01 10:11:12') + (number / 300) AS created_at,
    CAST((number + 14) % 3, 'Enum8(\'Bat\' = 0, \'Mice\' = 1, \'Rat\' = 2)') AS platform,
    number % 17 AS clicks,
    generateUUIDv4() AS session_id
SELECT
    app,
    user_id,
    created_at,
    platform,
    clicks,
    session_id
FROM numbers_mt(1000000000)

0 rows in set. Elapsed: 46.078 sec. Processed 1.00 billion rows, 8.00 GB (21.70 million rows/s., 173.62 MB/s.)

┌─database─┬─table────┬─column─────┬─type───────────────────┬───────rows─┬─compressed_bytes─┬─compressed─┬─uncompressed─┬──────────────ratio─┬─codec─┐
 default   sessions  session_id  UUID                    1000000000       16065918103  14.96 GiB   14.90 GiB     0.9958970223439835        
 default   sessions  user_id     String                  1000000000        3056977462  2.85 GiB    13.04 GiB       4.57968701896828        
 default   sessions  clicks      UInt32                  1000000000        1859359032  1.73 GiB    3.73 GiB       2.151278979023993        
 default   sessions  created_at  DateTime                1000000000        1332089630  1.24 GiB    3.73 GiB      3.0028009451586226        
 default   sessions  platform    LowCardinality(String)  1000000000         329702248  314.43 MiB  956.63 MiB     3.042446801879252        
 default   sessions  app         LowCardinality(String)  1000000000           4825544  4.60 MiB    956.63 MiB    207.87333386660654        
└──────────┴──────────┴────────────┴────────────────────────┴────────────┴──────────────────┴────────────┴──────────────┴────────────────────┴───────┘

All queries and datasets are unique, so in different situations different hacks could work better or worse.

PreFilter values before GROUP BY

SELECT
    user_id,
    sum(clicks)
FROM sessions
WHERE created_at > '2021-11-01 00:00:00'
GROUP BY user_id
HAVING (argMax(clicks, created_at) = 16) AND (argMax(platform, created_at) = 'Rat')
FORMAT `Null`


<Debug> MemoryTracker: Peak memory usage (for query): 18.36 GiB.

SELECT
    user_id,
    sum(clicks)
FROM sessions
WHERE user_id IN (
    SELECT user_id
    FROM sessions
    WHERE (platform = 'Rat') AND (clicks = 16) AND (created_at > '2021-11-01 00:00:00') -- So we will select user_id which could potentially match our HAVING clause in OUTER query.
) AND (created_at > '2021-11-01 00:00:00')
GROUP BY user_id
HAVING (argMax(clicks, created_at) = 16) AND (argMax(platform, created_at) = 'Rat')
FORMAT `Null`

<Debug> MemoryTracker: Peak memory usage (for query): 4.43 GiB.

Use Fixed-width data types instead of String

For example, you have 2 strings which has values in special form like this

‘ABX 1412312312313’

You can just remove the first 4 characters and convert the rest to UInt64

toUInt64(substr(‘ABX 1412312312313’,5))

And you packed 17 bytes in 8, more than 2 times the improvement of size!

SELECT
    user_id,
    sum(clicks)
FROM sessions
GROUP BY
    user_id,
    platform
FORMAT `Null`

Aggregator: Aggregation method: serialized

<Debug> MemoryTracker: Peak memory usage (for query): 28.19 GiB.

Elapsed: 7.375 sec. Processed 1.00 billion rows, 27.00 GB (135.60 million rows/s., 3.66 GB/s.)

WITH
    CAST(user_id, 'FixedString(14)') AS user_fx,
    CAST(platform, 'FixedString(4)') AS platform_fx
SELECT
    user_fx,
    sum(clicks)
FROM sessions
GROUP BY
    user_fx,
    platform_fx
FORMAT `Null`

Aggregator: Aggregation method: keys256

MemoryTracker: Peak memory usage (for query): 22.24 GiB.

Elapsed: 6.637 sec. Processed 1.00 billion rows, 27.00 GB (150.67 million rows/s., 4.07 GB/s.)

WITH
    CAST(user_id, 'FixedString(14)') AS user_fx,
    CAST(platform, 'Enum8(\'Rat\' = 1, \'Mice\' = 2, \'Bat\' = 0)') AS platform_enum
SELECT
    user_fx,
    sum(clicks)
FROM sessions
GROUP BY
    user_fx,
    platform_enum
FORMAT `Null`

Aggregator: Aggregation method: keys128

MemoryTracker: Peak memory usage (for query): 14.14 GiB.

Elapsed: 5.335 sec. Processed 1.00 billion rows, 27.00 GB (187.43 million rows/s., 5.06 GB/s.)

WITH
    toUInt32OrZero(trim( LEADING '0' FROM substr(user_id,6))) AS user_int,
    CAST(platform, 'Enum8(\'Rat\' = 1, \'Mice\' = 2, \'Bat\' = 0)') AS platform_enum
SELECT
    user_int,
    sum(clicks)
FROM sessions
GROUP BY
    user_int,
    platform_enum
FORMAT `Null`

Aggregator: Aggregation method: keys64

MemoryTracker: Peak memory usage (for query): 10.14 GiB.

Elapsed: 8.549 sec. Processed 1.00 billion rows, 27.00 GB (116.97 million rows/s., 3.16 GB/s.)


WITH
    toUInt32('1' || substr(user_id,6)) AS user_int,
    CAST(platform, 'Enum8(\'Rat\' = 1, \'Mice\' = 2, \'Bat\' = 0)') AS platform_enum
SELECT
    user_int,
    sum(clicks)
FROM sessions
GROUP BY
    user_int,
    platform_enum
FORMAT `Null`

Aggregator: Aggregation method: keys64

Peak memory usage (for query): 10.14 GiB.

Elapsed: 6.247 sec. Processed 1.00 billion rows, 27.00 GB (160.09 million rows/s., 4.32 GB/s.)

It can be especially useful when you tries to do GROUP BY lc_column_1, lc_column_2 and ClickHouse falls back to serialized algorithm.

Two LowCardinality Columns in GROUP BY

SELECT
    app,
    sum(clicks)
FROM sessions
GROUP BY app
FORMAT `Null`

Aggregator: Aggregation method: low_cardinality_key_string

MemoryTracker: Peak memory usage (for query): 43.81 MiB.

Elapsed: 0.545 sec. Processed 1.00 billion rows, 5.00 GB (1.83 billion rows/s., 9.17 GB/s.)

SELECT
    app,
    platform,
    sum(clicks)
FROM sessions
GROUP BY
    app,
    platform
FORMAT `Null`

Aggregator: Aggregation method: serialized -- Slowest method!

MemoryTracker: Peak memory usage (for query): 222.86 MiB.

Elapsed: 2.923 sec. Processed 1.00 billion rows, 6.00 GB (342.11 million rows/s., 2.05 GB/s.)

SELECT
    CAST(app, 'FixedString(6)') AS app_fx,
    CAST(platform, 'FixedString(4)') AS platform_fx,
    sum(clicks)
FROM sessions
GROUP BY
    app_fx,
    platform_fx
FORMAT `Null`

Aggregator: Aggregation method: keys128

MemoryTracker: Peak memory usage (for query): 160.23 MiB.

Elapsed: 1.632 sec. Processed 1.00 billion rows, 6.00 GB (612.63 million rows/s., 3.68 GB/s.)

Split your query in multiple smaller queries and execute them one BY one

SELECT
    user_id,
    sum(clicks)
FROM sessions
GROUP BY
    user_id,
    platform
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 28.19 GiB.

Elapsed: 7.375 sec. Processed 1.00 billion rows, 27.00 GB (135.60 million rows/s., 3.66 GB/s.)


SELECT
    user_id,
    sum(clicks)
FROM sessions
WHERE (cityHash64(user_id) % 4) = 0
GROUP BY
    user_id,
    platform
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 8.16 GiB.

Elapsed: 2.910 sec. Processed 1.00 billion rows, 27.00 GB (343.64 million rows/s., 9.28 GB/s.)

Shard your data by one of common high cardinal GROUP BY key

So on each shard you will have 1/N of all unique combination and this will result in smaller hash tables.

Let’s create 2 distributed tables with different distribution: rand() and by user_id

CREATE TABLE sessions_distributed AS sessions
ENGINE = Distributed('distr-groupby', default, sessions, rand());

INSERT INTO sessions_distributed WITH
    CAST(number % 4, 'Enum8(\'Orange\' = 0, \'Melon\' = 1, \'Red\' = 2, \'Blue\' = 3)') AS app,
    concat('UID: ', leftPad(toString(number % 20000000), 8, '0')) AS user_id,
    toDateTime('2021-10-01 10:11:12') + (number / 300) AS created_at,
    CAST((number + 14) % 3, 'Enum8(\'Bat\' = 0, \'Mice\' = 1, \'Rat\' = 2)') AS platform,
    number % 17 AS clicks,
    generateUUIDv4() AS session_id
SELECT
    app,
    user_id,
    created_at,
    platform,
    clicks,
    session_id
FROM numbers_mt(1000000000);

CREATE TABLE sessions_2 ON CLUSTER 'distr-groupby'
(
    `app` LowCardinality(String),
    `user_id` String,
    `created_at` DateTime,
    `platform` LowCardinality(String),
    `clicks` UInt32,
    `session_id` UUID
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(created_at)
ORDER BY (app, user_id, session_id, created_at);

CREATE TABLE sessions_distributed_2 AS sessions
ENGINE = Distributed('distr-groupby', default, sessions_2, cityHash64(user_id));

INSERT INTO sessions_distributed_2 WITH
    CAST(number % 4, 'Enum8(\'Orange\' = 0, \'Melon\' = 1, \'Red\' = 2, \'Blue\' = 3)') AS app,
    concat('UID: ', leftPad(toString(number % 20000000), 8, '0')) AS user_id,
    toDateTime('2021-10-01 10:11:12') + (number / 300) AS created_at,
    CAST((number + 14) % 3, 'Enum8(\'Bat\' = 0, \'Mice\' = 1, \'Rat\' = 2)') AS platform,
    number % 17 AS clicks,
    generateUUIDv4() AS session_id
SELECT
    app,
    user_id,
    created_at,
    platform,
    clicks,
    session_id
FROM numbers_mt(1000000000);
SELECT
    app,
    platform,
    sum(clicks)
FROM
(
    SELECT
        argMax(app, created_at) AS app,
        argMax(platform, created_at) AS platform,
        user_id,
        argMax(clicks, created_at) AS clicks
    FROM sessions_distributed
    GROUP BY user_id
)
GROUP BY
    app,
    platform;

[chi-distr-groupby-distr-groupby-0-0-0] MemoryTracker: Current memory usage (for query): 12.02 GiB.
[chi-distr-groupby-distr-groupby-1-0-0] MemoryTracker: Current memory usage (for query): 12.05 GiB.
[chi-distr-groupby-distr-groupby-2-0-0] MemoryTracker: Current memory usage (for query): 12.05 GiB.

MemoryTracker: Peak memory usage (for query): 12.20 GiB.

12 rows in set. Elapsed: 28.345 sec. Processed 1.00 billion rows, 32.00 GB (35.28 million rows/s., 1.13 GB/s.)

SELECT
    app,
    platform,
    sum(clicks)
FROM
(
    SELECT
        argMax(app, created_at) AS app,
        argMax(platform, created_at) AS platform,
        user_id,
        argMax(clicks, created_at) AS clicks
    FROM sessions_distributed_2
    GROUP BY user_id
)
GROUP BY
    app,
    platform;

[chi-distr-groupby-distr-groupby-0-0-0] MemoryTracker: Current memory usage (for query): 5.05 GiB.
[chi-distr-groupby-distr-groupby-1-0-0] MemoryTracker: Current memory usage (for query): 5.05 GiB.
[chi-distr-groupby-distr-groupby-2-0-0] MemoryTracker: Current memory usage (for query): 5.05 GiB.

MemoryTracker: Peak memory usage (for query): 5.61 GiB.

12 rows in set. Elapsed: 11.952 sec. Processed 1.00 billion rows, 32.00 GB (83.66 million rows/s., 2.68 GB/s.)

SELECT
    app,
    platform,
    sum(clicks)
FROM
(
    SELECT
        argMax(app, created_at) AS app,
        argMax(platform, created_at) AS platform,
        user_id,
        argMax(clicks, created_at) AS clicks
    FROM sessions_distributed_2
    GROUP BY user_id
)
GROUP BY
    app,
    platform
SETTINGS optimize_distributed_group_by_sharding_key = 1

[chi-distr-groupby-distr-groupby-0-0-0] MemoryTracker: Current memory usage (for query): 5.05 GiB.
[chi-distr-groupby-distr-groupby-1-0-0] MemoryTracker: Current memory usage (for query): 5.05 GiB.
[chi-distr-groupby-distr-groupby-2-0-0] MemoryTracker: Current memory usage (for query): 5.05 GiB.
MemoryTracker: Peak memory usage (for query): 5.61 GiB.

12 rows in set. Elapsed: 11.916 sec. Processed 1.00 billion rows, 32.00 GB (83.92 million rows/s., 2.69 GB/s.)


SELECT
    app,
    platform,
    sum(clicks)
FROM cluster('distr-groupby', view(
    SELECT
        app,
        platform,
        sum(clicks) as clicks
    FROM
    (
        SELECT
            argMax(app, created_at) AS app,
            argMax(platform, created_at) AS platform,
            user_id,
            argMax(clicks, created_at) AS clicks
        FROM sessions_2
        GROUP BY user_id
    )
    GROUP BY
        app,
        platform
))
GROUP BY
    app,
    platform;

[chi-distr-groupby-distr-groupby-0-0-0] MemoryTracker: Current memory usage (for query): 5.05 GiB.
[chi-distr-groupby-distr-groupby-1-0-0] MemoryTracker: Current memory usage (for query): 5.05 GiB.
[chi-distr-groupby-distr-groupby-2-0-0] MemoryTracker: Current memory usage (for query): 5.05 GiB.

MemoryTracker: Peak memory usage (for query): 5.55 GiB.

12 rows in set. Elapsed: 9.491 sec. Processed 1.00 billion rows, 32.00 GB (105.36 million rows/s., 3.37 GB/s.)

Query with bigger state:


SELECT
    app,
    platform,
    sum(last_click) as sum,
    max(max_clicks) as max,
    min(min_clicks) as min,
    max(max_time) as max_time,
    min(min_time) as min_time
FROM
(
    SELECT
        argMax(app, created_at) AS app,
        argMax(platform, created_at) AS platform,
        user_id,
        argMax(clicks, created_at) AS last_click,
        max(clicks) AS max_clicks,
        min(clicks) AS min_clicks,
        max(created_at) AS max_time,
        min(created_at) AS min_time
    FROM sessions_distributed
    GROUP BY user_id
)
GROUP BY
    app,
    platform;

MemoryTracker: Peak memory usage (for query): 19.95 GiB.
12 rows in set. Elapsed: 34.339 sec. Processed 1.00 billion rows, 32.00 GB (29.12 million rows/s., 932.03 MB/s.)

SELECT
    app,
    platform,
    sum(last_click) as sum,
    max(max_clicks) as max,
    min(min_clicks) as min,
    max(max_time) as max_time,
    min(min_time) as min_time
FROM
(
    SELECT
        argMax(app, created_at) AS app,
        argMax(platform, created_at) AS platform,
        user_id,
        argMax(clicks, created_at) AS last_click,
        max(clicks) AS max_clicks,
        min(clicks) AS min_clicks,
        max(created_at) AS max_time,
        min(created_at) AS min_time
    FROM sessions_distributed_2
    GROUP BY user_id
)
GROUP BY
    app,
    platform;


MemoryTracker: Peak memory usage (for query): 10.09 GiB.

12 rows in set. Elapsed: 13.220 sec. Processed 1.00 billion rows, 32.00 GB (75.64 million rows/s., 2.42 GB/s.)

SELECT
    app,
    platform,
    sum(last_click) AS sum,
    max(max_clicks) AS max,
    min(min_clicks) AS min,
    max(max_time) AS max_time,
    min(min_time) AS min_time
FROM
(
    SELECT
        argMax(app, created_at) AS app,
        argMax(platform, created_at) AS platform,
        user_id,
        argMax(clicks, created_at) AS last_click,
        max(clicks) AS max_clicks,
        min(clicks) AS min_clicks,
        max(created_at) AS max_time,
        min(created_at) AS min_time
    FROM sessions_distributed_2
    GROUP BY user_id
)
GROUP BY
    app,
    platform
SETTINGS optimize_distributed_group_by_sharding_key = 1;

MemoryTracker: Peak memory usage (for query): 10.09 GiB.

12 rows in set. Elapsed: 13.361 sec. Processed 1.00 billion rows, 32.00 GB (74.85 million rows/s., 2.40 GB/s.)

SELECT
    app,
    platform,
    sum(last_click) AS sum,
    max(max_clicks) AS max,
    min(min_clicks) AS min,
    max(max_time) AS max_time,
    min(min_time) AS min_time
FROM
(
    SELECT
        argMax(app, created_at) AS app,
        argMax(platform, created_at) AS platform,
        user_id,
        argMax(clicks, created_at) AS last_click,
        max(clicks) AS max_clicks,
        min(clicks) AS min_clicks,
        max(created_at) AS max_time,
        min(created_at) AS min_time
    FROM sessions_distributed_2
    GROUP BY user_id
)
GROUP BY
    app,
    platform
SETTINGS distributed_group_by_no_merge=2;

MemoryTracker: Peak memory usage (for query): 10.02 GiB.

12 rows in set. Elapsed: 9.789 sec. Processed 1.00 billion rows, 32.00 GB (102.15 million rows/s., 3.27 GB/s.)

SELECT
    app,
    platform,
    sum(sum),
    max(max),
    min(min),
    max(max_time) AS max_time,
    min(min_time) AS min_time
FROM cluster('distr-groupby', view(
    SELECT
        app,
        platform,
        sum(last_click) AS sum,
        max(max_clicks) AS max,
        min(min_clicks) AS min,
        max(max_time) AS max_time,
        min(min_time) AS min_time
    FROM
    (
        SELECT
            argMax(app, created_at) AS app,
            argMax(platform, created_at) AS platform,
            user_id,
            argMax(clicks, created_at) AS last_click,
            max(clicks) AS max_clicks,
            min(clicks) AS min_clicks,
            max(created_at) AS max_time,
            min(created_at) AS min_time
        FROM sessions_2
        GROUP BY user_id
    )
    GROUP BY
        app,
        platform
))
GROUP BY
    app,
    platform;

MemoryTracker: Peak memory usage (for query): 10.09 GiB.

12 rows in set. Elapsed: 9.525 sec. Processed 1.00 billion rows, 32.00 GB (104.98 million rows/s., 3.36 GB/s.)

SELECT
    app,
    platform,
    sum(sessions)
FROM
(
    SELECT
        argMax(app, created_at) AS app,
        argMax(platform, created_at) AS platform,
        user_id,
        uniq(session_id) as sessions
    FROM sessions_distributed_2
    GROUP BY user_id
)
GROUP BY
    app,
    platform

MemoryTracker: Peak memory usage (for query): 14.57 GiB.
12 rows in set. Elapsed: 37.730 sec. Processed 1.00 billion rows, 44.01 GB (26.50 million rows/s., 1.17 GB/s.)


SELECT
    app,
    platform,
    sum(sessions)
FROM
(
    SELECT
        argMax(app, created_at) AS app,
        argMax(platform, created_at) AS platform,
        user_id,
        uniq(session_id) as sessions
    FROM sessions_distributed_2
    GROUP BY user_id
)
GROUP BY
    app,
    platform
SETTINGS optimize_distributed_group_by_sharding_key = 1;

MemoryTracker: Peak memory usage (for query): 14.56 GiB.

12 rows in set. Elapsed: 37.792 sec. Processed 1.00 billion rows, 44.01 GB (26.46 million rows/s., 1.16 GB/s.)

SELECT
    app,
    platform,
    sum(sessions)
FROM
(
    SELECT
        argMax(app, created_at) AS app,
        argMax(platform, created_at) AS platform,
        user_id,
        uniq(session_id) as sessions
    FROM sessions_distributed_2
    GROUP BY user_id
)
GROUP BY
    app,
    platform
SETTINGS distributed_group_by_no_merge = 2;

MemoryTracker: Peak memory usage (for query): 14.54 GiB.
12 rows in set. Elapsed: 17.762 sec. Processed 1.00 billion rows, 44.01 GB (56.30 million rows/s., 2.48 GB/s.)

SELECT
    app,
    platform,
    sum(sessions)
FROM cluster('distr-groupby', view(
SELECT
    app,
    platform,
    sum(sessions) as sessions
FROM
(
    SELECT
        argMax(app, created_at) AS app,
        argMax(platform, created_at) AS platform,
        user_id,
        uniq(session_id) as sessions
    FROM sessions_2
    GROUP BY user_id
)
GROUP BY
    app,
    platform))
GROUP BY
    app,
    platform   

MemoryTracker: Peak memory usage (for query): 14.55 GiB.

12 rows in set. Elapsed: 17.574 sec. Processed 1.00 billion rows, 44.01 GB (56.90 million rows/s., 2.50 GB/s.)

Reduce number of threads

Because each thread uses an independent hash table, if you lower thread amount it will reduce number of hash tables as well and lower memory usage at the cost of slower query execution.


SELECT
    user_id,
    sum(clicks)
FROM sessions
GROUP BY
    user_id,
    platform
FORMAT `Null`


MemoryTracker: Peak memory usage (for query): 28.19 GiB.

Elapsed: 7.375 sec. Processed 1.00 billion rows, 27.00 GB (135.60 million rows/s., 3.66 GB/s.)

SET max_threads = 2;

SELECT
    user_id,
    sum(clicks)
FROM sessions
GROUP BY
    user_id,
    platform
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 13.26 GiB.

Elapsed: 62.014 sec. Processed 1.00 billion rows, 27.00 GB (16.13 million rows/s., 435.41 MB/s.)

UNION ALL


SELECT
    user_id,
    sum(clicks)
FROM sessions
GROUP BY
    app,
    user_id
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 24.19 GiB.

Elapsed: 5.043 sec. Processed 1.00 billion rows, 27.00 GB (198.29 million rows/s., 5.35 GB/s.)


SELECT
    user_id,
    sum(clicks)
FROM sessions WHERE app = 'Orange'
GROUP BY
    user_id
UNION ALL
SELECT
    user_id,
    sum(clicks)
FROM sessions WHERE app = 'Red'
GROUP BY
    user_id
UNION ALL
SELECT
    user_id,
    sum(clicks)
FROM sessions WHERE app = 'Melon'
GROUP BY
    user_id
UNION ALL
SELECT
    user_id,
    sum(clicks)
FROM sessions WHERE app = 'Blue'
GROUP BY
    user_id
FORMAT Null

MemoryTracker: Peak memory usage (for query): 7.99 GiB.

Elapsed: 2.852 sec. Processed 1.00 billion rows, 27.01 GB (350.74 million rows/s., 9.47 GB/s.)

aggregation_in_order

SELECT
    user_id,
    sum(clicks)
FROM sessions
WHERE app = 'Orange'
GROUP BY user_id
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 969.33 MiB.

Elapsed: 2.494 sec. Processed 250.09 million rows, 6.75 GB (100.27 million rows/s., 2.71 GB/s.)



SET optimize_aggregation_in_order = 1;

SELECT
    user_id,
    sum(clicks)
FROM sessions
WHERE app = 'Orange'
GROUP BY
    app,
    user_id
FORMAT `Null`

AggregatingInOrderTransform: Aggregating in order

MemoryTracker: Peak memory usage (for query): 169.24 MiB.

Elapsed: 4.925 sec. Processed 250.09 million rows, 6.75 GB (50.78 million rows/s., 1.37 GB/s.)

Reduce dimensions from GROUP BY with functions like sumMap, *Resample

One

SELECT
    user_id,
    toDate(created_at) AS day,
    sum(clicks)
FROM sessions
WHERE (created_at >= toDate('2021-10-01')) AND (created_at < toDate('2021-11-01')) AND (app IN ('Orange', 'Red', 'Blue'))
GROUP BY
    user_id,
    day
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 50.74 GiB.

Elapsed: 22.671 sec. Processed 594.39 million rows, 18.46 GB (26.22 million rows/s., 814.41 MB/s.)


SELECT
    user_id,
    (toDate('2021-10-01') + date_diff) - 1 AS day,
    clicks
FROM
(
    SELECT
        user_id,
        sumResample(0, 31, 1)(clicks, toDate(created_at) - toDate('2021-10-01')) AS clicks_arr
    FROM sessions
    WHERE (created_at >= toDate('2021-10-01')) AND (created_at < toDate('2021-11-01')) AND (app IN ('Orange', 'Red', 'Blue'))
    GROUP BY user_id
)
ARRAY JOIN
    clicks_arr AS clicks,
    arrayEnumerate(clicks_arr) AS date_diff
FORMAT `Null`

Peak memory usage (for query): 8.24 GiB.

Elapsed: 5.191 sec. Processed 594.39 million rows, 18.46 GB (114.50 million rows/s., 3.56 GB/s.)

Multiple


SELECT
    user_id,
    platform,
    toDate(created_at) AS day,
    sum(clicks)
FROM sessions
WHERE (created_at >= toDate('2021-10-01')) AND (created_at < toDate('2021-11-01')) AND (app IN ('Orange')) AND user_id ='UID: 08525196'
GROUP BY
    user_id,
    platform,
    day
ORDER BY user_id,
    day,
    platform
FORMAT `Null`

Peak memory usage (for query): 29.50 GiB.

Elapsed: 8.181 sec. Processed 198.14 million rows, 6.34 GB (24.22 million rows/s., 775.14 MB/s.)

WITH arrayJoin(arrayZip(clicks_arr_lvl_2, range(3))) AS clicks_res
SELECT
    user_id,
    CAST(clicks_res.2 + 1, 'Enum8(\'Rat\' = 1, \'Mice\' = 2, \'Bat\' = 3)') AS platform,
    (toDate('2021-10-01') + date_diff) - 1 AS day,
    clicks_res.1 AS clicks
FROM
(
    SELECT
        user_id,
        sumResampleResample(1, 4, 1, 0, 31, 1)(clicks, CAST(platform, 'Enum8(\'Rat\' = 1, \'Mice\' = 2, \'Bat\' = 3)'), toDate(created_at) - toDate('2021-10-01')) AS clicks_arr
    FROM sessions
    WHERE (created_at >= toDate('2021-10-01')) AND (created_at < toDate('2021-11-01')) AND (app IN ('Orange'))
    GROUP BY user_id
)
ARRAY JOIN
    clicks_arr AS clicks_arr_lvl_2,
    range(31) AS date_diff
FORMAT `Null`

Peak memory usage (for query): 9.92 GiB.

Elapsed: 4.170 sec. Processed 198.14 million rows, 6.34 GB (47.52 million rows/s., 1.52 GB/s.)


WITH arrayJoin(arrayZip(clicks_arr_lvl_2, range(3))) AS clicks_res
SELECT
    user_id,
    CAST(clicks_res.2 + 1, 'Enum8(\'Rat\' = 1, \'Mice\' = 2, \'Bat\' = 3)') AS platform,
    (toDate('2021-10-01') + date_diff) - 1 AS day,
    clicks_res.1 AS clicks
FROM
(
    SELECT
        user_id,
        sumResampleResample(1, 4, 1, 0, 31, 1)(clicks, CAST(platform, 'Enum8(\'Rat\' = 1, \'Mice\' = 2, \'Bat\' = 3)'), toDate(created_at) - toDate('2021-10-01')) AS clicks_arr
    FROM sessions
    WHERE (created_at >= toDate('2021-10-01')) AND (created_at < toDate('2021-11-01')) AND (app IN ('Orange'))
    GROUP BY user_id
)
ARRAY JOIN
    clicks_arr AS clicks_arr_lvl_2,
    range(31) AS date_diff
WHERE clicks > 0
FORMAT `Null`

Peak memory usage (for query): 10.14 GiB.

Elapsed: 9.533 sec. Processed 198.14 million rows, 6.34 GB (20.78 million rows/s., 665.20 MB/s.)

SELECT
    user_id,
    CAST(plat + 1, 'Enum8(\'Rat\' = 1, \'Mice\' = 2, \'Bat\' = 3)') AS platform,
    (toDate('2021-10-01') + date_diff) - 1 AS day,
    clicks
FROM
(
    WITH
        (SELECT flatten(arrayMap(x -> range(3) AS platforms, range(31) as days))) AS platform_arr,
        (SELECT flatten(arrayMap(x -> [x, x, x], range(31) as days))) AS days_arr
    SELECT
        user_id,
        flatten(sumResampleResample(1, 4, 1, 0, 31, 1)(clicks, CAST(platform, 'Enum8(\'Rat\' = 1, \'Mice\' = 2, \'Bat\' = 3)'), toDate(created_at) - toDate('2021-10-01'))) AS clicks_arr,
        platform_arr,
        days_arr
    FROM sessions
    WHERE (created_at >= toDate('2021-10-01')) AND (created_at < toDate('2021-11-01')) AND (app IN ('Orange'))
    GROUP BY user_id
)
ARRAY JOIN
    clicks_arr AS clicks,
    platform_arr AS plat,
    days_arr AS date_diff
FORMAT `Null`

Peak memory usage (for query): 9.95 GiB.

Elapsed: 3.095 sec. Processed 198.14 million rows, 6.34 GB (64.02 million rows/s., 2.05 GB/s.)

SELECT
    user_id,
    CAST(plat + 1, 'Enum8(\'Rat\' = 1, \'Mice\' = 2, \'Bat\' = 3)') AS platform,
    (toDate('2021-10-01') + date_diff) - 1 AS day,
    clicks
FROM
(
    WITH
        (SELECT flatten(arrayMap(x -> range(3) AS platforms, range(31) as days))) AS platform_arr,
        (SELECT flatten(arrayMap(x -> [x, x, x], range(31) as days))) AS days_arr
    SELECT
        user_id,
        sumResampleResample(1, 4, 1, 0, 31, 1)(clicks, CAST(platform, 'Enum8(\'Rat\' = 1, \'Mice\' = 2, \'Bat\' = 3)'), toDate(created_at) - toDate('2021-10-01')) AS clicks_arr,
        arrayFilter(x -> ((x.1) > 0), arrayZip(flatten(clicks_arr), platform_arr, days_arr)) AS result
    FROM sessions
    WHERE (created_at >= toDate('2021-10-01')) AND (created_at < toDate('2021-11-01')) AND (app IN ('Orange'))
    GROUP BY user_id
)
ARRAY JOIN
    result.1 AS clicks,
    result.2 AS plat,
    result.3 AS date_diff
FORMAT `Null`

Peak memory usage (for query): 9.93 GiB.

Elapsed: 4.717 sec. Processed 198.14 million rows, 6.34 GB (42.00 million rows/s., 1.34 GB/s.)

SELECT
    user_id,
    CAST(range % 3, 'Enum8(\'Rat\' = 0, \'Mice\' = 1, \'Bat\' = 2)') AS platform,
    toDate('2021-10-01') + intDiv(range, 3) AS day,
    clicks
FROM
(
    WITH (
            SELECT range(93)
        ) AS range_arr
    SELECT
        user_id,
        sumResample(0, 93, 1)(clicks, ((toDate(created_at) - toDate('2021-10-01')) * 3) + toUInt8(CAST(platform, 'Enum8(\'Rat\' = 0, \'Mice\' = 1, \'Bat\' = 2)'))) AS clicks_arr,
        range_arr
    FROM sessions
    WHERE (created_at >= toDate('2021-10-01')) AND (created_at < toDate('2021-11-01')) AND (app IN ('Orange'))
    GROUP BY user_id
)
ARRAY JOIN
    clicks_arr AS clicks,
    range_arr AS range
FORMAT `Null`

Peak memory usage (for query): 8.24 GiB.

Elapsed: 4.838 sec. Processed 198.14 million rows, 6.36 GB (40.95 million rows/s., 1.31 GB/s.)

SELECT
    user_id,
    sumResampleResample(1, 4, 1, 0, 31, 1)(clicks, CAST(platform, 'Enum8(\'Rat\' = 1, \'Mice\' = 2, \'Bat\' = 3)'), toDate(created_at) - toDate('2021-10-01')) AS clicks_arr
FROM sessions
WHERE (created_at >= toDate('2021-10-01')) AND (created_at < toDate('2021-11-01')) AND (app IN ('Orange'))
GROUP BY user_id
FORMAT `Null`

Peak memory usage (for query): 5.19 GiB.

0 rows in set. Elapsed: 1.160 sec. Processed 198.14 million rows, 6.34 GB (170.87 million rows/s., 5.47 GB/s.) 

ARRAY JOIN can be expensive

https://kb.altinity.com/altinity-kb-functions/array-like-memory-usage/

sumMap, *Resample

https://kb.altinity.com/altinity-kb-functions/resample-vs-if-vs-map-vs-subquery/

Play with two-level

Disable:

SET group_by_two_level_threshold = 0, group_by_two_level_threshold_bytes = 0; 

From 22.4 ClickHouse can predict, when it make sense to initialize aggregation with two-level from start, instead of rehashing on fly. It can improve query time. https://github.com/ClickHouse/ClickHouse/pull/33439

GROUP BY in external memory

Slow!

Use hash function for GROUP BY keys

GROUP BY cityHash64(‘xxxx’)

Can lead to incorrect results as hash functions is not 1 to 1 mapping.

Performance bugs

https://github.com/ClickHouse/ClickHouse/issues/15005

https://github.com/ClickHouse/ClickHouse/issues/29131

https://github.com/ClickHouse/ClickHouse/issues/31120

https://github.com/ClickHouse/ClickHouse/issues/35096 Fixed in 22.7

2 - DateTime64

DateTime64 data type

Substract fractional seconds

WITH toDateTime64('2021-09-07 13:41:50.926', 3) AS time
SELECT
    time - 1,
    time - 0.1 AS no_affect,
    time - toDecimal64(0.1, 3) AS uncorrect_result,
    time - toIntervalMillisecond(100) AS correct_result -- from 22.4

Query id: 696722bd-3c22-4270-babe-c6b124fee97f

┌──────────minus(time, 1)─┬───────────────no_affect─┬────────uncorrect_result─┬──────────correct_result─┐
 2021-09-07 13:41:49.926  2021-09-07 13:41:50.926  1970-01-01 00:00:00.000  2021-09-07 13:41:50.826 
└─────────────────────────┴─────────────────────────┴─────────────────────────┴─────────────────────────┘


WITH
    toDateTime64('2021-03-03 09:30:00.100', 3) AS time,
    fromUnixTimestamp64Milli(toInt64(toUnixTimestamp64Milli(time) + (1.25 * 1000))) AS first,
    toDateTime64(toDecimal64(time, 3) + toDecimal64('1.25', 3), 3) AS second,
    reinterpret(reinterpret(time, 'Decimal64(3)') + toDecimal64('1.25', 3), 'DateTime64(3)') AS third,
    time + toIntervalMillisecond(1250) AS fourth, -- from 22.4
    addMilliseconds(time, 1250) AS fifth          -- from 22.4
SELECT
    first,
    second,
    third,
    fourth,
    fifth

Query id: 176cd2e7-68bf-4e26-a492-63e0b5a87cc5

┌───────────────────first─┬──────────────────second─┬───────────────────third─┬──────────────────fourth─┬───────────────────fifth─┐
 2021-03-03 09:30:01.350  2021-03-03 09:30:01.350  2021-03-03 09:30:01.350  2021-03-03 09:30:01.350  2021-03-03 09:30:01.350 
└─────────────────────────┴─────────────────────────┴─────────────────────────┴─────────────────────────┴─────────────────────────┘

SET max_threads=1;

Starting from 22.4

WITH
    materialize(toDateTime64('2021-03-03 09:30:00.100', 3)) AS time,
    time + toIntervalMillisecond(1250) AS fourth
SELECT count()
FROM numbers(100000000)
WHERE NOT ignore(fourth)

1 rows in set. Elapsed: 0.215 sec. Processed 100.03 million rows, 800.21 MB (464.27 million rows/s., 3.71 GB/s.)

WITH
    materialize(toDateTime64('2021-03-03 09:30:00.100', 3)) AS time,
    addMilliseconds(time, 1250) AS fifth
SELECT count()
FROM numbers(100000000)
WHERE NOT ignore(fifth)

1 rows in set. Elapsed: 0.208 sec. Processed 100.03 million rows, 800.21 MB (481.04 million rows/s., 3.85 GB/s.)

###########

WITH
    materialize(toDateTime64('2021-03-03 09:30:00.100', 3)) AS time,
    fromUnixTimestamp64Milli(reinterpretAsInt64(toUnixTimestamp64Milli(time) + (1.25 * 1000))) AS first
SELECT count()
FROM numbers(100000000)
WHERE NOT ignore(first)

1 rows in set. Elapsed: 0.370 sec. Processed 100.03 million rows, 800.21 MB (270.31 million rows/s., 2.16 GB/s.)

WITH
    materialize(toDateTime64('2021-03-03 09:30:00.100', 3)) AS time,
    fromUnixTimestamp64Milli(toUnixTimestamp64Milli(time) + toInt16(1.25 * 1000)) AS first
SELECT count()
FROM numbers(100000000)
WHERE NOT ignore(first)

1 rows in set. Elapsed: 0.256 sec. Processed 100.03 million rows, 800.21 MB (391.06 million rows/s., 3.13 GB/s.)


WITH
    materialize(toDateTime64('2021-03-03 09:30:00.100', 3)) AS time,
    toDateTime64(toDecimal64(time, 3) + toDecimal64('1.25', 3), 3) AS second
SELECT count()
FROM numbers(100000000)
WHERE NOT ignore(second)

1 rows in set. Elapsed: 2.240 sec. Processed 100.03 million rows, 800.21 MB (44.65 million rows/s., 357.17 MB/s.)

SET decimal_check_overflow=0;

WITH
    materialize(toDateTime64('2021-03-03 09:30:00.100', 3)) AS time,
    toDateTime64(toDecimal64(time, 3) + toDecimal64('1.25', 3), 3) AS second
SELECT count()
FROM numbers(100000000)
WHERE NOT ignore(second)

1 rows in set. Elapsed: 1.991 sec. Processed 100.03 million rows, 800.21 MB (50.23 million rows/s., 401.81 MB/s.)


WITH
    materialize(toDateTime64('2021-03-03 09:30:00.100', 3)) AS time,
    reinterpret(reinterpret(time, 'Decimal64(3)') + toDecimal64('1.25', 3), 'DateTime64(3)') AS third
SELECT count()
FROM numbers(100000000)
WHERE NOT ignore(third)

1 rows in set. Elapsed: 0.515 sec. Processed 100.03 million rows, 800.21 MB (194.39 million rows/s., 1.56 GB/s.)

SET decimal_check_overflow=0;

WITH
    materialize(toDateTime64('2021-03-03 09:30:00.100', 3)) AS time,
    reinterpret(reinterpret(time, 'Decimal64(3)') + toDecimal64('1.25', 3), 'DateTime64(3)') AS third
SELECT count()
FROM numbers(100000000)
WHERE NOT ignore(third)

1 rows in set. Elapsed: 0.281 sec. Processed 100.03 million rows, 800.21 MB (356.21 million rows/s., 2.85 GB/s.)

3 - DISTINCT & GROUP BY & LIMIT 1 BY what the difference

Page description for heading and indexes.

DISTINCT


SELECT DISTINCT number
FROM numbers_mt(100000000)
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 4.00 GiB.

0 rows in set. Elapsed: 18.720 sec. Processed 100.03 million rows, 800.21 MB (5.34 million rows/s., 42.75 MB/s.)

SELECT DISTINCT number
FROM numbers_mt(100000000)
SETTINGS max_threads = 1
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 4.00 GiB.

0 rows in set. Elapsed: 18.349 sec. Processed 100.03 million rows, 800.21 MB (5.45 million rows/s., 43.61 MB/s.)

SELECT DISTINCT number
FROM numbers_mt(100000000)
LIMIT 1000
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 21.56 MiB.

0 rows in set. Elapsed: 0.014 sec. Processed 589.54 thousand rows, 4.72 MB (43.08 million rows/s., 344.61 MB/s.)



SELECT DISTINCT number % 1000
FROM numbers_mt(1000000000)
LIMIT 1000
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 1.80 MiB.

0 rows in set. Elapsed: 0.005 sec. Processed 589.54 thousand rows, 4.72 MB (127.23 million rows/s., 1.02 GB/s.)

SELECT DISTINCT number % 1000
FROM numbers(1000000000)
LIMIT 1001
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 847.05 KiB.

0 rows in set. Elapsed: 0.448 sec. Processed 1.00 billion rows, 8.00 GB (2.23 billion rows/s., 17.88 GB/s.)
  • Final distinct step is single threaded
  • Stream resultset

GROUP BY


SELECT number
FROM numbers_mt(100000000)
GROUP BY number
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 4.04 GiB.

0 rows in set. Elapsed: 8.212 sec. Processed 100.00 million rows, 800.00 MB (12.18 million rows/s., 97.42 MB/s.)

SELECT number
FROM numbers_mt(100000000)
GROUP BY number
SETTINGS max_threads = 1
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 6.00 GiB.

0 rows in set. Elapsed: 19.206 sec. Processed 100.03 million rows, 800.21 MB (5.21 million rows/s., 41.66 MB/s.)

SELECT number
FROM numbers_mt(100000000)
GROUP BY number
LIMIT 1000
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 4.05 GiB.

0 rows in set. Elapsed: 4.852 sec. Processed 100.00 million rows, 800.00 MB (20.61 million rows/s., 164.88 MB/s.)

This query faster than first, because ClickHouse doesn't need to merge states for all keys, only for first 1000 (based on LIMIT)


SELECT number % 1000 AS key
FROM numbers_mt(1000000000)
GROUP BY key
LIMIT 1000
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 3.15 MiB.

0 rows in set. Elapsed: 0.770 sec. Processed 1.00 billion rows, 8.00 GB (1.30 billion rows/s., 10.40 GB/s.)

SELECT number % 1000 AS key
FROM numbers_mt(1000000000)
GROUP BY key
LIMIT 1001
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 3.77 MiB.

0 rows in set. Elapsed: 0.770 sec. Processed 1.00 billion rows, 8.00 GB (1.30 billion rows/s., 10.40 GB/s.)
  • Multi threaded
  • Will return result only after competion of aggregation

LIMIT BY

SELECT number
FROM numbers_mt(100000000)
LIMIT 1 BY number
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 6.00 GiB.

0 rows in set. Elapsed: 39.541 sec. Processed 100.00 million rows, 800.00 MB (2.53 million rows/s., 20.23 MB/s.)

SELECT number
FROM numbers_mt(100000000)
LIMIT 1 BY number
SETTINGS max_threads = 1
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 6.01 GiB.

0 rows in set. Elapsed: 36.773 sec. Processed 100.03 million rows, 800.21 MB (2.72 million rows/s., 21.76 MB/s.)

SELECT number
FROM numbers_mt(100000000)
LIMIT 1 BY number
LIMIT 1000
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 10.56 MiB.

0 rows in set. Elapsed: 0.019 sec. Processed 589.54 thousand rows, 4.72 MB (30.52 million rows/s., 244.20 MB/s.)



SELECT number % 1000 AS key
FROM numbers_mt(1000000000)
LIMIT 1 BY key
LIMIT 1000
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 5.14 MiB.

0 rows in set. Elapsed: 0.008 sec. Processed 589.54 thousand rows, 4.72 MB (71.27 million rows/s., 570.16 MB/s.)

SELECT number % 1000 AS key
FROM numbers_mt(1000000000)
LIMIT 1 BY key
LIMIT 1001
FORMAT `Null`

MemoryTracker: Peak memory usage (for query): 3.23 MiB.

0 rows in set. Elapsed: 36.027 sec. Processed 1.00 billion rows, 8.00 GB (27.76 million rows/s., 222.06 MB/s.)
  • Single threaded
  • Stream resultset
  • Can return arbitrary amount of rows per each key

4 - Imprecise parsing of literal Decimal or Float64

Imprecise parsing of literal Decimal or Float64

Decimal

SELECT
    9.2::Decimal64(2) AS postgresql_cast,
    toDecimal64(9.2, 2) AS to_function,
    CAST(9.2, 'Decimal64(2)') AS cast_float_literal,
    CAST('9.2', 'Decimal64(2)') AS cast_string_literal

┌─postgresql_cast─┬─to_function─┬─cast_float_literal─┬─cast_string_literal─┐
             9.2         9.19                9.19                  9.2 
└─────────────────┴─────────────┴────────────────────┴─────────────────────┘

When we try to type cast 64.32 to Decimal128(2) the resulted value is 64.31.

When it sees a number with a decimal separator it interprets as Float64 literal (where 64.32 have no accurate representation, and actually you get something like 64.319999999999999999) and later that Float is casted to Decimal by removing the extra precision.

Workaround is very simple - wrap the number in quotes (and it will be considered as a string literal by query parser, and will be transformed to Decimal directly), or use postgres-alike casting syntax:

select cast(64.32,'Decimal128(2)') a, cast('64.32','Decimal128(2)') b, 64.32::Decimal128(2) c;

┌─────a─┬─────b─┬─────c─┐
 64.31  64.32  64.32 
└───────┴───────┴───────┘

Float64

SELECT
    toFloat64(15008753.) AS to_func,
    toFloat64('1.5008753E7') AS to_func_scientific,
    CAST('1.5008753E7', 'Float64') AS cast_scientific

┌──to_func─┬─to_func_scientific─┬────cast_scientific─┐
 15008753  15008753.000000002  15008753.000000002 
└──────────┴────────────────────┴────────────────────┘

5 - Multiple aligned date columns in PARTITION BY expression

How to put multiple correlated date-like columns in partition key without generating a lot of partitions in case not exact match between them.

Alternative to doing that by minmax skip index.

CREATE TABLE part_key_multiple_dates
(
    `key` UInt32,
    `date` Date,
    `time` DateTime,
    `created_at` DateTime,
    `inserted_at` DateTime
)
ENGINE = MergeTree
PARTITION BY (toYYYYMM(date), ignore(created_at, inserted_at))
ORDER BY (key, time);


INSERT INTO part_key_multiple_dates SELECT
    number,
    toDate(x),
    now() + intDiv(number, 10) AS x,
    x - (rand() % 100),
    x + (rand() % 100)
FROM numbers(100000000);

SELECT count()
FROM part_key_multiple_dates
WHERE date > (now() + toIntervalDay(105));

┌─count()─┐
 8434210 
└─────────┘

1 rows in set. Elapsed: 0.022 sec. Processed 11.03 million rows, 22.05 MB (501.94 million rows/s., 1.00 GB/s.)

SELECT count()
FROM part_key_multiple_dates
WHERE inserted_at > (now() + toIntervalDay(105));

┌─count()─┐
 9279818 
└─────────┘

1 rows in set. Elapsed: 0.046 sec. Processed 11.03 million rows, 44.10 MB (237.64 million rows/s., 950.57 MB/s.)

SELECT count()
FROM part_key_multiple_dates
WHERE created_at > (now() + toIntervalDay(105));

┌─count()─┐
 9279139 
└─────────┘

1 rows in set. Elapsed: 0.043 sec. Processed 11.03 million rows, 44.10 MB (258.22 million rows/s., 1.03 GB/s.)

6 - Why is simple `SELECT count()` Slow in ClickHouse?

Why is simple SELECT count() Slow in ClickHouse?

ClickHouse is a columnar database that provides excellent performance for analytical queries. However, in some cases, a simple count query can be slow. In this article, we’ll explore the reasons why this can happen and how to optimize the query.

Three Strategies for Counting Rows in ClickHouse

There are three ways to count rows in a table in ClickHouse:

  1. optimize_trivial_count_query: This strategy extracts the number of rows from the table metadata. It’s the fastest and most efficient way to count rows, but it only works for simple count queries.

  2. allow_experimental_projection_optimization: This strategy uses a virtual projection called _minmax_count_projection to count rows. It’s faster than scanning the table but slower than the trivial count query.

  3. Scanning the smallest column in the table and reading rows from that. This is the slowest strategy and is only used when the other two strategies can’t be used.

Why Does ClickHouse Sometimes Choose the Slowest Counting Strategy?

In some cases, ClickHouse may choose the slowest counting strategy even when there are faster options available. Here are some possible reasons why this can happen:

  1. Row policies are used on the table: If row policies are used, ClickHouse needs to filter rows to give the proper count. You can check if row policies are used by selecting from system.row_policies.

  2. Experimental light-weight delete feature was used on the table: If the experimental light-weight delete feature was used, ClickHouse may use the slowest counting strategy. You can check this by looking into parts_columns for the column named _row_exists. To do this, run the following query:

SELECT DISTINCT database, table FROM system.parts_columns WHERE column = '_row_exists';

You can also refer to this issue on GitHub for more information: https://github.com/ClickHouse/ClickHouse/issues/47930.

  1. SELECT FINAL or final=1 setting is used.

  2. max_parallel_replicas > 1 is used.

  3. Sampling is used.

  4. Some other features like allow_experimental_query_deduplication or empty_result_for_aggregation_by_empty_set is used.

7 - Collecting query execution flamegraphs using system.trace_log

Collecting query execution flamegraph using trace_log

Collecting query execution flamegraph using system.trace_log

ClickHouse has embedded functionality to analyze the details of query performance.

It’s system.trace_log table.

By default it collects information only about queries when runs longer than 1 sec (and collects stacktraces every second).

You can adjust that per query using settings query_profiler_real_time_period_ns & query_profiler_cpu_time_period_ns.

Both works very similar (with desired interval dump the stacktraces of all the threads which execute the query). real timer - allows to ‘see’ the situtions when cpu was not working much, but time was spend for example on IO. cpu timer - allows to see the ‘hot’ points in calculations more accurately (skip the io time).

Trying to collect stacktraces with a frequency higher than few KHz is usually not possible.

To check where most of the RAM is used you can collect stacktraces during memory allocations / deallocation, by using the setting memory_profiler_sample_probability.

clickhouse-speedscope

# install 
wget https://github.com/laplab/clickhouse-speedscope/archive/refs/heads/master.tar.gz -O clickhouse-speedscope.tar.gz
tar -xvzf clickhouse-speedscope.tar.gz
cd clickhouse-speedscope-master/
pip3 install -r requirements.txt

For debugging particular query:

clickhouse-client 

SET query_profiler_cpu_time_period_ns=1000000; -- 1000 times per 'cpu' sec
-- or SET query_profiler_real_time_period_ns=2000000; -- 500 times per 'real' sec.
-- or SET memory_profiler_sample_probability=0.1; -- to debug the memory allocations

SELECT ... <your select>

SYSTEM FLUSH LOGS;

-- get the query_id from the clickhouse-client output or from system.query_log (also pay attention on query_id vs initial_query_id for distributed queries).

Now let’s process that:

python3 main.py &  # start the proxy in background
python3 main.py --query-id 908952ee-71a8-48a4-84d5-f4db92d45a5d # process the stacktraces
fg # get the proxy from background 
Ctrl + C  # stop it.

To access ClickHouse with other username / password etc. - see the sources of https://github.com/laplab/clickhouse-speedscope/blob/master/main.py

clickhouse-flamegraph

Installation & usage instructions: https://github.com/Slach/clickhouse-flamegraph

pure flamegraph.pl examples

git clone https://github.com/brendangregg/FlameGraph /opt/flamegraph

clickhouse-client -q "SELECT  arrayStringConcat(arrayReverse(arrayMap(x -> concat( addressToLine(x), '#', demangle(addressToSymbol(x)) ), trace)), ';') AS stack, count() AS samples FROM system.trace_log WHERE event_time >= subtractMinutes(now(),10) GROUP BY trace FORMAT TabSeparated" | /opt/flamegraph/flamegraph.pl > flamegraph.svg

clickhouse-client -q "SELECT  arrayStringConcat((arrayMap(x -> concat(splitByChar('/', addressToLine(x))[-1], '#', demangle(addressToSymbol(x)) ), trace)), ';') AS stack, sum(abs(size)) AS samples FROM system.trace_log where trace_type = 'Memory' and event_date = today() group by trace order by samples desc FORMAT TabSeparated" | /opt/flamegraph/flamegraph.pl > allocs.svg
clickhouse-client -q "SELECT  arrayStringConcat(arrayReverse(arrayMap(x -> concat(splitByChar('/', addressToLine(x))[-1], '#', demangle(addressToSymbol(x)) ), trace)), ';') AS stack, count() AS samples FROM system.trace_log where trace_type = 'Memory' group by trace FORMAT TabSeparated SETTINGS allow_introspection_functions=1" | /opt/flamegraph/flamegraph.pl > ~/mem1.svg

similar using perf

apt-get update -y 
apt-get install -y linux-tools-common linux-tools-generic linux-tools-`uname -r`git
apt-get install -y clickhouse-common-static-dbg clickhouse-common-dbg
mkdir -p /opt/flamegraph
git clone https://github.com/brendangregg/FlameGraph /opt/flamegraph

perf record -F 99 -p $(pidof clickhouse) -G
perf script > /tmp/out.perf
/opt/flamegraph/stackcollapse-perf.pl /tmp/out.perf | /opt/flamegraph/flamegraph.pl > /tmp/flamegraph.svg

also

https://kb.altinity.com/altinity-kb-queries-and-syntax/troubleshooting/#flamegraph

https://github.com/samber/grafana-flamegraph-panel/pull/2

8 - Using array functions to mimic window-functions alike behavior

Using array functions to mimic window-functions alike behavior.

Using array functions to mimic window functions alike behavior

There are some usecases when you may want to mimic window functions using Arrays - as an optimization step, or to contol the memory better / use on-disk spiling, or just if you have old ClickHouse version.

Running difference sample

DROP TABLE IS EXISTS test_running_difference

CREATE TABLE test_running_difference
ENGINE = Log AS
SELECT 
    number % 20 AS id, 
    toDateTime('2010-01-01 00:00:00') + (intDiv(number, 20) * 15) AS ts, 
    (number * round(xxHash32(number % 20) / 1000000)) - round(rand() / 1000000) AS val
FROM numbers(100)


SELECT * FROM test_running_difference
┌─id─┬──────────────────ts─┬────val─┐
│  0 │ 2010-01-01 00:00:00 │  -1209 │
│  1 │ 2010-01-01 00:00:00 │     43 │
│  2 │ 2010-01-01 00:00:00 │   4322 │
│  3 │ 2010-01-01 00:00:00 │    -25 │
│  4 │ 2010-01-01 00:00:00 │  13720 │
│  5 │ 2010-01-01 00:00:00 │    903 │
│  6 │ 2010-01-01 00:00:00 │  18062 │
│  7 │ 2010-01-01 00:00:00 │  -2873 │
│  8 │ 2010-01-01 00:00:00 │   6286 │
│  9 │ 2010-01-01 00:00:00 │  13399 │
│ 10 │ 2010-01-01 00:00:00 │  18320 │
│ 11 │ 2010-01-01 00:00:00 │  11731 │
│ 12 │ 2010-01-01 00:00:00 │    857 │
│ 13 │ 2010-01-01 00:00:00 │   8752 │
│ 14 │ 2010-01-01 00:00:00 │  23060 │
│ 15 │ 2010-01-01 00:00:00 │  41902 │
│ 16 │ 2010-01-01 00:00:00 │  39406 │
│ 17 │ 2010-01-01 00:00:00 │  50010 │
│ 18 │ 2010-01-01 00:00:00 │  57673 │
│ 19 │ 2010-01-01 00:00:00 │  51389 │
│  0 │ 2010-01-01 00:00:15 │  66839 │
│  1 │ 2010-01-01 00:00:15 │  19440 │
│  2 │ 2010-01-01 00:00:15 │  74513 │
│  3 │ 2010-01-01 00:00:15 │  10542 │
│  4 │ 2010-01-01 00:00:15 │  94245 │
│  5 │ 2010-01-01 00:00:15 │   8230 │
│  6 │ 2010-01-01 00:00:15 │  87823 │
│  7 │ 2010-01-01 00:00:15 │   -128 │
│  8 │ 2010-01-01 00:00:15 │  30101 │
│  9 │ 2010-01-01 00:00:15 │  54321 │
│ 10 │ 2010-01-01 00:00:15 │  64078 │
│ 11 │ 2010-01-01 00:00:15 │  31886 │
│ 12 │ 2010-01-01 00:00:15 │   8749 │
│ 13 │ 2010-01-01 00:00:15 │  28982 │
│ 14 │ 2010-01-01 00:00:15 │  61299 │
│ 15 │ 2010-01-01 00:00:15 │  95867 │
│ 16 │ 2010-01-01 00:00:15 │  93667 │
│ 17 │ 2010-01-01 00:00:15 │ 114072 │
│ 18 │ 2010-01-01 00:00:15 │ 124279 │
│ 19 │ 2010-01-01 00:00:15 │ 109605 │
│  0 │ 2010-01-01 00:00:30 │ 135082 │
│  1 │ 2010-01-01 00:00:30 │  37345 │
│  2 │ 2010-01-01 00:00:30 │ 148744 │
│  3 │ 2010-01-01 00:00:30 │  21607 │
│  4 │ 2010-01-01 00:00:30 │ 171744 │
│  5 │ 2010-01-01 00:00:30 │  14736 │
│  6 │ 2010-01-01 00:00:30 │ 155349 │
│  7 │ 2010-01-01 00:00:30 │  -3901 │
│  8 │ 2010-01-01 00:00:30 │  54303 │
│  9 │ 2010-01-01 00:00:30 │  89629 │
│ 10 │ 2010-01-01 00:00:30 │ 106595 │
│ 11 │ 2010-01-01 00:00:30 │  54545 │
│ 12 │ 2010-01-01 00:00:30 │  18903 │
│ 13 │ 2010-01-01 00:00:30 │  48023 │
│ 14 │ 2010-01-01 00:00:30 │  97930 │
│ 15 │ 2010-01-01 00:00:30 │ 152165 │
│ 16 │ 2010-01-01 00:00:30 │ 146130 │
│ 17 │ 2010-01-01 00:00:30 │ 174854 │
│ 18 │ 2010-01-01 00:00:30 │ 189194 │
│ 19 │ 2010-01-01 00:00:30 │ 170134 │
│  0 │ 2010-01-01 00:00:45 │ 207471 │
│  1 │ 2010-01-01 00:00:45 │  54323 │
│  2 │ 2010-01-01 00:00:45 │ 217984 │
│  3 │ 2010-01-01 00:00:45 │  31835 │
│  4 │ 2010-01-01 00:00:45 │ 252709 │
│  5 │ 2010-01-01 00:00:45 │  21493 │
│  6 │ 2010-01-01 00:00:45 │ 221271 │
│  7 │ 2010-01-01 00:00:45 │   -488 │
│  8 │ 2010-01-01 00:00:45 │  76827 │
│  9 │ 2010-01-01 00:00:45 │ 131066 │
│ 10 │ 2010-01-01 00:00:45 │ 149087 │
│ 11 │ 2010-01-01 00:00:45 │  71934 │
│ 12 │ 2010-01-01 00:00:45 │  25125 │
│ 13 │ 2010-01-01 00:00:45 │  65274 │
│ 14 │ 2010-01-01 00:00:45 │ 135980 │
│ 15 │ 2010-01-01 00:00:45 │ 210910 │
│ 16 │ 2010-01-01 00:00:45 │ 200007 │
│ 17 │ 2010-01-01 00:00:45 │ 235872 │
│ 18 │ 2010-01-01 00:00:45 │ 256112 │
│ 19 │ 2010-01-01 00:00:45 │ 229371 │
│  0 │ 2010-01-01 00:01:00 │ 275331 │
│  1 │ 2010-01-01 00:01:00 │  72668 │
│  2 │ 2010-01-01 00:01:00 │ 290366 │
│  3 │ 2010-01-01 00:01:00 │  46074 │
│  4 │ 2010-01-01 00:01:00 │ 329207 │
│  5 │ 2010-01-01 00:01:00 │  26770 │
│  6 │ 2010-01-01 00:01:00 │ 287619 │
│  7 │ 2010-01-01 00:01:00 │  -2207 │
│  8 │ 2010-01-01 00:01:00 │ 100456 │
│  9 │ 2010-01-01 00:01:00 │ 165688 │
│ 10 │ 2010-01-01 00:01:00 │ 194136 │
│ 11 │ 2010-01-01 00:01:00 │  94113 │
│ 12 │ 2010-01-01 00:01:00 │  35810 │
│ 13 │ 2010-01-01 00:01:00 │  85081 │
│ 14 │ 2010-01-01 00:01:00 │ 170256 │
│ 15 │ 2010-01-01 00:01:00 │ 265445 │
│ 16 │ 2010-01-01 00:01:00 │ 254828 │
│ 17 │ 2010-01-01 00:01:00 │ 297238 │
│ 18 │ 2010-01-01 00:01:00 │ 323494 │
│ 19 │ 2010-01-01 00:01:00 │ 286252 │
└────┴─────────────────────┴────────┘

100 rows in set. Elapsed: 0.003 sec. 

runningDifference works only in blocks & require ordered data & problematic when group changes

select id, val, runningDifference(val) from (select * from test_running_difference order by id, ts);
┌─id─┬────val─┬─runningDifference(val)─┐
│  0 │  -1209 │                      0 │
│  0 │  66839 │                  68048 │
│  0 │ 135082 │                  68243 │
│  0 │ 207471 │                  72389 │
│  0 │ 275331 │                  67860 │
│  1 │     43 │                -275288 │
│  1 │  19440 │                  19397 │
│  1 │  37345 │                  17905 │
│  1 │  54323 │                  16978 │
│  1 │  72668 │                  18345 │
│  2 │   4322 │                 -68346 │
│  2 │  74513 │                  70191 │
│  2 │ 148744 │                  74231 │
│  2 │ 217984 │                  69240 │
│  2 │ 290366 │                  72382 │
│  3 │    -25 │                -290391 │
│  3 │  10542 │                  10567 │
│  3 │  21607 │                  11065 │
│  3 │  31835 │                  10228 │
│  3 │  46074 │                  14239 │
│  4 │  13720 │                 -32354 │
│  4 │  94245 │                  80525 │
│  4 │ 171744 │                  77499 │
│  4 │ 252709 │                  80965 │
│  4 │ 329207 │                  76498 │
│  5 │    903 │                -328304 │
│  5 │   8230 │                   7327 │
│  5 │  14736 │                   6506 │
│  5 │  21493 │                   6757 │
│  5 │  26770 │                   5277 │
│  6 │  18062 │                  -8708 │
│  6 │  87823 │                  69761 │
│  6 │ 155349 │                  67526 │
│  6 │ 221271 │                  65922 │
│  6 │ 287619 │                  66348 │
│  7 │  -2873 │                -290492 │
│  7 │   -128 │                   2745 │
│  7 │  -3901 │                  -3773 │
│  7 │   -488 │                   3413 │
│  7 │  -2207 │                  -1719 │
│  8 │   6286 │                   8493 │
│  8 │  30101 │                  23815 │
│  8 │  54303 │                  24202 │
│  8 │  76827 │                  22524 │
│  8 │ 100456 │                  23629 │
│  9 │  13399 │                 -87057 │
│  9 │  54321 │                  40922 │
│  9 │  89629 │                  35308 │
│  9 │ 131066 │                  41437 │
│  9 │ 165688 │                  34622 │
│ 10 │  18320 │                -147368 │
│ 10 │  64078 │                  45758 │
│ 10 │ 106595 │                  42517 │
│ 10 │ 149087 │                  42492 │
│ 10 │ 194136 │                  45049 │
│ 11 │  11731 │                -182405 │
│ 11 │  31886 │                  20155 │
│ 11 │  54545 │                  22659 │
│ 11 │  71934 │                  17389 │
│ 11 │  94113 │                  22179 │
│ 12 │    857 │                 -93256 │
│ 12 │   8749 │                   7892 │
│ 12 │  18903 │                  10154 │
│ 12 │  25125 │                   6222 │
│ 12 │  35810 │                  10685 │
│ 13 │   8752 │                 -27058 │
│ 13 │  28982 │                  20230 │
│ 13 │  48023 │                  19041 │
│ 13 │  65274 │                  17251 │
│ 13 │  85081 │                  19807 │
│ 14 │  23060 │                 -62021 │
│ 14 │  61299 │                  38239 │
│ 14 │  97930 │                  36631 │
│ 14 │ 135980 │                  38050 │
│ 14 │ 170256 │                  34276 │
│ 15 │  41902 │                -128354 │
│ 15 │  95867 │                  53965 │
│ 15 │ 152165 │                  56298 │
│ 15 │ 210910 │                  58745 │
│ 15 │ 265445 │                  54535 │
│ 16 │  39406 │                -226039 │
│ 16 │  93667 │                  54261 │
│ 16 │ 146130 │                  52463 │
│ 16 │ 200007 │                  53877 │
│ 16 │ 254828 │                  54821 │
│ 17 │  50010 │                -204818 │
│ 17 │ 114072 │                  64062 │
│ 17 │ 174854 │                  60782 │
│ 17 │ 235872 │                  61018 │
│ 17 │ 297238 │                  61366 │
│ 18 │  57673 │                -239565 │
│ 18 │ 124279 │                  66606 │
│ 18 │ 189194 │                  64915 │
│ 18 │ 256112 │                  66918 │
│ 18 │ 323494 │                  67382 │
│ 19 │  51389 │                -272105 │
│ 19 │ 109605 │                  58216 │
│ 19 │ 170134 │                  60529 │
│ 19 │ 229371 │                  59237 │
│ 19 │ 286252 │                  56881 │
└────┴────────┴────────────────────────┘

100 rows in set. Elapsed: 0.005 sec. 

Arrays !

1. Group & Collect the data into array

you can collect several column by builing array of tuples:

SELECT 
    id, 
    groupArray(tuple(ts, val))
FROM test_running_difference
GROUP BY id

┌─id─┬─groupArray(tuple(ts, val))──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│  0 │ [('2010-01-01 00:00:00',-1209),('2010-01-01 00:00:15',66839),('2010-01-01 00:00:30',135082),('2010-01-01 00:00:45',207471),('2010-01-01 00:01:00',275331)]  │
│  1 │ [('2010-01-01 00:00:00',43),('2010-01-01 00:00:15',19440),('2010-01-01 00:00:30',37345),('2010-01-01 00:00:45',54323),('2010-01-01 00:01:00',72668)]        │
│  2 │ [('2010-01-01 00:00:00',4322),('2010-01-01 00:00:15',74513),('2010-01-01 00:00:30',148744),('2010-01-01 00:00:45',217984),('2010-01-01 00:01:00',290366)]   │
│  3 │ [('2010-01-01 00:00:00',-25),('2010-01-01 00:00:15',10542),('2010-01-01 00:00:30',21607),('2010-01-01 00:00:45',31835),('2010-01-01 00:01:00',46074)]       │
│  4 │ [('2010-01-01 00:00:00',13720),('2010-01-01 00:00:15',94245),('2010-01-01 00:00:30',171744),('2010-01-01 00:00:45',252709),('2010-01-01 00:01:00',329207)]  │
│  5 │ [('2010-01-01 00:00:00',903),('2010-01-01 00:00:15',8230),('2010-01-01 00:00:30',14736),('2010-01-01 00:00:45',21493),('2010-01-01 00:01:00',26770)]        │
│  6 │ [('2010-01-01 00:00:00',18062),('2010-01-01 00:00:15',87823),('2010-01-01 00:00:30',155349),('2010-01-01 00:00:45',221271),('2010-01-01 00:01:00',287619)]  │
│  7 │ [('2010-01-01 00:00:00',-2873),('2010-01-01 00:00:15',-128),('2010-01-01 00:00:30',-3901),('2010-01-01 00:00:45',-488),('2010-01-01 00:01:00',-2207)]       │
│  8 │ [('2010-01-01 00:00:00',6286),('2010-01-01 00:00:15',30101),('2010-01-01 00:00:30',54303),('2010-01-01 00:00:45',76827),('2010-01-01 00:01:00',100456)]     │
│  9 │ [('2010-01-01 00:00:00',13399),('2010-01-01 00:00:15',54321),('2010-01-01 00:00:30',89629),('2010-01-01 00:00:45',131066),('2010-01-01 00:01:00',165688)]   │
│ 10 │ [('2010-01-01 00:00:00',18320),('2010-01-01 00:00:15',64078),('2010-01-01 00:00:30',106595),('2010-01-01 00:00:45',149087),('2010-01-01 00:01:00',194136)]  │
│ 11 │ [('2010-01-01 00:00:00',11731),('2010-01-01 00:00:15',31886),('2010-01-01 00:00:30',54545),('2010-01-01 00:00:45',71934),('2010-01-01 00:01:00',94113)]     │
│ 12 │ [('2010-01-01 00:00:00',857),('2010-01-01 00:00:15',8749),('2010-01-01 00:00:30',18903),('2010-01-01 00:00:45',25125),('2010-01-01 00:01:00',35810)]        │
│ 13 │ [('2010-01-01 00:00:00',8752),('2010-01-01 00:00:15',28982),('2010-01-01 00:00:30',48023),('2010-01-01 00:00:45',65274),('2010-01-01 00:01:00',85081)]      │
│ 14 │ [('2010-01-01 00:00:00',23060),('2010-01-01 00:00:15',61299),('2010-01-01 00:00:30',97930),('2010-01-01 00:00:45',135980),('2010-01-01 00:01:00',170256)]   │
│ 15 │ [('2010-01-01 00:00:00',41902),('2010-01-01 00:00:15',95867),('2010-01-01 00:00:30',152165),('2010-01-01 00:00:45',210910),('2010-01-01 00:01:00',265445)]  │
│ 16 │ [('2010-01-01 00:00:00',39406),('2010-01-01 00:00:15',93667),('2010-01-01 00:00:30',146130),('2010-01-01 00:00:45',200007),('2010-01-01 00:01:00',254828)]  │
│ 17 │ [('2010-01-01 00:00:00',50010),('2010-01-01 00:00:15',114072),('2010-01-01 00:00:30',174854),('2010-01-01 00:00:45',235872),('2010-01-01 00:01:00',297238)] │
│ 18 │ [('2010-01-01 00:00:00',57673),('2010-01-01 00:00:15',124279),('2010-01-01 00:00:30',189194),('2010-01-01 00:00:45',256112),('2010-01-01 00:01:00',323494)] │
│ 19 │ [('2010-01-01 00:00:00',51389),('2010-01-01 00:00:15',109605),('2010-01-01 00:00:30',170134),('2010-01-01 00:00:45',229371),('2010-01-01 00:01:00',286252)] │
└────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Do needed ordering in each array

For example - by second element of tuple:

SELECT 
    id, 
    arraySort(x -> (x.2), groupArray((ts, val)))
FROM test_running_difference
GROUP BY id

┌─id─┬─arraySort(lambda(tuple(x), tupleElement(x, 2)), groupArray(tuple(ts, val)))─────────────────────────────────────────────────────────────────────────────────┐
│  0 │ [('2010-01-01 00:00:00',-1209),('2010-01-01 00:00:15',66839),('2010-01-01 00:00:30',135082),('2010-01-01 00:00:45',207471),('2010-01-01 00:01:00',275331)]  │
│  1 │ [('2010-01-01 00:00:00',43),('2010-01-01 00:00:15',19440),('2010-01-01 00:00:30',37345),('2010-01-01 00:00:45',54323),('2010-01-01 00:01:00',72668)]        │
│  2 │ [('2010-01-01 00:00:00',4322),('2010-01-01 00:00:15',74513),('2010-01-01 00:00:30',148744),('2010-01-01 00:00:45',217984),('2010-01-01 00:01:00',290366)]   │
│  3 │ [('2010-01-01 00:00:00',-25),('2010-01-01 00:00:15',10542),('2010-01-01 00:00:30',21607),('2010-01-01 00:00:45',31835),('2010-01-01 00:01:00',46074)]       │
│  4 │ [('2010-01-01 00:00:00',13720),('2010-01-01 00:00:15',94245),('2010-01-01 00:00:30',171744),('2010-01-01 00:00:45',252709),('2010-01-01 00:01:00',329207)]  │
│  5 │ [('2010-01-01 00:00:00',903),('2010-01-01 00:00:15',8230),('2010-01-01 00:00:30',14736),('2010-01-01 00:00:45',21493),('2010-01-01 00:01:00',26770)]        │
│  6 │ [('2010-01-01 00:00:00',18062),('2010-01-01 00:00:15',87823),('2010-01-01 00:00:30',155349),('2010-01-01 00:00:45',221271),('2010-01-01 00:01:00',287619)]  │
│  7 │ [('2010-01-01 00:00:30',-3901),('2010-01-01 00:00:00',-2873),('2010-01-01 00:01:00',-2207),('2010-01-01 00:00:45',-488),('2010-01-01 00:00:15',-128)]       │
│  8 │ [('2010-01-01 00:00:00',6286),('2010-01-01 00:00:15',30101),('2010-01-01 00:00:30',54303),('2010-01-01 00:00:45',76827),('2010-01-01 00:01:00',100456)]     │
│  9 │ [('2010-01-01 00:00:00',13399),('2010-01-01 00:00:15',54321),('2010-01-01 00:00:30',89629),('2010-01-01 00:00:45',131066),('2010-01-01 00:01:00',165688)]   │
│ 10 │ [('2010-01-01 00:00:00',18320),('2010-01-01 00:00:15',64078),('2010-01-01 00:00:30',106595),('2010-01-01 00:00:45',149087),('2010-01-01 00:01:00',194136)]  │
│ 11 │ [('2010-01-01 00:00:00',11731),('2010-01-01 00:00:15',31886),('2010-01-01 00:00:30',54545),('2010-01-01 00:00:45',71934),('2010-01-01 00:01:00',94113)]     │
│ 12 │ [('2010-01-01 00:00:00',857),('2010-01-01 00:00:15',8749),('2010-01-01 00:00:30',18903),('2010-01-01 00:00:45',25125),('2010-01-01 00:01:00',35810)]        │
│ 13 │ [('2010-01-01 00:00:00',8752),('2010-01-01 00:00:15',28982),('2010-01-01 00:00:30',48023),('2010-01-01 00:00:45',65274),('2010-01-01 00:01:00',85081)]      │
│ 14 │ [('2010-01-01 00:00:00',23060),('2010-01-01 00:00:15',61299),('2010-01-01 00:00:30',97930),('2010-01-01 00:00:45',135980),('2010-01-01 00:01:00',170256)]   │
│ 15 │ [('2010-01-01 00:00:00',41902),('2010-01-01 00:00:15',95867),('2010-01-01 00:00:30',152165),('2010-01-01 00:00:45',210910),('2010-01-01 00:01:00',265445)]  │
│ 16 │ [('2010-01-01 00:00:00',39406),('2010-01-01 00:00:15',93667),('2010-01-01 00:00:30',146130),('2010-01-01 00:00:45',200007),('2010-01-01 00:01:00',254828)]  │
│ 17 │ [('2010-01-01 00:00:00',50010),('2010-01-01 00:00:15',114072),('2010-01-01 00:00:30',174854),('2010-01-01 00:00:45',235872),('2010-01-01 00:01:00',297238)] │
│ 18 │ [('2010-01-01 00:00:00',57673),('2010-01-01 00:00:15',124279),('2010-01-01 00:00:30',189194),('2010-01-01 00:00:45',256112),('2010-01-01 00:01:00',323494)] │
│ 19 │ [('2010-01-01 00:00:00',51389),('2010-01-01 00:00:15',109605),('2010-01-01 00:00:30',170134),('2010-01-01 00:00:45',229371),('2010-01-01 00:01:00',286252)] │
└────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

20 rows in set. Elapsed: 0.004 sec. 

That can be rewritten like this:

WITH 
    groupArray(tuple(ts, val)) as window_rows,
    arraySort(x -> x.1, window_rows) as sorted_window_rows
SELECT 
    id, 
    sorted_window_rows
FROM test_running_difference
GROUP BY id

Apply needed logic arrayMap / arrayDifference etc

WITH 
    groupArray(tuple(ts, val)) as window_rows,
    arraySort(x -> x.1, window_rows) as sorted_window_rows,
    arrayMap(x -> x.2, sorted_window_rows) as sorted_window_rows_val_column,
	arrayDifference(sorted_window_rows_val_column) as sorted_window_rows_val_column_diff
SELECT 
    id, 
    sorted_window_rows_val_column_diff
FROM test_running_difference
GROUP BY id
┌─id─┬─sorted_window_rows_val_column_diff─┐
│  0 │ [0,68048,68243,72389,67860]        │
│  1 │ [0,19397,17905,16978,18345]        │
│  2 │ [0,70191,74231,69240,72382]        │
│  3 │ [0,10567,11065,10228,14239]        │
│  4 │ [0,80525,77499,80965,76498]        │
│  5 │ [0,7327,6506,6757,5277]            │
│  6 │ [0,69761,67526,65922,66348]        │
│  7 │ [0,2745,-3773,3413,-1719]          │
│  8 │ [0,23815,24202,22524,23629]        │
│  9 │ [0,40922,35308,41437,34622]        │
│ 10 │ [0,45758,42517,42492,45049]        │
│ 11 │ [0,20155,22659,17389,22179]        │
│ 12 │ [0,7892,10154,6222,10685]          │
│ 13 │ [0,20230,19041,17251,19807]        │
│ 14 │ [0,38239,36631,38050,34276]        │
│ 15 │ [0,53965,56298,58745,54535]        │
│ 16 │ [0,54261,52463,53877,54821]        │
│ 17 │ [0,64062,60782,61018,61366]        │
│ 18 │ [0,66606,64915,66918,67382]        │
│ 19 │ [0,58216,60529,59237,56881]        │
└────┴────────────────────────────────────┘

20 rows in set. Elapsed: 0.005 sec. 

You can do also a lot of magic with arrayEnumerate and accessing different values by their ids.

Now you can return you arrays back to rows

use arrayJoin

WITH 
    groupArray(tuple(ts, val)) as window_rows,
    arraySort(x -> x.1, window_rows) as sorted_window_rows,
    arrayMap(x -> x.2, sorted_window_rows) as sorted_window_rows_val_column,
	arrayDifference(sorted_window_rows_val_column) as sorted_window_rows_val_column_diff,
	arrayJoin(sorted_window_rows_val_column_diff) as diff
SELECT 
    id, 
    diff
FROM test_running_difference
GROUP BY id

or ARRAY JOIN

SELECT 
  id,
  diff,
  ts
FROM 
(
WITH 
    groupArray(tuple(ts, val)) as window_rows,
    arraySort(x -> x.1, window_rows) as sorted_window_rows,
    arrayMap(x -> x.2, sorted_window_rows) as sorted_window_rows_val_column
SELECT 
    id, 
	arrayDifference(sorted_window_rows_val_column) as sorted_window_rows_val_column_diff,
    arrayMap(x -> x.1, sorted_window_rows) as sorted_window_rows_ts_column
FROM test_running_difference
GROUP BY id
) as t1
ARRAY JOIN sorted_window_rows_val_column_diff as diff, sorted_window_rows_ts_column as ts

etc.

9 - -State & -Merge combinators

-State & -Merge combinators

-State combinator doesn’t actually store information about -If combinator, so aggregate functions with -If and without have the same serialized data.

$ clickhouse-local --query "SELECT maxIfState(number,number % 2) as x, maxState(number) as y FROM numbers(10) FORMAT RowBinary" | clickhouse-local --input-format RowBinary --structure="x AggregateFunction(max,UInt64), y AggregateFunction(max,UInt64)" --query "SELECT maxMerge(x), maxMerge(y) FROM table"
9       9
$ clickhouse-local --query "SELECT maxIfState(number,number % 2) as x, maxState(number) as y FROM numbers(11) FORMAT RowBinary" | clickhouse-local --input-format RowBinary --structure="x AggregateFunction(max,UInt64), y AggregateFunction(max,UInt64)" --query "SELECT maxMerge(x), maxMerge(y) FROM table"
9       10

-State combinator have the same serialized data footprint regardless of parameters used in definition of aggregate function. That’s true for quantile* and sequenceMatch/sequenceCount functions.

$ clickhouse-local --query "SELECT quantilesTDigestIfState(0.1,0.9)(number,number % 2) FROM  numbers(1000000) FORMAT RowBinary" | clickhouse-local --input-format RowBinary --structure="x AggregateFunction(quantileTDigestWeighted(0.5),UInt64,UInt8)" --query "SELECT quantileTDigestWeightedMerge(0.4)(x) FROM table"
400000

$ clickhouse-local --query "SELECT quantilesTDigestIfState(0.1,0.9)(number,number % 2) FROM  numbers(1000000) FORMAT RowBinary" | clickhouse-local --input-format RowBinary --structure="x AggregateFunction(quantilesTDigestWeighted(0.5),UInt64,UInt8)" --query "SELECT quantilesTDigestWeightedMerge(0.4,0.8)(x) FROM table"
[400000,800000]

SELECT quantileMerge(0.9)(x)
FROM
(
    SELECT quantileState(0.1)(number) AS x
    FROM numbers(1000)
)

┌─quantileMerge(0.9)(x)─┐
                 899.1 
└───────────────────────┘
SELECT
    sequenceMatchMerge('(?2)(?3)')(x) AS `2_3`,
    sequenceMatchMerge('(?1)(?4)')(x) AS `1_4`,
    sequenceMatchMerge('(?1)(?2)(?3)')(x) AS `1_2_3`
FROM
(
    SELECT sequenceMatchState('(?1)(?2)(?3)')(number, number = 8, number = 5, number = 6, number = 9) AS x
    FROM numbers(10)
)

┌─2_3─┬─1_4─┬─1_2_3─┐
   1    1      0 
└─────┴─────┴───────┘

SELECT
    sequenceCountMerge('(?1)(?2)')(x) AS `2_3`,
    sequenceCountMerge('(?1)(?4)')(x) AS `1_4`,
    sequenceCountMerge('(?1)(?2)(?3)')(x) AS `1_2_3`
FROM
(
    WITH number % 4 AS cond
    SELECT sequenceCountState('(?1)(?2)(?3)')(number, cond = 1, cond = 2, cond = 3, cond = 5) AS x
    FROM numbers(11)
)

┌─2_3─┬─1_4─┬─1_2_3─┐
   3    0      2 
└─────┴─────┴───────┘

10 - ALTER MODIFY COLUMN is stuck, the column is inaccessible.

ALTER MODIFY COLUMN is stuck, the column is inaccessible.

Problem

You have table:

CREATE TABLE modify_column(column_n String) ENGINE=MergeTree() ORDER BY tuple();

Populate it with data:

INSERT INTO modify_column VALUES ('key_a');
INSERT INTO modify_column VALUES ('key_b');
INSERT INTO modify_column VALUES ('key_c');

Tried to apply alter table query with changing column type:

ALTER TABLE modify_column MODIFY COLUMN column_n Enum8('key_a'=1, 'key_b'=2);

But it didn’t succeed and you see an error in system.mutations table:

SELECT *
FROM system.mutations
WHERE (table = 'modify_column') AND (is_done = 0)
FORMAT Vertical

Row 1:
──────
database:                   default
table:                      modify_column
mutation_id:                mutation_4.txt
command:                    MODIFY COLUMN `column_n` Enum8('key_a' = 1, 'key_b' = 2)
create_time:                2021-03-03 18:38:09
block_numbers.partition_id: ['']
block_numbers.number:       [4]
parts_to_do_names:          ['all_3_3_0']
parts_to_do:                1
is_done:                    0
latest_failed_part:         all_3_3_0
latest_fail_time:           2021-03-03 18:38:59
latest_fail_reason:         Code: 36, e.displayText() = DB::Exception: Unknown element 'key_c' for type Enum8('key_a' = 1, 'key_b' = 2): while executing 'FUNCTION CAST(column_n :: 0, 'Enum8(\'key_a\' = 1, \'key_b\' = 2)' :: 1) -> cast(column_n, 'Enum8(\'key_a\' = 1, \'key_b\' = 2)') Enum8('key_a' = 1, 'key_b' = 2) : 2': (while reading from part /var/lib/clickhouse/data/default/modify_column/all_3_3_0/): While executing MergeTree (version 21.3.1.6041)

And you can’t query that column anymore:

SELECT column_n
FROM modify_column

┌─column_n─┐
 key_a    
└──────────┘
┌─column_n─┐
 key_b    
└──────────┘
 Progress: 2.00 rows, 2.00 B (19.48 rows/s., 19.48 B/s.)
2 rows in set. Elapsed: 0.104 sec.

Received exception from server (version 21.3.1):
Code: 36. DB::Exception: Received from localhost:9000. DB::Exception: Unknown element 'key_c' for type Enum8('key_a' = 1, 'key_b' = 2): while executing 'FUNCTION CAST(column_n :: 0, 'Enum8(\'key_a\' = 1, \'key_b\' = 2)' :: 1) -> cast(column_n, 'Enum8(\'key_a\' = 1, \'key_b\' = 2)') Enum8('key_a' = 1, 'key_b' = 2) : 2': (while reading from part /var/lib/clickhouse/data/default/modify_column/all_3_3_0/): While executing MergeTreeThread.

Solution

You should do the following:

Check which mutation is stuck and kill it:

SELECT * FROM system.mutations WHERE table = 'modify_column' AND is_done=0 FORMAT Vertical;
KILL MUTATION WHERE table = 'modify_column' AND mutation_id = 'id_of_stuck_mutation';

Apply reverting modify column query to convert table to previous column type:

ALTER TABLE modify_column MODIFY COLUMN column_n String;

Check if column is accessible now:

SELECT column_n, count() FROM modify_column GROUP BY column_n;

Run fixed ALTER MODIFY COLUMN query.

ALTER TABLE modify_column MODIFY COLUMN column_n Enum8('key_a'=1, 'key_b'=2, 'key_c'=3);

You can monitor progress of column type change with system.mutations or system.parts_columns tables:

SELECT
    command,
    parts_to_do,
    is_done
FROM system.mutations
WHERE table = 'modify_column'

SELECT
    column,
    type,
    count() AS parts,
    sum(rows) AS rows,
    sum(bytes_on_disk) AS bytes
FROM system.parts_columns
WHERE (table = 'modify_column') AND (column = 'column_n') AND active
GROUP BY
    column,
    type

11 - ANSI SQL mode

ANSI SQL mode

It’s possible to tune some settings which would make ClickHouse more ANSI SQL compatible(and slower):

SET join_use_nulls=1; -- introduced long ago
SET cast_keep_nullable=1; -- introduced in 20.5
SET union_default_mode='DISTINCT'; -- introduced in 21.1
SET allow_experimental_window_functions=1; -- introduced in 21.3
SET prefer_column_name_to_alias=1; -- introduced in 21.4;
SET group_by_use_nulls=1; -- introduced in 22.7;

12 - Async INSERTs

Async INSERTs

Async INSERTs is a ClickHouse feature tha enables batching data automatically and transparently on the server-side. Although async inserts work, they still have issues, but have been improved in latest versions. We recommend to batch at app/ingestor level because you will have more control and you decouple this responsibility from ClickHouse. Being said that here some insights about Async inserts you should now:

  • Async inserts give acknowledgment immediately after the data got inserted into the buffer (wait_for_async_insert = 0) or by default, after the data got written to a part after flushing from buffer (wait_for_async_insert = 1).
  • INSERT .. SELECT is NOT async insert.
  • Async inserts will do (idempotent) retries.
  • Async inserts can collect data for some offline remote clusters: Yandex self-driving cars were collecting the metrics data during the ride into ClickHouse installed on the car computer to a distributed table with Async inserts enabled, which were flushed to the cluster once the car was plugged to the network.
  • Async inserts can do batching, so multiple inserts can be squashed as a single insert (but in that case, retries are not idempotent anymore).
  • Async inserts can loose your data in case of sudden restart (no fsyncs by default).
  • Async inserted data becomes available for selects not immediately after acknowledgment.
  • Async inserts generally have more moving parts there are some background threads monitoring new data to be sent and pushing it out.
  • Async inserts require extra monitoring from different system.tables (see system.part_log, system.query_log and system.asynchronous_inserts for 22.8). Previously such queries didn’t appear in the query log. Check: #33239.
  • Important to use wait_for_async_insert = 1 because with any error you will loose data without knowing it. For example your table is read only -> losing data, out of disk space -> losing data, too many parts -> losing data.

22.10+ bugfixes/features

  • Fixed bug which could lead to deadlock while using asynchronous inserts. See #43233.
  • Async insert dedup: Support block deduplication for asynchronous inserts. Before this change, async inserts did not support deduplication, because multiple small inserts coexisted in one inserted batch. See #38075 and #43304.
  • Added system table asynchronous_insert_log. It contains information about asynchronous inserts (including results of queries in fire-and-forget mode. (with wait_for_async_insert=0)) for better introspection. See #42040.
  • Support async inserts in clickhouse-client for queries with inlined data. See #34267.

To improve observability / introspection

It is not possible to relate part_log/query_id column with asynchronous_insert_log/query_id column. We need to use query_log/query_id:

asynchronous_insert_log shows up the query_id and flush_query_id of each async insert. The query_id from async insert shows up in the system.query_log as type = 'QueryStart' but the same query_id does not show up in the query_id column of the system.part_log. Because the query_id column in the part_log is the identifier of the INSERT query that created a data part, and it seems it is for sync INSERTS but not for async inserts.

Metrics

SELECT *
FROM system.metrics
WHERE metric LIKE '%AsyncInsert%'

Query id: 7384b8c8-3d87-4059-b1c4-e9955e97232b

┌─metric───────────────┬─value─┬─description────────────────────────────────────────────────┐
 PendingAsyncInsert        0  Number of asynchronous inserts that are waiting for flush. 
 AsyncInsertCacheSize      0  Number of async insert hash id in cache                    
└──────────────────────┴───────┴────────────────────────────────────────────────────────────┘

2 rows in set. Elapsed: 0.001 sec.

13 - Atomic insert

Atomic insert

An insert is atomic if it creates only one part.

An insert will create one part if:

  • Data is inserted directly into a MergeTree table
  • Data is inserted into a single partition.
  • For INSERT FORMAT:
    • Number of rows is less than max_insert_block_size (default is 1048545)
    • Parallel formatting is disabled (For TSV, TKSV, CSV, and JSONEachRow formats setting input_format_parallel_parsing=0 is set).
  • For INSERT SELECT:
    • Number of rows is less than max_block_size
  • Smaller blocks are properly squashed up to the configured block size (min_insert_block_size_rows and min_insert_block_size_bytes)
  • The MergeTree table doesn’t have Materialized Views (there is no atomicity Table <> MV)

https://github.com/ClickHouse/ClickHouse/issues/9195#issuecomment-587500824 https://github.com/ClickHouse/ClickHouse/issues/5148#issuecomment-487757235

Example how to make a large insert atomically

Generate test data in Native and TSV format ( 100 millions rows )

Text formats and Native format require different set of settings, here I want to find / demonstrate mandatory minumum of settings for any case.

clickhouse-client -q \
     'SELECT toInt64(number) A, toString(number) S FROM numbers(100000000) FORMAT Native' > t.native
clickhouse-client -q \
     'SELECT toInt64(number) A, toString(number) S FROM numbers(100000000) FORMAT TSV' > t.tsv

Insert with default settings (not atomic)

DROP TABLE IF EXISTS trg;
CREATE TABLE trg(A Int64, S String) Engine=MergeTree ORDER BY A;

-- Load data in Native format
clickhouse-client  -q 'INSERT INTO trg FORMAT Native' <t.native

-- Check how many parts is created
SELECT 
    count(),
    min(rows),
    max(rows),
    sum(rows)
FROM system.parts
WHERE (level = 0) AND (table = 'trg');
┌─count()─┬─min(rows)─┬─max(rows)─┬─sum(rows)─┐
908909351113585100000000└─────────┴───────────┴───────────┴───────────┘

--- 90 parts! was created - not atomic



DROP TABLE IF EXISTS trg;
CREATE TABLE trg(A Int64, S String) Engine=MergeTree ORDER BY A;

-- Load data in TSV format
clickhouse-client  -q 'INSERT INTO trg FORMAT TSV' <t.tsv

-- Check how many parts is created
SELECT 
    count(),
    min(rows),
    max(rows),
    sum(rows)
FROM system.parts
WHERE (level = 0) AND (table = 'trg');
┌─count()─┬─min(rows)─┬─max(rows)─┬─sum(rows)─┐
858982071449610100000000└─────────┴───────────┴───────────┴───────────┘

--- 85 parts! was created - not atomic

Insert with adjusted settings (atomic)

Atomic insert use more memory because it needs 100 millions rows in memory.

DROP TABLE IF EXISTS trg;
CREATE TABLE trg(A Int64, S String) Engine=MergeTree ORDER BY A;

clickhouse-client --input_format_parallel_parsing=0 \
                  --min_insert_block_size_bytes=0 \
                  --min_insert_block_size_rows=1000000000 \
                  -q 'INSERT INTO trg FORMAT Native' <t.native

-- Check that only one part is created
SELECT
    count(),
    min(rows),
    max(rows),
    sum(rows)
FROM system.parts
WHERE (level = 0) AND (table = 'trg');
┌─count()─┬─min(rows)─┬─max(rows)─┬─sum(rows)─┐
1100000000100000000100000000└─────────┴───────────┴───────────┴───────────┘

-- 1 part, success.



DROP TABLE IF EXISTS trg;
CREATE TABLE trg(A Int64, S String) Engine=MergeTree ORDER BY A;

-- Load data in TSV format
clickhouse-client --input_format_parallel_parsing=0 \
                  --min_insert_block_size_bytes=0 \
                  --min_insert_block_size_rows=1000000000 \
                  -q 'INSERT INTO trg FORMAT TSV' <t.tsv

-- Check that only one part is created
SELECT 
    count(),
    min(rows),
    max(rows),
    sum(rows)
FROM system.parts
WHERE (level = 0) AND (table = 'trg');
┌─count()─┬─min(rows)─┬─max(rows)─┬─sum(rows)─┐
1100000000100000000100000000└─────────┴───────────┴───────────┴───────────┘

-- 1 part, success.

14 - Cumulative Anything

Cumulative Anything

Sample data

CREATE TABLE events
(
    `ts` DateTime,
    `user_id` UInt32
)
ENGINE = Memory;

INSERT INTO events SELECT
    toDateTime('2021-04-29 10:10:10') + toIntervalHour(7 * number) AS ts,
    toDayOfWeek(ts) + (number % 2) AS user_id
FROM numbers(15);

Using arrays

WITH
    groupArray(_ts) AS ts_arr,
    groupArray(state) AS state_arr
SELECT
    arrayJoin(ts_arr) AS ts,
    arrayReduce('uniqExactMerge', arrayFilter((x, y) -> (y <= ts), state_arr, ts_arr)) AS uniq
FROM
(
    SELECT
        toStartOfDay(ts) AS _ts,
        uniqExactState(user_id) AS state
    FROM events
    GROUP BY _ts
)
ORDER BY ts ASC

┌──────────────────ts─┬─uniq─┐
 2021-04-29 00:00:00     2 
 2021-04-30 00:00:00     3 
 2021-05-01 00:00:00     4 
 2021-05-02 00:00:00     5 
 2021-05-03 00:00:00     7 
└─────────────────────┴──────┘

WITH arrayJoin(range(toUInt32(_ts) AS int, least(int + toUInt32((3600 * 24) * 5), toUInt32(toDateTime('2021-05-04 00:00:00'))), 3600 * 24)) AS ts_expanded
SELECT
    toDateTime(ts_expanded) AS ts,
    uniqExactMerge(state) AS uniq
FROM
(
    SELECT
        toStartOfDay(ts) AS _ts,
        uniqExactState(user_id) AS state
    FROM events
    GROUP BY _ts
)
GROUP BY ts
ORDER BY ts ASC

┌──────────────────ts─┬─uniq─┐
 2021-04-29 00:00:00     2 
 2021-04-30 00:00:00     3 
 2021-05-01 00:00:00     4 
 2021-05-02 00:00:00     5 
 2021-05-03 00:00:00     7 
└─────────────────────┴──────┘

Using window functions (starting from Clickhouse 21.3)

SELECT
    ts,
    uniqExactMerge(state) OVER (ORDER BY ts ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS uniq
FROM
(
    SELECT
        toStartOfDay(ts) AS ts,
        uniqExactState(user_id) AS state
    FROM events
    GROUP BY ts
)
ORDER BY ts ASC

┌──────────────────ts─┬─uniq─┐
 2021-04-29 00:00:00     2 
 2021-04-30 00:00:00     3 
 2021-05-01 00:00:00     4 
 2021-05-02 00:00:00     5 
 2021-05-03 00:00:00     7 
└─────────────────────┴──────┘

Using runningAccumulate (incorrect result over blocks)

SELECT
    ts,
    runningAccumulate(state) AS uniq
FROM
(
    SELECT
        toStartOfDay(ts) AS ts,
        uniqExactState(user_id) AS state
    FROM events
    GROUP BY ts
    ORDER BY ts ASC
)
ORDER BY ts ASC

┌──────────────────ts─┬─uniq─┐
 2021-04-29 00:00:00     2 
 2021-04-30 00:00:00     3 
 2021-05-01 00:00:00     4 
 2021-05-02 00:00:00     5 
 2021-05-03 00:00:00     7 
└─────────────────────┴──────┘

15 - Data types on disk and in RAM

Data types on disk and in RAM
DataTypeRAM size (=byteSize)Disk Size
String

string byte length + 9

string length: 64 bit integer

zero-byte terminator: 1 byte.

string length prefix (varint) + string itself:


string shorter than 128 - string byte length + 1
string shorter than 16384 - string byte length + 2
string shorter than 2097152 - string byte length + 2
string shorter than 268435456 - string byte length + 4

AggregateFunction(count, ...)varint

See also https://github.com/ClickHouse/clickhouse-presentations/blob/master/meetup41/data_processing.pdf (slide 17-22)

16 - DELETE via tombstone column

DELETE via tombstone column
CREATE TABLE test_delete
(
    `key` UInt32,
    `ts` UInt32,
    `value_a` String,
    `value_b` String,
    `value_c` String,
    `is_active` UInt8 DEFAULT 1
)
ENGINE = MergeTree
ORDER BY