Atomic insert

An insert is atomic if it creates only one part.

An insert will create one part if:

  • Data is inserted directly into a MergeTree table
  • Data is inserted into a single partition.
  • Smaller blocks are properly squashed up to the configured block size (min_insert_block_size_rows and min_insert_block_size_bytes)
  • For INSERT FORMAT:
    • Number of rows is less than max_insert_block_size (default is 1048545)
    • Parallel formatting is disabled (For TSV, TSKV, CSV, and JSONEachRow formats setting input_format_parallel_parsing=0 is set).
  • For INSERT SELECT (including all variants with table functions), data for insert should be created fully deterministically.
    • non-deterministic functions there like rand() not used in SELECT
    • Number of rows/bytes is less than min_insert_block_size_rows and min_insert_block_size_bytes
    • And one of:
      • setting max_threads to 1
      • adding ORDER BY to the table’s DDL (not ordering by tuple)
      • There is some ORDER BY inside SELECT
    • See example
  • The MergeTree table doesn’t have Materialized Views (there is no atomicity Table <> MV)

https://github.com/ClickHouse/ClickHouse/issues/9195#issuecomment-587500824 https://github.com/ClickHouse/ClickHouse/issues/5148#issuecomment-487757235

Example how to make a large insert atomically

Generate test data in Native and TSV format ( 100 millions rows )

Text formats and Native format require different set of settings, here I want to find / demonstrate mandatory minimum of settings for any case.

clickhouse-client -q \
     'SELECT toInt64(number) A, toString(number) S FROM numbers(100000000) FORMAT Native' > t.native
clickhouse-client -q \
     'SELECT toInt64(number) A, toString(number) S FROM numbers(100000000) FORMAT TSV' > t.tsv

Insert with default settings (not atomic)

DROP TABLE IF EXISTS trg;
CREATE TABLE trg(A Int64, S String) Engine=MergeTree ORDER BY A;

-- Load data in Native format
clickhouse-client  -q 'INSERT INTO trg FORMAT Native' <t.native

-- Check how many parts is created
SELECT 
    count(),
    min(rows),
    max(rows),
    sum(rows)
FROM system.parts
WHERE (level = 0) AND (table = 'trg');
┌─count()─┬─min(rows)─┬─max(rows)─┬─sum(rows)─┐
908909351113585100000000└─────────┴───────────┴───────────┴───────────┘

--- 90 parts! was created - not atomic



DROP TABLE IF EXISTS trg;
CREATE TABLE trg(A Int64, S String) Engine=MergeTree ORDER BY A;

-- Load data in TSV format
clickhouse-client  -q 'INSERT INTO trg FORMAT TSV' <t.tsv

-- Check how many parts is created
SELECT 
    count(),
    min(rows),
    max(rows),
    sum(rows)
FROM system.parts
WHERE (level = 0) AND (table = 'trg');
┌─count()─┬─min(rows)─┬─max(rows)─┬─sum(rows)─┐
858982071449610100000000└─────────┴───────────┴───────────┴───────────┘

--- 85 parts! was created - not atomic

Insert with adjusted settings (atomic)

Atomic insert use more memory because it needs 100 millions rows in memory.

DROP TABLE IF EXISTS trg;
CREATE TABLE trg(A Int64, S String) Engine=MergeTree ORDER BY A;

clickhouse-client --input_format_parallel_parsing=0 \
                  --min_insert_block_size_bytes=0 \
                  --min_insert_block_size_rows=1000000000 \
                  -q 'INSERT INTO trg FORMAT Native' <t.native

-- Check that only one part is created
SELECT
    count(),
    min(rows),
    max(rows),
    sum(rows)
FROM system.parts
WHERE (level = 0) AND (table = 'trg');
┌─count()─┬─min(rows)─┬─max(rows)─┬─sum(rows)─┐
1100000000100000000100000000└─────────┴───────────┴───────────┴───────────┘

-- 1 part, success.



DROP TABLE IF EXISTS trg;
CREATE TABLE trg(A Int64, S String) Engine=MergeTree ORDER BY A;

-- Load data in TSV format
clickhouse-client --input_format_parallel_parsing=0 \
                  --min_insert_block_size_bytes=0 \
                  --min_insert_block_size_rows=1000000000 \
                  -q 'INSERT INTO trg FORMAT TSV' <t.tsv

-- Check that only one part is created
SELECT 
    count(),
    min(rows),
    max(rows),
    sum(rows)
FROM system.parts
WHERE (level = 0) AND (table = 'trg');
┌─count()─┬─min(rows)─┬─max(rows)─┬─sum(rows)─┐
1100000000100000000100000000└─────────┴───────────┴───────────┴───────────┘

-- 1 part, success.
Last modified 2024.11.19: insert select clarifications (a4b71b0)