SAMPLE by

SAMPLE by

The execution pipeline is embedded in the partition reading code.

So that works this way:

  1. ClickHouse does partition pruning based on WHERE conditions.
  2. For every partition, it picks a columns ranges (aka ‘marks’ / ‘granulas’) based on primary key conditions.
  3. Here the sampling logic is applied: a) in case of SAMPLE k (k in 0..1 range) it adds conditions WHERE sample_key < k * max_int_of_sample_key_type b) in case of SAMPLE k OFFSET m it adds conditions WHERE sample_key BETWEEN m * max_int_of_sample_key_type AND (m + k) * max_int_of_sample_key_typec) in case of SAMPLE N (N>1) if first estimates how many rows are inside the range we need to read and based on that convert it to 3a case (calculate k based on number of rows in ranges and desired number of rows)
  4. on the data returned by those other conditions are applied (so here the number of rows can be decreased here)

Source Code

SAMPLE by

Docs Source Code

SAMPLE key Must be:

  • Included in the primary key.
  • Uniformly distributed in the domain of its data type:
    • Bad: Timestamp;
    • Good: intHash32(UserID);
  • Cheap to calculate:
    • Bad: cityHash64(URL);
    • Good: intHash32(UserID);
  • Not after high granular fields in primary key:
    • Bad: ORDER BY (Timestamp, sample_key);
    • Good: ORDER BY (CounterID, Date, sample_key).

Sampling is:

  • Deterministic
  • Works in a consistent way for different tables.
  • Allows reading less amount of data from disk.
    • SAMPLE key, bonus
    • SAMPLE 1/10
    • Select data for 1/10 of all possible sample keys; SAMPLE 1000000
  • Select from about (not less than) 1 000 000 rows on each shard;
    • You can use _sample_factor virtual column to determine the relative sample factor; SAMPLE 1/10 OFFSET 1/10
  • Select second 1/10 of all possible sample keys; SET max_parallel_replicas = 3
  • Select from multiple replicas of each shard in parallel;
Last modified 2021.09.07 : Replaced HTML code artifacts. (465040d6)