How to convert uniqExact states to approximate uniq functions states
uniqExactState
uniqExactState is stored in two parts: a count of values in LEB128 format + list values without a delimiter.
Depending on the orignial datatype of the values to count, the datatype of the list values differ.
Numeric Values
In case of numeric values like UInt8, UInt64 etc. the representation of uniqExactState is just a simple array of the unique values encountered.
Therefore it is easy to recover the values from the state which have appeared:
┌─hex(uniqExactState(arrayJoin([1, 3])))─┐
│ 020103 │
└────────────────────────────────────────┘
02 01 03
^ ^ ^
LEB128 hex(1::UInt8) hex(3::UInt8)
┌─finalizeAggregation(CAST(unhex('020103'), 'AggregateFunction(groupArray, UInt8)'))─┐
│ [1,3] │
└────────────────────────────────────────────────────────────────────────────────────┘
String Values
Internal Representation
In case of values of data type String, ClickHouse® applies a hashing algorithm before storing the values into the internal array, otherwise the amount of space needed could get enormous.
┌─hex(uniqExactState(toString(arrayJoin([1]))))─┐
│ 01E2756D8F7A583CA23016E03447724DE7 │
└───────────────────────────────────────────────┘
01 E2756D8F7A583CA23016E03447724DE7
^ ^
LEB128 hash of '1'
┌─hex(uniqExactState(toString(arrayJoin([1, 2]))))───────────────────┐
│ 024809CB4528E00621CF626BE9FA14E2BFE2756D8F7A583CA23016E03447724DE7 │
└────────────────────────────────────────────────────────────────────┘
02 4809CB4528E00621CF626BE9FA14E2BF E2756D8F7A583CA23016E03447724DE7
^ ^ ^
LEB128 hash of '2' hash of '1'
So, our task is to find how we can generate such values by ourself, speak what hash function is used.
In case of String data type, it is just the simple sipHash128 function.
┌─hex(sipHash128(toString(2)))─────┬─hex(sipHash128(toString(1)))─────┐
│ 4809CB4528E00621CF626BE9FA14E2BF │ E2756D8F7A583CA23016E03447724DE7 │
└──────────────────────────────────┴──────────────────────────────────┘
Getting the Hash Values
The second task: now that we know how the state is formed, how can we demangle it and convert it into an Array of values.
Unfortunatelly it is not possible to get the original values back, as sipHash128 is a one way conversion, but at least we can try to get an Array of hashes.
Luckily for us, ClickHouse® use the exact same serialization (LEB128 + list of values) for Arrays (in this case if uniqExactState and Array are serialized into RowBinary format).
One way to “convert” the uniqExactState to an Array of hashes would be via an external helper
UDF function to do that conversion:
cat /etc/clickhouse-server/pipe_function.xml
<clickhouse>
<function>
<type>executable</type>
<execute_direct>0</execute_direct>
<name>pipe</name>
<return_type>Array(FixedString(16))</return_type>
<argument>
<type>String</type>
</argument>
<format>RowBinary</format>
<command>cat</command>
<send_chunk_header>0</send_chunk_header>
</function>
</clickhouse>
This UDF – pipe converts uniqExactState to the Array(FixedString(16)):
┌─arrayMap(x -> hex(x), pipe(uniqExactState(toString(arrayJoin([1, 2])))))──────────────┐
│ ['4809CB4528E00621CF626BE9FA14E2BF','E2756D8F7A583CA23016E03447724DE7'] │
└───────────────────────────────────────────────────────────────────────────────────────┘
This way only works if you have direct access to your ClickHouse® installation.
However if you are on a managed platform like Altinity.Cloud installing executable UDFs is typically not supported for security reasons.
Luckily we know that the internal representation of sipHash128 is FixedString(16) which has exactly 128 bit. UInt128 also takes up exactly 128 bit.
Therefore we can consider the uniqExactState(String) as a representation of Array(UInt128).
Again, we can therefore convert our state to an Array:
┌─arrayMap(lambda(tuple(x), hex(reinterpretAsFixedString(x))), finalizeAggregation(CAST(unhex(hex(uniqExactState(arrayJoin(['1', '2'])))), 'AggregateFunction(groupArray, UInt128)')))─┐
│ ['4809CB4528E00621CF626BE9FA14E2BF','E2756D8F7A583CA23016E03447724DE7'] │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
As you can see the Array is identical to the one we created with the pipe function.
Full Example of Conversion
And here is the full example, how you can convert uniqExactState(string) to any approximate uniq function like uniqState(string) or uniqCombinedState(string) by reinterpret and arrayReduce('func', [..]).
-- Generate demo with random data, uniqs are stored as heavy uniqExact
CREATE TABLE aggregates
(
`id` UInt32,
`uniqExact` AggregateFunction(uniqExact, String)
)
ENGINE = AggregatingMergeTree
ORDER BY id as
SELECT
number % 10000 AS id,
uniqExactState(toString(number))
FROM numbers(10000000)
GROUP BY id;
0 rows in set. Elapsed: 2.042 sec. Processed 10.01 million rows, 80.06 MB (4.90 million rows/s., 39.21 MB/s.)
-- Let's add a new columns to store optimized, approximate uniq & uniqCombined
ALTER TABLE aggregates
ADD COLUMN `uniq` AggregateFunction(uniq, FixedString(16)),
ADD COLUMN `uniqCombined` AggregateFunction(uniqCombined, FixedString(16));
-- Materialize values in the new columns
ALTER TABLE aggregates
UPDATE
uniqCombined = arrayReduce('uniqCombinedState', arrayMap(x -> reinterpretAsFixedString(x), finalizeAggregation(unhex(hex(uniqExact))::AggregateFunction(groupArray, UInt128)))),
uniq = arrayReduce('uniqState', arrayMap(x -> reinterpretAsFixedString(x), finalizeAggregation(unhex(hex(uniqExact))::AggregateFunction(groupArray, UInt128))))
WHERE 1
SETTINGS mutations_sync=2;
-- Check results, results are slighty different, because uniq & uniqCombined are approximate functions
SELECT
id % 20 AS key,
uniqExactMerge(uniqExact),
uniqCombinedMerge(uniqCombined),
uniqMerge(uniq)
FROM aggregates
GROUP BY key
┌─key─┬─uniqExactMerge(uniqExact)─┬─uniqCombinedMerge(uniqCombined)─┬─uniqMerge(uniq)─┐
│ 0 │ 500000 │ 500195 │ 500455 │
│ 1 │ 500000 │ 502599 │ 501549 │
│ 2 │ 500000 │ 498058 │ 504428 │
│ 3 │ 500000 │ 499748 │ 500195 │
│ 4 │ 500000 │ 500791 │ 500836 │
│ 5 │ 500000 │ 502430 │ 497558 │
│ 6 │ 500000 │ 500262 │ 501785 │
│ 7 │ 500000 │ 501514 │ 495758 │
│ 8 │ 500000 │ 500121 │ 498597 │
│ 9 │ 500000 │ 502173 │ 500455 │
│ 10 │ 500000 │ 499144 │ 498386 │
│ 11 │ 500000 │ 500525 │ 503139 │
│ 12 │ 500000 │ 503624 │ 497103 │
│ 13 │ 500000 │ 499986 │ 497992 │
│ 14 │ 500000 │ 502027 │ 494833 │
│ 15 │ 500000 │ 498831 │ 500983 │
│ 16 │ 500000 │ 501103 │ 500836 │
│ 17 │ 500000 │ 499409 │ 496791 │
│ 18 │ 500000 │ 501641 │ 502991 │
│ 19 │ 500000 │ 500648 │ 500881 │
└─────┴───────────────────────────┴─────────────────────────────────┴─────────────────┘
20 rows in set. Elapsed: 2.312 sec. Processed 10.00 thousand rows, 7.61 MB (4.33 thousand rows/s., 3.29 MB/s.)
Now, lets repeat the same insert, but in that case we will also populate uniq & uniqCombined with values converted via sipHash128 function.
If we did everything right, uniq counts will not change, because we inserted the exact same values.
INSERT INTO aggregates SELECT
number % 10000 AS id,
uniqExactState(toString(number)),
uniqState(sipHash128(toString(number))),
uniqCombinedState(sipHash128(toString(number)))
FROM numbers(10000000)
GROUP BY id;
0 rows in set. Elapsed: 5.386 sec. Processed 10.01 million rows, 80.06 MB (1.86 million rows/s., 14.86 MB/s.)
SELECT
id % 20 AS key,
uniqExactMerge(uniqExact),
uniqCombinedMerge(uniqCombined),
uniqMerge(uniq)
FROM aggregates
GROUP BY key
┌─key─┬─uniqExactMerge(uniqExact)─┬─uniqCombinedMerge(uniqCombined)─┬─uniqMerge(uniq)─┐
│ 0 │ 500000 │ 500195 │ 500455 │
│ 1 │ 500000 │ 502599 │ 501549 │
│ 2 │ 500000 │ 498058 │ 504428 │
│ 3 │ 500000 │ 499748 │ 500195 │
│ 4 │ 500000 │ 500791 │ 500836 │
│ 5 │ 500000 │ 502430 │ 497558 │
│ 6 │ 500000 │ 500262 │ 501785 │
│ 7 │ 500000 │ 501514 │ 495758 │
│ 8 │ 500000 │ 500121 │ 498597 │
│ 9 │ 500000 │ 502173 │ 500455 │
│ 10 │ 500000 │ 499144 │ 498386 │
│ 11 │ 500000 │ 500525 │ 503139 │
│ 12 │ 500000 │ 503624 │ 497103 │
│ 13 │ 500000 │ 499986 │ 497992 │
│ 14 │ 500000 │ 502027 │ 494833 │
│ 15 │ 500000 │ 498831 │ 500983 │
│ 16 │ 500000 │ 501103 │ 500836 │
│ 17 │ 500000 │ 499409 │ 496791 │
│ 18 │ 500000 │ 501641 │ 502991 │
│ 19 │ 500000 │ 500648 │ 500881 │
└─────┴───────────────────────────┴─────────────────────────────────┴─────────────────┘
20 rows in set. Elapsed: 3.318 sec. Processed 20.00 thousand rows, 11.02 MB (6.03 thousand rows/s., 3.32 MB/s.)
Let’s compare the data size, uniq won in this case, but check this article Functions to count uniqs
, mileage may vary.
optimize table aggregates final;
SELECT
column,
formatReadableSize(sum(column_data_compressed_bytes) AS size) AS compressed,
formatReadableSize(sum(column_data_uncompressed_bytes) AS usize) AS uncompressed
FROM system.parts_columns
WHERE (active = 1) AND (table LIKE 'aggregates') and column like '%uniq%'
GROUP BY column
ORDER BY size DESC;
┌─column───────┬─compressed─┬─uncompressed─┐
│ uniqExact │ 153.21 MiB │ 152.61 MiB │
│ uniqCombined │ 76.62 MiB │ 76.32 MiB │
│ uniq │ 38.33 MiB │ 38.18 MiB │
└──────────────┴────────────┴──────────────┘