How to encode/decode quantileTDigest states from/to list of centroids
A way to export or import quantileTDigest states from/into ClickHouse®
quantileTDigestState
quantileTDigestState is stored in two parts: a count of centroids in LEB128 format + list of centroids without a delimiter. Each centroid is represented as two Float32 values: Mean & Count.
SELECT
hex(quantileTDigestState(1)),
hex(toFloat32(1))
┌─hex(quantileTDigestState(1))─┬─hex(toFloat32(1))─┐
│ 010000803F0000803F │ 0000803F │
└──────────────────────────────┴───────────────────┘
01 0000803F 0000803F
^ ^ ^
LEB128 Float32 Mean Float32 Count
We need to make two helper UDF
functions:
cat /etc/clickhouse-server/decodeTDigestState_function.xml
<yandex>
<function>
<type>executable</type>
<execute_direct>0</execute_direct>
<name>decodeTDigestState</name>
<return_type>Array(Tuple(mean Float32, count Float32))</return_type>
<argument>
<type>AggregateFunction(quantileTDigest, UInt32)</type>
</argument>
<format>RowBinary</format>
<command>cat</command>
<send_chunk_header>0</send_chunk_header>
</function>
</yandex>
cat /etc/clickhouse-server/encodeTDigestState_function.xml
<yandex>
<function>
<type>executable</type>
<execute_direct>0</execute_direct>
<name>encodeTDigestState</name>
<return_type>AggregateFunction(quantileTDigest, UInt32)</return_type>
<argument>
<type>Array(Tuple(mean Float32, count Float32))</type>
</argument>
<format>RowBinary</format>
<command>cat</command>
<send_chunk_header>0</send_chunk_header>
</function>
</yandex>
Those UDF – (encode/decode)TDigestState
converts TDigestState
to the Array(Tuple(Float32, Float32))
and back.
SELECT quantileTDigest(CAST(number, 'UInt32')) AS result
FROM numbers(10)
┌─result─┐
│ 4 │
└────────┘
SELECT decodeTDigestState(quantileTDigestState(CAST(number, 'UInt32'))) AS state
FROM numbers(10)
┌─state─────────────────────────────────────────────────────────┐
│ [(0,1),(1,1),(2,1),(3,1),(4,1),(5,1),(6,1),(7,1),(8,1),(9,1)] │
└───────────────────────────────────────────────────────────────┘
SELECT finalizeAggregation(encodeTDigestState(CAST('[(0,1),(1,1),(2,1),(3,1),(4,1),(5,1),(6,1),(7,1),(8,1),(9,1)]', 'Array(Tuple(Float32, Float32))'))) AS result
┌─result─┐
│ 4 │
└────────┘