SPARSE_HASHED VS HASHED
Sparse_hashed layout is supposed to save memory but has some downsides. We can test how much slower SPARSE_HASHED than HASHED is with the following:
create table orders(id UInt64, price Float64)
Engine = MergeTree() order by id;
insert into orders select number, 0 from numbers(5000000);
CREATE DICTIONARY orders_hashed (id UInt64, price Float64)
PRIMARY KEY id SOURCE(CLICKHOUSE(HOST 'localhost' PORT 9000
TABLE orders DB 'default' USER 'default'))
LIFETIME(MIN 0 MAX 0) LAYOUT(HASHED());
CREATE DICTIONARY orders_sparse (id UInt64, price Float64)
PRIMARY KEY id SOURCE(CLICKHOUSE(HOST 'localhost' PORT 9000
TABLE orders DB 'default' USER 'default'))
LIFETIME(MIN 0 MAX 0) LAYOUT(SPARSE_HASHED());
SELECT
name,
type,
status,
element_count,
formatReadableSize(bytes_allocated) AS RAM
FROM system.dictionaries
WHERE name LIKE 'orders%'
┌─name──────────┬─type─────────┬─status─┬─element_count─┬─RAM────────┐
│ orders_sparse │ SparseHashed │ LOADED │ 5000000 │ 84.29 MiB │
│ orders_hashed │ Hashed │ LOADED │ 5000000 │ 256.00 MiB │
└───────────────┴──────────────┴────────┴───────────────┴────────────┘
SELECT sum(dictGet('default.orders_hashed', 'price', toUInt64(number))) AS res
FROM numbers(10000000)
┌─res─┐
│ 0 │
└─────┘
1 rows in set. Elapsed: 0.279 sec. Processed 10.02 million rows ...
SELECT sum(dictGet('default.orders_sparse', 'price', toUInt64(number))) AS res
FROM numbers(10000000)
┌─res─┐
│ 0 │
└─────┘
1 rows in set. Elapsed: 1.085 sec. Processed 10.02 million rows ...
As you can see SPARSE_HASHED is memory efficient and use about 3 times less memory (!!!) but is almost 4 times slower. But this is the ultimate case because this test does not read data from the disk (no MergeTree table involved).
We encourage you to test SPARSE_HASHED against your real queries, because it able to save a lot of memory and have larger (in rows) external dictionaries.
Last modified 2021.08.12: General corrections and updates. (5969b6886)