Async INSERTs is a ClickHouse feature tha enables batching data automatically and transparently on the server-side. Although async inserts work, they still have issues, but have been improved in latest versions. We recommend to batch at app/ingestor level because you will have more control and you decouple this responsibility from ClickHouse. Being said that here some insights about Async inserts you should now:
- Async inserts give acknowledgment immediately after the data got inserted into the buffer (wait_for_async_insert = 0) or by default, after the data got written to a part after flushing from buffer (wait_for_async_insert = 1).
- INSERT .. SELECT is NOT async insert.
- Async inserts will do (idempotent) retries.
- Async inserts can collect data for some offline remote clusters: Yandex self-driving cars were collecting the metrics data during the ride into ClickHouse installed on the car computer to a distributed table with Async inserts enabled, which were flushed to the cluster once the car was plugged to the network.
- Async inserts can do batching, so multiple inserts can be squashed as a single insert (but in that case, retries are not idempotent anymore).
- Async inserts can loose your data in case of sudden restart (no fsyncs by default).
- Async inserted data becomes available for selects not immediately after acknowledgment.
- Async inserts generally have more
moving partsthere are some background threads monitoring new data to be sent and pushing it out.
- Async inserts require extra monitoring from different system.tables (see
system.asynchronous_insertsfor 22.8). Previously such queries didn’t appear in the query log. Check: #33239.
- Important to use
wait_for_async_insert = 1because with any error you will loose data without knowing it. For example your table is read only -> losing data, out of disk space -> losing data, too many parts -> losing data.
- Fixed bug which could lead to deadlock while using asynchronous inserts. See #43233.
- Async insert dedup: Support block deduplication for asynchronous inserts. Before this change, async inserts did not support deduplication, because multiple small inserts coexisted in one inserted batch. See #38075 and #43304.
- Added system table
asynchronous_insert_log. It contains information about asynchronous inserts (including results of queries in fire-and-forget mode. (with wait_for_async_insert=0)) for better introspection. See #42040.
- Support async inserts in clickhouse-client for queries with inlined data. See #34267.
To improve observability / introspection
It is not possible to relate
part_log/query_id column with
asynchronous_insert_log/query_id column. We need to use
asynchronous_insert_log shows up the
flush_query_id of each async insert. The
query_id from async insert shows up in the
type = 'QueryStart' but the same
query_id does not show up in the
query_id column of the
system.part_log. Because the
query_id column in the
part_log is the identifier of the INSERT query that created a data part, and it seems it is for sync INSERTS but not for async inserts.
SELECT * FROM system.metrics WHERE metric LIKE '%AsyncInsert%' Query id: 7384b8c8-3d87-4059-b1c4-e9955e97232b ┌─metric───────────────┬─value─┬─description────────────────────────────────────────────────┐ │ PendingAsyncInsert │ 0 │ Number of asynchronous inserts that are waiting for flush. │ │ AsyncInsertCacheSize │ 0 │ Number of async insert hash id in cache │ └──────────────────────┴───────┴────────────────────────────────────────────────────────────┘ 2 rows in set. Elapsed: 0.001 sec.