Replication and DDL queue problems
replication_queue
and distributed_ddl_queue
and how to troubleshoot.How to check replication problems:
check
system.replicas
first, cluster-wide. It allows to check if the problem is local to some replica or global, and allows to see the exception. allows to answer the following questions:- Are there any ReadOnly replicas?
- Is there the connection to zookeeper active?
- Is there the exception during table init? (
Code: 999. Coordination::Exception: Transaction failed (No node): Op #1
)
Check
system.replication_queue
.- How many tasks there / are they moving / are there some very old tasks there? (check
created_time
column, if tasks are 24h old, it is a sign of a problem): - You can use this qkb article query: https://kb.altinity.com/altinity-kb-setup-and-maintenance/altinity-kb-replication-queue/
- Check if there are tasks with a high number of
num_tries
ornum_postponed
andpostponed_reason
this is a sign of stuck tasks. - Check the problematic parts affecting the stuck tasks. You can use columns
new_part_name
orparts_to_merge
- Check which type is the task. If it is
MUTATE_PART
then it is a mutation task. If it isMERGE_PARTS
then it is a merge task. These tasks can be deleted from the replication queue butGET_PARTS
should not be deleted.
- How many tasks there / are they moving / are there some very old tasks there? (check
Check
system.errors
Check
system.mutations
:- You can check that in the replication queue are stuck tasks of type
MUTATE_PART
, and that those mutations are still executingsystem.mutations
using columnis_done
- You can check that in the replication queue are stuck tasks of type
Find the moment when problem started and collect / analyze / preserve logs from that moment. It is usually during the first steps of a restart/crash
Use
part_log
andsystem.parts
to gather information of the parts related with the stuck tasks in the replication queue:- Check if those parts exist and are active from
system.parts
(use partition_id, name as part and active columns to filter) - Extract the part history from
system.part_log
- Example query from
part_log
:
- Check if those parts exist and are active from
SELECT hostName(), * FROM
cluster('all-sharded',system.part_log)
WHERE
hostName() IN ('chi-prod-live-2-0-0','chi-prod-live-2-2-0','chi-prod-live-2-1-0')
AND table = 'sessions_local'
AND database = 'analytics'
AND part_name in ('20230411_33631_33654_3')
- If there are no errors, just everything get slower - check the load (usual system metrics)
Common problems & solutions
If the replication queue does not have any Exceptions only postponed reasons without exceptions just leave ClickHouse® do Merges/Mutations and it will eventually catch up and reduce the number of tasks in
replication_queue
. Number of concurrent merges and fetches can be tuned but if it is done without an analysis of your workload then you may end up in a worse situation. If Delay in queue is going up actions may be needed:First simplest approach:
- try to
SYSTEM RESTART REPLICA db.table
(This will DETACH/ATTACH table internally)
- try to
Some stuck replication task for a partition which was already removed or has no data
This can be easily detected because some exceptions will be in the replication queue that reference a part from a partition that do not exist. Here the most probably scenario is that the partition was dropped and some tasks were left in the queue.
drop the partition manually once again (it should remove the task)
If the partition exists but the part is missing (maybe because it is superseded by a newer merged part) then you can try to DETACH/ATTACH the partition.
Below DML generates the ALTER commands to do this:
WITH
extract(new_part_name, '^[^_]+') as partition_id
SELECT
'/* count: ' || count() || ' */\n' ||
'ALTER TABLE ' || database || '.' || table || ' DETACH PARTITION ID \''|| partition_id || '\';\n' ||
'ALTER TABLE ' || database || '.' || table || ' ATTACH PARTITION ID \''|| partition_id || '\';\n'
FROM
system.replication_queue as rq
GROUP BY
database, table, partition_id
HAVING sum(num_tries) > 1000 OR count() > 100
ORDER BY count() DESC, sum(num_tries) DESC
FORMAT TSVRaw;
Problem with mutation stuck in the queue:
This can happen if the mutation is finished and by some reason the task is not removed from the queue. This can be detected by checking
system.mutations
table and see if the mutation is done but the task is still in the queue.kill the mutation (again)
Replica is not starting because local set of files differs too much
- First try increase the thresholds or set flag
force_restore_data
flag and restarting clickhouse/pod https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/replication#recovery-after-complete-data-loss
Replica is in Read Only MODE
Sometimes due to crashes, zookeeper split brain problem or other reasons some of the tables can be in Read-Only mode. This allows SELECTS but not INSERTS. So we need to do DROP / RESTORE replica procedure.
Just to be clear, this procedure will not delete any data, it will just re-create the metadata in zookeeper with the current state of the ClickHouse replica.
DETACH TABLE table_name; -- Required for DROP REPLICA
-- Use the zookeeper_path and replica_name from the above query.
SYSTEM DROP REPLICA 'replica_name' FROM ZKPATH '/table_path_in_zk'; -- It will remove everything from the /table_path_in_zk/replicas/replica_name
ATTACH TABLE table_name; -- Table will be in readonly mode, because there is no metadata in ZK and after that execute
SYSTEM RESTORE REPLICA table_name; -- It will detach all partitions, re-create metadata in ZK (like it's new empty table), and then attach all partitions back
SYSTEM SYNC REPLICA table_name; -- Wait for replicas to synchronize parts. Also it's recommended to check `system.detached_parts` on all replicas after recovery is finished.
There are some variants in new 23 versions of this procedure using (SYSTEM DROP REPLICA 'replica_name' FROM TABLE db.table
)[https://clickhouse.com/docs/en/sql-reference/statements/system#drop-replica] instead of the ZKPATH
variant, but you need to execute the above command from a different replica that the one you want to drop which is not convenient sometimes. We recommend to use the above method because it works for different versions from 21 to 24 and it is more reliable.
- Procedure for many replicas generating DDL:
SELECT DISTINCT 'DETACH TABLE ' || database || '.' || table || ' ON CLUSTER \'data\';' FROM clusterAllReplicas('data',system.replicas) WHERE active_replicas < total_replicas FORMAT TSVRaw;
SELECT DISTINCT 'SYSTEM DROP REPLICA \'' || replica_name || \' FROM ZKPATH \'' || zookeeper_path || '\';' FROM clusterAllReplicas('data',system.replicas) WHERE active_replicas < total_replicas FORMAT TSVRaw;
SELECT DISTINCT 'ATTACH TABLE ' || database || '.' || table || ' ON CLUSTER \'data\';' FROM clusterAllReplicas('data',system.replicas) WHERE active_replicas < total_replicas FORMAT TSVRaw;
SELECT DISTINCT 'SYSTEM RESTORE REPLICA ' || database || '.' || table || ' ON CLUSTER \'data\';' FROM clusterAllReplicas('data',system.replicas) WHERE active_replicas < total_replicas FORMAT TSVRaw;
-- check detached parts afterwards
SELECT * FROM clusterAllReplicas('data',system.detached_parts)
-- make clickhouse 'forget' about the table (data persisted on disk)
DETACH TABLE db.table ON CLUSTER '...';
-- remove the zookeeper data about that table in zookeeper
SYSTEM DROP REPLICA 'replica_name' FROM ZKPATH '/path/to/table/in/zk'; -- run the commands generated before.
-- register table in clickhouse again - it will be in readonly mode.
ATTACH TABLE db.table ON CLUSTER '...';
-- recreate the zookeeper data from the
SYSTEM RESTORE REPLICA db.name ON CLUSTER '...';
--- do restart replica
SELECT DISTINCT 'clickhouse-client --host=' || left(hostName(),-2) || ' --query=\'SYSTEM RESTART REPLICA '||database || '.' || table|| '\'' FROM clusterAllReplicas('all-sharded', system.replication_queue) WHERE last_exception != '' and create_time > now() -130 FORMAT TSVRaw;
Here a bash script that will do the same as above but tailored to a single replica, you can call it like bash restore_replica.sh chi-clickhouse-cluster-main-cluster-1-3
:
#!/usr/bin/env bash
#Call like bash restore_replica.sh chi-clickhouse-cluster-main-cluster-1-3
set -o errexit # exit on fail
set -o pipefail # catch errors in pipelines
set -o nounset # exit on undeclared variable
set -o xtrace # trace execution
restore_replica() {
local chi_name=$1
# assumes `chi-...-cluster-<shard>-<replica>` naming ou can change this patter to your needs
local shard=$(echo $chi_name |grep -oP '(?<=cluster-)\d+(?=-\d+$)')
while true; do
clickhouse-client --host=${chi_name} --user=admin --password=the_admin_password --query="select concat(database, '.\`', table, '\`', ' ', database, '/', table) FROM system.replicas WHERE is_readonly = 1 ORDER BY database, table" |
while read -r db_table zk_path; do
clickhouse-client --host=${chi_name} --user=admin --password=the_admin_password --query="DETACH TABLE ${db_table}"
clickhouse-client --host=${chi_name} --user=admin --password=the_admin_password --query="SYSTEM DROP REPLICA '"${chi_name}"' FROM ZKPATH '/clickhouse/tables/${shard}/${zk_path}'" || true
clickhouse-client --host=${chi_name} --user=admin --password=the_admin_password --query="ATTACH TABLE ${db_table}"
clickhouse-client --host=${chi_name} --user=admin --password=the_admin_password --query="SYSTEM RESTORE REPLICA ${db_table}"
done || true
sleep 5
done
}
restore_replica "$@"
Stuck DDL tasks in the distributed_ddl_queue
Sometimes DDL tasks (the ones that use ON CLUSTER) can get stuck in the distributed_ddl_queue
because the replicas can overload if multiple DDLs (thousands of CREATE/DROP/ALTER) are executed at the same time. This is very normal in heavy ETL jobs.This can be detected by checking the distributed_ddl_queue
table and see if there are tasks that are not moving or are stuck for a long time.
If these DDLs completed in some replicas but failed in others, the simplest way to solve this is to execute the failed command in the missed replicas without ON CLUSTER. If most of the DDLs failed then check the number of unfinished records in distributed_ddl_queue
on the other nodes, because most probably it will be as high as thousands.
First backup the distributed_ddl_queue
into a table so you will have a snapshot of the table with states of the tasks. You can do this with the following command:
CREATE TABLE default.system_distributed_ddl_queue AS SELECT * FROM system.distributed_ddl_queue;
After this we need to check from the backup table which tasks are not finished and execute them manually in the missed replicas, and review the pipeline which do ON CLUSTER
command and not abuse of them. There is a new CREATE TEMPORARY TABLE
command that can be used to avoid the ON CLUSTER
command in some cases, where you need an intermediate table to do some operations and after that you can INSERT INTO
the final table or do ALTER TABLE final ATTACH PARTITION FROM TABLE temp
and this temp table will be dropped automatically after the session is closed.