集群版本:3.1.2 存算一体
表分区参数:
"enable_single_replica_compaction" = "false"
"dynamic_partition.time_unit" = "DAY"
"compaction_policy" = "time_series"
"dynamic_partition.replication_allocation" = "tag.location.b28: 1, tag.location.test: 1"
"dynamic_partition.storage_policy" = "3day"
问题描述:
按天分区的日志表,设置两个副本。没有开启单副本压缩,设置了冷却策略,当tablet冷却期间,follower副本持续报错:
I20260423 13:53:59.924453 550719 tablet.cpp:2216] try to follow cooldowned data. tablet_id=1776406642013 cooldown_replica_id=1771059700698 local replica=1771059700697
W20260423 13:53:59.930737 550719 olap_server.cpp:1343] failed to cooldown, tablet: 1776406642013 err: [INTERNAL_ERROR]cooldowned version is not aligned with version 15008
正常的话,会在leader副本冷却完成后,自动恢复,但是在冷却期间leader副本节点出现故障,follower副本仍然持续报错,且无法clone出新的副本,删除故障节点也不行,一直周期执行tablet clone,但是clone失败,be日志:
2cd546610f72777b580_0.dat. size(B): 166763, timeout(s): 300
I20260423 17:19:01.285480 238688 engine_clone_task.cpp:603] clone begin to download file from: http://172.30.196.116:8040/api/_tablet/_download?token=******&file=/data2/doris/data/snapshot/20260423171901.9535564.180/1776406642013/813585454/020000001d25bb06d2438a9863b422cd546610f72777b580_0.dat to: /data2/doris_be02/data/data/305/1776406642013/813585454/020000001d25bb06d2438a9863b422cd546610f72777b580_0.dat. size(B): 165161, timeout(s): 300
I20260423 17:19:01.287786 238688 engine_clone_task.cpp:603] clone begin to download file from: http://172.30.196.116:8040/api/_tablet/_download?token=******&file=/data2/doris/data/snapshot/20260423171901.9535564.180/1776406642013/813585454/1776406642013.hdr to: /data2/doris_be02/data/data/305/1776406642013/813585454/1776406642013.hdr. size(B): 91852, timeout(s): 300
I20260423 17:19:01.289645 238688 engine_clone_task.cpp:642] succeed to copy tablet 1776406642013, total files: 16, total file size: 2710286 B, cost: 38 ms, rate: 71.3233 MB/s
I20260423 17:19:01.290928 238688 beta_rowset.cpp:218] deleting /data2/doris_be02/data/data/305/1776406642013/813585454/020000001d25baf8d2438a9863b422cd546610f72777b580_0.dat
I20260423 17:19:01.291128 238688 beta_rowset.cpp:218] deleting /data2/doris_be02/data/data/305/1776406642013/813585454/020000001d25baf9d2438a9863b422cd546610f72777b580_0.dat
I20260423 17:19:01.291316 238688 beta_rowset.cpp:218] deleting /data2/doris_be02/data/data/305/1776406642013/813585454/020000001d25bafad2438a9863b422cd546610f72777b580_0.dat
I20260423 17:19:01.291493 238688 beta_rowset.cpp:218] deleting /data2/doris_be02/data/data/305/1776406642013/813585454/020000001d25bafbd2438a9863b422cd546610f72777b580_0.dat
I20260423 17:19:01.291673 238688 beta_rowset.cpp:218] deleting /data2/doris_be02/data/data/305/1776406642013/813585454/020000001d25bafcd2438a9863b422cd546610f72777b580_0.dat
I20260423 17:19:01.291848 238688 beta_rowset.cpp:218] deleting /data2/doris_be02/data/data/305/1776406642013/813585454/020000001d25bafdd2438a9863b422cd546610f72777b580_0.dat
I20260423 17:19:01.292022 238688 beta_rowset.cpp:218] deleting /data2/doris_be02/data/data/305/1776406642013/813585454/020000001d25bafed2438a9863b422cd546610f72777b580_0.dat
I20260423 17:19:01.292199 238688 beta_rowset.cpp:218] deleting /data2/doris_be02/data/data/305/1776406642013/813585454/020000001d25baffd2438a9863b422cd546610f72777b580_0.dat
I20260423 17:19:01.292393 238688 beta_rowset.cpp:218] deleting /data2/doris_be02/data/data/305/1776406642013/813585454/020000001d25bb00d2438a9863b422cd546610f72777b580_0.dat
I20260423 17:19:01.292563 238688 beta_rowset.cpp:218] deleting /data2/doris_be02/data/data/305/1776406642013/813585454/020000001d25bb01d2438a9863b422cd546610f72777b580_0.dat
I20260423 17:19:01.292737 238688 beta_rowset.cpp:218] deleting /data2/doris_be02/data/data/305/1776406642013/813585454/020000001d25bb02d2438a9863b422cd546610f72777b580_0.dat
I20260423 17:19:01.292905 238688 beta_rowset.cpp:218] deleting /data2/doris_be02/data/data/305/1776406642013/813585454/020000001d25bb03d2438a9863b422cd546610f72777b580_0.dat
I20260423 17:19:01.293093 238688 beta_rowset.cpp:218] deleting /data2/doris_be02/data/data/305/1776406642013/813585454/020000001d25bb04d2438a9863b422cd546610f72777b580_0.dat
I20260423 17:19:01.293277 238688 beta_rowset.cpp:218] deleting /data2/doris_be02/data/data/305/1776406642013/813585454/020000001d25bb05d2438a9863b422cd546610f72777b580_0.dat
I20260423 17:19:01.293450 238688 beta_rowset.cpp:218] deleting /data2/doris_be02/data/data/305/1776406642013/813585454/020000001d25bb06d2438a9863b422cd546610f72777b580_0.dat
I20260423 17:19:01.294862 238688 engine_clone_task.cpp:306] clone copy done. src_host: 172.30.196.116 src_file_path: /data2/doris/data/snapshot/20260423171901.9535564.180/
I20260423 17:19:01.294878 238688 tablet_manager.cpp:964] begin to load tablet from dir. tablet_id=1776406642013 schema_hash=813585454 path = /data2/doris_be02/data/data/305/1776406642013/813585454 force = 0 restore = 0
I20260423 17:19:01.298640 238688 tablet_manager.cpp:1060] begin to process report tablet info.tablet_id=1776406642013
W20260423 17:19:01.298659 238688 engine_clone_task.cpp:342] begin to drop the stale tablet. tablet_id:1776406642013, replica_id:1776441438992, schema_hash:813585454, signature:1776406642013, version:-1, expected_version: 2165
I20260423 17:19:01.298676 238688 tablet_manager.cpp:525] begin drop tablet. tablet_id=1776406642013, replica_id=1776441438992, is_drop_table_or_partition=0, keep_files=0
I20260423 17:19:01.298684 238688 tablet_manager.cpp:1370] add tablet_id= 1776406642013 to map, reason=drop tablet, lock times=2, thread_id_in_map=139768285955840
I20260423 17:19:01.298884 238688 tablet_manager.cpp:575] set tablet to shutdown state and remove it from memory. tablet_id=1776406642013, tablet_path=/data2/doris_be02/data/data/305/1776406642013/813585454
I20260423 17:19:01.299532 238688 tablet_manager.cpp:1394] erase tablet_id= 1776406642013 from map, reason=drop tablet, left=1, thread_id_in_map=139768285955840
W20260423 17:19:01.299556 238688 status.h:427] meet error status: [INTERNAL_ERROR]unexpected version. tablet version: -1, expected version: 2165
0# doris::EngineCloneTask::_set_tablet_info() at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:0
1# doris::EngineCloneTask::_do_clone() at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_vector.h:335
2# doris::EngineCloneTask::execute() at /home/zcp/repo_center/doris_release/doris/be/src/olap/task/engine_clone_task.cpp:165
3# doris::clone_callback(doris::StorageEngine&, doris::ClusterInfo const*, doris::TAgentTaskRequest const&) at /home/zcp/repo_center/doris_release/doris/be/../gensrc/build/gen_cpp/MasterService_types.h:343
4# doris::PriorTaskWorkerPool::normal_loop() at /home/zcp/repo_center/doris_release/doris/be/src/agent/task_worker_pool.cpp:691
5# doris::Thread::supervise_thread(void*) at /var/local/ldb-toolchain/bin/../usr/include/pthread.h:562
6# start_thread
7# clone
I20260423 17:19:01.299574 238688 tablet_manager.cpp:1397] erase tablet_id= 1776406642013 from map, reason=clone, thread_id_in_map=139768285955840
W20260423 17:19:01.299587 238688 task_worker_pool.cpp:2239] failed to clone tablet|signature=1776406642013|tablet_id=1776406642013|error=[INTERNAL_ERROR]unexpected version. tablet version: -1, expected version: 2165
0# doris::EngineCloneTask::_set_tablet_info() at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:0
1# doris::EngineCloneTask::_do_clone() at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_vector.h:335
2# doris::EngineCloneTask::execute() at /home/zcp/repo_center/doris_release/doris/be/src/olap/task/engine_clone_task.cpp:165
3# doris::clone_callback(doris::StorageEngine&, doris::ClusterInfo const*, doris::TAgentTaskRequest const&) at /home/zcp/repo_center/doris_release/doris/be/../gensrc/build/gen_cpp/MasterService_types.h:343
4# doris::PriorTaskWorkerPool::normal_loop() at /home/zcp/repo_center/doris_release/doris/be/src/agent/task_work
er_pool.cpp:691
5# doris::Thread::supervise_thread(void*) at /var/local/ldb-toolchain/bin/../usr/include/pthread.h:562
6# start_thread
7# clone
fe日志:
2026-04-23 17:30:11,124 INFO (tablet checker|156) [TabletScheduler.addTablet():301] Add tablet to pending queue, tablet id: 1776406642013, status: REPLICA_MISSING, state: PENDING, type: REPAIR, priority: VERY_HIGH, tablet size: 0, visible version: -1, committed version: -1
2026-04-23 17:30:16,291 INFO (tablet scheduler|156) [TabletSchedCtx.createCloneReplicaAndTask():1053] create clone task to repair replica, tabletId=1776406642013, replica=[replicaId=1776441511370, BackendId=1776433717754, version=-1, dataSize=-1, rowCount=-1, lastFailedVersion=2165, lastSuccessVersion=-1, lastFailedTimestamp=1776936616291, schemaHash=813585454, state=CLONE, isBad=false], visible version 2165, tablet status REPLICA_MISSING
2026-04-23 17:30:16,292 INFO (tablet scheduler|156) [TabletScheduler.schedulePendingTablets():478] add clone task to agent task queue: tablet id: 1776406642013, replica id: 1776441511370, schema hash: 813585454, storageMedium: SSD, visible version: 2165, src backend: 172.30.196.116, src path hash: -7089620660731652427, dest backend id: 1776433717754, dest backend: 172.30.195.181, dest path hash: -4937331539893838086
2026-04-23 17:30:16,338 WARN (thrift-server-pool-3349|156) [MasterImpl.finishTask():100] finish task reports bad. request: TFinishTaskRequest(backend:TBackend(host:172.30.195.181, be_port:29060, http_port:28040, brpc_port:28060, id:1776433717754), task_type:CLONE, signature:1776406642013, task_status:TStatus(status_code:INTERNAL_ERROR, error_msgs:[(172.30.195.181)[INTERNAL_ERROR]unexpected version. tablet version: -1, expected version: 2165]))
tablet状态:

大佬帮分析下,这种情况应该如何处理