集群中有20+个flink作业写doris,每个作业写1~20个doris表,全部关闭了2pc,在某个时间点发生了几个作业checkpoint超时
- 每个作业的checkpoint状态都不大,在超时前和重启后都正常耗时很短
- checkpoint间隔基本都为一分钟,10分钟超时
- 观察BE的资源 cpu,memory,io都不是特别高
- 每次写入的数据也不是特别多
- 在BE日志中看到checkpoint超时时,很多abort transaction: TransactionState,例如某个作业写了10个表,就有对应的这10个事务
- 当时 doris集群的prepare事务数约为100,max_running_txn_num_per_db配置为1000
- 看作业taskmanager日志,在最后一次checkpoint间隔开始时,作业内10个dorisSink start stream load后,然后直接10分钟超期间都没日志输出
取一个事务分别拿FE和BE的日志:
FE:
2026-04-01 11:00:32,827 INFO (thrift-server-pool-8235|5975562) [DatabaseTransactionMgr.beginTransaction():396] begin transaction: txn id 178277697 with label test_0_1775012432827 from coordinator BE: 192.168.1.1, listener id: -1
2026-04-01 11:10:32,818 INFO (thrift-server-pool-7832|5620194) [DatabaseTransactionMgr.abortTransaction():1644] abort transaction: TransactionState. transaction id: 178277697, label: test_0_1775012432827, db id: 370739, table id list: 5586149, callback id: -1, coordinator: BE: 192.168.1.1, transaction status: ABORTED, error replicas num: 0, replica ids: , prepare time: 1775012432827, commit time: -1, finish time: 1775013032818, reason: [OK] successfully
2026-04-01 13:07:49,229 INFO (leaderCheckpointer|109) [DatabaseTransactionMgr.replayUpsertTransactionState():2264] replay a ABORTED transaction TransactionState. transaction id: 178277697, label: test_0_1775012432827, db id: 370739, table id list: 5586149, callback id: -1, coordinator: BE: 192.168.1.1, transaction status: ABORTED, error replicas num: 0, replica ids: , prepare time: 1775012432827, commit time: -1, finish time: 1775013032818, reason: [OK]
BE:
I20260401 11:00:32.830299 26416 stream_load_executor.cpp:72] begin to execute stream load. label=test_0_1775012432827, txn_id=178277697, query_id=a6433bb6bc985de6-741ac909a1ee46bd
I20260401 11:00:32.830945 26416 stream_load.cpp:214] finished to handle HTTP header, id=a6433bb6bc985de6-741ac909a1ee46bd, job_id=-1, txn_id=178277697, label=test_0_1775012432827, elapse(s)=0
I20260401 11:00:32.832597 25349 tablets_channel.cpp:165] txn 178277697: TabletsChannel of index 5586150 init senders 1 with incremental off
W20260401 11:10:32.816349 24150 vtablet_writer.cpp:589] cancel node channel VNodeChannel[5586150-8521003], load_id=a6433bb6bc985de6-741ac909a1ee46bd, txn_id=178277697, node=192.168.147.182:18060, error message: [CANCELLED]cancelled: sender is gone. cur path:
W20260401 11:10:32.816439 24150 vtablet_writer.cpp:589] cancel node channel VNodeChannel[5586150-10004], load_id=a6433bb6bc985de6-741ac909a1ee46bd, txn_id=178277697, node=192.168.73.216:18060, error message: [CANCELLED]cancelled: sender is gone. cur path:
W20260401 11:10:32.816476 24150 vtablet_writer.cpp:589] cancel node channel VNodeChannel[5586150-10003], load_id=a6433bb6bc985de6-741ac909a1ee46bd, txn_id=178277697, node=192.168.73.217:18060, error message: [CANCELLED]cancelled: sender is gone. cur path:
W20260401 11:10:32.816514 24150 vtablet_writer.cpp:589] cancel node channel VNodeChannel[5586150-10002], load_id=a6433bb6bc985de6-741ac909a1ee46bd, txn_id=178277697, node=192.168.1.1:18060, error message: [CANCELLED]cancelled: sender is gone. cur path:
I20260401 11:10:32.816699 24150 vtablet_writer.cpp:1376] close olap table sink. load_id=a6433bb6bc985de6-741ac909a1ee46bd, txn_id=178277697, canceled all node channels due to error: [CANCELLED]cancelled: sender is gone. cur path:
W20260401 11:10:32.817793 24150 stream_load_executor.cpp:105] fragment execute failed, err_msg=[CANCELLED]cancelled: sender is gone. cur path: , id=a6433bb6bc985de6-741ac909a1ee46bd, job_id=-1, txn_id=178277697, label=test_0_1775012432827, elapse(s)=599
I20260401 11:10:32.818830 57938 task_worker_pool.cpp:337] successfully submit task|type=CLEAR_TRANSACTION_TASK|signature=178277697
I20260401 11:10:32.819159 25017 task_worker_pool.cpp:1641] get clear transaction task. signature=178277697, transaction_id=178277697, partition_id_size=0
I20260401 11:10:32.819168 25017 storage_engine.cpp:715] begin to clear transaction task. transaction_id=178277697
I20260401 11:10:32.819175 25017 storage_engine.cpp:743] finish to clear transaction task. transaction_id=178277697
I20260401 11:10:32.819180 25017 task_worker_pool.cpp:1657] finish to clear transaction task. signature=178277697, transaction_id=178277697