flink 多job checkpoint超时，doirs abort transation

Question

集群中有20+个flink作业写doris，每个作业写1~20个doris表，全部关闭了2pc，在某个时间点发生了几个作业checkpoint超时

每个作业的checkpoint状态都不大，在超时前和重启后都正常耗时很短
checkpoint间隔基本都为一分钟，10分钟超时
观察BE的资源 cpu,memory,io都不是特别高
每次写入的数据也不是特别多
在BE日志中看到checkpoint超时时，很多abort transaction: TransactionState，例如某个作业写了10个表，就有对应的这10个事务
当时 doris集群的prepare事务数约为100，max_running_txn_num_per_db配置为1000
看作业taskmanager日志，在最后一次checkpoint间隔开始时,作业内10个dorisSink start stream load后，然后直接10分钟超期间都没日志输出

取一个事务分别拿FE和BE的日志：

FE:
2026-04-01 11:00:32,827 INFO (thrift-server-pool-8235|5975562) [DatabaseTransactionMgr.beginTransaction():396] begin transaction: txn id 178277697 with label test_0_1775012432827 from coordinator BE: 192.168.1.1, listener id: -1
2026-04-01 11:10:32,818 INFO (thrift-server-pool-7832|5620194) [DatabaseTransactionMgr.abortTransaction():1644] abort transaction: TransactionState. transaction id: 178277697, label: test_0_1775012432827, db id: 370739, table id list: 5586149, callback id: -1, coordinator: BE: 192.168.1.1, transaction status: ABORTED, error replicas num: 0, replica ids: , prepare time: 1775012432827, commit time: -1, finish time: 1775013032818, reason: [OK] successfully
2026-04-01 13:07:49,229 INFO (leaderCheckpointer|109) [DatabaseTransactionMgr.replayUpsertTransactionState():2264] replay a ABORTED transaction TransactionState. transaction id: 178277697, label: test_0_1775012432827, db id: 370739, table id list: 5586149, callback id: -1, coordinator: BE: 192.168.1.1, transaction status: ABORTED, error replicas num: 0, replica ids: , prepare time: 1775012432827, commit time: -1, finish time: 1775013032818, reason: [OK]

BE:
I20260401 11:00:32.830299 26416 stream_load_executor.cpp:72] begin to execute stream load. label=test_0_1775012432827, txn_id=178277697, query_id=a6433bb6bc985de6-741ac909a1ee46bd
I20260401 11:00:32.830945 26416 stream_load.cpp:214] finished to handle HTTP header, id=a6433bb6bc985de6-741ac909a1ee46bd, job_id=-1, txn_id=178277697, label=test_0_1775012432827, elapse(s)=0
I20260401 11:00:32.832597 25349 tablets_channel.cpp:165] txn 178277697: TabletsChannel of index 5586150 init senders 1 with incremental off
W20260401 11:10:32.816349 24150 vtablet_writer.cpp:589] cancel node channel VNodeChannel[5586150-8521003], load_id=a6433bb6bc985de6-741ac909a1ee46bd, txn_id=178277697, node=192.168.147.182:18060, error message: [CANCELLED]cancelled: sender is gone. cur path: 
W20260401 11:10:32.816439 24150 vtablet_writer.cpp:589] cancel node channel VNodeChannel[5586150-10004], load_id=a6433bb6bc985de6-741ac909a1ee46bd, txn_id=178277697, node=192.168.73.216:18060, error message: [CANCELLED]cancelled: sender is gone. cur path: 
W20260401 11:10:32.816476 24150 vtablet_writer.cpp:589] cancel node channel VNodeChannel[5586150-10003], load_id=a6433bb6bc985de6-741ac909a1ee46bd, txn_id=178277697, node=192.168.73.217:18060, error message: [CANCELLED]cancelled: sender is gone. cur path: 
W20260401 11:10:32.816514 24150 vtablet_writer.cpp:589] cancel node channel VNodeChannel[5586150-10002], load_id=a6433bb6bc985de6-741ac909a1ee46bd, txn_id=178277697, node=192.168.1.1:18060, error message: [CANCELLED]cancelled: sender is gone. cur path: 
I20260401 11:10:32.816699 24150 vtablet_writer.cpp:1376] close olap table sink. load_id=a6433bb6bc985de6-741ac909a1ee46bd, txn_id=178277697, canceled all node channels due to error: [CANCELLED]cancelled: sender is gone. cur path: 
W20260401 11:10:32.817793 24150 stream_load_executor.cpp:105] fragment execute failed, err_msg=[CANCELLED]cancelled: sender is gone. cur path: , id=a6433bb6bc985de6-741ac909a1ee46bd, job_id=-1, txn_id=178277697, label=test_0_1775012432827, elapse(s)=599
I20260401 11:10:32.818830 57938 task_worker_pool.cpp:337] successfully submit task|type=CLEAR_TRANSACTION_TASK|signature=178277697
I20260401 11:10:32.819159 25017 task_worker_pool.cpp:1641] get clear transaction task. signature=178277697, transaction_id=178277697, partition_id_size=0
I20260401 11:10:32.819168 25017 storage_engine.cpp:715] begin to clear transaction task. transaction_id=178277697
I20260401 11:10:32.819175 25017 storage_engine.cpp:743] finish to clear transaction task. transaction_id=178277697
I20260401 11:10:32.819180 25017 task_worker_pool.cpp:1657] finish to clear transaction task. signature=178277697, transaction_id=178277697

阿渊@SelectDB (没回帖直接加我主页微信) · Answer

报错：sender is gone ，这个看着是客户端主动断开链接了

flink 挂过没？如果Flink 挂了也会有这个问题

这里还有个参数 abort_txn_after_lost_heartbeat_time_second，如果BE 失去心跳几秒后事物就会被 abort掉，可以都排查下

有问题可以私聊我主页微信，一起看下

flink 多job checkpoint超时，doirs abort transation

1 Answers