集群中有20+个flink作业写doris,每个作业写1~20个doris表,全部关闭了2pc,在某个时间点发生了几个作业checkpoint超时
- 每个作业的checkpoint状态都不大,在超时前和重启后都正常耗时很短
- checkpoint间隔基本都为一分钟,10分钟超时
- 观察BE的资源 cpu,memory,io都不是特别高
- 每次写入的数据也不是特别多
- 在BE日志中看到checkpoint超时时,很多abort transaction: TransactionState,例如某个作业写了10个表,就有对应的这10个事务
- 当时 doris集群的prepare事务数约为100,max_running_txn_num_per_db配置为1000
- 看作业taskmanager日志,在最后一次checkpoint间隔开始时,作业内10个dorisSink start stream load后,然后直接10分钟超期间都没日志输出,日志片断以下
2026-04-01 11:00:36,670 INFO org.apache.doris.flink.sink.writer.RecordBuffer [] - start buffer data, read queue size 0, write queue size 3
2026-04-01 11:00:36,670 INFO org.apache.doris.flink.sink.writer.DorisStreamLoad [] - stream load started for ods_desktop_flavor_inst_0_1775012436670
2026-04-01 11:00:36,670 INFO org.apache.doris.flink.sink.writer.DorisStreamLoad [] - start execute load
2026-04-01 11:10:32,813 INFO org.apache.flink.runtime.taskmanager.Task [] - Attempting to cancel task Sink Sink(table=[test], fields=[ ]) (1/1)#3 (e343d3605d286b40ee9b0aa5277ae362).
2026-04-01 11:10:32,813 INFO org.apache.flink.runtime.taskmanager.Task [] - Sink Sink(table=[test], fields=[ ]) (1/1)#3 (e343d3605d286b40ee9b0aa5277ae362) switched from RUNNING to CANCELING.