Doris自动分区表初始化插入时,3个fe相继挂掉,日志未出现报错信息

Viewed 38

分区表很简单,只是date到天的分区,一次性插入了大概400个分区,但无法在测试环境重现问题,像这种情况如何排查?

create table DWS.PMF_INV_BAL_RR_ASS_ACCOUNT_BATCH_SNAP_F(
    ExportDate date NULL,
    CompanyID varchar(100) NOT NULL,
    ……
) AUTO PARTITION BY LIST(ExportDate) ();

fe日志:

RuntimeLogger 2025-03-18 21:02:56,800 INFO (replayer|13) [StreamLoadRecordMgr.replayFetchStreamLoadRecord():368] Replay stream load bdbje. backend: test-disaggregated-cluster-cg1-0.test-disaggregated-cluster-cg1.doris.svc.lygk8s04.local, last stream load time: 1742297940592
RuntimeLogger 2025-03-18 21:02:56,802 INFO (replayer|13) [EditLog.loadJournal():253] Begin to unprotect add partition. db = 38850 table = 12870844 partitionName = p20242d122d0310
RuntimeLogger 2025-03-18 21:02:56,811 INFO (replayer|13) [EditLog.loadJournal():253] Begin to unprotect add partition. db = 38850 table = 12870844 partitionName = p20242d112d0310
RuntimeLogger 2025-03-18 21:02:56,820 INFO (replayer|13) [EditLog.loadJournal():253] Begin to unprotect add partition. db = 38850 table = 12870844 partitionName = p20242d112d0410
RuntimeLogger 2025-03-18 21:02:56,829 INFO (replayer|13) [EditLog.loadJournal():253] Begin to unprotect add partition. db = 38850 table = 12870844 partitionName = p20252d032d1010
RuntimeLogger 2025-03-18 21:02:56,838 INFO (replayer|13) [EditLog.loadJournal():253] Begin to unprotect add partition. db = 38850 table = 12870844 partitionName = p20252d032d1110
RuntimeLogger 2025-03-18 21:02:56,846 INFO (replayer|13) [EditLog.loadJournal():253] Begin to unprotect add partition. db = 38850 table = 12870844 partitionName = p20252d032d1210
RuntimeLogger 2025-03-18 21:02:56,854 INFO (replayer|13) [EditLog.loadJournal():253] Begin to unprotect add partition. db = 38850 table = 12870844 partitionName = p20252d032d1310
RuntimeLogger 2025-03-18 21:02:56,863 INFO (replayer|13) [EditLog.loadJournal():253] Begin to unprotect add partition. db = 38850 table = 12870844 partitionName = p20252d032d0410
RuntimeLogger 2025-03-18 21:02:56,872 INFO (replayer|13) [EditLog.loadJournal():253] Begin to unprotect add partition. db = 38850 table = 12870844 partitionName = p20252d022d1210
RuntimeLogger 2025-03-18 21:02:56,881 INFO (replayer|13) [EditLog.loadJournal():253] Begin to unprotect add partition. db = 38850 table = 12870844 partitionName = p20252d032d0710
RuntimeLogger 2025-03-18 21:02:56,890 INFO (replayer|13) [EditLog.loadJournal():253] Begin to unprotect add partition. db = 38850 table = 12870844 partitionName = p20252d032d0810
RuntimeLogger 2025-03-18 21:02:56,899 INFO (replayer|13) [EditLog.loadJournal():253] Begin to unprotect add partition. db = 38850 table = 12870844 partitionName = p20242d112d0510
RuntimeLogger 2025-03-18 21:02:56,908 INFO (replayer|13) [EditLog.loadJournal():253] Begin to unprotect add partition. db = 38850 table = 12870844 partitionName = p20242d112d0610
RuntimeLogger 2025-03-18 21:02:56,918 INFO (replayer|13) [EditLog.loadJournal():253] Begin to unprotect add partition. db = 38850 table = 12870844 partitionName = p20242d122d1710
RuntimeLogger 2025-03-18 21:02:56,935 INFO (replayer|13) [EditLog.loadJournal():253] Begin to unprotect add partition. db = 38850 table = 12870844 partitionName = p20242d122d1810
RuntimeLogger 2025-03-18 21:02:56,945 INFO (replayer|13) [EditLog.loadJournal():253] Begin to unprotect add partition. db = 38850 table = 12870844 partitionName = p20242d122d2510
RuntimeLogger 2025-03-18 21:02:56,954 INFO (replayer|13) [EditLog.loadJournal():253] Begin to unprotect add partition. db = 38850 table = 12870844 partitionName = p20242d122d3010
RuntimeLogger 2025-03-18 21:02:56,962 INFO (replayer|13) [EditLog.loadJournal():253] Begin to unprotect add partition. db = 38850 table = 12870844 partitionName = p20242d122d2910
RuntimeLogger 2025-03-18 21:02:56,970 INFO (replayer|13) [EditLog.loadJournal():253] Begin to unprotect add partition. db = 38850 table = 12870844 partitionName = p20242d122d2810
RuntimeLogger 2025-03-18 21:02:56,979 INFO (replayer|13) [EditLog.loadJournal():253] Begin to unprotect add partition. db = 38850 table = 12870844 partitionName = p20242d122d2710
RuntimeLogger 2025-03-18 21:02:56,989 INFO (replayer|13) [EditLog.loadJournal():253] Begin to unprotect add partition. db = 38850 table = 12870844 partitionName = p20252d012d1310
RuntimeLogger 2025-03-18 21:02:56,998 INFO (replayer|13) [EditLog.loadJournal():253] Begin to unprotect add partition. db = 38850 table = 12870844 partitionName = p20242d122d3110
RuntimeLogger 2025-03-18 21:02:57,007 INFO (replayer|13) [EditLog.loadJournal():253] Begin to unprotect add partition. db = 38850 table = 12870844 partitionName = p20252d012d0110
RuntimeLogger 2025-03-18 21:02:57,017 INFO (replayer|13) [EditLog.loadJournal():253] Begin to unprotect add partition. db = 38850 table = 12870844 partitionName = p20252d012d0910
RuntimeLogger 2025-03-18 21:02:57,026 INFO (replayer|13) [EditLog.loadJournal():253] Begin to unprotect add partition. db = 38850 table = 12870844 partitionName = p20242d122d1610
RuntimeLogger 2025-03-18 21:02:57,034 INFO (replayer|13) [StreamLoadRecordMgr.replayFetchStreamLoadRecord():368] Replay stream load bdbje. backend: test-disaggregated-cluster-cg1-2.test-disaggregated-cluster-cg1.doris.svc.lygk8s04.local, last stream load time: 1742298000623
RuntimeLogger 2025-03-18 21:02:57,034 INFO (replayer|13) [StreamLoadRecordMgr.replayFetchStreamLoadRecord():368] Replay stream load bdbje. backend: test-disaggregated-cluster-cg1-1.test-disaggregated-cluster-cg1.doris.svc.lygk8s04.local, last stream load time: 1742298060723
RuntimeLogger 2025-03-18 21:02:57,034 INFO (replayer|13) [StreamLoadRecordMgr.replayFetchStreamLoadRecord():368] Replay stream load bdbje. backend: test-disaggregated-cluster-cg1-0.test-disaggregated-cluster-cg1.doris.svc.lygk8s04.local, last stream load time: 1742298000716
RuntimeLogger 2025-03-18 21:02:57,035 INFO (replayer|13) [RefreshManager.refreshCatalogInternal():80] refresh catalog ctl_ruiyun_hr with invalidCache true
RuntimeLogger 2025-03-18 21:02:57,036 INFO (replayer|13) [LoadJob.isExpired():1245] state FINISHED, expireTime 43200, currentTimeMs 1742302977036, finishTimestamp 1742298142401
RuntimeLogger 2025-03-18 21:02:57,036 INFO (replayer|13) [LoadManager.replayCreateLoadJob():186] LOAD_JOB=12899861, msg={replay create load job}
RuntimeLogger 2025-03-18 21:02:57,036 INFO (replayer|13) [LoadJob.isExpired():1245] state FINISHED, expireTime 43200, currentTimeMs 1742302977036, finishTimestamp 1742298142636
RuntimeLogger 2025-03-18 21:02:57,036 INFO (replayer|13) [LoadManager.replayCreateLoadJob():186] LOAD_JOB=12899863, msg={replay create load job}
RuntimeLogger 2025-03-18 21:02:57,036 INFO (replayer|13) [LoadJob.isExpired():1245] state FINISHED, expireTime 43200, currentTimeMs 1742302977036, finishTimestamp 1742298142854
RuntimeLogger 2025-03-18 21:02:57,036 INFO (replayer|13) [LoadManager.replayCreateLoadJob():186] LOAD_JOB=12899865, msg={replay create load job}
RuntimeLogger 2025-03-18 21:02:57,038 INFO (replayer|13) [EditLog.loadJournal():253] Begin to unprotect add partition. db = 38850 table = 12870844 partitionName = p20252d012d0210
RuntimeLogger 2025-03-18 21:02:57,046 INFO (replayer|13) [EditLog.loadJournal():253] Begin to unprotect add partition. db = 38850 table = 12870844 partitionName = p20252d022d1810
RuntimeLogger 2025-03-18 21:02:57,055 INFO (replayer|13) [EditLog.loadJournal():253] Begin to unprotect add partition. db = 38850 table = 12870844 partitionName = p20252d012d0310
RuntimeLogger 2025-03-18 21:02:57,064 INFO (replayer|13) [StreamLoadRecordMgr.replayFetchStreamLoadRecord():368] Replay stream load bdbje. backend: test-disaggregated-cluster-cg1-2.test-disaggregated-cluster-cg1.doris.svc.lygk8s04.local, last stream load time: 1742298000623
RuntimeLogger 2025-03-18 21:02:57,064 INFO (replayer|13) [StreamLoadRecordMgr.replayFetchStreamLoadRecord():368] Replay stream load bdbje. backend: test-disaggregated-cluster-cg1-1.test-disaggregated-cluster-cg1.doris.svc.lygk8s04.local, last stream load time: 1742298120631
RuntimeLogger 2025-03-18 21:02:57,064 INFO (replayer|13) [StreamLoadRecordMgr.replayFetchStreamLoadRecord():368] Replay stream load bdbje. backend: test-disaggregated-cluster-cg1-0.test-disaggregated-cluster-cg1.doris.svc.lygk8s04.local, last stream load time: 1742298180731
RuntimeLogger 2025-03-18 21:02:57,064 INFO (replayer|13) [LoadJob.isExpired():1245] state FINISHED, expireTime 43200, currentTimeMs 1742302977064, finishTimestamp 1742298246130
RuntimeLogger 2025-03-18 21:02:57,065 INFO (replayer|13) [LoadManager.replayCreateLoadJob():186] LOAD_JOB=12900027, msg={replay create load job}
RuntimeLogger 2025-03-18 21:02:57,067 ERROR (replayer|13) [EditLog.loadJournal():1251] replay Operation Type 210, log id: 3815795
java.lang.NullPointerException: null
	at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:903) ~[guava-32.1.2-jre.jar:?]
	at org.apache.doris.catalog.OlapTable.checkPartition(OlapTable.java:2593) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.catalog.OlapTable.replaceTempPartitions(OlapTable.java:2559) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.catalog.Env.replayReplaceTempPartition(Env.java:6104) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.persist.EditLog.loadJournal(EditLog.java:839) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.catalog.Env.replayJournal(Env.java:2999) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.catalog.Env$4.runOneCycle(Env.java:2761) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.common.util.Daemon.run(Daemon.java:119) ~[doris-fe.jar:1.2-SNAPSHOT]

大概过了几个小时,3个fe全挂了,恢复时fe元数据后几个文件去掉后才启动成功,但成功之后其它表写入时报错。吐槽下 s3备份功能到3.0.4了还不可用

Caused by: org.apache.seatunnel.connectors.doris.exception.DorisConnectorException: ErrorCode:[Doris-01], ErrorDescription:[stream load error] - stream load error: [CANCELLED]cancelled: [INTERNAL_ERROR][INTERNAL_ERROR]add row failed. VNodeChannel[7846320-10061], load_id=7d45dd4a21f162ea-124112d55e8c519a, txn_id=9811272430452736, node=k8s地址, add batch req success but status isn't ok, err: [INTERNAL_ERROR]PStatus: (k8s地址)[INTERNAL_ERROR]failed to get tablet 12851021
2 Answers

fe宕机的时候fe.log有什么异常日志吗?

没有找到异常的信息,只有上边插入分区的部分