doris 中be节点交替false是什么原因呢

Viewed 6

doris版本:2.0.4
集群:4个节点,3fe+4be,其中3个fe和be是混部的
问题描述
在做streamload时(3副本),1个副本写入成功,1个副本写入失败(日志显示该be的状态backendAlive=false),1个副本写入成功但丢失了之前的版本

已排查的工作
1、首先检查了各个be节点的状态,都未重启过
2、随后搜索了其他be节点在问题发生时间附近的日志,发现有大量的该错误,有3个be节点轮流是false的状态
3、日志报错信息还有(但是我觉得该问题的原因是3个be轮流false导致的,所以根本原因在于定位为啥3个be轮流false了)


W0521.08:28:17.161291-45440 stream load executor.cpp:399]·commit-transaction.failed,-errmsg-[ANALYSIS ERROR]TStatus:-errCode
=.2,detailMessage.=.Failed to commit.txn.426765578,•cause.tablet.179229911 succ replica num 1.< quorum replica num 2. table
49003,partition.179229910,-this tablet detail: 1 replicas.final.succ:-{[replicaId=179229913, backendId=122059323,
backendAlive=true,version=463377,state=NORMAL].};.1.replicas write data failed:.{[replicaId=179229914,-backendId=24479841
backendAlive=false,'version=463377,.state=NORMAL].};.1 replicas write data succ but miss previous version:{
[replicaId=179229912, backendId=2313862, backendAlive=true, version=463376,.lastFailedVersion=463377,lastsuccessVersion=463376,lastFailedTimestamp=1747787206442,state=NORMAL]
0#`doris::Status doris::Status::create<true>(doris::Tstatus const&)-at /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:187
1#doris::StreamLoadExecutor::commit_txn(doris::StreamLoadContext*)at
/home/zcp/repo_center/doris_release/doris/be/src/common/status.h:442
2#doris::StreamLoadAction::_handle(std::shared_ptr<doris::StreamLoadContext>)-at /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-1inux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:701
3#--doris::StreamLoadAction::handle (doris::HttpRequest*)·at /yar/local/ldb_toolehain/bin/../lib/gec/x86_o4-1inux-gnu/11/../../../../ineluae/e++/n/bits/shared_ptrbae.h:2gn
4#·
5#.

bufferevent_run read_cb
6# ?
7# ?
8# ?
9# ?
10# std::_Function_handler<void.(), doris::EvHttpServer::start()::$_0>::_M_invoke(std::_Any _data const&) at /var/local/ldb toolchain/bin/../lib/gcc/x86 64-linux-gnu/11/../../../../include/c++/11/ext/atomicity.h:98
11#-doris::ThreadPool::dispatch_thread()at /home/zcp/repo_center/doris_release/doris/be/src/util/threadpool.cpp:0
12#.doris::Thread::supervise_thread(void*) at./var/local/ldb_toolchain/bin/../usr/include/pthread.h:562
13#·start_ thread
14#·clone
,id=4847b12441a5998f-1218f5976360e487, job_id--1, txn_id-426765578, label=DTS-KAFKA-b361f19f-5c09-44a7-8e8c-32f62eb4a9b1,
elapse(s)=0
1 Answers

是的,这个问题您得看下 be.out 中打出的堆栈信息,不过 2.0.4 版本,可能很早就 fix 了。建议升级到2.1最新版本