多台BE节点异常宕机

Viewed 15

集群配置:66台BE节点,64C400G15T,5台FE节点16C32G500G,CPU为飞腾,系统为Kylin V10Sp2
Doris版本:2.1.4(暂时无法进行升级)
问题描述:
在执行INSERT INTO ... SELECT ... 类型SQL时,偶发部分BE节点异常宕机。
insert的目标表使用的是明细模型,三副本,每次insert前都会执行一次drop当前要插入数据的分区操作,自动分桶策略,目前每个分区的分桶数为2-3。

最近一次宕机三台节点,对应的be.out中的信息分别如下
节点1

F20260613 17:05:28.752749 2381756 pending_rowset_helper.cpp:43] Check failed: !_pending_rowset_set || (_rowset_id == other._rowset_id && _pending_rowset_set == other._pending_rowset_set) 020000000002d99d4042160673a46718900de78cd8a3c6bc 020000000002d99d4042160673a46718900de78cd8a3c6bc 0xfffc034e19c0 0
*** Check failure stack trace: ***
    @     0xaaac34d7a0e8  google::LogMessage::SendToLog()
    @     0xaaac34d76f30  google::LogMessage::Flush()
    @     0xaaac34d7a7b0  google::LogMessageFatal::~LogMessageFatal()
    @     0xaaac2a6cfb24  doris::PendingRowsetGuard::operator=()
    @     0xaaac2b110b7c  doris::TxnManager::commit_txn()
    @     0xaaac2b110554  doris::TxnManager::commit_txn()
    @     0xaaac2b128cb4  doris::RowsetBuilder::commit_txn()
    @     0xaaac2b21c878  doris::LoadStreamWriter::close()
    @     0xaaac2b209124  std::_Function_handler<>::_M_invoke()
    @     0xaaac2b347784  doris::WorkThreadPool<>::work_thread()
    @     0xaaac3732ea7c  execute_native_thread_routine
    @     0xfffc07d487ac  (unknown)
    @     0xfffc07f95cec  (unknown)
    @              (nil)  (unknown)
*** Query id: 60bef42e04684c99-a2df3f4be96a4dca ***
*** is nereids: 0 ***
*** tablet id: 0 ***
*** Aborted at 1781341529 (unix time) try "date -d @1781341529" if you are using GNU date ***
*** Current BE git commitid: 6ff0573991 ***
*** SIGABRT unknown detail explain (@0x5dd00245072) received by PID 2379890 (TID 2381756 OR 0xffff763c79890) from PID 2379890; stack trace: ***
 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_enterprise/doris/be/src/common/signal_handler.h:421
 1# 0x0000FFFC080F07C0 in linux-vdso.so.1
 2# raise in /usr/lib64/libc.so.6
 3# abort in /usr/lib64/libc.so.6
 4# google::IsGoogleLoggingInitialized() in /app/doris/be/lib/doris_be
 5# google::base::GetLogger(int) in /app/doris/be/lib/doris_be
 6# google::LogMessage::SendToLog() in /app/doris/be/lib/doris_be
 7# google::LogMessage::Flush() in /app/doris/be/lib/doris_be
 8# google::LogMessageFatal::~LogMessageFatal() in /app/doris/be/lib/doris_be
 9# doris::PendingRowsetGuard::operator=(doris::PendingRowsetGuard&&) at /home/zcp/repo_center/doris_enterprise/doris/be/src/olap/rowset/pending_rowset_helper.cpp:42
10# doris::TxnManager::commit_txn(doris::OlapMeta*, long, long, long, doris::UniqueId, doris::PUniqueId const&, std::shared_ptr<doris::Rowset> const&, doris::PendingRowsetGuard, bool)    in /app/doris/be/lib/doris_be
11# doris::TxnManager::commit_txn(long, doris::Tablet const&, long, doris::PUniqueId const&, std::shared_ptr<doris::Rowset> const&, doris::PendingRowsetGuard, bool) at /home/zcp/    repo_center/doris_enterprise/doris/be/src/olap/txn_manager.cpp:177
12# doris::RowsetBuilder::commit_txn() at /home/zcp/repo_center/doris_enterprise/doris/be/src/olap/rowset_builder.cpp:317
13# doris::LoadStreamWriter::close() at /home/zcp/repo_center/doris_enterprise/doris/be/src/runtime/load_stream_writer.cpp:228
14# std::_Function_handler<void (), doris::TabletStream::close()::$_1>::_M_invoke(std::_Any_data const&) at /usr/local/bin/ldb-toolchain/bin/../lib/gcc/aarch64-linux-gnu/11    /../../../../include/c++/11/bits/std_function.h:291
15# doris::WorkThreadPool<false>::work_thread(int) at /home/zcp/repo_center/doris_enterprise/doris/be/src/util/work_thread_pool.hpp:159
16# execute_native_thread_routine in /app/doris/be/lib/doris_be
17# 0x0000FFFC07D487AC in /usr/lib64/libpthread.so.0
18# 0x0000FFFC07F95CEC in /usr/lib64/libc.so.6

节点2

[warn] evbuffer_file_segment_materialize: mmap(591, 0, 282) failed: No such device
F20260613 17:05:28.751812 1477793 pending_rowset_helper.cpp:43] Check failed: !_pending_rowset_set || (_rowset_id == other._rowset_id && _pending_rowset_set == other._pending_rowset_set) 020000000002d7a29841ecd8a028b98a4d638b3a408b8a0 020000000002d7a29841ecd8a028b98a4d638b3a408b8a0 0xfff8c37410bc 0
*** Check failure stack trace: ***
    @     0xaaac7379a0e8  google::LogMessage::SendToLog()
    @     0xaaac73796f30  google::LogMessage::Flush()
    @     0xaaac7379a7b0  google::LogMessageFatal::~LogMessageFatal()
    @     0xaaac690efb24  doris::PendingRowsetGuard::operator=()
    @     0xaaac69b30b7c  doris::TxnManager::commit_txn()
    @     0xaaac69b30554  doris::TxnManager::commit_txn()
    @     0xaaac69b48cb4  doris::RowsetBuilder::commit_txn()
    @     0xaaac69c3c878  doris::LoadStreamWriter::close()
    @     0xaaac69c29124  std::_Function_handler<>::_M_invoke()
    @     0xaaac69d67784  doris::WorkThreadPool<>::work_thread()
    @     0xaaac75d4ea7c  execute_native_thread_routine
    @     0xfffd8aef87ac  (unknown)
    @     0xfffd8b145cec  (unknown)
    @              (nil)  (unknown)
*** Query id: 60bef42e04684c99-a2df3f4be96a4dca ***
*** is nereids: 0 ***
*** tablet id: 0 ***
*** Aborted at 1781341529 (unix time) try "date -d @1781341529" if you are using GNU date ***
*** Current BE git commitid: 6ff0573991 ***
*** SIGABRT unknown detail explain (@0x5dd0016847d) received by PID 1475709 (TID 1477793 OR 0xfff8c5ef9890) from PID 1475709; stack trace: ***
 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_enterprise/doris/be/src/common/signal_handler.h:421
 1# 0x0000FFFD8B2A07C0 in linux-vdso.so.1
 2# raise in /usr/lib64/libc.so.6
 3# abort in /usr/lib64/libc.so.6
 4# google::IsGoogleLoggingInitialized() in /app/doris/be/lib/doris_be
 5# google::base::GetLogger(int) in /app/doris/be/lib/doris_be
 6# google::LogMessage::SendToLog() in /app/doris/be/lib/doris_be
 7# google::LogMessage::Flush() in /app/doris/be/lib/doris_be
 8# google::LogMessageFatal::~LogMessageFatal() in /app/doris/be/lib/doris_be
 9# doris::PendingRowsetGuard::operator=(doris::PendingRowsetGuard&&) at /home/zcp/repo_center/doris_enterprise/doris/be/src/olap/rowset/pending_rowset_helper.cpp:42
10# doris::TxnManager::commit_txn(doris::OlapMeta*, long, long, long, doris::UniqueId, doris::PUniqueId const&, std::shared_ptr<doris::Rowset> const&, doris::PendingRowsetGuard, bool    in /app/doris/be/lib/doris_be
11# doris::TxnManager::commit_txn(long, doris::Tablet const&, long, doris::PUniqueId const&, std::shared_ptr<doris::Rowset> const&, doris::PendingRowsetGuard, bool) at /home/zcp    repo_center/doris_enterprise/doris/be/src/olap/txn_manager.cpp:177
12# doris::RowsetBuilder::commit_txn() at /home/zcp/repo_center/doris_enterprise/doris/be/src/olap/rowset_builder.cpp:317
13# doris::LoadStreamWriter::close() at /home/zcp/repo_center/doris_enterprise/doris/be/src/runtime/load_stream_writer.cpp:228
14# std::_Function_handler<void (), doris::TabletStream::close()::$_1>::_M_invoke(std::_Any_data const&) at /usr/local/bin/ldb-toolchain/bin/../lib/gcc/aarch64-linux-gnu/1    /../../../../include/c++/11/bits/std_function.h:291
15# doris::WorkThreadPool<false>::work_thread(int) at /home/zcp/repo_center/doris_enterprise/doris/be/src/util/work_thread_pool.hpp:159
16# execute_native_thread_routine in /app/doris/be/lib/doris_be
17# 0x0000FFFD8AEF87AC in /usr/lib64/libpthread.so.0
18# 0x0000FFFD8B145CEC in /usr/lib64/libc.so.6

节点3

[warn] evbuffer_file_segment_materialize: mmap(675, 0, 282) failed: No such device
F20260613 17:05:28.756786 1927044 pending_rowset_helper.cpp:43] Check failed: !_pending_rowset_set || (_rowset_id == other._rowset_id && _pending_rowset_set == other._pending_rowset_set) 020000000002c87e5641b1308bc8aa37577b558ebbf13b8e 020000000002c87e5641b1308bc8aa37577b558ebbf13b8e 0xfffd6d6f23c0 0
*** Check failure stack trace: ***
    @     0xaaacdee0a0e8  google::LogMessage::SendToLog()
    @     0xaaacdee6f300  google::LogMessage::Flush()
    @     0xaaacdee7a7b0  google::LogMessageFatal::~LogMessageFatal()
    @     0xaaacde473fb2  doris::PendingRowsetGuard::operator=()
    @     0xaaacde5180b7  doris::TxnManager::commit_txn()
    @     0xaaacde518055  doris::TxnManager::commit_txn()
    @     0xaaacde5198cb  doris::RowsetBuilder::commit_txn()
    @     0xaaacde528c87  doris::LoadStreamWriter::close()
    @     0xaaacde527912  std::_Function_handler<>::_M_invoke()
    @     0xaaacde53b778  doris::WorkThreadPool<>::work_thread()
    @     0xaaacadf139ea  execute_native_thread_routine
    @     0xfffd71f387ac  (unknown)
    @     0xfffd72185cec  (unknown)
    @              (nil)  (unknown)
*** Query id: 60bef42e04684c99-a2df3f4be96a4dca ***
*** is nereids: 0 ***
*** tablet id: 0 ***
*** Aborted at 1781341529 (unix time) try "date -d @1781341529" if you are using GNU date ***
*** Current BE git commitid: 6ff0573991 ***
*** SIGABRT unknown detail explain (@0x5dd001d6070) received by PID 1925232 (TID 1927044 OR 0xffff8ecfe9890) from PID 1925232; stack trace: ***
 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_enterprise/doris/be/src/common/signal_handler.h:421
 1# 0x0000FFFD722E07C0 in linux-vdso.so.1
 2# raise in /usr/lib64/libc.so.6
 3# abort in /usr/lib64/libc.so.6
 4# google::IsGoogleLoggingInitialized() in /app/doris/be/lib/doris_be
 5# google::base::GetLogger(int) in /app/doris/be/lib/doris_be
 6# google::LogMessage::SendToLog() in /app/doris/be/lib/doris_be
 7# google::LogMessage::Flush() in /app/doris/be/lib/doris_be
 8# google::LogMessageFatal::~LogMessageFatal() in /app/doris/be/lib/doris_be
 9# doris::PendingRowsetGuard::operator=(doris::PendingRowsetGuard&&) at /home/zcp/repo_center/doris_enterprise/doris/be/src/olap/rowset/pending_rowset_helper.cpp:42
10# doris::TxnManager::commit_txn(doris::OlapMeta*, long, long, long, doris::UniqueId, doris::PUniqueId const&, std::shared_ptr<doris::Rowset> const&, doris::PendingRowsetGuard, bool    in /app/doris/be/lib/doris_be
11# doris::TxnManager::commit_txn(long, doris::Tablet const&, long, doris::PUniqueId const&, std::shared_ptr<doris::Rowset> const&, doris::PendingRowsetGuard, bool) at /home/zcp    repo_center/doris_enterprise/doris/be/src/olap/txn_manager.cpp:177
12# doris::RowsetBuilder::commit_txn() at /home/zcp/repo_center/doris_enterprise/doris/be/src/olap/rowset_builder.cpp:317
13# doris::LoadStreamWriter::close() at /home/zcp/repo_center/doris_enterprise/doris/be/src/runtime/load_stream_writer.cpp:228
14# std::_Function_handler<void (), doris::TabletStream::close()::$_1>::_M_invoke(std::_Any_data const&) at /usr/local/bin/ldb-toolchain/bin/../lib/gcc/aarch64-linux-gnu/1    /../../../../include/c++/11/bits/std_function.h:291
15# doris::WorkThreadPool<false>::work_thread(int) at /home/zcp/repo_center/doris_enterprise/doris/be/src/util/work_thread_pool.hpp:159
16# execute_native_thread_routine in /app/doris/be/lib/doris_be
17# 0x0000FFFD71F387AC in /usr/lib64/libpthread.so.0
18# 0x0000FFFD72185CEC in /usr/lib64/libc.so.6
0 Answers