集群配置:66台BE节点,64C400G15T,5台FE节点16C32G500G,CPU为飞腾,系统为Kylin V10Sp2
Doris版本:2.1.4(暂时无法进行升级)
问题描述:
在执行INSERT INTO ... SELECT ... 类型SQL时,偶发部分BE节点异常宕机。
insert的目标表使用的是明细模型,三副本,每次insert前都会执行一次drop当前要插入数据的分区操作,自动分桶策略,目前每个分区的分桶数为2-3。
最近一次宕机三台节点,对应的be.out中的信息分别如下
节点1
F20260613 17:05:28.752749 2381756 pending_rowset_helper.cpp:43] Check failed: !_pending_rowset_set || (_rowset_id == other._rowset_id && _pending_rowset_set == other._pending_rowset_set) 020000000002d99d4042160673a46718900de78cd8a3c6bc 020000000002d99d4042160673a46718900de78cd8a3c6bc 0xfffc034e19c0 0
*** Check failure stack trace: ***
@ 0xaaac34d7a0e8 google::LogMessage::SendToLog()
@ 0xaaac34d76f30 google::LogMessage::Flush()
@ 0xaaac34d7a7b0 google::LogMessageFatal::~LogMessageFatal()
@ 0xaaac2a6cfb24 doris::PendingRowsetGuard::operator=()
@ 0xaaac2b110b7c doris::TxnManager::commit_txn()
@ 0xaaac2b110554 doris::TxnManager::commit_txn()
@ 0xaaac2b128cb4 doris::RowsetBuilder::commit_txn()
@ 0xaaac2b21c878 doris::LoadStreamWriter::close()
@ 0xaaac2b209124 std::_Function_handler<>::_M_invoke()
@ 0xaaac2b347784 doris::WorkThreadPool<>::work_thread()
@ 0xaaac3732ea7c execute_native_thread_routine
@ 0xfffc07d487ac (unknown)
@ 0xfffc07f95cec (unknown)
@ (nil) (unknown)
*** Query id: 60bef42e04684c99-a2df3f4be96a4dca ***
*** is nereids: 0 ***
*** tablet id: 0 ***
*** Aborted at 1781341529 (unix time) try "date -d @1781341529" if you are using GNU date ***
*** Current BE git commitid: 6ff0573991 ***
*** SIGABRT unknown detail explain (@0x5dd00245072) received by PID 2379890 (TID 2381756 OR 0xffff763c79890) from PID 2379890; stack trace: ***
0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_enterprise/doris/be/src/common/signal_handler.h:421
1# 0x0000FFFC080F07C0 in linux-vdso.so.1
2# raise in /usr/lib64/libc.so.6
3# abort in /usr/lib64/libc.so.6
4# google::IsGoogleLoggingInitialized() in /app/doris/be/lib/doris_be
5# google::base::GetLogger(int) in /app/doris/be/lib/doris_be
6# google::LogMessage::SendToLog() in /app/doris/be/lib/doris_be
7# google::LogMessage::Flush() in /app/doris/be/lib/doris_be
8# google::LogMessageFatal::~LogMessageFatal() in /app/doris/be/lib/doris_be
9# doris::PendingRowsetGuard::operator=(doris::PendingRowsetGuard&&) at /home/zcp/repo_center/doris_enterprise/doris/be/src/olap/rowset/pending_rowset_helper.cpp:42
10# doris::TxnManager::commit_txn(doris::OlapMeta*, long, long, long, doris::UniqueId, doris::PUniqueId const&, std::shared_ptr<doris::Rowset> const&, doris::PendingRowsetGuard, bool) in /app/doris/be/lib/doris_be
11# doris::TxnManager::commit_txn(long, doris::Tablet const&, long, doris::PUniqueId const&, std::shared_ptr<doris::Rowset> const&, doris::PendingRowsetGuard, bool) at /home/zcp/ repo_center/doris_enterprise/doris/be/src/olap/txn_manager.cpp:177
12# doris::RowsetBuilder::commit_txn() at /home/zcp/repo_center/doris_enterprise/doris/be/src/olap/rowset_builder.cpp:317
13# doris::LoadStreamWriter::close() at /home/zcp/repo_center/doris_enterprise/doris/be/src/runtime/load_stream_writer.cpp:228
14# std::_Function_handler<void (), doris::TabletStream::close()::$_1>::_M_invoke(std::_Any_data const&) at /usr/local/bin/ldb-toolchain/bin/../lib/gcc/aarch64-linux-gnu/11 /../../../../include/c++/11/bits/std_function.h:291
15# doris::WorkThreadPool<false>::work_thread(int) at /home/zcp/repo_center/doris_enterprise/doris/be/src/util/work_thread_pool.hpp:159
16# execute_native_thread_routine in /app/doris/be/lib/doris_be
17# 0x0000FFFC07D487AC in /usr/lib64/libpthread.so.0
18# 0x0000FFFC07F95CEC in /usr/lib64/libc.so.6
节点2
[warn] evbuffer_file_segment_materialize: mmap(591, 0, 282) failed: No such device
F20260613 17:05:28.751812 1477793 pending_rowset_helper.cpp:43] Check failed: !_pending_rowset_set || (_rowset_id == other._rowset_id && _pending_rowset_set == other._pending_rowset_set) 020000000002d7a29841ecd8a028b98a4d638b3a408b8a0 020000000002d7a29841ecd8a028b98a4d638b3a408b8a0 0xfff8c37410bc 0
*** Check failure stack trace: ***
@ 0xaaac7379a0e8 google::LogMessage::SendToLog()
@ 0xaaac73796f30 google::LogMessage::Flush()
@ 0xaaac7379a7b0 google::LogMessageFatal::~LogMessageFatal()
@ 0xaaac690efb24 doris::PendingRowsetGuard::operator=()
@ 0xaaac69b30b7c doris::TxnManager::commit_txn()
@ 0xaaac69b30554 doris::TxnManager::commit_txn()
@ 0xaaac69b48cb4 doris::RowsetBuilder::commit_txn()
@ 0xaaac69c3c878 doris::LoadStreamWriter::close()
@ 0xaaac69c29124 std::_Function_handler<>::_M_invoke()
@ 0xaaac69d67784 doris::WorkThreadPool<>::work_thread()
@ 0xaaac75d4ea7c execute_native_thread_routine
@ 0xfffd8aef87ac (unknown)
@ 0xfffd8b145cec (unknown)
@ (nil) (unknown)
*** Query id: 60bef42e04684c99-a2df3f4be96a4dca ***
*** is nereids: 0 ***
*** tablet id: 0 ***
*** Aborted at 1781341529 (unix time) try "date -d @1781341529" if you are using GNU date ***
*** Current BE git commitid: 6ff0573991 ***
*** SIGABRT unknown detail explain (@0x5dd0016847d) received by PID 1475709 (TID 1477793 OR 0xfff8c5ef9890) from PID 1475709; stack trace: ***
0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_enterprise/doris/be/src/common/signal_handler.h:421
1# 0x0000FFFD8B2A07C0 in linux-vdso.so.1
2# raise in /usr/lib64/libc.so.6
3# abort in /usr/lib64/libc.so.6
4# google::IsGoogleLoggingInitialized() in /app/doris/be/lib/doris_be
5# google::base::GetLogger(int) in /app/doris/be/lib/doris_be
6# google::LogMessage::SendToLog() in /app/doris/be/lib/doris_be
7# google::LogMessage::Flush() in /app/doris/be/lib/doris_be
8# google::LogMessageFatal::~LogMessageFatal() in /app/doris/be/lib/doris_be
9# doris::PendingRowsetGuard::operator=(doris::PendingRowsetGuard&&) at /home/zcp/repo_center/doris_enterprise/doris/be/src/olap/rowset/pending_rowset_helper.cpp:42
10# doris::TxnManager::commit_txn(doris::OlapMeta*, long, long, long, doris::UniqueId, doris::PUniqueId const&, std::shared_ptr<doris::Rowset> const&, doris::PendingRowsetGuard, bool in /app/doris/be/lib/doris_be
11# doris::TxnManager::commit_txn(long, doris::Tablet const&, long, doris::PUniqueId const&, std::shared_ptr<doris::Rowset> const&, doris::PendingRowsetGuard, bool) at /home/zcp repo_center/doris_enterprise/doris/be/src/olap/txn_manager.cpp:177
12# doris::RowsetBuilder::commit_txn() at /home/zcp/repo_center/doris_enterprise/doris/be/src/olap/rowset_builder.cpp:317
13# doris::LoadStreamWriter::close() at /home/zcp/repo_center/doris_enterprise/doris/be/src/runtime/load_stream_writer.cpp:228
14# std::_Function_handler<void (), doris::TabletStream::close()::$_1>::_M_invoke(std::_Any_data const&) at /usr/local/bin/ldb-toolchain/bin/../lib/gcc/aarch64-linux-gnu/1 /../../../../include/c++/11/bits/std_function.h:291
15# doris::WorkThreadPool<false>::work_thread(int) at /home/zcp/repo_center/doris_enterprise/doris/be/src/util/work_thread_pool.hpp:159
16# execute_native_thread_routine in /app/doris/be/lib/doris_be
17# 0x0000FFFD8AEF87AC in /usr/lib64/libpthread.so.0
18# 0x0000FFFD8B145CEC in /usr/lib64/libc.so.6
节点3
[warn] evbuffer_file_segment_materialize: mmap(675, 0, 282) failed: No such device
F20260613 17:05:28.756786 1927044 pending_rowset_helper.cpp:43] Check failed: !_pending_rowset_set || (_rowset_id == other._rowset_id && _pending_rowset_set == other._pending_rowset_set) 020000000002c87e5641b1308bc8aa37577b558ebbf13b8e 020000000002c87e5641b1308bc8aa37577b558ebbf13b8e 0xfffd6d6f23c0 0
*** Check failure stack trace: ***
@ 0xaaacdee0a0e8 google::LogMessage::SendToLog()
@ 0xaaacdee6f300 google::LogMessage::Flush()
@ 0xaaacdee7a7b0 google::LogMessageFatal::~LogMessageFatal()
@ 0xaaacde473fb2 doris::PendingRowsetGuard::operator=()
@ 0xaaacde5180b7 doris::TxnManager::commit_txn()
@ 0xaaacde518055 doris::TxnManager::commit_txn()
@ 0xaaacde5198cb doris::RowsetBuilder::commit_txn()
@ 0xaaacde528c87 doris::LoadStreamWriter::close()
@ 0xaaacde527912 std::_Function_handler<>::_M_invoke()
@ 0xaaacde53b778 doris::WorkThreadPool<>::work_thread()
@ 0xaaacadf139ea execute_native_thread_routine
@ 0xfffd71f387ac (unknown)
@ 0xfffd72185cec (unknown)
@ (nil) (unknown)
*** Query id: 60bef42e04684c99-a2df3f4be96a4dca ***
*** is nereids: 0 ***
*** tablet id: 0 ***
*** Aborted at 1781341529 (unix time) try "date -d @1781341529" if you are using GNU date ***
*** Current BE git commitid: 6ff0573991 ***
*** SIGABRT unknown detail explain (@0x5dd001d6070) received by PID 1925232 (TID 1927044 OR 0xffff8ecfe9890) from PID 1925232; stack trace: ***
0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_enterprise/doris/be/src/common/signal_handler.h:421
1# 0x0000FFFD722E07C0 in linux-vdso.so.1
2# raise in /usr/lib64/libc.so.6
3# abort in /usr/lib64/libc.so.6
4# google::IsGoogleLoggingInitialized() in /app/doris/be/lib/doris_be
5# google::base::GetLogger(int) in /app/doris/be/lib/doris_be
6# google::LogMessage::SendToLog() in /app/doris/be/lib/doris_be
7# google::LogMessage::Flush() in /app/doris/be/lib/doris_be
8# google::LogMessageFatal::~LogMessageFatal() in /app/doris/be/lib/doris_be
9# doris::PendingRowsetGuard::operator=(doris::PendingRowsetGuard&&) at /home/zcp/repo_center/doris_enterprise/doris/be/src/olap/rowset/pending_rowset_helper.cpp:42
10# doris::TxnManager::commit_txn(doris::OlapMeta*, long, long, long, doris::UniqueId, doris::PUniqueId const&, std::shared_ptr<doris::Rowset> const&, doris::PendingRowsetGuard, bool in /app/doris/be/lib/doris_be
11# doris::TxnManager::commit_txn(long, doris::Tablet const&, long, doris::PUniqueId const&, std::shared_ptr<doris::Rowset> const&, doris::PendingRowsetGuard, bool) at /home/zcp repo_center/doris_enterprise/doris/be/src/olap/txn_manager.cpp:177
12# doris::RowsetBuilder::commit_txn() at /home/zcp/repo_center/doris_enterprise/doris/be/src/olap/rowset_builder.cpp:317
13# doris::LoadStreamWriter::close() at /home/zcp/repo_center/doris_enterprise/doris/be/src/runtime/load_stream_writer.cpp:228
14# std::_Function_handler<void (), doris::TabletStream::close()::$_1>::_M_invoke(std::_Any_data const&) at /usr/local/bin/ldb-toolchain/bin/../lib/gcc/aarch64-linux-gnu/1 /../../../../include/c++/11/bits/std_function.h:291
15# doris::WorkThreadPool<false>::work_thread(int) at /home/zcp/repo_center/doris_enterprise/doris/be/src/util/work_thread_pool.hpp:159
16# execute_native_thread_routine in /app/doris/be/lib/doris_be
17# 0x0000FFFD71F387AC in /usr/lib64/libpthread.so.0
18# 0x0000FFFD72185CEC in /usr/lib64/libc.so.6