doris be节点会不定期在凌晨02:00 崩溃

Viewed 64

doris 版本为 2.1.8
集群环境:3个fe+7个be节点,虚拟机,单个节点的资源配置为16C 32G 10T磁盘,3个fe节点和be节点为混合部署,其余4个be节点单独部署。

最初是随机有1-2个be节点会在周六或周天晚上的凌晨2点崩溃,程序的采集任务集中配置在凌晨1:30-03:00。查看be.out日志,未找到crash信息,info和waring也没有具体的报错信息。后续通过doris-manager将集群版本升级到2.1.11版本后6个be节点均在凌晨2点出现了crash重启的情况,be.out的日志如下
*** Query id: 1c2e81dfe754dc3-aa6a44613e368e0b ***
*** is nereids: 1 ***
*** tablet id: 0 ***
*** Aborted at 1764612023 (unix time) try "date -d @1764612023" if you are using GNU date ***
*** Current BE git commitID: 97b77e6cda ***
*** SIGSEGV invalid permissions for mapped object (@0x55aedb23afc8) received by PID 4102987 (TID 4105799 OR 0x7f3c04bb6640) from PID 18446744073091133384; stack trace: ***
0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_release/doris/be/src/common/signal_handler.h:421
1# os::Linux::chained_handler(int, siginfo*, void*) in /data/sdv1/jdk1.8/jre/lib/amd64/server/libjvm.so
2# JVM_handle_linux_signal in /data/sdv1/jdk1.8/jre/lib/amd64/server/libjvm.so
3# signalHandler(int, siginfo*, void*) in /data/sdv1/jdk1.8/jre/lib/amd64/server/libjvm.so
4# 0x00007F401B00C520 in /lib/x86_64-linux-gnu/libc.so.6
5# doris::vectorized::VMergeIteratorContext::copy_rows(doris::vectorized::Block*, bool) at /home/zcp/repo_center/doris_release/doris/be/src/vec/olap/vgeneric_iterators.cpp:148
6# doris::Status doris::vectorized::VMergeIterator::_next_batch(doris::vectorized::Block*) at /home/zcp/repo_center/doris_release/doris/be/src/vec/olap/vgeneric_iterators.h:242
7# doris::vectorized::VMergeIterator::next_batch(doris::vectorized::Block*) at /home/zcp/repo_center/doris_release/doris/be/src/vec/olap/vgeneric_iterators.h:201
8# doris::BetaRowsetReader::next_block(doris::vectorized::Block*) at /home/zcp/repo_center/doris_release/doris/be/src/olap/rowset/beta_rowset_reader.cpp:357
9# doris::vectorized::VCollectIterator::_topn_next(doris::vectorized::Block*) at /home/zcp/repo_center/doris_release/doris/be/src/vec/olap/vcollect_iterator.cpp:292
10# doris::vectorized::VCollectIterator::next(doris::vectorized::Block*) in /data/sdv1/doris/be/lib/doris_be
11# doris::vectorized::BlockReader::_direct_next_block(doris::vectorized::Block*, bool*) at /home/zcp/repo_center/doris_release/doris/be/src/vec/olap/block_reader.cpp:262
12# doris::vectorized::BlockReader::next_block_with_aggregation(doris::vectorized::Block*, bool*) at /home/zcp/repo_center/doris_release/doris/be/src/vec/olap/block_reader.cpp:67
13# doris::vectorized::NewOlapScanner::_get_block_impl(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /home/zcp/repo_center/doris_release/doris/be/src/vec/exec/scan/new_olap_scanner.cpp:508
14# doris::vectorized::VScanner::get_block(doris::RuntimeState*, doris::vectorized::Block*, bool*) in /data/sdv1/doris/be/lib/doris_be
15# doris::vectorized::VScanner::get_block_after_projects(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /home/zcp/repo_center/doris_release/doris/be/src/vec/exec/scan/vscanner.cpp:102
16# doris::vectorized::ScannerScheduler::_scanner_scan(std::shared_ptr, std::shared_ptr) at /home/zcp/repo_center/doris_release/doris/be/src/vec/exec/scan/scanner_scheduler.cpp:280
17# std::_Function_handler<void (), doris::vectorized::ScannerScheduler::submit(std::shared_ptr, std::shared_ptr)::$_1::operator()() const::{lambda()#1}>::_M_invoke(std::_Any_data const&) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:291
18# doris::ThreadPool::dispatch_thread() in /data/sdv1/doris/be/lib/doris_be
19# doris::Thread::supervise_thread(void*) at /home/zcp/repo_center/doris_release/doris/be/src/util/thread.cpp:499
20# 0x00007F401B05EAC3 in /lib/x86_64-linux-gnu/libc.so.6
21# 0x00007F401B0F08C0 in /lib/x86_64-linux-gnu/libc.so.6

2 Answers

老师,这个问题是必现的吗,方便的话加我主页微信我们一起看下的。可能需要您帮忙取个 core dump 文件排查下的。

2.1.8 出现同时多个BE coredump的问题. aarch64环境, 几乎相同的调用栈, 都是在VMergeIteratorContext::copy_rows (148行)

是一条 ORDER BY ... LIMIT ... 语句触发的, 但是没能够复现.

*** Query id: a318954eeb7a43c1-ad34a0cff96fac5c ***
*** is nereids: 1 ***
*** tablet id: 0 ***
*** Aborted at 1768883311 (unix time) try "date -d @1768883311" if you are using GNU date ***
*** Current BE git commitID: 018ba5371f ***
*** SIGBUS invalid address alignment (@0x100000009) received by PID 2544969 (TID 2551614 OR 0xffdef42b97f0) from PID 9; stack trace: ***
 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/jenkins_agent/workspace/BigDataComponent_doris-unified-arm-release/be/src/common/signal_handler.h:421
 1# os::Linux::chained_handler(int, siginfo_t*, void*) in /usr/jdk64/current/jre/lib/aarch64/server/libjvm.so
 2# JVM_handle_linux_signal in /usr/jdk64/current/jre/lib/aarch64/server/libjvm.so
 3# signalHandler(int, siginfo_t*, void*) in /usr/jdk64/current/jre/lib/aarch64/server/libjvm.so
 4# 0x0000FFFCFBDE07C0 in linux-vdso.so.1
 5# __aarch64_ldadd4_acq_rel in /usr/local/doris-be/lib/doris_be
 6# doris::vectorized::VMergeIteratorContext::copy_rows(doris::vectorized::Block*, bool) at /home/jenkins_agent/workspace/BigDataComponent_doris-unified-arm-release/be/src/vec/olap/vgeneric_iterators.cpp:148
 7# doris::Status doris::vectorized::VMergeIterator::_next_batch<doris::vectorized::Block>(doris::vectorized::Block*) at /home/jenkins_agent/workspace/BigDataComponent_doris-unified-arm-release/be/src/vec/olap/vgeneric_iterators.h:242
 8# doris::BetaRowsetReader::next_block(doris::vectorized::Block*) at /home/jenkins_agent/workspace/BigDataComponent_doris-unified-arm-release/be/src/olap/rowset/beta_rowset_reader.cpp:357
 9# doris::vectorized::VCollectIterator::_topn_next(doris::vectorized::Block*) at /home/jenkins_agent/workspace/BigDataComponent_doris-unified-arm-release/be/src/vec/olap/vcollect_iterator.cpp:292
10# doris::vectorized::BlockReader::_direct_next_block(doris::vectorized::Block*, bool*) at /home/jenkins_agent/workspace/BigDataComponent_doris-unified-arm-release/be/src/vec/olap/block_reader.cpp:262
11# doris::vectorized::BlockReader::next_block_with_aggregation(doris::vectorized::Block*, bool*) at /home/jenkins_agent/workspace/BigDataComponent_doris-unified-arm-release/be/src/vec/olap/block_reader.cpp:67
12# doris::vectorized::NewOlapScanner::_get_block_impl(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /home/jenkins_agent/workspace/BigDataComponent_doris-unified-arm-release/be/src/vec/exec/scan/new_olap_scanner.cpp:507
13# doris::vectorized::VScanner::get_block(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /home/jenkins_agent/workspace/BigDataComponent_doris-unified-arm-release/be/src/vec/exec/scan/vscanner.cpp:133
14# doris::vectorized::ScannerScheduler::_scanner_scan(std::shared_ptr<doris::vectorized::ScannerContext>, std::shared_ptr<doris::vectorized::ScanTask>) at /home/jenkins_agent/workspace/BigDataComponent_doris-unified-arm-release/be/src/vec/exec/scan/scanner_scheduler.cpp:280
15# std::_Function_handler<void (), doris::vectorized::ScannerScheduler::submit(std::shared_ptr<doris::vectorized::ScannerContext>, std::shared_ptr<doris::vectorized::ScanTask>)::$_1::operator()() const::{lambda()#1}>::_M_invoke(std::_Any_data const&) at /usr/lib/gcc/aarch64-linux-gnu/13/../../../../include/c++/13/bits/std_function.h:290
16# doris::ThreadPool::dispatch_thread() at /home/jenkins_agent/workspace/BigDataComponent_doris-unified-arm-release/be/src/util/threadpool.cpp:552
17# doris::Thread::supervise_thread(void*) at /home/jenkins_agent/workspace/BigDataComponent_doris-unified-arm-release/be/src/util/thread.cpp:499
18# 0x0000FFFCFBA587AC in /lib64/libpthread.so.0
19# 0x0000FFFCFBCA548C in /lib64/libc.so.6

我分析可能的原因是:

  1. 索引i越界 (_num_columns 异常的负值) (因为这里涉及一个优化 https://zhuanlan.zhihu.com/p/1989452704494395614, 这个值在这个场景下应该是2, 不太可能出问题)
  2. 列位置 start 越界 (size_t start = _index_in_block - _cur_batch_num + 1 - advanced), _index_in_block 或者 _cur_batch_num 的值异常 导致 start 是一个负值(size_t则对应一个很大的数) (可能性最高, 但是从代码看逻辑正常, _index_in_block和_cur_batch_num会同时更新)
  3. insert_range_from 内部调用memcpy报错, 直接移植 x86 SSE 代码到 ARM NEON 可能有问题?, x86环境上的see指令比如 _mm_store_si128 支持未对齐的访问, 而arm上NEON 指令要求对齐, 则会出现问题.(但我实际测试在没有添加 -mstrict-align 的编译选项的情况下也没问题)