BE出现get_tablet返回空指针导致coredump

Viewed 7

版本是2.1.5, 出现core时的调用栈如下:

*** Query id: 0-0 ***
*** is nereids: 0 ***
*** tablet id: 0 ***
*** Aborted at 1772767549 (unix time) try "date -d @1772767549" if you are using GNU date ***
*** Current BE git commitID: 654acde ***
*** SIGSEGV address not mapped to object (@0x40) received by PID 79914 (TID 81118 OR 0x7fc5b7fc8700) from PID 64; stack trace: ***
 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/jenkins_agent/workspace/BigDataComponent_doris-unified-x86-release/be/src/common/signal_handler.h:421
 1# os::Linux::chained_handler(int, siginfo_t*, void*) in /usr/jdk64/current/jre/lib/amd64/server/libjvm.so
 2# JVM_handle_linux_signal in /usr/jdk64/current/jre/lib/amd64/server/libjvm.so
 3# signalHandler(int, siginfo_t*, void*) in /usr/jdk64/current/jre/lib/amd64/server/libjvm.so
 4# 0x00007FC86CD24400 in /lib64/libc.so.6
 5# doris::create_tablet_callback(doris::StorageEngine&, doris::TAgentTaskRequest const&) at /home/jenkins_agent/workspace/BigDataComponent_doris-unified-x86-release/be/src/agent/task_worker_pool.cpp:1398
 6# std::_Function_handler<void (), doris::TaskWorkerPool::submit_task(doris::TAgentTaskRequest const&)::$_0::operator()<doris::TAgentTaskRequest const&>(doris::TAgentTaskRequest const&) const::{lambda()#1}>::_M_invoke(std::_Any_data const&) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:291
 7# doris::ThreadPool::dispatch_thread() in /usr/local/doris-be/lib/doris_be
 8# doris::Thread::supervise_thread(void*) at /home/jenkins_agent/workspace/BigDataComponent_doris-unified-x86-release/be/src/util/thread.cpp:499
 9# start_thread in /lib64/libpthread.so.0
10# clone in /lib64/libc.so.6

以下是我自己的分析:
出现问题的直接原因很容易分析, 是空指针问题引起的.

image.png

出现的问题是在 create_tablet_callback 回调函数上, 是由FE发送 TTaskType.CREATE 类型的任务给BE触发的.
有两种情况get_tablet会返回nullptr:

  • 有另一个任务在执行drop_tablet_callback, 在执行get_tablet之前, 已经把这个tablet删掉了 (期间执行了drop_tablet_callback?)
  • Tablet 处于 "不可用状态 (!tablet->is_used())", 这种情况的具体情形分为两种:
    • Tablet 已经是bad状态 (_is_bad == true)
    • 在目录进行健康检查时(DataDir::health_check)出现IO_ERROR

想咨询是否出现过类似的问题

0 Answers