BE节点system cpu使用率异常

Viewed 7

不定时会出现某个BE节点的system cpu使用率短时间内升高,大约到40%左右,偶尔也有到80%的,然后长时间稳定在一条线上几乎没变化.在这期间发起的任务执行时间会很长,后面报错失败

perf top -p doris_pid 查看到

  72.47%  [kernel]              [k] native_queued_spin_lock_slowpath
   1.97%  libc.so.6             [.] pthread_mutex_lock@@GLIBC_2.2.5
   1.19%  doris_be              [.] doris::TabletManager::for_each_tablet(std::function<void (std::shared_ptr<doris::Tablet> c
   1.06%  libc.so.6             [.] __GI___pthread_mutex_unlock_usercnt
   1.03%  doris_be              [.] std::_Function_handler<void (std::shared_ptr<doris::Tablet> const&), doris::TabletManager:
   1.03%  [kernel]              [k] futex_wake
   0.89%  libc.so.6             [.] __GI___lll_lock_wait
   0.83%  doris_be              [.] doris::LRUCache::lookup(doris::CacheKey const&, unsigned int)
   0.81%  [kernel]              [k] futex_q_lock
   0.71%  doris_be              [.] memcpy
   0.63%  [kernel]              [k] __get_user_nocheck_4
   0.55%  [kernel]              [k] _raw_spin_lock
   0.55%  doris_be              [.] doris::ShardedLRUCache::lookup(doris::CacheKey const&)
   0.54%  doris_be              [.] doris::Tablet::should_skip_compaction(doris::CompactionType, long)
   0.54%  doris_be              [.] doris::vectorized::IndexChannel::_quorum_success(std::unordered_set<long, std::hash<long>,
   0.50%  doris_be              [.] jefree

问题节点的BE日志有这样的报错

W20260624 09:34:36.150771 673495 brpc_client_cache.h:78] [NETWORK_ERROR]Failed to send brpc, error=主机关闭, error_text=[E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R1][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R2][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R3][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R4][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R5][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R6][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R7][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R8][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R9][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R10][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160, client: 192.168.11.64, latency = 53
W20260624 09:34:36.150776 870520 vtablet_writer.cpp:837] cancel node channel VNodeChannel[1776339357943-10025], load_id=cf48b12d0cb58a9b-213565c9918b359b, txn_id=21867095, node=192.168.11.53:8060, error message: [INTERNAL_ERROR]VNodeChannel[1776339357943-21067122], load_id=cf48b12d0cb58a9b-213565c9918b359b, txn_id=21867095, node=192.168.11.65:8060, open failed, err: [INTERNAL_ERROR]failed to open tablet writer, error=主机关闭, error_text=[E104]Fail to read from Socket{id=8590014160 fd=3365 addr=192.168.11.65:8060:58866} (0x0x7fe9acec9b80): 连接被对方重设 [R1][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R2][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R3][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R4][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R5][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R6][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R7][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R8][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R9][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R10][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160, info=VNodeChannel[1776339357943-21067122], load_id=cf48b12d0cb58a9b-213565c9918b359b, txn_id=21867095, node=192.168.11.65:8060, host: 192.168.11.65
W20260624 09:35:05.652308 867211 load_stream_stub.cpp:361] LoadStreamStub load_id=9063b087b1e04b99-969e2901468805bf, src_id=21066410, dst_id=21066410, stream_id=81194 is cancelled because of [CANCELLED]PStatus: insert timeout


W20260624 09:51:49.911350 673423 vtablet_writer.cpp:837] cancel node channel VNodeChannel[1776339357943-21067122], load_id=104c70d34b15a383-b4e22d387f782a0, txn_id=21868238, node=192.168.11.65:8060, error message: [INTERNAL_ERROR]tablet error: [CORRUPTION]Failed to decode value at position 284

        0#  doris::segment_v2::BinaryPrefixPageDecoder::_read_next_value() at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:0
        1#  doris::segment_v2::BinaryPrefixPageDecoder::seek_at_or_after_value(void const*, bool*) at /home/zcp/repo_center/doris_release/doris/be/src/olap/rowset/segment_v2/binary_prefix_page.cpp:0
        2#  doris::segment_v2::IndexedColumnIterator::seek_at_or_after(void const*, bool*) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:501
        3#  doris::segment_v2::Segment::lookup_row_key(doris::Slice const&, doris::TabletSchema const*, bool, bool, doris::RowLocation*, doris::OlapReaderStatistics*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:506
        4#  doris::BaseTablet::lookup_row_key(doris::Slice const&, doris::TabletSchema*, bool, std::vector<std::shared_ptr<doris::Rowset>, std::allocator<std::shared_ptr<doris::Rowset> > > const&, doris::RowLocation*, unsigned int, std::vector<std::unique_ptr<doris::SegmentCacheHandle, std::default_delete<doris::SegmentCacheHandle> >, std::allocator<std::unique_ptr<doris::SegmentCacheHandle, std::default_delete<doris::SegmentCacheHandle> > > >&, std::shared_ptr<doris::Rowset>*, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, doris::OlapReaderStatistics*) at /home/zcp/repo_center/doris_release/doris/be/src/olap/base_tablet.cpp:540
        5#  doris::BaseTablet::calc_segment_delete_bitmap(std::shared_ptr<doris::Rowset>, std::shared_ptr<doris::segment_v2::Segment> const&, std::vector<std::shared_ptr<doris::Rowset>, std::allocator<std::shared_ptr<doris::Rowset> > > const&, std::shared_ptr<doris::DeleteBitmap>, long, doris::RowsetWriter*) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:506
        6#  doris::BaseTablet::calc_delete_bitmap(std::shared_ptr<doris::BaseTablet> const&, std::shared_ptr<doris::Rowset>, std::vector<std::shared_ptr<doris::segment_v2::Segment>, std::allocator<std::shared_ptr<doris::segment_v2::Segment> > > const&, std::vector<std::shared_ptr<doris::Rowset>, std::allocator<std::shared_ptr<doris::Rowset> > > const&, std::shared_ptr<doris::DeleteBitmap>, long, doris::CalcDeleteBitmapToken*, doris::RowsetWriter*) at /home/zcp/repo_center/doris_release/doris/be/src/olap/base_tablet.cpp:594
        7#  doris::BaseBetaRowsetWriter::_generate_delete_bitmap(int) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:701
        8#  doris::BaseBetaRowsetWriter::add_segment(unsigned int, doris::SegmentStatistics const&, std::shared_ptr<doris::TabletSchema>) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:506
        9#  doris::BetaRowsetWriter::add_segment(unsigned int, doris::SegmentStatistics const&, std::shared_ptr<doris::TabletSchema>) at /home/zcp/repo_center/doris_release/doris/be/src/olap/rowset/beta_rowset_writer.cpp:0
        10# doris::SegmentCollectorT<doris::BaseBetaRowsetWriter>::add(unsigned int, doris::SegmentStatistics&, std::shared_ptr<doris::TabletSchema>) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:701
        11# doris::SegmentFlusher::_flush_segment_writer(std::unique_ptr<doris::segment_v2::VerticalSegmentWriter, std::default_delete<doris::segment_v2::VerticalSegmentWriter> >&, std::shared_ptr<doris::TabletSchema>, long*) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:701
        12# doris::SegmentFlusher::flush_single_block(doris::vectorized::Block const*, int, long*) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:701
        13# doris::SegmentCreator::flush_single_block(doris::vectorized::Block const*, int, long*) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:506
        14# doris::BaseBetaRowsetWriter::flush_memtable(doris::vectorized::Block*, int, long*) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:506
        15# doris::FlushToken::_do_flush_memtable(doris::MemTable*, int, long*) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:506
        16# doris::FlushToken::_flush_memtable(std::shared_ptr<doris::MemTable>, int, long) at /home/zcp/repo_center/doris_release/doris/be/src/olap/memtable_flush_executor.cpp:0
        17# doris::MemtableFlushTask::run() at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:701
        18# doris::ThreadPool::dispatch_thread() at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:730
        19# doris::Thread::supervise_thread(void*) at /var/local/ldb-toolchain/bin/../usr/include/pthread.h:562
        20# ?
        21# ?
, host: 192.168.11.63


有什么排查思路吗

0 Answers