不定时会出现某个BE节点的system cpu使用率短时间内升高,大约到40%左右,偶尔也有到80%的,然后长时间稳定在一条线上几乎没变化.在这期间发起的任务执行时间会很长,后面报错失败
perf top -p doris_pid 查看到
72.47% [kernel] [k] native_queued_spin_lock_slowpath
1.97% libc.so.6 [.] pthread_mutex_lock@@GLIBC_2.2.5
1.19% doris_be [.] doris::TabletManager::for_each_tablet(std::function<void (std::shared_ptr<doris::Tablet> c
1.06% libc.so.6 [.] __GI___pthread_mutex_unlock_usercnt
1.03% doris_be [.] std::_Function_handler<void (std::shared_ptr<doris::Tablet> const&), doris::TabletManager:
1.03% [kernel] [k] futex_wake
0.89% libc.so.6 [.] __GI___lll_lock_wait
0.83% doris_be [.] doris::LRUCache::lookup(doris::CacheKey const&, unsigned int)
0.81% [kernel] [k] futex_q_lock
0.71% doris_be [.] memcpy
0.63% [kernel] [k] __get_user_nocheck_4
0.55% [kernel] [k] _raw_spin_lock
0.55% doris_be [.] doris::ShardedLRUCache::lookup(doris::CacheKey const&)
0.54% doris_be [.] doris::Tablet::should_skip_compaction(doris::CompactionType, long)
0.54% doris_be [.] doris::vectorized::IndexChannel::_quorum_success(std::unordered_set<long, std::hash<long>,
0.50% doris_be [.] jefree
问题节点的BE日志有这样的报错
W20260624 09:34:36.150771 673495 brpc_client_cache.h:78] [NETWORK_ERROR]Failed to send brpc, error=主机关闭, error_text=[E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R1][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R2][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R3][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R4][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R5][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R6][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R7][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R8][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R9][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R10][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160, client: 192.168.11.64, latency = 53
W20260624 09:34:36.150776 870520 vtablet_writer.cpp:837] cancel node channel VNodeChannel[1776339357943-10025], load_id=cf48b12d0cb58a9b-213565c9918b359b, txn_id=21867095, node=192.168.11.53:8060, error message: [INTERNAL_ERROR]VNodeChannel[1776339357943-21067122], load_id=cf48b12d0cb58a9b-213565c9918b359b, txn_id=21867095, node=192.168.11.65:8060, open failed, err: [INTERNAL_ERROR]failed to open tablet writer, error=主机关闭, error_text=[E104]Fail to read from Socket{id=8590014160 fd=3365 addr=192.168.11.65:8060:58866} (0x0x7fe9acec9b80): 连接被对方重设 [R1][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R2][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R3][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R4][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R5][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R6][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R7][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R8][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R9][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160 [R10][E112]Not connected to 192.168.11.65:8060 yet, server_id=8590014160, info=VNodeChannel[1776339357943-21067122], load_id=cf48b12d0cb58a9b-213565c9918b359b, txn_id=21867095, node=192.168.11.65:8060, host: 192.168.11.65
W20260624 09:35:05.652308 867211 load_stream_stub.cpp:361] LoadStreamStub load_id=9063b087b1e04b99-969e2901468805bf, src_id=21066410, dst_id=21066410, stream_id=81194 is cancelled because of [CANCELLED]PStatus: insert timeout
W20260624 09:51:49.911350 673423 vtablet_writer.cpp:837] cancel node channel VNodeChannel[1776339357943-21067122], load_id=104c70d34b15a383-b4e22d387f782a0, txn_id=21868238, node=192.168.11.65:8060, error message: [INTERNAL_ERROR]tablet error: [CORRUPTION]Failed to decode value at position 284
0# doris::segment_v2::BinaryPrefixPageDecoder::_read_next_value() at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:0
1# doris::segment_v2::BinaryPrefixPageDecoder::seek_at_or_after_value(void const*, bool*) at /home/zcp/repo_center/doris_release/doris/be/src/olap/rowset/segment_v2/binary_prefix_page.cpp:0
2# doris::segment_v2::IndexedColumnIterator::seek_at_or_after(void const*, bool*) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:501
3# doris::segment_v2::Segment::lookup_row_key(doris::Slice const&, doris::TabletSchema const*, bool, bool, doris::RowLocation*, doris::OlapReaderStatistics*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:506
4# doris::BaseTablet::lookup_row_key(doris::Slice const&, doris::TabletSchema*, bool, std::vector<std::shared_ptr<doris::Rowset>, std::allocator<std::shared_ptr<doris::Rowset> > > const&, doris::RowLocation*, unsigned int, std::vector<std::unique_ptr<doris::SegmentCacheHandle, std::default_delete<doris::SegmentCacheHandle> >, std::allocator<std::unique_ptr<doris::SegmentCacheHandle, std::default_delete<doris::SegmentCacheHandle> > > >&, std::shared_ptr<doris::Rowset>*, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, doris::OlapReaderStatistics*) at /home/zcp/repo_center/doris_release/doris/be/src/olap/base_tablet.cpp:540
5# doris::BaseTablet::calc_segment_delete_bitmap(std::shared_ptr<doris::Rowset>, std::shared_ptr<doris::segment_v2::Segment> const&, std::vector<std::shared_ptr<doris::Rowset>, std::allocator<std::shared_ptr<doris::Rowset> > > const&, std::shared_ptr<doris::DeleteBitmap>, long, doris::RowsetWriter*) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:506
6# doris::BaseTablet::calc_delete_bitmap(std::shared_ptr<doris::BaseTablet> const&, std::shared_ptr<doris::Rowset>, std::vector<std::shared_ptr<doris::segment_v2::Segment>, std::allocator<std::shared_ptr<doris::segment_v2::Segment> > > const&, std::vector<std::shared_ptr<doris::Rowset>, std::allocator<std::shared_ptr<doris::Rowset> > > const&, std::shared_ptr<doris::DeleteBitmap>, long, doris::CalcDeleteBitmapToken*, doris::RowsetWriter*) at /home/zcp/repo_center/doris_release/doris/be/src/olap/base_tablet.cpp:594
7# doris::BaseBetaRowsetWriter::_generate_delete_bitmap(int) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:701
8# doris::BaseBetaRowsetWriter::add_segment(unsigned int, doris::SegmentStatistics const&, std::shared_ptr<doris::TabletSchema>) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:506
9# doris::BetaRowsetWriter::add_segment(unsigned int, doris::SegmentStatistics const&, std::shared_ptr<doris::TabletSchema>) at /home/zcp/repo_center/doris_release/doris/be/src/olap/rowset/beta_rowset_writer.cpp:0
10# doris::SegmentCollectorT<doris::BaseBetaRowsetWriter>::add(unsigned int, doris::SegmentStatistics&, std::shared_ptr<doris::TabletSchema>) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:701
11# doris::SegmentFlusher::_flush_segment_writer(std::unique_ptr<doris::segment_v2::VerticalSegmentWriter, std::default_delete<doris::segment_v2::VerticalSegmentWriter> >&, std::shared_ptr<doris::TabletSchema>, long*) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:701
12# doris::SegmentFlusher::flush_single_block(doris::vectorized::Block const*, int, long*) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:701
13# doris::SegmentCreator::flush_single_block(doris::vectorized::Block const*, int, long*) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:506
14# doris::BaseBetaRowsetWriter::flush_memtable(doris::vectorized::Block*, int, long*) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:506
15# doris::FlushToken::_do_flush_memtable(doris::MemTable*, int, long*) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:506
16# doris::FlushToken::_flush_memtable(std::shared_ptr<doris::MemTable>, int, long) at /home/zcp/repo_center/doris_release/doris/be/src/olap/memtable_flush_executor.cpp:0
17# doris::MemtableFlushTask::run() at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:701
18# doris::ThreadPool::dispatch_thread() at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:730
19# doris::Thread::supervise_thread(void*) at /var/local/ldb-toolchain/bin/../usr/include/pthread.h:562
20# ?
21# ?
, host: 192.168.11.63
有什么排查思路吗