Doris 2.1.4版本,be节点崩溃重启后集群恢复很慢问题

Viewed 2

95doris 2.1.4版本集群,be 71个节点,Arm架构,最近出现部分be节点因为hive catalog导入bug频繁宕机,节点重新拉起后几分钟内可用(可正常查询),之后逐渐出现大量查询超时失败,十分钟左右整个集群查询全部超时失败,大概20分钟左右又全部恢复,查看be.out没有异常信息,be.info中信息如下,请各位大佬帮忙看下。
16:57 节点重启
16:59 节点be.INFO部分信息:

W20260408 16:59:44.503226 3043455 task_worker_pool.cpp:1722] Failed to clone tablet[signature=412344806|tablet_id=412344806|error=[INTERNAL_ERROR]unexpected version. tablet version: 321034, expected version: 321040

#0  doris::EngineCloneTask::_set_tablet_info(bool)
#1  doris::EngineCloneTask::_do_clone()
#2  doris::EngineCloneTask::execute()
#3  doris::clone_callback(doris::StorageEngine&, doris::TMasterInfo const&, doris::TAgentTaskRequest const&)::<lambda()>::operator()(doris::TAgentTaskRequest const&) const::{lambda()#1}::M_invoke(std::_Any_data const&)
#4  std::_Function_handler<void (), doris::clone_callback(doris::StorageEngine&, doris::TMasterInfo const&, doris::TAgentTaskRequest const&)::<lambda()> >::_M_invoke(std::_Any_data const&)
#5  doris::ThreadPool::dispatch_thread()
#6  doris::Thread::supervise_thread(void*)
#7  ?
#8  ?

I20260408 16:59:44.506873 3043273 tablet_meta_manager.cpp:188] remove pending publish info, key:ppl_412344806_321035, res:[OK]
I20260408 16:59:44.507738 3043194 tablet.cpp:3088] calc segment delete bitmap tablet: 412344806 rowset: 02000000000000000000000000000000 add: 1, roset_ids to del: 0, cur version: 321035, transaction_id: 271784661, cost: 623(us), total rows: 2
I20260408 16:59:44.507922 3043251 tablet.cpp:3496] [Publish] construct delete bitmap tablet: 412344806, rowset_ids to add: 1, roset_ids to del: 0, cur version: 321035, transaction_id: 271784661, version=321035, num_rows=2, res:[OK], cost: 1448(us)
I20260408 16:59:44.527201 3043273 tablet_meta_manager.cpp:188] remove pending publish info, key:ppl_412344806_321036, res:[OK]
I20260408 16:59:44.530830 3043194 tablet.cpp:3088] calc segment delete bitmap tablet: 412344806 rowset: 02000000000000000000000000000000 add: 2, roset_ids to del: 0, cur version: 321036, transaction_id: 271784664, cost: 688(us), total rows: 155
I20260408 16:59:44.536714 3043251 tablet.cpp:3496] [Publish] construct delete bitmap tablet: 412344806, rowset_ids to add: 2, roset_ids to del: 0, cur version: 321036, transaction_id: 271784664, version=321036, num_rows=155, res:[OK], cost: 1446(us)
I20260408 16:59:44.546798 3043273 tablet_meta_manager.cpp:188] remove pending publish info, key:ppl_412344806_321037, res:[OK]
I20260408 16:59:44.548214 3043194 tablet.cpp:3088] calc segment delete bitmap tablet: 412344806 rowset: 02000000000000000000000000000000 add: 3, roset_ids to del: 0, cur version: 321037, transaction_id: 271784666, cost: 528(us), total rows: 2
I20260408 16:59:44.568701 3043259 tablet.cpp:3496] [Publish] construct delete bitmap tablet: 412344806, rowset_ids to add: 3, roset_ids to del: 0, cur version: 321037, transaction_id: 271784666, version=321037, num_rows=2, res:[OK], cost: 1327(us)
I20260408 16:59:44.577977 3043273 tablet_meta_manager.cpp:188] remove pending publish info, key:ppl_412344806_321038, res:[OK]
I20260408 16:59:44.597724 3043194 tablet.cpp:3088] calc segment delete bitmap tablet: 412344806 rowset: 02000000000000000000000000000000 add: 4, roset_ids to del: 0, cur version: 321038, transaction_id: 271784665, cost: 784(us), total rows: 161
I20260408 16:59:44.598915 3043259 tablet.cpp:3496] [Publish] construct delete bitmap tablet: 412344806, rowset_ids to add: 4, roset_ids to del: 0, cur version: 321038, transaction_id: 271784665, version=321038, num_rows=161, res:[OK], cost: 1486(us)
I20260408 16:59:44.599166 3043259 tablet.cpp:3496] [Publish] construct delete bitmap tablet: 412344806, rowset_ids to add: 5, roset_ids to del: 0, cur version: 321039, transaction_id: 271784517, cost: 558(us), total rows: 10
I20260408 16:59:44.626101 3043273 tablet_meta_manager.cpp:188] remove pending publish info, key:ppl_412344806_321039, res:[OK]
I20260408 16:59:44.626903 3043194 tablet.cpp:3088] calc segment delete bitmap tablet: 412344806 rowset: 02000000000000000000000000000000 add: 5, roset_ids to del: 0, cur version: 321039, transaction_id: 271784517, cost: 412(us)
I20260408 16:59:44.629722 3043272 tablet_meta_manager.cpp:188] remove pending publish info, key:ppl_412344806_321040, res:[OK]
I20260408 16:59:44.629735 3043272 tablet.cpp:3496] [Publish] construct delete bitmap tablet: 412344806, roset_ids to add: 5, roset_ids to del: 0, cur version: 321039, transaction_id: 271784517, version=321039, num_rows=10, res:[OK], cost: 1402(us)
I20260408 16:59:44.656894 3043462 tablet_meta_manager.cpp:1012] find expired transactions for 0 tablets
I20260408 16:59:44.659428 3043273 tablet_meta_manager.cpp:188] remove pending publish info, key:ppl_412344806_321040, res:[OK]
I20260408 16:59:44.659459 3043194 tablet.cpp:3088] calc segment delete bitmap tablet: 412344806 rowset: 02000000000000000000000000000000 add: 6, roset_ids to del: 0, cur version: 321040, transaction_id: 271784694, cost: 841(us), total rows: 131
I20260408 16:59:44.660637 3043241 tablet.cpp:3496] [Publish] construct delete bitmap tablet: 412344806, roset_ids to add: 6, roset_ids to del: 0, cur version: 321040, transaction_id: 271784694, version=321040, num_rows=131, res:[OK]

此时开始陆续出现查询超时问题
17:06 此时所有查询出现超时失败

W20260408 17:06:18.000160 3043437 task_worker_pool.cpp:1565] Failed to publish version[signature=27178565][transaction_id=27178565][error_tablets_num=1][error=E-3115]version not continuous for row, tablet_id=412344806, tablet_max_version=321387, txn_version=321389
I20260408 17:06:18.830312 3043928 fragment_mgr.cpp:652] query_id: 6ebda1fc68524845-970158fc5ce2a, coord_addr: TheNetworkAddress(hostname=25.18.0.89, port=9020), total_fragment_num on current host: 5, fe process uid: 1768389755747, query type: SELECT, report audit
I20260408 17:06:18.830429 3043928 fragment_mgr.cpp:693] Query load_id: 6ebda1fc68524845-970158fc5ce2a, use workload group: TGId=1, name=normal, cpu_share=1024, memory_limit=68.01 GiB, enable_memory_overcommit=true, version=0, cpu_hard_limit=-1, scan
read_num=128, max_remote_scan_thread_num=640, min_remote_scan_thread_num=640, spill_low_watermark=80, spill_high_watermark=90, is_shutdown=false, query_num=11, is_pipeline=1, enable_group_soft_limit=1
I20260408 17:06:18.838451 3043920 query_load_memory_tracker.cpp:706] Register query load memory tracker, prepare query_id=6ebda1fc68524845-970158fc5ce2a limit: 0
I20260408 17:06:18.838473 3043920 pipeline_x_fragment_context.cpp:189] PipelineOfFragmentContext::prepare(query_id=6ebda1fc68524845-970158fc5ce2a|fragment_id=1|pthread_id=281446810239028
I20260408 17:06:18.838478 3043928 pipeline_x_fragment_context.cpp:189] PipelineOfFragmentContext::prepare(query_id=6ebda1fc68524845-970158fc5ce2a|fragment_id=0|pthread_id=281446810239028
I20260408 17:06:18.060467 3044628 fragment_mgr.cpp:607] Removing query 6ebda1fc68524845-970158fc5ce2a instance 6ebda1fc68524845-970158fc5ce2a7f, all done? false
I20260408 17:06:18.060914 3044628 fragment_mgr.cpp:607] Removing query 6ebda1fc68524845-970158fc5ce2a instance 6ebda1fc68524845-970158fc5ce2a79, all done? false
I20260408 17:06:18.060915 3044628 fragment_mgr.cpp:607] Removing query 6ebda1fc68524845-970158fc5ce2a instance 6ebda1fc68524845-970158fc5ce2a7a, all done? false
I20260408 17:06:18.060915 3044628 fragment_mgr.cpp:607] Removing query 6ebda1fc68524845-970158fc5ce2a instance 6ebda1fc68524845-970158fc5ce2a7b, all done? false
I20260408 17:06:18.910405 3043435 task_worker_pool.cpp:1524] wait for previous publish version task to be done, transaction_id=271785683
I20260408 17:06:18.932930 3043435 engine_publish_version_task.cpp:229] unique key with merge-on-write version not continuous, missed version=321388, it's transaction_id=-1, current publish version=321397, tablet_id=412344806, transaction_id=271785782
I20260408 17:06:18.938038 3043439 task_worker_pool.cpp:1529] task elapsed 0 seconds since it is inserted to queue, it is timeout
I20260408 17:06:19.000092 3043439 tablet_meta_manager.cpp:178] save pending publish rows, [key=pbl_412344806_321300 msize=12
W20260408 17:06:19.000161 3043439 olap_meta.cpp:495] add pending publish task, version_id=412344806 version=321390 tnx_id=271785628 is_recovery: 0
W20260408 17:06:19.176000 3043439 task_worker_pool.cpp:1565] Failed to publish version[signature=27178528][transaction_id=27178528][error_tablets_num=1][error=E-3115]version not continuous for row, tablet_id=412344806, tablet_max_version=321387, txn_version=321390
I20260408 17:06:19.200376 3043912 tablets_channel.cpp:136] open tablets channel of index -1, tablets_num: 16 timeout(s): 259200
I20260408 17:06:19.280524 3043912 tablets_channel.cpp:164] txn 271785843: TabletsChannel or index 201623691 init senders 1 with incremental off
I20260408 17:06:19.310889 3043579 tablets_channel.cpp:268] close tablets channel: (load_id=5c4db820c51cf782-6d82c7944044df29, index_id=201623691), sender id: 0, backend id: 4870972
I20260408 17:06:19.311306 3043185 vertical_segment_writer.cpp:721] add a single block 126
I20260408 17:06:19.371328 3043185 calc_segment_delete_bitmap.cpp:3088] calc segment delete bitmap, tablet: 412344806, rosette_id: 80, cur_max_version: 321387, transaction_id: 271785843, cost: 56571(us)
I20260408 17:06:19.371380 3043185 beta_rowset_writer.cpp:192] [Newtable Flush] construct delete bitmap tablet: 412344806, rosette_id: 80, cur_max_version: 321387, transaction_id: 271785843, cost: 56564(us), total_rows: 126
I20260408 17:06:19.371727 3043579 reset_builder.cpp:222] submit calc delete bitmap task to executor, tablet_id: 412344806, txn_id: 271785843
I20260408 17:06:19.371846 3043579 tablet.cpp:3107] skip to construct delete bitmap tablet: 412344806, rosette_id=0, cur_max_version=321387, transaction_id: 271785843, total_rows: 126
I20260408 17:06:19.371864 3043579 tablet.cpp:3318] [Before Commit] construct delete bitmap tablet: 412344806, rosette_ids to add: 0, rosete_ids to del: 0, cur_max_version: 321387, transaction_id: 271785843, total_rows: 126
I20260408 17:06:19.371923 3043579 rosette_builder.cpp:2260] Got result of calc delete bitmap task from executor, tablet_id: 412344806, txn_id: 271785843
I20260408 17:06:19.372052 3043579 load_channel.cpp:217] txn 271785843 closed tablets channel 201623691
I20260408 17:06:19.372166 3043579 load_channel.cpp:69] load channel removed load_id=5c4db820c51cf782-6d82c7944044df29, is_high_priority=0, sender_ip=25.18.0.154, index_id: 201623691, total_received_rows: 126, num_rows_filtered: 0
I20260408 17:06:19.383322 3043430 daemo.cpp:216] os physical memory 254.87 GB, process memory used: 22.06 GB, limit 229.38 GB, soft limit 206.44 GB, sys_available memory 208.66 GB, low water mark 1.60 GB, waiting water mark 3.20 GB. Refresh interval memory growth 0
I20260408 17:06:19.393605 3042888 wal_manager.cpp:481] Scheduled(every 10s) WAL info: [/doris/be/data/wal: limit 391527200883 bytes, used 0 bytes, estimated wal bytes 0 bytes, available 391527200883 bytes.][/doris/be/data/wal: limit 402568244428 bytes, used 0 bytes, estimated wal bytes 0 bytes, available 402568244428 Bytes, used 0 Bytes
I20260408 17:06:19.475767 3043448 engine_publish_version_task.cpp:227] unique key with merge-on-write version not continuous, missed version=321388, it's transaction_id=-1, current publish version=321382, tablet_id=412344806, transaction_id=271785843
I20260408 17:06:19.475756 3043699 task_worker_pool.cpp:332] successfully submit task[type=PUBLISH_VERSION][signature=271785843

报错信息(audit_log表中error_message):

errCode = 2.detailMessage(25.12.0.144)CANCELLED]failed to sendbrpc when exchange, error=主机关闭,error text=[E110]Fail to read fromSocket(id=3164 fd=597ddr=25.0.4.133:8060:423760x0xfff5e5a10000):连接超时R1][E112]Not connected to25.0.4.133:8060 yetserver_id=3164 [R2][E112]Notconnected to25.0.4.133:8060 yetserver_id=3164 [R3][E112]Notonnected to25.0.4.133:8060 yetserver_id=3164 [R4][E112]Notconnected to25.0.4.133:8060 yetserver_id=3164 [R5][E112]Not connected to 25.0.4.133:8060 yet
0 Answers