doris 版本:2.1.11
集群配置:3BE 3FE 混合部署在3台 112C、128G内存的机器上,每台服务器单盘SSD, IOPS 100K
业务场景:flink cdc 同步数据到该集群、自动建主键表,主键和hash字段都是int类型ID, 报表查询doris数据,使用的最大表数据量只有10GB左右
问题现象:报表查询时,只有一个节点BE负载飙升,另外两个很低,导致整体查询变慢,P99最高到100多秒, 未出问题前正常情况下都在10s以内
临时处置:flink cdc 写入频率由5s降低到30s, 问题BE停止,报表查询可秒回,一启动该BE就又变慢了
问题BE日志:
W20251015 15:03:02.219172 510877 pipeline_x_fragment_context.cpp:168] PipelineXFragmentContext cancel instance: 7f771f94b1b4da6-830288f93e90f74a
W20251015 15:03:02.218962 509224 fragment_mgr.cpp:628] report error status: to coordinator: TNetworkAddress(hostname=doris2, port=9020), query id: 7f771f94b1b4da6-830288f93e90f691, instance id: 0-
0
W20251015 15:03:02.219291 510877 pipeline_x_fragment_context.cpp:168] PipelineXFragmentContext cancel instance: 7f771f94b1b4da6-830288f93e90f74b
W20251015 15:03:02.219221 509194 fragment_mgr.cpp:628] report error status: to coordinator: TNetworkAddress(hostname=doris2, port=9020), query id: 7f771f94b1b4da6-830288f93e90f691, instance id: 0-
0
W20251015 15:03:02.219475 510873 fragment_mgr.cpp:1297] Could not find the query id:7f771f94b1b4da6-830288f93e90f691 fragment id:3 to cancel
W20251015 15:03:02.219502 510877 pipeline_x_fragment_context.cpp:168] PipelineXFragmentContext cancel instance: 7f771f94b1b4da6-830288f93e90f74c
W20251015 15:03:02.220273 510877 pipeline_x_fragment_context.cpp:168] PipelineXFragmentContext cancel instance: 7f771f94b1b4da6-830288f93e90f768
W20251015 15:03:02.220278 510877 pipeline_x_fragment_context.cpp:168] PipelineXFragmentContext cancel instance: 7f771f94b1b4da6-830288f93e90f769
W20251015 15:03:02.220338 510877 pipeline_x_fragment_context.cpp:168] PipelineXFragmentContext cancel instance: 7f771f94b1b4da6-830288f93e90f76a
W20251015 15:03:02.220371 510877 pipeline_x_fragment_context.cpp:168] PipelineXFragmentContext cancel instance: 7f771f94b1b4da6-830288f93e90f76b
W20251015 15:03:02.220377 510877 pipeline_x_fragment_context.cpp:168] PipelineXFragmentContext cancel instance: 7f771f94b1b4da6-830288f93e90f76c
W20251015 15:03:02.220393 510877 pipeline_x_fragment_context.cpp:168] PipelineXFragmentContext cancel instance: 7f771f94b1b4da6-830288f93e90f76d
W20251015 15:03:02.220415 510877 pipeline_x_fragment_context.cpp:168] PipelineXFragmentContext cancel instance: 7f771f94b1b4da6-830288f93e90f76e
W20251015 15:03:02.220420 510877 pipeline_x_fragment_context.cpp:168] PipelineXFragmentContext cancel instance: 7f771f94b1b4da6-830288f93e90f76f
W20251015 15:03:02.220526 510877 pipeline_x_fragment_context.cpp:168] PipelineXFragmentContext cancel instance: 7f771f94b1b4da6-830288f93e90f770
W20251015 15:03:02.220539 510877 pipeline_x_fragment_context.cpp:168] PipelineXFragmentContext cancel instance: 7f771f94b1b4da6-830288f93e90f771
W20251015 15:03:02.661422 510861 fragment_mgr.cpp:1297] Could not find the query id:327104ecefe24ddb-ab36b7a438c65e6d fragment id:1 to cancel
W20251015 15:03:02.661945 510910 fragment_mgr.cpp:1297] Could not find the query id:327104ecefe24ddb-ab36b7a438c65e6d fragment id:0 to cancel
E20251015 15:03:05.932909 511609 task_group_inl.h:91] _rq is full, capacity=4096
W20251015 15:03:10.818768 508464 sampler.cpp:194] bvar is busy at sampling for 2 seconds!
W20251015 15:03:13.469141 508464 sampler.cpp:194] bvar is busy at sampling for 2 seconds!
W20251015 15:03:17.128377 511487 ref_count_closure.h:115] RPC meet failed: [E1008]Reached timeout=1000ms @10.10.40.220:8060
W20251015 15:03:17.900588 511575 ref_count_closure.h:115] RPC meet failed: [E1008]Reached timeout=1000ms @10.10.40.220:8060
W20251015 15:03:20.767459 511433 ref_count_closure.h:115] RPC meet failed: [E1008]Reached timeout=1000ms @10.10.40.220:8060
W20251015 15:03:20.768615 511624 ref_count_closure.h:115] RPC meet failed: [E1008]Reached timeout=1000ms @10.10.40.220:8060
W20251015 15:04:22.966709 508464 sampler.cpp:194] bvar is busy at sampling for 2 seconds!
W20251015 15:13:53.997448 511459 ref_count_closure.h:115] RPC meet failed: [E1008]Reached timeout=1000ms @10.10.40.221:8060
W20251015 15:13:54.231604 511574 ref_count_closure.h:115] RPC meet failed: [E1008]Reached timeout=1000ms @10.10.40.221:8060
W20251015 15:13:54.233968 511499 ref_count_closure.h:115] RPC meet failed: [E1008]Reached timeout=1000ms @10.10.40.221:8060
W20251015 15:13:54.817966 511505 ref_count_closure.h:115] RPC meet failed: [E1008]Reached timeout=1000ms @10.10.40.221:8060
W20251015 15:15:26.536909 510748 fragment_mgr.cpp:1297] Could not find the query id:b374f8fe3dbf4d27-b3b2907af7bb5f8a fragment id:1 to cancel
W20251015 15:15:26.537175 510742 fragment_mgr.cpp:1297] Could not find the query id:b374f8fe3dbf4d27-b3b2907af7bb5f8a fragment id:6 to cancel
W20251015 15:15:26.539348 510754 fragment_mgr.cpp:1297] Could not find the query id:b374f8fe3dbf4d27-b3b2907af7bb5f8a fragment id:0 to cancel
W20251015 15:16:49.734885 511443 ref_count_closure.h:119] RPC meet error status: [INVALID_ARGUMENT]PStatus: (doris1)[INVALID_ARGUMENT]query-id: 9e453653f55e4a2c-adb3215ec67a2482
W20251015 15:17:24.731065 510525 fragment_mgr.cpp:1297] Could not find the query id:c075dbeecb524340-be554cd86a6c2a10 fragment id:1 to cancel
W20251015 15:17:24.732664 510518 fragment_mgr.cpp:1297] Could not find the query id:c075dbeecb524340-be554cd86a6c2a10 fragment id:2 to cancel
W20251015 15:17:24.733428 510540 fragment_mgr.cpp:1297] Could not find the query id:c075dbeecb524340-be554cd86a6c2a10 fragment id:0 to cancel
W20251015 15:18:07.335014 508464 sampler.cpp:194] bvar is busy at sampling for 2 seconds!
W20251015 15:18:12.805512 508464 sampler.cpp:194] bvar is busy at sampling for 2 seconds!
W20251015 15:20:09.854138 508464 sampler.cpp:194] bvar is busy at sampling for 2 seconds!
W20251015 15:20:12.349632 508464 sampler.cpp:194] bvar is busy at sampling for 2 seconds!
W20251015 15:40:14.533026 510531 fragment_mgr.cpp:1297] Could not find the query id:cbbb996f6f5b4db6-8ab1d34a7a393207 fragment id:2 to cancel
W20251015 15:40:14.533998 510525 fragment_mgr.cpp:1297] Could not find the query id:cbbb996f6f5b4db6-8ab1d34a7a393207 fragment id:3 to cancel
W20251015 15:40:14.534351 510518 fragment_mgr.cpp:1297] Could not find the query id:cbbb996f6f5b4db6-8ab1d34a7a393207 fragment id:4 to cancel
W20251015 15:40:14.546610 510546 fragment_mgr.cpp:1297] Could not find the query id:cbbb996f6f5b4db6-8ab1d34a7a393207 fragment id:0 to cancel
W20251015 15:40:16.737372 510601 fragment_mgr.cpp:1297] Could not find the query id:78e8cc5a22c24e80-9df8ef5b9713b38c fragment id:0 to cancel
W20251015 15:40:22.916581 511515 ref_count_closure.h:115] RPC meet failed: [E1008]Reached timeout=1000ms @10.10.40.221:8060
W20251015 15:40:23.006196 511487 ref_count_closure.h:115] RPC meet failed: [E1008]Reached timeout=1000ms @10.10.40.221:8060
W20251015 15:40:24.019788 511506 ref_count_closure.h:115] RPC meet failed: [E1008]Reached timeout=1000ms @10.10.40.221:8060
W20251015 15:40:24.023370 511499 ref_count_closure.h:115] RPC meet failed: [E1008]Reached timeout=1000ms @10.10.40.221:8060
W20251015 15:40:25.558025 508464 sampler.cpp:194] bvar is busy at sampling for 2 seconds!
低负载节点日志:
W20251015 10:58:54.348802 883681 pipeline_x_fragment_context.cpp:168] PipelineXFragmentContext cancel instance: f95728d41a8c43b1-938206a5459a7167
W20251015 10:58:54.348814 882286 fragment_mgr.cpp:628] report error status: to coordinator: TNetworkAddress(hostname=doris2, port=9020), query id: f95728d41a8c43b1-938206a5459a629a, instance id: 0
-0
W20251015 10:58:54.348800 882289 fragment_mgr.cpp:628] report error status: to coordinator: TNetworkAddress(hostname=doris2, port=9020), query id: f95728d41a8c43b1-938206a5459a629a, instance id: 0
-0
W20251015 10:58:54.348838 883681 pipeline_x_fragment_context.cpp:168] PipelineXFragmentContext cancel instance: f95728d41a8c43b1-938206a5459a7169
W20251015 10:58:54.348873 883681 pipeline_x_fragment_context.cpp:168] PipelineXFragmentContext cancel instance: f95728d41a8c43b1-938206a5459a716a
W20251015 10:58:54.348878 883681 pipeline_x_fragment_context.cpp:168] PipelineXFragmentContext cancel instance: f95728d41a8c43b1-938206a5459a7170
W20251015 10:58:54.348883 883681 pipeline_x_fragment_context.cpp:168] PipelineXFragmentContext cancel instance: f95728d41a8c43b1-938206a5459a7172
W20251015 10:58:54.348888 883681 pipeline_x_fragment_context.cpp:168] PipelineXFragmentContext cancel instance: f95728d41a8c43b1-938206a5459a7173
W20251015 10:58:54.348894 883681 pipeline_x_fragment_context.cpp:168] PipelineXFragmentContext cancel instance: f95728d41a8c43b1-938206a5459a7174
W20251015 10:58:54.348903 883681 pipeline_x_fragment_context.cpp:168] PipelineXFragmentContext cancel instance: f95728d41a8c43b1-938206a5459a7176
W20251015 10:58:54.348908 883681 pipeline_x_fragment_context.cpp:168] PipelineXFragmentContext cancel instance: f95728d41a8c43b1-938206a5459a7178
W20251015 10:58:54.348913 883681 pipeline_x_fragment_context.cpp:168] PipelineXFragmentContext cancel instance: f95728d41a8c43b1-938206a5459a717a
W20251015 10:58:54.348918 883681 pipeline_x_fragment_context.cpp:168] PipelineXFragmentContext cancel instance: f95728d41a8c43b1-938206a5459a717e
W20251015 10:58:54.348923 883681 pipeline_x_fragment_context.cpp:168] PipelineXFragmentContext cancel instance: f95728d41a8c43b1-938206a5459a7180
W20251015 10:58:54.348943 883681 pipeline_x_fragment_context.cpp:168] PipelineXFragmentContext cancel instance: f95728d41a8c43b1-938206a5459a7182
W20251015 10:58:54.348959 883681 pipeline_x_fragment_context.cpp:168] PipelineXFragmentContext cancel instance: f95728d41a8c43b1-938206a5459a7184
W20251015 10:58:54.349308 882288 fragment_mgr.cpp:628] report error status: to coordinator: TNetworkAddress(hostname=doris2, port=9020), query id: f95728d41a8c43b1-938206a5459a629a, instance id: 0
-0
W20251015 10:58:54.351845 716742 fragment_mgr.cpp:628] report error status: to coordinator: TNetworkAddress(hostname=doris2, port=9020), query id: f95728d41a8c43b1-938206a5459a629a, instance id: 0
-0
W20251015 10:58:54.352560 716733 fragment_mgr.cpp:628] report error status: to coordinator: TNetworkAddress(hostname=doris2, port=9020), query id: f95728d41a8c43b1-938206a5459a629a, instance id: 0
-0
W20251015 11:13:25.533063 883890 fragment_mgr.cpp:1297] Could not find the query id:71aaa365a1e444f4-8a43f0ec6650fee8 fragment id:3 to cancel
W20251015 11:13:25.533385 883908 fragment_mgr.cpp:1297] Could not find the query id:71aaa365a1e444f4-8a43f0ec6650fee8 fragment id:4 to cancel
W20251015 11:13:25.541409 883957 fragment_mgr.cpp:1297] Could not find the query id:71aaa365a1e444f4-8a43f0ec6650fee8 fragment id:2 to cancel
W20251015 11:13:25.542622 883928 fragment_mgr.cpp:1297] Could not find the query id:71aaa365a1e444f4-8a43f0ec6650fee8 fragment id:0 to cancel
W20251015 11:13:26.496517 883540 fragment_mgr.cpp:1297] Could not find the query id:47a8f4a6695146ab-925610402b1032b3 fragment id:0 to cancel
W20251015 11:13:26.749213 883584 fragment_mgr.cpp:1297] Could not find the query id:d2b01caff22640d5-9a8493504da9b2f5 fragment id:0 to cancel
W20251015 11:54:54.440138 883965 fragment_mgr.cpp:1297] Could not find the query id:132b78bce86644b9-85932cffea54f733 fragment id:3 to cancel