执行insert into select语句,偶发be与fe连接错误,[THRIFT_RPC_ERROR]Couldn't open transport for 10.1.0.23:9020 (open() timed out)

Viewed 47
Caused by: java.sql.SQLException: errCode = 2, detailMessage = (10.1.0.26)[INTERNAL_ERROR]query_id: 7ee752fd37824e0a-8c5d6ca2af647e21, couldn't get a client for TNetworkAddress(hostname=10.1.0.23, port=9020), reason is [THRIFT_RPC_ERROR]Couldn't open transport for 10.1.0.23:9020 (open() timed out)

	0#  doris::ThriftClientImpl::open() at /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:187
	1#  doris::ThriftClientImpl::open_with_retry(int, int) at /home/zcp/repo_center/doris_enterprise/doris/be/src/common/status.h:357
	2#  doris::ClientCacheHelper::_create_client(doris::TNetworkAddress const&, std::function<doris::ThriftClientImpl* (doris::TNetworkAddress const&, void**)>&, void**, int) at /home/zcp/repo_center/doris_enterprise/doris/be/src/common/status.h:446
	3#  doris::ClientCacheHelper::get_client(doris::TNetworkAddress const&, std::function<doris::ThriftClientImpl* (doris::TNetworkAddress const&, void**)>&, void**, int) at /home/zcp/repo_center/doris_enterprise/doris/be/src/common/status.h:446
	4#  doris::ClientConnection<doris::FrontendServiceClient>::ClientConnection(doris::ClientCache<doris::FrontendServiceClient>*, doris::TNetworkAddress const&, int, doris::Status*, int) at /home/zcp/repo_center/doris_enterprise/doris/be/src/common/status.h:357
	5#  doris::FragmentMgr::coordinator_callback(doris::ReportStatusRequest const&) at /home/zcp/repo_center/doris_enterprise/doris/be/src/common/status.h:446
	6#  doris::FragmentExecState::coordinator_callback(doris::Status const&, doris::RuntimeProfile*, doris::RuntimeProfile*, bool) at /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:244
	7#  doris::PlanFragmentExecutor::send_report(bool) at /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:360
	8#  doris::PlanFragmentExecutor::report_profile() at /home/zcp/repo_center/doris_enterprise/doris/be/src/runtime/plan_fragment_executor.cpp:421
	9#  std::_Function_handler<void (), doris::PlanFragmentExecutor::open()::$_0>::_M_invoke(std::_Any_data const&) at /home/zcp/repo_center/doris_enterprise/doris/be/src/runtime/plan_fragment_executor.cpp:256
	10# doris::ThreadPool::dispatch_thread() at /home/zcp/repo_center/doris_enterprise/doris/be/src/util/threadpool.cpp:0
	11# doris::Thread::supervise_thread(void*) at /var/local/ldb_toolchain/bin/../usr/include/pthread.h:562
	12# ?
	13# __clone

目前排查了cpu/memory/io,执行sql时负载较低,正常执行只需40s,没有性能瓶颈.

# fe thrift相关配置
thrift_backlog_num=1024
thrift_client_timeout_ms=0
thrift_server_max_worker_threads=4096
thrift_server_type=THREAD_POOL

# be thrift相关配置
thrift_client_open_num_tries=1
thrift_client_retry_interval_ms=1000
thrift_connect_timeout_seconds=3
thrift_rpc_timeout_ms=60000
thrift_server_type_of_fe=THREAD_POOL

image.png
image.png
单台机器的配置为3T SSD硬盘,1T 内存,128核

请问接下来该如何排查呢?

1 Answers

be.conf
thrift_connect_timeout_seconds=10 测试下,同时看下当时网络负载如何?
这个怀疑网络基座有问题,或者是本机or对端连接数过多。可以持续监控一下BE和FE的网络连接数,看看是否出现问题的时候位于高峰

同时您可以加我主页微信,我们一起看下