manager-agent报错 异常信息:Agent heartbeat failed too many times

Viewed 68

manager-agent报错 异常信息:Agent heartbeat failed too many times
image.png

排查正常:

[root@lhrdoris /]# netstat -tulnp | grep 8972
tcp6       0      0 :::8972                 :::*                    LISTEN      2663/agent          
[root@lhrdoris /]# ps -ef | grep 2663
root        2663       1  0 Aug12 ?        00:07:46 /soft/manager-agent/lib/agent --config.file /soft/manager-agent/conf/agent.yaml
root      209854  209679  0 08:57 pts/2    00:00:00 grep --color=auto 2663
[root@lhrdoris /]# 
[root@lhrdoris /]# telnet 127.0.0.1 8972
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.

HTTP/1.1 400 Bad Request
Content-Type: text/plain; charset=utf-8
Connection: close

400 Bad RequestConnection closed by foreign host.
[root@lhrdoris /]# 

[root@lhrdoris /]# curl http://127.0.0.1:8972/health
{
    "code": 0,
    "data": "ok"
}[root@lhrdoris /]# 

3 Answers

已经和用户远程看了这个问题,该问题已经得到解决。
补充一下背景:
用户部署了一个CentOS容器,然后在容器里面部署doris和manager服务,然后manager这边的定时心跳请求流程里面会试图获取主机的一些数据,比如CPU负载、内存使用率、IO、网络速率等信息,由于容器这块的环境Manager还没有做任何适配,所以获取主机监控数据这块出现了一些NaN错误,导致心跳请求无法正常返回,预计在25.1.1中修复这个问题

  1. 看下agent/bin目录下有manager_id和agent_id文件嘛
  2. 看下manager/webserver/manager_info.log 找到心跳日志看看有具体报错信息嘛

manager_id和agent_id文件,这里执行curl http://127.0.0.1:8972/heartbeat返回404

[root@lhrdoris /]# cd /soft/manager-agent/bin
[root@lhrdoris bin]# ll
total 20
-rw-r--r--. 1 root root   24 Aug 12 15:14 agent_id
-rw-r--r--. 1 root root    7 Aug 13 14:02 agent.pid
-rw-r--r--. 1 root root   26 Aug 12 15:14 manager_id
-rwxr-xr-x. 1 root root 3766 Jun 16 10:33 start.sh
-rwxr-xr-x. 1 root root 2666 Jun 16 10:33 stop.sh
[root@lhrdoris bin]# more agent.pid 
285958
[root@lhrdoris bin]# more manager_id 
manager-r1070ht7n6jqw4lq6t
[root@lhrdoris bin]# more agent_id
agent-imb7uhjrhe8mryrjaj
[root@lhrdoris bin]#



2025-08-18 00:01:41.918 [SimpleAsyncTaskExecutor-239717] INFO  com.selectdb.enterprise.manager.service.component.agent.AgentHttpClient - connect to agent 1 heartbeat api, url:http://127.0.0.1:8972/heartbeat
2025-08-18 00:01:41.918 [SimpleAsyncTaskExecutor-239717] INFO  com.selectdb.enterprise.manager.service.component.agent.AgentHttpClient - heartbeat, serverIps: [192.92.0.15]
2025-08-18 00:01:41.918 [SimpleAsyncTaskExecutor-239717] INFO  com.selectdb.enterprise.manager.common.pool.HttpClientPoolManager - Post body is:{"agent_id":"agent-imb7uhjrhe8mryrjaj","credentials":[{"encrypted":false,"password":"","user":"root"}],"manager_id":"manager-r1070ht7n6jqw4lq6t","manager_version":"25.0.0","nodes":[{"cluster_id":1,"deploy_dir":"/usr/local/apache-doris/fe","edit_log_port":9010,"http_port":8030,"id":1,"ip":"127.0.0.1","is_master":true,"jdbc_port":9030,"jdk_version":"jdk17","keep_alive":true,"log_alert_dir":"/usr/local/apache-doris/fe/log","log_alert_keys":[],"log_dir":"/usr/local/apache-doris/fe/log","meta_dir":"/usr/local/apache-doris/fe/doris-meta","module_name":"Fe","priority_networks":"127.0.0.1/32","resource_node_id":1,"role":"Follower","rpc_port":9020,"status":"Running"},{"be_port":9060,"be_role":"mix","brpc_port":8060,"cluster_id":1,"deploy_dir":"/usr/local/apache-doris/be","heartbeat_port":9050,"id":2,"ip":"127.0.0.1","jdk_version":"jdk17","keep_alive":tr
ue,"log_alert_dir":"/usr/local/apache-doris/be/log","log_alert_keys":[],"log_dir":"/usr/local/apache-doris/be/log/","module_name":"Be","priority_networks":"127.0.0.1/32","resource_node_id":1,"status":"Running","storage_dir":"/data/doris,medium:SSD","webserver_port":8040}],"resource":{"ips":["192.92.0.15"],"package_md5_sum":"b2e8e56792a30137456853c2968aa676","package_name":"manager-agent-25.0.0-x64-bin.tar.gz","port":8004}}
2025-08-18 00:01:41.929 [SimpleAsyncTaskExecutor-239717] INFO  com.selectdb.enterprise.manager.common.pool.HttpClientPoolManager - execute request result:

2025-08-18 00:01:41.930 [SimpleAsyncTaskExecutor-239717] INFO  com.selectdb.enterprise.manager.service.impl.ResourceNodeServiceImpl - Update cluster 1 status start.
2025-08-18 00:01:41.931 [SimpleAsyncTaskExecutor-239717] INFO  com.selectdb.enterprise.manager.service.impl.ResourceNodeServiceImpl - Update cluster 1 status end.


[root@lhrdoris /]# curl http://127.0.0.1:8972/heartbeat
404 page not