FundationDB 问题 导致 FE 服务无法运行 及后续运维操作问题

Viewed 26

描述

  1. 起因: FDB Pod 报错 导致服务不正常运行。
Error determining public address.
SIGSEGV: segmentation violation
PC=0x7f2951e1c13a m=3 sigcode=1 addr=0x40
signal arrived during cgo execution

看这个错 好像新版本也有问题 https://github.com/apple/foundationdb/issues/11222

  1. 想着重新apply FDB的配置, 从而排除是 FDB 运行久,偶尔异常的情况。(按照ticket描述是 可以的样子)
    重新apply之后,问题更麻烦一些,对应的pod重新命名了,pvc和pv也一样, 导致无法挂在正确 存算分离的数据, 进而导致后续的FE/MS 都无法正常运行。

问题

问题01. 是否可以根据 老的 pv 把数据导出来,在导入到 FDB 中。

 wdxxl@mac-air ~ % kubectl --context=sandbox get pv |grep doris |grep test-cluster
pvc-02df6150-03e1-4397-bb04-31a27e9f9522   15Gi       RWO            Retain           Released   doris/test-cluster-log-37687-data                                                       ebs-sc-retain   <unset>                          3h11m
pvc-0d87a9fd-a142-4dfc-a52c-f91211a5b76c   15Gi       RWO            Retain           Released   doris/test-cluster-log-98885-data                                                       ebs-sc-retain   <unset>                          31d
pvc-3ac75272-29d3-4cd0-acee-ee8bb0e3bf4f   15Gi       RWO            Retain           Bound      doris/test-cluster-log-22410-data                                                       ebs-sc-retain   <unset>                          145m
pvc-3b25e676-00bc-46b9-813b-c5adf0e051de   15Gi       RWO            Retain           Released   doris/test-cluster-storage-86505-data                                                   ebs-sc-retain   <unset>                          31d
pvc-4a0cf236-8028-4b57-81f2-c8ce553082d4   15Gi       RWO            Retain           Released   doris/test-cluster-storage-85646-data                                                   ebs-sc-retain   <unset>                          3h11m
pvc-56e7c475-0bde-4e86-9895-54f11b0b6525   15Gi       RWO            Retain           Bound      doris/test-cluster-log-54266-data                                                       ebs-sc-retain   <unset>                          145m
pvc-75fd25cb-a95a-4861-a383-fabb90e510b4   15Gi       RWO            Retain           Released   doris/test-cluster-log-86411-data                                                       ebs-sc-retain   <unset>                          3h11m
pvc-92780718-85cd-437e-b934-1066e5d20c19   15Gi       RWO            Retain           Bound      doris/test-cluster-storage-56816-data                                                   ebs-sc-retain   <unset>                          145m
pvc-bd481f05-671a-45fc-ba42-f9640eae6f12   15Gi       RWO            Retain           Released   doris/test-cluster-storage-81691-data                                                   ebs-sc-retain   <unset>                          3h11m
pvc-fa73fc1b-a609-498d-94dd-699bffd8f550   15Gi       RWO            Retain           Released   doris/test-cluster-storage-11661-data                                                   ebs-sc-retain   <unset>                          31d
pvc-ff392d40-dc42-419b-82e9-db65642ead8a   15Gi       RWO            Retain           Bound      doris/test-cluster-storage-32441-data                                                   ebs-sc-retain   <unset>                          145m
pvc-ffe84721-b870-404b-a3b7-3b2e89b776dc   15Gi       RWO            Retain           Released   doris/test-cluster-log-63196-data                                                       ebs-sc-retain   <unset>                          31d
 wdxxl@mac-air ~ % 

问题02. FDB Pod 和 PV 是否可以正常按照statefulset的形式, 按照自增id来命名,方便后续运维

https://doris.apache.org/zh-CN/docs/3.0/install/deploy-on-kubernetes/separating-storage-compute/install-fdb

参考yaml - https://raw.githubusercontent.com/foundationdb/fdb-kubernetes-operator/main/config/samples/cluster.yaml

FundationDB 部署 pod的名字和 pv 都是按照进程名字来命名的,能否改成 跟statefulset一样 按照自增id来命名?
由于服务挂了,想要重启设置,发现按照 进程id 命名比较麻烦

 wdxxl@mac-air ~ % kubectl --context=sandbox get pod -n doris                                           
NAME                                                          READY   STATUS    RESTARTS   AGE
test-cluster-cluster-controller-48002                         2/2     Running   0          6m43s
test-cluster-log-22410                                        2/2     Running   0          6m43s
test-cluster-log-54266                                        2/2     Running   0          6m43s
test-cluster-storage-32441                                    2/2     Running   0          6m43s
test-cluster-storage-56816                                    2/2     Running   0          6m43s
 wdxxl@mac-air ~ % kubectl --context=sandbox get pvc -n doris
NAME                                            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS    VOLUMEATTRIBUTESCLASS   AGE
test-cluster-log-22410-data                     Bound    pvc-3ac75272-29d3-4cd0-acee-ee8bb0e3bf4f   15Gi       RWO            ebs-sc-retain   <unset>                 7m14s
test-cluster-log-54266-data                     Bound    pvc-56e7c475-0bde-4e86-9895-54f11b0b6525   15Gi       RWO            ebs-sc-retain   <unset>                 7m14s
test-cluster-storage-32441-data                 Bound    pvc-ff392d40-dc42-419b-82e9-db65642ead8a   15Gi       RWO            ebs-sc-retain   <unset>                 7m14s
test-cluster-storage-56816-data                 Bound    pvc-92780718-85cd-437e-b934-1066e5d20c19   15Gi       RWO            ebs-sc-retain   <unset>                 7m14s
 wdxxl@mac-air ~ % 

删除重启后, 对应的 进程id会变, 导致磁盘无法再次找到准确的pod。 进而影响后续的 FE和MS的启动检查

问题03 - 后续服务也启动不起来了

  1. FDB 挂了, 重启后磁盘挂不上,元数据丢失的状况
  2. MS 报错
    RuntimeLogger W20250523 07:39:03.721222 176 resource_manager.cpp:220] failed to check instance instance_id=1814801713, code=KeyNotFound, info=failed to get instance, instance_id=1814801713 err=KeyNotFound
  3. FE 报错
    RuntimeLogger 2025-05-23 07:41:34,183 WARN (main|1) [CloudEnv.getLocalTypeFromMetaService():169] failed to get cloud cluster due to incomplete response, cloud_unique_id=1:1814801713:fe, clusterId=RESERVED_CLUSTER_ID_FOR_SQL_SERVER, response=status {
    code: INVALID_ARGUMENT
    msg: "empty instance_id"
    }

怀疑是FE或者MS 有数据是要跟FDB 交互的,比如 当前报错的 instance_id=1814801713, 找不到了,服务久直接启动不了了。

问题04 - 能否按照S3的文件,给FDB 再把元数据再反向写回去

0 Answers