描述
- 起因: FDB Pod 报错 导致服务不正常运行。
Error determining public address.
SIGSEGV: segmentation violation
PC=0x7f2951e1c13a m=3 sigcode=1 addr=0x40
signal arrived during cgo execution
看这个错 好像新版本也有问题 https://github.com/apple/foundationdb/issues/11222
- 想着重新apply FDB的配置, 从而排除是 FDB 运行久,偶尔异常的情况。(按照ticket描述是 可以的样子)
重新apply之后,问题更麻烦一些,对应的pod重新命名了,pvc和pv也一样, 导致无法挂在正确 存算分离的数据, 进而导致后续的FE/MS 都无法正常运行。
问题
问题01. 是否可以根据 老的 pv 把数据导出来,在导入到 FDB 中。
wdxxl@mac-air ~ % kubectl --context=sandbox get pv |grep doris |grep test-cluster
pvc-02df6150-03e1-4397-bb04-31a27e9f9522 15Gi RWO Retain Released doris/test-cluster-log-37687-data ebs-sc-retain <unset> 3h11m
pvc-0d87a9fd-a142-4dfc-a52c-f91211a5b76c 15Gi RWO Retain Released doris/test-cluster-log-98885-data ebs-sc-retain <unset> 31d
pvc-3ac75272-29d3-4cd0-acee-ee8bb0e3bf4f 15Gi RWO Retain Bound doris/test-cluster-log-22410-data ebs-sc-retain <unset> 145m
pvc-3b25e676-00bc-46b9-813b-c5adf0e051de 15Gi RWO Retain Released doris/test-cluster-storage-86505-data ebs-sc-retain <unset> 31d
pvc-4a0cf236-8028-4b57-81f2-c8ce553082d4 15Gi RWO Retain Released doris/test-cluster-storage-85646-data ebs-sc-retain <unset> 3h11m
pvc-56e7c475-0bde-4e86-9895-54f11b0b6525 15Gi RWO Retain Bound doris/test-cluster-log-54266-data ebs-sc-retain <unset> 145m
pvc-75fd25cb-a95a-4861-a383-fabb90e510b4 15Gi RWO Retain Released doris/test-cluster-log-86411-data ebs-sc-retain <unset> 3h11m
pvc-92780718-85cd-437e-b934-1066e5d20c19 15Gi RWO Retain Bound doris/test-cluster-storage-56816-data ebs-sc-retain <unset> 145m
pvc-bd481f05-671a-45fc-ba42-f9640eae6f12 15Gi RWO Retain Released doris/test-cluster-storage-81691-data ebs-sc-retain <unset> 3h11m
pvc-fa73fc1b-a609-498d-94dd-699bffd8f550 15Gi RWO Retain Released doris/test-cluster-storage-11661-data ebs-sc-retain <unset> 31d
pvc-ff392d40-dc42-419b-82e9-db65642ead8a 15Gi RWO Retain Bound doris/test-cluster-storage-32441-data ebs-sc-retain <unset> 145m
pvc-ffe84721-b870-404b-a3b7-3b2e89b776dc 15Gi RWO Retain Released doris/test-cluster-log-63196-data ebs-sc-retain <unset> 31d
wdxxl@mac-air ~ %
问题02. FDB Pod 和 PV 是否可以正常按照statefulset的形式, 按照自增id来命名,方便后续运维
FundationDB 部署 pod的名字和 pv 都是按照进程名字来命名的,能否改成 跟statefulset一样 按照自增id来命名?
由于服务挂了,想要重启设置,发现按照 进程id 命名比较麻烦
wdxxl@mac-air ~ % kubectl --context=sandbox get pod -n doris
NAME READY STATUS RESTARTS AGE
test-cluster-cluster-controller-48002 2/2 Running 0 6m43s
test-cluster-log-22410 2/2 Running 0 6m43s
test-cluster-log-54266 2/2 Running 0 6m43s
test-cluster-storage-32441 2/2 Running 0 6m43s
test-cluster-storage-56816 2/2 Running 0 6m43s
wdxxl@mac-air ~ % kubectl --context=sandbox get pvc -n doris
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
test-cluster-log-22410-data Bound pvc-3ac75272-29d3-4cd0-acee-ee8bb0e3bf4f 15Gi RWO ebs-sc-retain <unset> 7m14s
test-cluster-log-54266-data Bound pvc-56e7c475-0bde-4e86-9895-54f11b0b6525 15Gi RWO ebs-sc-retain <unset> 7m14s
test-cluster-storage-32441-data Bound pvc-ff392d40-dc42-419b-82e9-db65642ead8a 15Gi RWO ebs-sc-retain <unset> 7m14s
test-cluster-storage-56816-data Bound pvc-92780718-85cd-437e-b934-1066e5d20c19 15Gi RWO ebs-sc-retain <unset> 7m14s
wdxxl@mac-air ~ %
删除重启后, 对应的 进程id会变, 导致磁盘无法再次找到准确的pod。 进而影响后续的 FE和MS的启动检查
问题03 - 后续服务也启动不起来了
- FDB 挂了, 重启后磁盘挂不上,元数据丢失的状况
- MS 报错
RuntimeLogger W20250523 07:39:03.721222 176 resource_manager.cpp:220] failed to check instance instance_id=1814801713, code=KeyNotFound, info=failed to get instance, instance_id=1814801713 err=KeyNotFound - FE 报错
RuntimeLogger 2025-05-23 07:41:34,183 WARN (main|1) [CloudEnv.getLocalTypeFromMetaService():169] failed to get cloud cluster due to incomplete response, cloud_unique_id=1:1814801713:fe, clusterId=RESERVED_CLUSTER_ID_FOR_SQL_SERVER, response=status {
code: INVALID_ARGUMENT
msg: "empty instance_id"
}
怀疑是FE或者MS 有数据是要跟FDB 交互的,比如 当前报错的 instance_id=1814801713, 找不到了,服务久直接启动不了了。