Doris CBO 导致的BE节点内存规律性波动问题

Viewed 2

腾讯云TChouse-D 基于 Doris-2.1.12
因排查其他业务查询问题,将业务A表同结构新建了一张临时表,并将业务表的数据通过insert into的方式导入到这张临时表里,大约导入1.2亿+数据,导入后此表再无数据写入

此时CBO开始对这张表进行统计,但发现一直处于全量统计的状态
(图中全是这张临时表的CBO统计sql)
screenshot-20260416-101417.png

screenshot-20260416-101744.png

screenshot-20260416-101915.png

这些统计sql导致内存规律性波动,CPU也有部分上涨和规律性波动,且这个现象已持续了好几天。

Show analyze 也没有任何记录,SHOW TABLE STATS 能看到 updateTime和lastAnalyzeTime有十来秒的差异,和每次的慢sql记录吻合,但是没有状态字段看到实际情况,想着去FE或BE查询日志,但是腾讯云的doris日志查询不太好用,没找到相关日志,可能也是人笨

目前怀疑是CBO的自动分析每次虽然都执行了,但是执行了十来秒就结束,感觉像是任务占用资源太多被杀掉了,然后下一次分析时判断还是需要全量分析,就这样循环下去了...

因为本来这张临时表是为了排查业务问题存在的,业务排查完成后就将该表删除了,之后问题消失,CPU和内存回归正常

观察一阵后发现有一张业务表,也有这种现象,但是情况相比那张临时表要好很多
screenshot-20260416-104232.png
虽然扫描的表数量更大,但是占用的内存只有几十到百来M,偶尔几十分钟到一两个小时会出现一次占用内存几千M的情况,所以感觉这张表的情况可能也有点不对

临时表的表结构(聚合模型、按日分区):

CREATE TABLE `ads_common_metric_i_s` (
  `stat_time` datetime NOT NULL COMMENT "统计时间,分区键 yyyy-MM-dd HH:mm:ss",
  `app_id` varchar(32) NOT NULL COMMENT "产品线ID",
  `tenant_id` varchar(32) NOT NULL COMMENT "租户ID",
  `metric_code_version` varchar(64) NOT NULL COMMENT "指标Code和Version 例如 recharge_amt::1 ",
  `metric_agg_id` bigint NOT NULL DEFAULT "0" COMMENT "指标聚合Hash值 ",
  `metric_agg_condition` varchar(3072) NOT NULL DEFAULT "" COMMENT "指标聚合列 例如 channel_id::H5&currency_type::RMB ",
  `metric_search_condition` varchar(3072) NOT NULL DEFAULT "" COMMENT "指标检索列 例如 channel_id::H5&currency_type::RMB ",
  `metric_value` decimal(32,4) SUM NULL DEFAULT "0.00" COMMENT "指标值(如果指标原始值是datetime,转化为时间戳)",
  `distinct_value` bitmap BITMAP_UNION NOT NULL DEFAULT BITMAP_EMPTY COMMENT "bitmap类型值",
  `meta_time` datetime(3) REPLACE NULL DEFAULT "1970-01-01 00:00:00.000" COMMENT "Kafka元数据消息时间 yyyy-MM-dd HH:mm:ss.SSS",
  `doris_time` datetime REPLACE NULL DEFAULT CURRENT_TIMESTAMP COMMENT "doris数据落表时间 yyyy-MM-dd HH:mm:ss",
  INDEX idx_metric_search_condition (`metric_search_condition`) USING INVERTED PROPERTIES("support_phrase" = "true", "ignore_above" = "3072", "parser" = "unicode", "lower_case" = "false") COMMENT '检索条件分词'
) ENGINE=OLAP
AGGREGATE KEY(`stat_time`, `app_id`, `tenant_id`, `metric_code_version`, `metric_agg_id`, `metric_agg_condition`, `metric_search_condition`)
COMMENT 'ADS通用指标纵表'
AUTO PARTITION BY RANGE (date_trunc(`stat_time`, 'day'))()
DISTRIBUTED BY HASH(`metric_code_version`) BUCKETS 15
PROPERTIES (
"replication_allocation" = "tag.location.default: 3",
"min_load_replica_num" = "-1",
"bloom_filter_columns" = "metric_code_version",
"is_being_synced" = "false",
"dynamic_partition.enable" = "true",
"dynamic_partition.time_unit" = "day",
"dynamic_partition.time_zone" = "Asia/Shanghai",
"dynamic_partition.start" = "-1000",
"dynamic_partition.end" = "1",
"dynamic_partition.prefix" = "p",
"dynamic_partition.replication_allocation" = "tag.location.default: 3",
"dynamic_partition.buckets" = "15",
"dynamic_partition.create_history_partition" = "false",
"dynamic_partition.history_partition_num" = "-1",
"dynamic_partition.hot_partition_num" = "0",
"dynamic_partition.reserved_history_periods" = "[1970-01-01,1970-01-02]",
"dynamic_partition.storage_policy" = "",
"dynamic_partition.storage_medium" = "hdd",
"storage_medium" = "hdd",
"storage_format" = "V2",
"inverted_index_storage_format" = "V1",
"light_schema_change" = "true",
"disable_auto_compaction" = "false",
"enable_single_replica_compaction" = "false",
"group_commit_interval_ms" = "10000",
"group_commit_data_bytes" = "134217728"
);;

刚才提到的有类似情况但情况较好的业务表,和上边的表结构基本一致,相当于那张表的日志表,明细模型,按周分区

咨询过腾讯云,腾讯云说CBO底层就是全量采集,但是Doris官方文档说的不会全量采集,有点搞懵了,但不管底层是否是全量采集,这个临时表一直全量采集肯定是不正常的
知道可以通过手动分析或者使用采样的方式来暂时解决这个问题,但是就是想搞清楚这个一直全量采集到底是什么原因导致的,因为从操作过程来说,就是新建了一张表,导入了一些数据,就出现了这个问题

还请社区大佬能给出一些排查方向和解惑,谢谢

0 Answers