问题如标题,详细细节如下,求解决办法。
paimon表在spark中的ddl语句如下:
CREATE TABLE test_table (
id STRING NOT NULL COMMENT 'id',
vector ARRAY<DOUBLE> COMMENT '特征向量',
timestamp BIGINT NOT NULL COMMENT '时间戳',
dt STRING NOT NULL COMMENT '日期分区',
hour STRING NOT NULL COMMENT '小时分区')
PARTITIONED BY (dt, hour)
TBLPROPERTIES (
'bucket' = '1',
'partition.expiration-check-interval' = '1 d',
'partition.expiration-time' = '3650 d',
'partition.timestamp-formatter' = 'yyyyMMdd',
'primary-key' = 'dt,hour,id,timestamp'
);
使用spark插入一条数据
insert into test_table partition(dt='20260101',hour='11')
select '1' as id
, array(1.1,2.2) as vector
, 10000 as timestamp;
spark与doris中均可以查到该数据,且可以正常进行比较运算。
mysql> select timestamp<50000,* from test_table;
+-----------------+------+------------+-----------+----------+------+
| timestamp<50000 | id | vector | timestamp | dt | hour |
+-----------------+------+------------+-----------+----------+------+
| 1 | 1 | [1.1, 2.2] | 10000 | 20260101 | 11 |
+-----------------+------+------------+-----------+----------+------+
1 row in set (0.40 sec)
但在doris中,where语句中 增加该条件,查询不到数据(spark中正常)
mysql> select timestamp<50000,* from test_table where timestamp<50000;
Empty set (0.26 sec)
几点补充供参考,可以看出是array列影响,且可能是浮点类型转换问题:
1.没有ARRAY列的表,过滤结果正确
2.将ARRAY列删除,分区内重新写入数据后(数据重新组织),可以正确过滤;仅删除列,不重新写入数据时,bug仍存在
3.重新添加ARRAY列,分区内重新写入数据后,bug再次出现