unique key hash分桶数据倾斜

Viewed 69

doris2.1.9使用UNIQUE KEY Id bigint NOT NULL作为分桶键DISTRIBUTED BY HASH(Id) BUCKETS 8,仍然产生了数据倾斜的情况
"BucketIdx","AvgRowCount","AvgDataSize","Graph","Percent"
"0","39470","3485395",>>>>>>>>>>>>>>>>,"16.88 %"
"1","30235","2690570",>>>>>>>>>>>>>,"13.03 %"
"2","29866","2658435",>>>>>>>>>>>>,"12.87 %"
"3","25049","2237446",>>>>>>>>>>,"10.84 %"
"4","26317","2349757",>>>>>>>>>>>,"11.38 %"
"5","28558","2545532",>>>>>>>>>>>>,"12.33 %"
"6","28685","2542321",>>>>>>>>>>>>,"12.31 %"
"7","24129","2140665",>>>>>>>>>>,"10.37 %"

上官网查询到分桶策略
Hash 分桶:通过计算分桶列值的 crc32 哈希值,并对分桶数取模,将数据行均匀分布到分片中。
于是我使用 crc32(id)%8 发现其数据应该是分布均匀的,
"bucket_no","count(1)"
0,9549
1,9601
2,9486
3,9518
4,9452
5,9488
6,9567
7,9573
为什么会出现这种情况,是BUG吗?

2 Answers

BE 节点的磁盘使用率如何呀?

Unique 表,KEY列重复会替换,比如它的数据特征有8000行,落入第一个bucket假设1000行,但其实数据有重复,最后的实际结果是500行,其他有些分区假设没有重复的可能就是1000行,有重复数据,重复数据也会被统计到,因为是每个rowset的加起来,rowset之间没去重,得做了compaction后才会真正删除,但是做完compaction后,由于你每个bucket的都可能有重复数据,所以就会出现你看到的数据不均衡的情况,自均衡只是tablet级别的,不是rowset 级别的。Unique 就是这样。

若不想在使用SHOW DATA SKEW FROM [<db_name>.]<table_name>命令后出现数据倾斜的情况,可以将表清空再重新导入数据,但下次如果有重复数据进来可能还会出现倾斜现象。

感谢 阿渊@SelectDB 的回答