单条向量数据写入申请巨大内存错误

Viewed 17

使用微软云的embedding接口将一两百个字进行了3072维向量化,然后插入doirs的表中时,be申请了巨量内存导致申请失败写入失败,以下是日志:

I20260511 17:51:28.560243 1077277 ann_index_writer.cpp:75] Create a new faiss index, index_type hnsw dim 3072 metric_type l2_distance max_degree 32, ef_construction 40, quantizer flat
I20260511 17:51:28.560523 1077277 allocator.cpp:136] Task:Memtable#Load#Id=ddd60424a35545ab-81458da98db346ea waiting for enough memory in thread id:22709418255936, maximum 1000ms, Allocator sys memory check failed: Cannot alloc:16.00 GB, consuming tracker:<Memtable#Load#Id=ddd60424a35545ab-81458da98db346ea>, peak used 24.89 KB, current used 24.89 KB, reserved 0, exec node:, sys physical memory 31.27 GB. process memory used 3.03 GB(= 3.03 GB[vm/rss] + 0[reserved] + 0B[waiting_refresh]), limit 28.14 GB, soft limit 25.33 GB. sys available memory 8.49 GB(= 8.49 GB[proc/available] - 0[reserved] - 0B[waiting_refresh]), low water mark 1.56 GB, warning water mark 3.13 GB.
Alloc Stacktrace:
0# doris::Allocator<false, false, false, doris::DefaultMemoryAllocator, true>::sys_memory_exceed(unsigned long, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) const at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/basic_string.h:2462
1# doris::Allocator<false, false, false, doris::DefaultMemoryAllocator, true>::sys_memory_check(unsigned long) const at /home/zcp/repo_center/doris_release/doris/be/src/core/allocator.cpp:132
2# doris::Allocator<false, false, false, doris::DefaultMemoryAllocator, true>::alloc(unsigned long, unsigned long) at /home/zcp/repo_center/doris_release/doris/be/src/core/allocator.cpp:247
3# doris::segment_v2::AnnIndexColumnWriter::init() at /home/zcp/repo_center/doris_release/doris/be/src/core/pod_array.h:164
4# doris::segment_v2::ArrayColumnWriter::init() at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:526
5# doris::segment_v2::VerticalSegmentWriter::_create_column_writer(unsigned int, doris::TabletColumn const&, std::shared_ptr const&) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:526
6# doris::segment_v2::VerticalSegmentWriter::write_batch() at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:526
7# doris::SegmentFlusher::_add_rows(std::unique_ptr<doris::segment_v2::VerticalSegmentWriter, std::default_delete >&, doris::Block const
, unsigned long, unsigned long) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:526
8# doris::SegmentFlusher::flush_single_block(doris::Block const*, int, long*) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:526
9# doris::SegmentCreator::flush_single_block(doris::Block const*, int, long*) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:526
10# doris::BetaRowsetWriterV2::flush_memtable(doris::Block*, int, long*) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:526
11# doris::FlushToken::_do_flush_memtable(doris::MemTable*, int, long*) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:526
12# doris::FlushToken::_flush_memtable(std::shared_ptr, int, long) at /home/zcp/repo_center/doris_release/doris/be/src/load/memtable/memtable_flush_executor.cpp:0
13# doris::MemtableFlushTask::run() at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/shared_ptr_base.h:1068
14# doris::ThreadPool::dispatch_thread() at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/shared_ptr_base.h:1097
15# doris::Thread::supervise_thread(void*) at /usr/local/ldb-toolchain-v0.26/bin/../usr/include/pthread.h:562
16# ?
17# ?

1 Answers

这是 4.1.0 的 ANN index 写入实现问题,不是单条 3072 维向量本身需要 16GB。

根因在 be/src/storage/index/ann/ann_index_writer.cpp:81:

size_t block_size = AnnIndexColumnWriter::chunk_size() * build_parameter.dim;
_float_array.reserve(block_size);

chunk_size() 取的是 be/src/common/config.cpp:1718,默认 1000000。所以 3072 维时初始化 ANN writer 就会预留:

1,000,000 * 3072 * sizeof(float) = 12.288 GB

而 PODArray::reserve() 会向上按 2 的幂扩容,最终申请变成 16GB,和论坛日志里的 Cannot alloc:16.00 GB 完全对应。论坛栈也显示是在 AnnIndexColumnWriter::init() 里触发,不是在真正 add 单条向量时触发。

临时规避:把 BE 配置 ann_index_build_chunk_size 调小,比如:

ann_index_build_chunk_size = 10000

3072 维下大约是 10000 * 3072 * 4 = 117MB,向上取整后约 128MB。内存紧张环境建议先用 10000 或更小。

目前这个问题你可以调小 ann_index_build_chunk_size,
不过可能会影响一些召回性能,add() 分批过小导致 HNSW 增量构图质量/效率下降。