今天发现Doris的BE进程退出,排查发现为BE节点的一个数据盘挂载点消失导致进程退出,深入排查后发现消失的挂载点对应磁盘存在坏道,使用xfs_repair -L /dev/sda命令修复后可以重新挂载并验证文件写入正常,在重启be进程时监控日志发现出现以下错误日志:
W20250719 14:10:53.846815 588984 storage_engine.cpp:114] open engine failed, error: [E-3005]rocksdb seek failed. reason: Corruption: Bad table magic number: expected 9863518390377041911, found 3761690078487983156 in /data01/doris/storage/meta/8392761.sst
0# doris::OlapMeta::iterate(int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<bool (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)> const&)
1# doris::TabletMetaManager::traverse_headers(doris::OlapMeta*, std::function<bool (long, long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
2# doris::DataDir::_check_incompatible_old_format_tablet()
3# doris::DataDir::load()
4# std::thread::_State_impl<std::thread::_Invoker<std::tuple<doris::StorageEngine::load_data_dirs(std::vector<doris::DataDir*, std::allocator<doris::DataDir*> > const&)::$_0, unsigned long> > >::_M_run()
5# execute_native_thread_routine
6# ?
7# ?
E20250719 14:10:53.846902 588984 exec_env_init.cpp:290] Fail to open StorageEngine, res=[E-3005]rocksdb seek failed. reason: Corruption: Bad table magic number: expected 9863518390377041911, found 3761690078487983156 in /data01/doris/storage/meta/8392761.sst
0# doris::OlapMeta::iterate(int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<bool (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)> const&)
1# doris::TabletMetaManager::traverse_headers(doris::OlapMeta*, std::function<bool (long, long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
2# doris::DataDir::_check_incompatible_old_format_tablet()
3# doris::DataDir::load()
4# std::thread::_State_impl<std::thread::_Invoker<std::tuple<doris::StorageEngine::load_data_dirs(std::vector<doris::DataDir*, std::allocator<doris::DataDir*> > const&)::$_0, unsigned long> > >::_M_run()
5# execute_native_thread_routine
6# ?
7# ?
E20250719 14:10:53.846917 588984 doris_main.cpp:526] failed to init doris storage engine, res=[E-3005]rocksdb seek failed. reason: Corruption: Bad table magic number: expected 9863518390377041911, found 3761690078487983156 in /data01/doris/storage/meta/8392761.sst
0# doris::OlapMeta::iterate(int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<bool (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)> const&)
1# doris::TabletMetaManager::traverse_headers(doris::OlapMeta*, std::function<bool (long, long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
2# doris::DataDir::_check_incompatible_old_format_tablet()
3# doris::DataDir::load()
4# std::thread::_State_impl<std::thread::_Invoker<std::tuple<doris::StorageEngine::load_data_dirs(std::vector<doris::DataDir*, std::allocator<doris::DataDir*> > const&)::$_0, unsigned long> > >::_M_run()
5# execute_native_thread_routine
6# ?
7# ?
I20250719 14:10:54.185720 590186 wal_manager.cpp:480] Sleep 1s to wait for storage engine init.
I20250719 14:10:55.185808 590186 wal_manager.cpp:480] Sleep 1s to wait for storage engine init.
通过admin check tablet命令查询该tablet状态又是正常的
mysql> admin check tablet (8392761) PROPERTIES("type" = "consistency");
Query OK, 0 rows affected (0.01 sec)
show tablet结果如下:
mysql> show tablet 8392761;
+--------+-----------+---------------+-----------+------+---------+-------------+---------+--------+-------+-----------+--------------------------------------------------+
| DbName | TableName | PartitionName | IndexName | DbId | TableId | PartitionId | IndexId | IsSync | Order | QueryHits | DetailCmd |
+--------+-----------+---------------+-----------+------+---------+-------------+---------+--------+-------+-----------+--------------------------------------------------+
| NULL | NULL | NULL | NULL | -1 | -1 | -1 | -1 | false | -1 | 0 | SHOW PROC '/dbs/-1/-1/partitions/-1/-1/8392761'; |
+--------+-----------+---------------+-----------+------+---------+-------------+---------+--------+-------+-----------+--------------------------------------------------+
1 row in set (0.00 sec)
mysql> SHOW PROC '/dbs/-1/-1/partitions/-1/-1/8392761';
ERROR 1105 (HY000): errCode = 2, detailMessage = Database -1 does not exist
请各位大佬帮忙看看这是什么问题,在网上没有找到相关问题的思路了