doris-2.0.5-rc02,负载均衡问题:单个BE节点上的多个磁盘之间的数据不均

Viewed 193

集群环境

  • 版本 doris-2.0.5-rc02
  • docker 部署 3 FE 16 core 64G内存 JVM max 32G
  • docker 部署 10 BE 16 core 64G内存
  • BE 和 FE 混合部署的节点使用不同磁盘
  • BE磁盘 6893G xfs (混部节点为 5893G xfs)
  • FE磁盘 893G xfs

现象

image.png
运维发现5号be节点的4号磁盘使用率持续上涨至80%,其他磁盘大概在 50%左右

  1. 对5号be节点执行 ADMIN CLEAN TRASH ON(BackendHost:BackendHeartBeatPort); TrashUsedCapcacity 下降至0
  2. 对5号be节点执行 ADMIN REBALANCE DISK 等了十多分钟,还是没有什么变化
  3. 对整个集群进行 ADMIN CLEAN TRASH
  4. 通过 api 修改所有 be 和 fe 水位线
    	/api/update_config?storage_flood_stage_usage_percent=80\&persist=true 
    /api/update_config?	storage_flood_stage_left_capacity_bytes=193273528320\&persist=true
    /api/_set_config?storage_high_watermark_usage_percent=78\&storage_min_left_capacity_bytes=211527139328\&persist=true\&reset_persist=false
    /api/_set_config?storage_flood_stage_usage_percent=80\&storage_flood_stage_left_capacity_bytes=193273528320\&persist=true\&reset_persist=false
    
    ``
  5. stream load 出现错误, 5号be节点的4号达到限制
    disk /opt/apache-doris/be/storage3 on backend 15617 exceed limit usage, path hash: -8679819090117116242
    
    ``
  6. 修改5号be节点的storage_flood_stage_usage_percent为95
  7. stream load 恢复
  8. 5号be节点4号磁盘使用率达到95%
  9. stream load 再次出现错误, 5号be节点的4号达到限制
  10. 下线5号be节点 ALTER SYSTEM DECOMMISSION BACKEND
  11. stream load 恢复
  12. 8号节点磁盘使用率达到80%,stream load 再次出现错误
    disk /opt/apache-doris/be/storage3 on backend 15642
    
  13. 停止stream load

be 磁盘情况

其中 5、6、7 节点为 fe 所在节点
image.png

排查过程

  • 参数配置
    disable_balance、disable_disk_balance、disable_colocate_balance、
    disable_tablet_scheduler 均为 false
  • 均衡权重检查
    对整个集群进行 ADMIN CLEAN TRASH,检查时5号节点为LOW,4号磁盘为 HIGH
  • 均衡任务执行情况检查
    16个任务都在RUNNING,14个balance,记得SrcBe和DestBe都是同一个

错误日志

fe.log

2024-05-12 11:59:38,344 INFO (thrift-server-pool-63|49686) [DatabaseTransactionMgr.abortTransaction():1405] abort transaction: TransactionState. transaction id: 332518, label: 8b0b5e45-d6e6-4f8e-8596-059cd58d3576, db id: 16009, table id list: 46171, callback id: -1, coordinator: BE: 10.161.71.111, transaction status: ABORTED, error replicas num: 0, replica ids: , prepare time: 1715515178342, commit time: -1, finish time: 1715515178343, reason: [ANALYSIS_ERROR]TStatus: errCode = 2, detailMessage = disk /opt/apache-doris/be/storage3 on backend 15617 exceed limit usage, path hash: -8679819090117116242

        0#  doris::Status doris::Status::create<true>(doris::TStatus const&) at /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:187
        1#  doris::StreamLoadAction::_process_put(doris::HttpRequest*, std::shared_ptr<doris::StreamLoadContext>) at /home/zcp/repo_center/doris_release/doris/be/src/common/status.h:445
        2#  doris::StreamLoadAction::_on_header(doris::HttpRequest*, std::shared_ptr<doris::StreamLoadContext>) at /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:701
        3#  doris::StreamLoadAction::on_header(doris::HttpRequest*) at /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:701
        4#  doris::EvHttpServer::on_header(evhttp_request*) at /home/zcp/repo_center/doris_release/doris/be/src/http/ev_http_server.cpp:255
        5#  ?
        6#  bufferevent_run_readcb_
        7#  ?
        8#  ?
        9#  ?
        10# ?
        11# std::_Function_handler<void (), doris::EvHttpServer::start()::$_0>::_M_invoke(std::_Any_data const&) at /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/atomicity.h:98
        12# doris::ThreadPool::dispatch_thread() at /home/zcp/repo_center/doris_release/doris/be/src/util/threadpool.cpp:0
        13# doris::Thread::supervise_thread(void*) at /var/local/ldb_toolchain/bin/../usr/include/pthread.h:562
        14# ?
        15# clone
 successfully

fe.warn.log

2024-05-12 10:57:11,792 WARN (thrift-server-pool-132|63851) [MasterImpl.finishTask():94] finish task reports bad. request: TFinishTaskRequest(backend:TBackend(host:114, be_port:9595, http_port:9596), task_type:CLONE, signature:49799, task_status:TStatus(status_code:INTERNAL_ERROR, error_msgs:[(114)[INTERNAL_ERROR]Disk reach capacity limit]), report_version:17151864112113)
2024-05-12 10:57:11,856 WARN (thrift-server-pool-124|50415) [MasterImpl.finishTask():94] finish task reports bad. request: TFinishTaskRequest(backend:TBackend(host:114, be_port:9595, http_port:9596), task_type:CLONE, signature:49815, task_status:TStatus(status_code:INTERNAL_ERROR, error_msgs:[(114)[INTERNAL_ERROR]Disk reach capacity limit]), report_version:17151864112113)
2024-05-12 10:57:12,806 WARN (thrift-server-pool-32|414) [MasterImpl.finishTask():94] finish task reports bad. request: TFinishTaskRequest(backend:TBackend(host:114, be_port:9595, http_port:9596), task_type:CLONE, signature:49799, task_status:TStatus(status_code:INTERNAL_ERROR, error_msgs:[(114)[INTERNAL_ERROR]Disk reach capacity limit]), report_version:17151864112113)
2024-05-12 10:57:12,872 WARN (thrift-server-pool-106|50331) [MasterImpl.finishTask():94] finish task reports bad. request: TFinishTaskRequest(backend:TBackend(host:114, be_port:9595, http_port:9596), task_type:CLONE, signature:49815, task_status:TStatus(status_code:INTERNAL_ERROR, error_msgs:[(114)[INTERNAL_ERROR]Disk reach capacity limit]), report_version:17151864112113)
2024-05-12 10:57:13,809 WARN (thrift-server-pool-132|63851) [MasterImpl.finishTask():94] finish task reports bad. request: TFinishTaskRequest(backend:TBackend(host:114, be_port:9595, http_port:9596), task_type:CLONE, signature:49799, task_status:TStatus(status_code:INTERNAL_ERROR, error_msgs:[(114)[INTERNAL_ERROR]Disk reach capacity limit]), report_version:17151864112113)
2024-05-12 10:57:13,873 WARN (thrift-server-pool-124|50415) [MasterImpl.finishTask():94] finish task reports bad. request: TFinishTaskRequest(backend:TBackend(host:114, be_port:9595, http_port:9596), task_type:CLONE, signature:49815, task_status:TStatus(status_code:INTERNAL_ERROR, error_msgs:[(114)[INTERNAL_ERROR]Disk reach capacity limit]), report_version:17151864112113)
2024-05-12 10:57:31,959 WARN (thrift-server-pool-30|412) [MasterImpl.finishTask():94] finish task reports bad. request: TFinishTaskRequest(backend:TBackend(host:114, be_port:9595, http_port:9596), task_type:CLONE, signature:49815, task_status:TStatus(status_code:INTERNAL_ERROR, error_msgs:[(114)[INTERNAL_ERROR]Disk reach capacity limit]), report_version:17151864112113)
2024-05-12 10:57:32,019 WARN (thrift-server-pool-74|50095) [MasterImpl.finishTask():94] finish task reports bad. request: TFinishTaskRequest(backend:TBackend(host:114, be_port:9595, http_port:9596), task_type:CLONE, signature:49799, task_status:TStatus(status_code:INTERNAL_ERROR, error_msgs:[(114)[INTERNAL_ERROR]Disk reach capacity limit]), report_version:17151864112113)
2024-05-12 10:57:32,961 WARN (thrift-server-pool-107|50332) [MasterImpl.finishTask():94] finish task reports bad. request: TFinishTaskRequest(backend:TBackend(host:114, be_port:9595, http_port:9596), task_type:CLONE, signature:49815, task_status:TStatus(status_code:INTERNAL_ERROR, error_msgs:[(114)[INTERNAL_ERROR]Disk reach capacity limit]), report_version:17151864112113)
2024-05-12 10:57:33,029 WARN (thrift-server-pool-32|414) [MasterImpl.finishTask():94] finish task reports bad. request: TFinishTaskRequest(backend:TBackend(host:114, be_port:9595, http_port:9596), task_type:CLONE, signature:49799, task_status:TStatus(status_code:INTERNAL_ERROR, error_msgs:[(114)[INTERNAL_ERROR]Disk reach capacity limit]), report_version:17151864112113)
2024-05-12 10:57:33,968 WARN (thrift-server-pool-106|50331) [MasterImpl.finishTask():94] finish task reports bad. request: TFinishTaskRequest(backend:TBackend(host:114, be_port:9595, http_port:9596), task_type:CLONE, signature:49815, task_status:TStatus(status_code:INTERNAL_ERROR, error_msgs:[(114)[INTERNAL_ERROR]Disk reach capacity limit]), report_version:17151864112113)
2024-05-12 10:57:34,084 WARN (thrift-server-pool-132|63851) [MasterImpl.finishTask():94] finish task reports bad. request: TFinishTaskRequest(backend:TBackend(host:114, be_port:9595, http_port:9596), task_type:CLONE, signature:49799, task_status:TStatus(status_code:INTERNAL_ERROR, error_msgs:[(114)[INTERNAL_ERROR]Disk reach capacity limit]), report_version:17151864112113)
2024-05-12 10:57:34,829 WARN (ForkJoinPool-1-worker-15|106673) [TabletInvertedIndex.lambda$null$0():190] replica 88559 of tablet 49803 on backend 15621 need recovery. replica in FE: [replicaId=88559, BackendId=15621, version=55525, dataSize=176756482189, rowCount=2528641921, lastFailedVersion=63872, lastSuccessVersion=55525, lastFailedTimestamp=1715511452350, schemaHash=971940348, state=NORMAL], report version 55525, report schema hash: 971940348, is bad: false, is version missing: true

fe.log 搜索的均衡任务日志

2024-05-12 10:37:12,732 INFO (tablet scheduler|40) [TabletScheduler.addTablet():272] Add tablet to pending queue, tablet id: 72486, state: PENDING, type: BALANCE, balance: BE_BALANCE, priority: LOW, tablet size: 0, visible version: -1, committed version: -1
2024-05-12 10:37:12,736 INFO (tablet scheduler|40) [TabletScheduler.removeTabletCtx():1589] remove the tablet tablet id: 72486, status: HEALTHY, state: PENDING, type: BALANCE, balance: BE_BALANCE, priority: LOW, tablet size: 0, from backend: 15558, src path hash: -3799794340428994182, visible version: 1, committed version: 1. err: unable to find low backend. because: unable to find low backend
2024-05-12 11:00:00,144 INFO (tablet scheduler|40) [BeLoadRebalancer.selectAlternativeTabletsForCluster():220] select alternative tablets, medium: HDD, num: 3, detail: [72486, 40764, 21717]
2024-05-12 11:00:00,144 INFO (tablet scheduler|40) [TabletScheduler.addTablet():272] Add tablet to pending queue, tablet id: 72486, state: PENDING, type: BALANCE, balance: BE_BALANCE, priority: LOW, tablet size: 0, visible version: -1, committed version: -1
2024-05-12 11:00:00,144 INFO (tablet scheduler|40) [TabletScheduler.removeTabletCtx():1589] remove the tablet tablet id: 72486, status: HEALTHY, state: PENDING, type: BALANCE, balance: BE_BALANCE, priority: LOW, tablet size: 0, from backend: 15558, src path hash: -3799794340428994182, visible version: 1, committed version: 1. err: unable to find low backend. because: unable to find low backend
2024-05-12 11:01:20,847 INFO (tablet scheduler|40) [BeLoadRebalancer.selectAlternativeTabletsForCluster():220] select alternative tablets, medium: HDD, num: 3, detail: [72486, 45971, 42265]
2024-05-12 11:01:20,848 INFO (tablet scheduler|40) [TabletScheduler.addTablet():272] Add tablet to pending queue, tablet id: 72486, state: PENDING, type: BALANCE, balance: BE_BALANCE, priority: LOW, tablet size: 0, visible version: -1, committed version: -1
2024-05-12 11:01:22,863 INFO (tablet scheduler|40) [TabletScheduler.removeTabletCtx():1589] remove the tablet tablet id: 72486, status: HEALTHY, state: PENDING, type: BALANCE, balance: BE_BALANCE, priority: LOW, tablet size: 0, from backend: 15558, src path hash: -3799794340428994182, visible version: 1, committed version: 1. err: unable to find low backend. because: unable to find low backend

应用

数据分布
image.png

SHOW PARTITIONS FROM 100301;

分区 数据量
220240510 277.070 GB
220240511 6.648 TB
220240512 5.653 TB

建表语句

CREATE TABLE `100301` (
    in_time DATETIME, 
    pcode char(2), 
    s_logo CHAR(4), 
    d_flag BOOLEAN DEFAULT '0',
    s_id VARCHAR(10),
    dgi STRING,
    sri STRING,
    fid STRING,
    ofa STRING,
    cid STRING,
    usage_type STRING,
    file_category STRING,
    exception_file_id STRING,
    exception_cdr_id STRING,
    city_code STRING,
    sorting_action STRING,
    exception_type STRING,
    record_type STRING,
    phone_no VARCHAR(200),
    call_start_time STRING,
    call_end_time STRING,
    billed_duration BIGINT,
    charging_condition_changes STRING,
    data_traffic_list STRING,
    total_charging_volume STRING,
    service_code STRING,
    upstream_traffic_1 BIGINT,
    downstream_traffic_1 BIGINT,
    upstream_traffic_2 BIGINT,
    downstream_traffic_2 BIGINT,
    mobile_user_IMSI STRING,
    mobile_device_IMEI STRING,
    location_area_identification STRING,
    cell_number STRING,
    home_network_operator STRING,
    roaming_network_operator STRING,
    home_location_area_code STRING,
    visited_location_area_code STRING,
    home_province STRING,
    roaming_province STRING,
    SGSN_service_node_IP_address STRING,
    current_GGSN_PGW_IP_address STRING,
    network_initiated_PDP_context STRING,
    PDP_context_billing_Identifier STRING,
    PDP_type STRING,
    camel_related_to_PDP_context STRING,
    IMS_signaling_PDP_context_flag STRING,
    service_fee STRING,
    communication_charge STRING,
    charge_code STRING,
    partial_record_indicator STRING,
    mobile_network_capabilities STRING,
    routing_area_at_record_creation STRING,
    location_area_at_record_creation STRING,
    cell_identity_or_service_area_code STRING,
    APN_network_identifier STRING,
    APN_selection_mode STRING,
    APN_operational_identifier STRING,
    SGSN_change_flag STRING,
    used_SGSN_PLMN_identifier STRING,
    roaming_type STRING,
    user_type STRING,
    service_major_class STRING,
    record_closure_reason STRING,
    supplementary_fields STRING,
    user_data_charging_characteristics STRING,
    rat_type_value STRING,
    charging_features_selection_mode STRING,
    GSN_code STRING,
    camel_charging_information_set STRING,
    user_account STRING,
    access_point_info STRING,
    access_controller_IP_address STRING,
    nas_IP_address STRING,
    ap_ssid STRING,
    online_Charging_Session_Description STRING,
    pdn_connection_identifier STRING,
    user_csg_information STRING,
    ipv4_address STRING,
    ims_signaling STRING,
    p_gw_control_plane_IP_address STRING,
    pdn_type_ipv4_and_ipv6_dual_stack_info STRING,
    service_priority STRING,
    radio_resource_occupancy_priority STRING,
    uplink_bandwidth STRING,
    downlink_bandwidth STRING,
    guaranteed_bandwidth STRING,
    reserved_field_1 STRING,
    reserved_field_2 STRING,
    reserved_field_3 STRING,
    rg STRING,
    reserved_field_5 STRING,
    sgsn_ip_address_ipv6 STRING,
    current_ggsn_pgws_ip_address_ipv6 STRING,
    served_pdppdn_address_ipv6 STRING,
    reserved_field_6 STRING,
    reserved_field_7 STRING,
    reserved_field_8 STRING,
    reserved_field_9 STRING,
    reserved_field_10 STRING,
    served_pdppdn_address_ipv4 STRING,
    apn_operational_identifier_duplicate STRING,
    network_initiated_pdp_context_duplicate STRING,
    pdp_context_billing_identifier_duplicate STRING,
    roaming_city_code STRING,
    downstream_traffic_1_duplicate BIGINT,
    upstream_traffic_1_duplicate BIGINT,
    current_ggsn_pgws_ip_address_ipv6_duplicate STRING,
    served_pdppdn_address_ipv6_duplicate STRING,
    qci STRING,
    /* 索引 */
    INDEX idx_phone_no (`phone_no`) USING INVERTED PROPERTIES("parser" = "english")
) 
ENGINE = OLAP
DUPLICATE KEY(in_time,pcode,s_logo,d_flag)
PARTITION BY RANGE (`in_time`) ()
DISTRIBUTED BY HASH(`phone_no`) BUCKETS AUTO
PROPERTIES (
  "compression" = "zstd",
  "replication_allocation" = "tag.location.default: 3",
  "dynamic_partition.enable" = "true",
  "dynamic_partition.time_unit" = "DAY",
  "dynamic_partition.start" = "-10",
  "dynamic_partition.end" = "3",
  "dynamic_partition.prefix" = "jc2",
  "dynamic_partition.create_history_partition" = "true",
  "dynamic_partition.history_partition_num" = "10",
  "bloom_filter_columns" = "phone_no"
);

问题

  1. 是单个tablet数据过多导致的无法节点内均衡么?
  2. 单个be节点的单磁盘到达水位线后,看stream load 错误信息还是尝试写入那个磁盘是因为什么?不会尝试写入其他低负载磁盘么?
1 Answers

第一个问题:单个tablet不建议数据太多了数据,不建议使用auto bucket建表,因为这里的最大才128个,它现在单个partition 就几T 数据了,单个tablet应该不会小,你可以看下这个问题
第二个问题: 负载均衡是doris自动做的,stream load 目前这块还不能实现自动识别,然后写入