连续执行同一个hive表查询出现性能波动(Nereids Analysis Time)

Viewed 40

进行tpch性能测试时,使用query2进行hive表查询,发现通过webui手动连续执行数次会有1次查询耗时明显偏高,分析profile发现是“Nereids Analysis Time”指标导致。下面列出正常和异常的profile信息片段
异常:

Execution Summary:
   - Parse SQL Time: 1ms
   - Nereids Lock Table Time: 11ms
   **- Nereids Analysis Time: 4s602ms**
   - Nereids Rewrite Time: 6ms
   - Nereids Optimize Time: 3ms
   - Nereids Translate Time: 0ms
   - Workload Group: normal
   - Analysis Time: 4s612ms
   - Plan Time: 42ms
     - JoinReorder Time: N/A
     - CreateSingleNode Time: N/A
     - QueryDistributed Time: N/A
     - Init Scan Node Time: 0ms
     - Finalize Scan Node Time: 11ms
       - Get Splits Time: 7ms
         - Get Partitions Time: 0ms
         - Get Partition Files Time: 7ms
       - Create Scan Range Time: 4ms
   - Schedule Time: 28ms
     - Fragment Assign Time: 1ms
     - Fragment Serialize Time: 4ms
     - Fragment RPC Phase1 Time: 22ms
     - Fragment RPC Phase2 Time: 1ms
     - Fragment Compressed Size: 50.58 KB
     - Fragment RPC Count: 2
   - Schedule Time Of BE: {"phase1":{"158.2.100.39: 8060":{"RPC Work Time":"21ms","RPC Latency From FE To BE":"1ms","RPC Work Queue Time":"0ms","RPC Latency From BE To FE":"0ms"}},"phase2":{"158.2.100.39: 8060":{"RPC Work Time":"1ms","RPC Latency From FE To BE":"0ms","RPC Work Queue Time":"0ms","RPC Latency From BE To FE":"0ms"}}}
   - Wait and Fetch Result Time: 532ms
     - Fetch Result Time: 530ms
     - Write Result Time: 0ms
   - Doris Version: doris-2.1.8-1-834d802457
   - Is Nereids: Yes
   - Is Pipeline: Yes
   - Is Cached: No
   - Total Instances Num: 52
   - Instances Num Per BE: 158.2.100.39:8060:52
   - Parallel Fragment Exec Instance Num: 16
   - Trace ID: 716c05af-db0d-423d-bf9a-0a8e222eb51c
   - Transaction Commit Time: N/A
   - Executed By Frontend: N/A
   - Nereids GarbageCollect Time: 0ms
   - Nereids BeFoldConst Time: 0ms

正常:

Execution Summary:
   - Parse SQL Time: 1ms
   - Nereids Lock Table Time: 2ms
   - Nereids Analysis Time: 2ms
   - Nereids Rewrite Time: 5ms
   - Nereids Optimize Time: 3ms
   - Nereids Translate Time: 1ms
   - Workload Group: normal
   - Analysis Time: 3ms
   - Plan Time: 16ms
     - JoinReorder Time: N/A
     - CreateSingleNode Time: N/A
     - QueryDistributed Time: N/A
     - Init Scan Node Time: 0ms
     - Finalize Scan Node Time: 4ms
       - Get Splits Time: 0ms
         - Get Partitions Time: 0ms
         - Get Partition Files Time: 0ms
       - Create Scan Range Time: 4ms
   - Schedule Time: 27ms
     - Fragment Assign Time: 0ms
     - Fragment Serialize Time: 3ms
     - Fragment RPC Phase1 Time: 21ms
     - Fragment RPC Phase2 Time: 3ms
     - Fragment Compressed Size: 50.43 KB
     - Fragment RPC Count: 2
   - Schedule Time Of BE: {"phase1":{"158.2.100.39: 8060":{"RPC Work Time":"20ms","RPC Latency From FE To BE":"1ms","RPC Work Queue Time":"0ms","RPC Latency From BE To FE":"0ms"}},"phase2":{"158.2.100.39: 8060":{"RPC Work Time":"0ms","RPC Latency From FE To BE":"2ms","RPC Work Queue Time":"0ms","RPC Latency From BE To FE":"1ms"}}}
   - Wait and Fetch Result Time: 509ms
     - Fetch Result Time: 508ms
     - Write Result Time: 0ms
   - Doris Version: doris-2.1.8-1-834d802457
   - Is Nereids: Yes
   - Is Pipeline: Yes
   - Is Cached: No
   - Total Instances Num: 52
   - Instances Num Per BE: 158.2.100.39:8060:52
   - Parallel Fragment Exec Instance Num: 16
   - Trace ID: 376b8d76-8e49-496f-ac42-c185d15b871b
   - Transaction Commit Time: N/A
   - Executed By Frontend: N/A
   - Nereids GarbageCollect Time: 0ms
   - Nereids BeFoldConst Time: 0ms

补充:

  1. 进一步测试了3.0.4版本,依然有类似情况
  2. 暂停一小段时间,比如1分钟,再次执行也会出现性能明显变慢的情况
  3. 通过analyze database计算统计信息后,依然存在该情况
  4. 环境为1个FE+1个BE

结果:
经selectdb大佬指导,是由于hive catalog配置的刷新周期过于频繁导致"metadata_refresh_interval_sec" = "60",将其调大即可解决

1 Answers

请提供下 explain verbose sql 的结果