Doris告警 Image Write failed

Viewed 273

1,检查监控发现Image Write failed数量增长很快:1720149497243.jpg
2,我们有10个Routine load数据导入任务,最近一周半数都有OtherMsg信息:
2024-06-29 08:57:17:errCode = 2, detailMessage = failed to send task: Socket is closed by peer.
2024-07-04 18:00:08:errCode = 2, detailMessage = failed to send task: java.net.SocketException: Broken pipe (Write failed)
想得到帮助:
1,Image Write failed 快500了,什么原因?有什么影响,怎么修复?
2,routine load 作业的OtherMsg信息,是什么原因,怎么修复?

4 Answers

这几个问题:

  1. Image Write failed 这个结合FE JVM内存监控看下的,日志:"the memory used percent 72 exceed the checkpoint memory threshold: 70" ,内存使用超过 jvm heap 70% 不做checkpoint,导致生成image失败了。所以需要看看FE 内存是否有泄漏问题。如果导入任务比较多的话,可以尝试这样调整下:
    1. 观察profile是否是开启的,如果开启的话全局关闭

    2. 导入任务多的话,可能是label堆积,可以修改label的保留时间
      fe.conf
      label_keep_max_second = 14400;
      streaming_label_keep_max_second = 14400;

    3. 将FE JVM GC算法修改为G1
      CMS算法修改为G1 算法

JAVA_OPTS="-Djavax.security.auth.useSubjectCredsOnly=false -Xss4m -Xmx8192m -XX:+UnlockExperimentalVMOptions -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+PrintGCDateStamps -XX:+PrintGCDetails -Xloggc:$DORIS_HOME/log/fe.gc.log.$CUR_DATE -Dlog4j2.formatMsgNoLookups=true"

-Xmx改成一致的,注意下CUR_DATE这个环境变量,在老版本可能叫DATE

2.routineload 导入任务报错问题,这个问题一般就是没连上或者是有网络抖动问题,这个应该是会重试的。

image.png

有没有那个时间点FE的日志?得看看有没有更详细的错误信息

周末Image Write failed数量又增加了,fe.log警告级别日志大概如下

2024-07-06 02:57:15,879 WARN (Routine load task scheduler|51) [RoutineLoadTaskScheduler.scheduleOneTask():223] failed to submit routine load task f9c9be259fb94810-a66e924c09639c8c to BE: 10072, error: errCode = 2, detailMessage = failed to send task: java.net.SocketException: Broken p
ipe (Write failed)

上面警告出现次数不少

2024-07-06 05:52:27,716 WARN (leaderCheckpointer|327) [Env.replayJournal():2591] replay journal cost too much time: 3693 replayedJournalId: 19941604

上面警告看到1次

2024-07-06 09:55:29,522 WARN (leaderCheckpointer|327) [Checkpoint.checkMemoryEnoughToDoCheckpoint():327] the memory used percent 72 exceed the checkpoint memory threshold: 70

Image Write failed增加的时间,总有这警告,内存超过70的限制了。YoungGC Old GC 都比较多

2024-07-06 16:30:09,418 WARN (mysql-nio-pool-26706|1148759) [ConnectProcessor.processOnce():843] Null packet received from network. remote: 10.188.100.31:38094
2024-07-06 16:30:09,418 WARN (mysql-nio-pool-26706|1148759) [ReadListener.lambda$handleEvent$0():60] Exception happened in one session(org.apache.doris.qe.ConnectContext@599ce481).
java.io.IOException: Error happened when receiving packet.
        at org.apache.doris.qe.ConnectProcessor.processOnce(ConnectProcessor.java:844) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.mysql.ReadListener.lambda$handleEvent$0(ReadListener.java:52) ~[doris-fe.jar:1.2-SNAPSHOT]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_351]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_351]
        at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_351]

还有这个警告

doris-2.0.15版本。修改fe.conf为以下内容。Image Write failed问题解决了。

27 DATE = `date +%Y%m%d-%H%M%S`
JAVA_OPTS="-Djavax.security.auth.useSubjectCredsOnly=false -Xss4m -Xmx8192m -XX:+UnlockExperimentalVMOptions -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+PrintGCDateStamps -XX:+PrintGCDetails -Xloggc:$DORIS_HOME/log/fe.gc.log.$DATE -Dlog4j2.formatMsgNoLookups=true"
JAVA_OPTS_FOR_JDK_9="-Xmx8192m -Xss4m -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+PrintGCDateStamps -XX:+PrintGCDetails -Xlog:gc*:$DORIS_HOME/log/fe.gc.log.$DATE:time -Dlog4j2.formatMsgNoLookups=true"

修改以后检查:

(base) [root@node117 conf]# jps -l | grep org.apache.doris
30861 org.apache.doris.DorisFE
(base) [root@node117 conf]# jmap -heap 30861
Attaching to process ID 30861, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.151-b12

using thread-local object allocation.
Garbage-First (G1) GC with 23 thread(s)

Heap Configuration:
   MinHeapFreeRatio         = 40
   MaxHeapFreeRatio         = 70
   MaxHeapSize              = 8589934592 (8192.0MB)
   NewSize                  = 1363144 (1.2999954223632812MB)
   MaxNewSize               = 5152702464 (4914.0MB)
   OldSize                  = 5452592 (5.1999969482421875MB)
   NewRatio                 = 2
   SurvivorRatio            = 8
   MetaspaceSize            = 21807104 (20.796875MB)
   CompressedClassSpaceSize = 1073741824 (1024.0MB)
   MaxMetaspaceSize         = 17592186044415 MB
   G1HeapRegionSize         = 2097152 (2.0MB)

Heap Usage:
G1 Heap:
   regions  = 4096
   capacity = 8589934592 (8192.0MB)
   used     = 536966592 (512.0912475585938MB)
   free     = 8052968000 (7679.908752441406MB)
   6.25111386179924% used
G1 Young Generation:
Eden Space:
   regions  = 42
   capacity = 457179136 (436.0MB)
   used     = 88080384 (84.0MB)
   free     = 369098752 (352.0MB)
   19.26605504587156% used
Survivor Space:
   regions  = 27
   capacity = 56623104 (54.0MB)
   used     = 56623104 (54.0MB)
   free     = 0 (0.0MB)
   100.0% used
G1 Old Generation:
   regions  = 199
   capacity = 541065216 (516.0MB)
   used     = 390165952 (372.09124755859375MB)
   free     = 150899264 (143.90875244140625MB)
   72.11070689120034% used

46517 interned Strings occupying 4515176 bytes.