openhab-core: [automation] memory leak caused by bad rule in UI

Tested against S2207 but reported on https://community.openhab.org/t/openhab-3-runs-out-of-memory-java-heap-space-errors-cpu-100-after-a-few-hours/110924/105 as happening between 3.0 and 3.0.1

As identified by @mhilbush a rule similar to the rule below, but created on the new rules UI. This is 100% reproducible. I created an Ubuntu 20.04 VM (on ESXi) with 4GB ram. I only installed the systeminfo binding to monitor system use over time. I removed the rrd4j persistence as it was thought to be causing the problem at once point (it wasn’t). There is nothing on this system other than a single thing created as per the config below.

rule "Test One per Second"
when
    Time cron "* * * * * ? *"
then
    xxxxxxx
    var Number n = 1
end

Causes the following:

2021-02-15 03:25:52.728 [DEBUG] [e.automation.internal.RuleEngineImpl] - The rule 'aeefe05726' is executed.
2021-02-15 03:25:53.673 [DEBUG] [e.automation.internal.RuleEngineImpl] - The trigger '1' of rule 'aeefe05726' is triggered.
2021-02-15 03:25:53.735 [ERROR] [internal.handler.ScriptActionHandler] - Script execution of rule with UID 'aeefe05726' failed:      ___ xxxxxxx
    var Number n = 1

   The method or field xxxxxxx is undefined; line 1, column 4, length 7
javax.script.ScriptException:      ___ xxxxxxx
    var Number n = 1

   The method or field xxxxxxx is undefined; line 1, column 4, length 7
        at org.openhab.core.model.script.runtime.internal.engine.DSLScriptEngine.eval(DSLScriptEngine.java:127) ~[?:?]
        at org.openhab.core.automation.module.script.internal.handler.ScriptActionHandler.lambda$0(ScriptActionHandler.java:62) [bundleFile:?]
        at java.util.Optional.ifPresent(Optional.java:183) [?:?]
        at org.openhab.core.automation.module.script.internal.handler.ScriptActionHandler.execute(ScriptActionHandler.java:59) [bundleFile:?]
        at org.openhab.core.automation.internal.RuleEngineImpl.executeActions(RuleEngineImpl.java:1179) [bundleFile:?]
        at org.openhab.core.automation.internal.RuleEngineImpl.runRule(RuleEngineImpl.java:987) [bundleFile:?]
        at org.openhab.core.automation.internal.TriggerHandlerCallbackImpl$TriggerData.run(TriggerHandlerCallbackImpl.java:89) [bundleFile:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:834) [?:?]

Over time, this failure seems to cause a memory leak leading to an eventual system crash due to OutOfMemory

2021-02-15 02:22:34.305 [WARN ] [mmon.WrappedScheduledExecutorService] - Scheduled runnable ended with an exception:
java.lang.OutOfMemoryError: Java heap space
2021-02-15 02:22:34.307 [WARN ] [mmon.WrappedScheduledExecutorService] - Scheduled runnable ended with an exception:
java.lang.OutOfMemoryError: Java heap space
2021-02-15 02:22:34.307 [WARN ] [mmon.WrappedScheduledExecutorService] - Scheduled runnable ended with an exception:
java.lang.OutOfMemoryError: Java heap space
2021-02-15 02:22:34.307 [WARN ] [mmon.WrappedScheduledExecutorService] - Scheduled runnable ended with an exception:
java.lang.OutOfMemoryError: Java heap space
2021-02-15 02:22:34.306 [WARN ] [e.jetty.util.thread.QueuedThreadPool] -
java.lang.OutOfMemoryError: Java heap space
2021-02-15 02:22:34.308 [ERROR] [org.apache.felix.fileinstall        ] - In main loop, we have serious trouble
java.lang.OutOfMemoryError: Java heap space
2021-02-15 02:26:05.103 [WARN ] [e.jetty.util.thread.QueuedThreadPool] -
java.lang.OutOfMemoryError: Java heap space
2021-02-15 02:26:05.103 [WARN ] [mmon.WrappedScheduledExecutorService] - Scheduled runnable ended with an exception:
java.lang.OutOfMemoryError: Java heap space

systeminfo thing and items created as

openhab@oh3-test-vm:~$ cat /etc/openhab/things/systeminfo.things
Thing systeminfo:computer:test [interval_high=1, interval_medium=1]

openhab@oh3-test-vm:~$ cat /etc/openhab/items/systeminfo.items
String CPU_Name                   "Name"                <none>           { channel="systeminfo:computer:test:cpu#name" }
String CPU_Description            "Description"         <none>           { channel="systeminfo:computer:test:cpu#description" }
Number CPU_Load1                  "Load (1 min)"        <none>           { channel="systeminfo:computer:test:cpu#load1" }
Number CPU_Load5                  "Load (5 min)"        <none>           { channel="systeminfo:computer:test:cpu#load5" }
Number CPU_Load15                 "Load (15 min)"       <none>           { channel="systeminfo:computer:test:cpu#load15" }
Number CPU_Threads                "Threads"             <none>           { channel="systeminfo:computer:test:cpu#threads" }
Number CPU_Uptime                 "Uptime"              <time>           { channel="systeminfo:computer:test:cpu#uptime" }
Number Memory_Available           "Available"           <none>           { channel="systeminfo:computer:test:memory#available" }
Number Memory_Used                "Used"                <none>           { channel="systeminfo:computer:test:memory#used" }
Number Memory_Total               "Total"               <none>           { channel="systeminfo:computer:test:memory#total" }
Number Memory_Available_Percent   "Available (%)"       <none>           { channel="systeminfo:computer:test:memory#availablePercent" }
Number Memory_Used_Percent        "Used (%)"            <none>           { channel="systeminfo:computer:test:memory#usedPercent" }

I attempted to fix this in https://github.com/openhab/openhab-core/pull/2182 thinking that this was because of the threadpools but that was not successful. I also attempted to back out https://github.com/openhab/openhab-core/pull/2057 and https://github.com/openhab/openhab-core/pull/1952 to see if either of those were the problem. This was also fruitless. I am at a loss on where to move further and can not identify where the leak is.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 3
  • Comments: 47 (24 by maintainers)

Commits related to this issue

Most upvoted comments

Hey guys, sorry for not chiming in any earlier, but I have been sick this week and am only slowly recovering.

For the “invalid rule from UI” as per description of this issue, the fix https://github.com/openhab/openhab-core/compare/main...mhilbush:ui-dsl-mem-fix2 seems to be just the right thing to do. Clearly the created resources leak and must be deleted. I have tested it locally and it definitely looks fixed by that change.

So @mhilbush, not to steal you your credit for the fix, may I ask you to create a PR with that?

If you both feel that there are other leaks for valid rules, I’d suggest to open a new issue with a new description for that.

I think there’s still a memory leak even for valid rules. Every time a rule is saved or disabled/enabled, I believe memory will be lost.

I’ll defer to y’all here on the cause of the issue. I’ll still submit a PR for the context memort issue “because it was found and there’s no reason not to fix it”.

@morph166955 I haven’t had a chance to try your jar yet.

However, I believe the problem is in the call to parseScriptIntoXTextEObject, specifically the reuse of the resource set using the getResourceSet method here.

https://github.com/openhab/openhab-core/blob/main/bundles/org.openhab.core.model.script.runtime/src/org/openhab/core/model/script/runtime/internal/engine/ScriptEngineImpl.java#L126

It appears to me that the resource set will grow on every call to parseScriptIntoXTextEObject. The problem was masked when the parsed rule was cached. However, the cached rule only applies to valid rules. An invalid rule will result in a call to parseScriptIntoXTextEObject on every invocation of the rule, which is why we see the memory grow only for invalid rules. I believe that if you back out that caching commit, memory will grow continuously, even with a valid rule.

@kaikreuzer WDYT?

So I think I see the problem, I’m just not sure how to fix it. When you create a rule in the UI no context is created for it. I’m going to keep going through the code to see how contexts are created when text rules are added to see if I can add something to create a context if one doesn’t exist. I “think” that the fact that no context exists is causing the parsing attempt to be stored somewhere and every time the rule fires it attempts to parse and it just spins out of control eventually. Option B is to figure out how to dispose of it properly when it fails.