presto: Memory leak in `com.facebook.presto.memory.ClusterMemoryManager#changeListeners`

Example log: https://api.travis-ci.org/v3/job/389838962/log.txt Commit: https://github.com/prestodb/presto/commit/48ea8f3eaa9a3047893d0acd5304035912ebb681

Example PR: https://github.com/prestodb/presto/pull/10808 log: https://api.travis-ci.org/v3/job/390987547/log.txt

log tail:

2018-06-12T06:46:47.197+0545 INFO Major GC: application 12ms, stopped 6406ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:47.197+0545 INFO Major GC: application 12ms, stopped 6406ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:47.197+0545 INFO Major GC: application 12ms, stopped 6406ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:47.198+0545 INFO Major GC: application 6ms, stopped 6222ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:47.198+0545 INFO Major GC: application 6ms, stopped 6222ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:47.198+0545 INFO Major GC: application 6ms, stopped 6222ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:47.198+0545 INFO Major GC: application 6ms, stopped 6222ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:47.198+0545 INFO Major GC: application 6ms, stopped 6222ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:47.198+0545 INFO Major GC: application 6ms, stopped 6222ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:47.198+0545 INFO Major GC: application 6ms, stopped 6222ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:53.067+0545 INFO Major GC: application 6ms, stopped 6222ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:53.067+0545 INFO Major GC: application 6ms, stopped 6222ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:53.067+0545 INFO Major GC: application 6ms, stopped 6222ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:53.067+0545 INFO Major GC: application 6ms, stopped 6222ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:53.067+0545 INFO Major GC: application 22ms, stopped 6435ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:53.067+0545 INFO Major GC: application 22ms, stopped 6435ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:53.067+0545 INFO Major GC: application 22ms, stopped 6435ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:53.068+0545 INFO Major GC: application 22ms, stopped 6435ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:53.068+0545 INFO Major GC: application 22ms, stopped 6435ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:53.068+0545 INFO Major GC: application 22ms, stopped 6435ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:53.068+0545 INFO Major GC: application 22ms, stopped 6435ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:53.068+0545 INFO Major GC: application 22ms, stopped 6435ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:53.068+0545 INFO Major GC: application 22ms, stopped 6435ms:: 2.12GB -> 2.12GB
2018-06-12T06:46:59.581+0545 WARNING Node state update request to http://127.0.0.1:44031/v1/info/state has not returned in 18.21s
2018-06-12T06:46:59.582+0545 WARNING Error fetching node state from http://127.0.0.1:36641/v1/info/state: java.util.concurrent.TimeoutException: Total timeout 10000 ms elapsed
2018-06-12T06:46:59.584+0545 SEVERE Cannot connect to discovery server for refresh (presto/general): Lookup of presto failed for http://127.0.0.1:37427/v1/service/presto/general
2018-06-12T06:47:49.816+0545 INFO Discovery server connect succeeded for refresh (presto/general)
2018-06-12T06:48:13.880+0545 SEVERE Expected service announcement after 8000.00ms, but announcement was delayed 1.02m
2018-06-12T06:48:25.391+0545 INFO Major GC: application 22ms, stopped 6435ms:: 2.12GB -> 2.12GB
2018-06-12T06:48:36.835+0545 SEVERE Expected service announcement after 8000.00ms, but announcement was delayed 28.75s
2018-06-12T06:48:42.965+0545 SEVERE Cannot connect to discovery server for refresh (presto/general): Lookup of presto failed for http://127.0.0.1:37427/v1/service/presto/general
2018-06-12T06:48:54.927+0545 SEVERE Cannot connect to discovery server for refresh (presto/general): Lookup of presto failed for http://127.0.0.1:41456/v1/service/presto/general
2018-06-12T06:49:06.700+0545 SEVERE Cannot connect to discovery server for announce: Announcement failed for http://127.0.0.1:41456
2018-06-12T06:49:06.700+0545 SEVERE Service announcement failed after 2.62m. Next request will happen within 8000.00ms
2018-06-12T06:49:06.700+0545 SEVERE Cannot connect to discovery server for refresh (presto/general): Lookup of presto failed for http://127.0.0.1:41456/v1/service/presto/general
2018-06-12T06:49:06.703+0545 INFO Discovery server connect succeeded for refresh (presto/general)
2018-06-12T06:49:06.703+0545 INFO Discovery server connect succeeded for refresh (presto/general)
2018-06-12T06:49:12.678+0545 SEVERE Expected service announcement after 8000.00ms, but announcement was delayed 1.84m
2018-06-12T06:49:12.679+0545 SEVERE Cannot connect to discovery server for announce: Announcement failed for http://127.0.0.1:37427
2018-06-12T06:49:12.679+0545 SEVERE Cannot connect to discovery server for refresh (presto/general): Lookup of presto failed for http://127.0.0.1:37427/v1/service/presto/general
2018-06-12T06:49:12.679+0545 SEVERE Service announcement failed after 2.72m. Next request will happen within 8000.00ms
2018-06-12T06:49:18.676+0545 SEVERE Service announcement failed after 5.19m. Next request will happen within 2.00ms
2018-06-12T06:49:24.589+0545 SEVERE Expected service announcement after 8000.00ms, but announcement was delayed 1.94m
2018-06-12T06:49:24.591+0545 SEVERE Expected service announcement after 8000.00ms, but announcement was delayed 1.94m
2018-06-12T06:49:24.591+0545 SEVERE Cannot connect to discovery server for refresh (presto/general): Lookup of presto failed for http://127.0.0.1:37427/v1/service/presto/general
2018-06-12T06:49:24.593+0545 SEVERE Cannot connect to discovery server for announce: Announcement failed for http://127.0.0.1:37427
2018-06-12T06:49:30.664+0545 SEVERE Service announcement failed after 3.02m. Next request will happen within 8000.00ms
2018-06-12T06:49:54.974+0545 WARNING Error fetching memory info from http://127.0.0.1:36641/v1/memory: java.util.concurrent.TimeoutException: Total timeout 10000 ms elapsed
2018-06-12T06:50:00.882+0545 SEVERE Expected service announcement after 4.00ms, but announcement was delayed 42.21s
2018-06-12T06:50:07.174+0545 WARNING Node state update request to http://127.0.0.1:44031/v1/info/state has not returned in 205.80s
2018-06-12T06:50:07.175+0545 WARNING Error fetching node state from http://127.0.0.1:44031/v1/info/state: java.util.concurrent.TimeoutException: Total timeout 10000 ms elapsed
2018-06-12T06:50:07.175+0545 INFO Major GC: application 22ms, stopped 6435ms:: 2.12GB -> 2.12GB
2018-06-12T06:50:07.175+0545 INFO Major GC: application 7ms, stopped 5955ms:: 2.12GB -> 2.12GB

Should we increase tests memory limit? Could there be some mem leak?

CC: @electrum

About this issue

Original URL
State: open
Created 6 years ago
Reactions: 1
Comments: 15 (15 by maintainers)

Commits related to this issue

Fix memory leak in MemoryAwareQueryExecution Previously, we would add a listener for state change *on* every state change. This could also cause listeners to be added in a loop in between failed stat... — committed to raghavsethi/presto by raghavsethi 6 years ago

Most upvoted comments

I see now. 10M ClusterMemoryManager$$Lambda$7855 are queued executions in com.facebook.presto.memory.ClusterMemoryManager#listenerExecutor

Class Name                                                                                                | Shallow Heap | Retained Heap
-----------------------------------------------------------------------------------------------------------------------------------------
listenerExecutor java.util.concurrent.Executors$FinalizableDelegatedExecutorService @ 0x75bae0cd8         |           16 |            16
|- e java.util.concurrent.ThreadPoolExecutor @ 0x75bae0ce8                                                |           80 |   856,403,752
|  |- workQueue java.util.concurrent.LinkedBlockingQueue @ 0x75bae0d48                                    |           48 |   856,403,288
|  |  |- head java.util.concurrent.LinkedBlockingQueue$Node @ 0x75cefaf20                                 |           24 |   856,402,792
|  |  |  |- next java.util.concurrent.LinkedBlockingQueue$Node @ 0x75cefaf50                              |           24 |   856,402,768
|  |  |  |  |- next java.util.concurrent.LinkedBlockingQueue$Node @ 0x75cefb050                           |           24 |   856,402,512
|  |  |  |  |- item com.facebook.presto.memory.ClusterMemoryManager$$Lambda$7855 @ 0x75cefaf68            |           24 |            24
|  |  |  |  |  |- arg$2 com.facebook.presto.spi.memory.MemoryPoolInfo @ 0x75cefaf80                       |           48 |           208
|  |  |  |  |  |- arg$1 com.facebook.presto.execution.MemoryAwareQueryExecution$$Lambda$7853 @ 0x75c12ab68|           16 |            16
-----------------------------------------------------------------------------------------------------------------------------------------

I didn’t dig why the executions queue up (might be a bottleneck, since the executor is single threaded, or some kind of a lock).

findepi on Jul 16, 2018

Recently the problem became more server and more easily reproducible locally. I am under impression that MemoryAwareQueryExecution adds listeners in ClusterMemoryManager:

memoryManager.addChangeListener(GENERAL_POOL, none -> start());
memoryManager.addChangeListener(RESERVED_POOL, none -> start());

that are never going to be removed.

I am not entirely sure yet, since all info I got so far is the 10M instances of com.facebook.presto.memory.ClusterMemoryManager$$Lambda$7855

cc @nezihyigitbasi @raghavsethi

findepi on Jul 16, 2018