spinnaker: CloudDriver: java.lang.OutOfMemoryError: GC overhead limit exceeded
Issue Summary:
Spinnaker 1.8.0 CloudDriver java.lang.OutOfMemoryError: GC overhead limit exceeded
Cloud Provider(s):
Kubernetes 1.9 (OpenShift 3.9)
Environment:
Spinnaker deployed in OpenShift and also monitoring two additional Kubernetes (OpenShift) clusters. Halyard installed on a local VM.
Feature Area:
Pipelines ?
Description:
CloudDriver runs out of memory leading to all pipelines failing until the pod is restarted. Logs report java.lang.OutOfMemoryError: GC overhead limit exceeded followed by multiple jobs being cancelled.
Steps to Reproduce:
In our environment, leaving Spinnaker running for > 1 day eventually results in this issue.
Additional Details:
Exception in thread "Exec Stream Pumper" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "Exec Stream Pumper" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "WATCHDOG" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "Exec Default Executor" java.lang.OutOfMemoryError: GC overhead limit exceeded
2018-07-06 07:21:12.397 ERROR 1 --- [gentScheduler-1] c.n.s.c.r.c.ClusteredAgentScheduler : Unable to run agents
com.netflix.spinnaker.kork.jedis.telemetry.InstrumentedJedisException: could not execute delegate function
at com.netflix.spinnaker.kork.jedis.telemetry.InstrumentedJedis.internalInstrumented(InstrumentedJedis.java:84) ~[kork-jedis-1.132.3.jar:1.132.3]
at com.netflix.spinnaker.kork.jedis.telemetry.InstrumentedJedis.instrumented(InstrumentedJedis.java:69) ~[kork-jedis-1.132.3.jar:1.132.3]
at com.netflix.spinnaker.kork.jedis.telemetry.InstrumentedJedis.set(InstrumentedJedis.java:120) ~[kork-jedis-1.132.3.jar:1.132.3]
at com.netflix.spinnaker.cats.redis.cluster.ClusteredAgentScheduler.lambda$acquireRunKey$0(ClusteredAgentScheduler.java:164) ~[cats-redis-2.54.0-SNAPSHOT.jar:2.54.0-SNAPSHOT]
at com.netflix.spinnaker.kork.jedis.JedisClientDelegate.withCommandsClient(JedisClientDelegate.java:47) ~[kork-jedis-1.132.3.jar:1.132.3]
at com.netflix.spinnaker.cats.redis.cluster.ClusteredAgentScheduler.acquireRunKey(ClusteredAgentScheduler.java:163) ~[cats-redis-2.54.0-SNAPSHOT.jar:2.54.0-SNAPSHOT]
at com.netflix.spinnaker.cats.redis.cluster.ClusteredAgentScheduler.acquire(ClusteredAgentScheduler.java:121) ~[cats-redis-2.54.0-SNAPSHOT.jar:2.54.0-SNAPSHOT]
at com.netflix.spinnaker.cats.redis.cluster.ClusteredAgentScheduler.runAgents(ClusteredAgentScheduler.java:145) ~[cats-redis-2.54.0-SNAPSHOT.jar:2.54.0-SNAPSHOT]
at com.netflix.spinnaker.cats.redis.cluster.ClusteredAgentScheduler.run(ClusteredAgentScheduler.java:138) ~[cats-redis-2.54.0-SNAPSHOT.jar:2.54.0-SNAPSHOT]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_171]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [na:1.8.0_171]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_171]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [na:1.8.0_171]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_171]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_171]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_171]
Caused by: java.lang.ClassCastException: null
Exception in thread "Exec Default Executor" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "Exec Stream Pumper" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "Exec Stream Pumper" java.lang.OutOfMemoryError: GC overhead limit exceeded
2018-07-06 07:22:10.973 WARN 1 --- [tionAction-1999] c.n.s.c.cache.LoggingInstrumentation : kubernetes:openshift-dev/KubernetesUnregisteredCustomResourceCachingAgent[1/1] completed with one or more failures
java.lang.NullPointerException: null
2018-07-06 07:22:10.975 WARN 1 --- [tionAction-1990] c.n.s.c.cache.LoggingInstrumentation : kubernetes:openshift-prod/KubernetesUnregisteredCustomResourceCachingAgent[1/1] completed with one or more failures
java.lang.NullPointerException: null
2018-07-06 07:22:10.976 INFO 1 --- [tionAction-2000] k.v.c.a.KubernetesV2OnDemandCachingAgent : openshift-hq/KubernetesUnregisteredCustomResourceCachingAgent[1/1] is starting
Exception in thread "WATCHDOG" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "Exec Stream Pumper" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "Exec Stream Pumper" java.lang.OutOfMemoryError: GC overhead limit exceeded
2018-07-06 07:22:53.668 INFO 1 --- [tionAction-1999] c.n.s.c.k.v.c.a.KubernetesV2CachingAgent : openshift-prod/KubernetesNamespaceCachingAgent[1/1] is starting
2018-07-06 07:22:53.668 WARN 1 --- [tionAction-1994] c.n.s.c.cache.LoggingInstrumentation : kubernetes:openshift-dev/KubernetesNamespaceCachingAgent[1/1] completed with one or more failures
java.lang.OutOfMemoryError: GC overhead limit exceeded
2018-07-06 07:22:55.861 INFO 1 --- [tionAction-1990] k.v.c.a.KubernetesV2OnDemandCachingAgent : openshift-dev/KubernetesUnregisteredCustomResourceCachingAgent[1/1] is starting
2018-07-06 07:22:57.939 INFO 1 --- [tionAction-1996] k.v.c.a.KubernetesV2OnDemandCachingAgent : openshift-prod/KubernetesUnregisteredCustomResourceCachingAgent[1/1] is starting
Exception in thread "WATCHDOG" java.lang.OutOfMemoryError: GC overhead limit exceeded
2018-07-06 07:23:05.006 WARN 1 --- [tionAction-2000] c.n.s.c.cache.LoggingInstrumentation : kubernetes:openshift-hq/KubernetesUnregisteredCustomResourceCachingAgent[1/1] completed with one or more failures
java.lang.OutOfMemoryError: GC overhead limit exceeded
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 15 (7 by maintainers)
👋 As @lwander mentioned, I’ve got a patch that should help this. We’ve been running it in production for a few weeks now and have not seen an OOM in some time. I’ll work on getting the patch presentable for upstream review now that I’m back from the holidays.
I think a change proposed by @benjaminws to stream the kubectl output should help here
Update from our testing: At the beginning of the week I changed the Hal config and limited the namespaces in each Kubernetes account and CloudDriver has been much better behaved; we haven’t had to restart it in several days now (except today when we upgraded to 1.8.1).
There aren’t any CRD objects in any of the namespaces Spinnaker is watching which may or may not mean something.