spinnaker: CloudDriver: java.lang.OutOfMemoryError: GC overhead limit exceeded

Issue Summary:

Spinnaker 1.8.0 CloudDriver java.lang.OutOfMemoryError: GC overhead limit exceeded

Cloud Provider(s):

Kubernetes 1.9 (OpenShift 3.9)

Environment:

Spinnaker deployed in OpenShift and also monitoring two additional Kubernetes (OpenShift) clusters. Halyard installed on a local VM.

Feature Area:

Pipelines ?

Description:

CloudDriver runs out of memory leading to all pipelines failing until the pod is restarted. Logs report java.lang.OutOfMemoryError: GC overhead limit exceeded followed by multiple jobs being cancelled.

Steps to Reproduce:

In our environment, leaving Spinnaker running for > 1 day eventually results in this issue.

Additional Details:

Exception in thread "Exec Stream Pumper" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "Exec Stream Pumper" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "WATCHDOG" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "Exec Default Executor" java.lang.OutOfMemoryError: GC overhead limit exceeded
2018-07-06 07:21:12.397 ERROR 1 --- [gentScheduler-1] c.n.s.c.r.c.ClusteredAgentScheduler      : Unable to run agents

com.netflix.spinnaker.kork.jedis.telemetry.InstrumentedJedisException: could not execute delegate function
	at com.netflix.spinnaker.kork.jedis.telemetry.InstrumentedJedis.internalInstrumented(InstrumentedJedis.java:84) ~[kork-jedis-1.132.3.jar:1.132.3]
	at com.netflix.spinnaker.kork.jedis.telemetry.InstrumentedJedis.instrumented(InstrumentedJedis.java:69) ~[kork-jedis-1.132.3.jar:1.132.3]
	at com.netflix.spinnaker.kork.jedis.telemetry.InstrumentedJedis.set(InstrumentedJedis.java:120) ~[kork-jedis-1.132.3.jar:1.132.3]
	at com.netflix.spinnaker.cats.redis.cluster.ClusteredAgentScheduler.lambda$acquireRunKey$0(ClusteredAgentScheduler.java:164) ~[cats-redis-2.54.0-SNAPSHOT.jar:2.54.0-SNAPSHOT]
	at com.netflix.spinnaker.kork.jedis.JedisClientDelegate.withCommandsClient(JedisClientDelegate.java:47) ~[kork-jedis-1.132.3.jar:1.132.3]
	at com.netflix.spinnaker.cats.redis.cluster.ClusteredAgentScheduler.acquireRunKey(ClusteredAgentScheduler.java:163) ~[cats-redis-2.54.0-SNAPSHOT.jar:2.54.0-SNAPSHOT]
	at com.netflix.spinnaker.cats.redis.cluster.ClusteredAgentScheduler.acquire(ClusteredAgentScheduler.java:121) ~[cats-redis-2.54.0-SNAPSHOT.jar:2.54.0-SNAPSHOT]
	at com.netflix.spinnaker.cats.redis.cluster.ClusteredAgentScheduler.runAgents(ClusteredAgentScheduler.java:145) ~[cats-redis-2.54.0-SNAPSHOT.jar:2.54.0-SNAPSHOT]
	at com.netflix.spinnaker.cats.redis.cluster.ClusteredAgentScheduler.run(ClusteredAgentScheduler.java:138) ~[cats-redis-2.54.0-SNAPSHOT.jar:2.54.0-SNAPSHOT]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_171]
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [na:1.8.0_171]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_171]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [na:1.8.0_171]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_171]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_171]
	at java.lang.Thread.run(Thread.java:748) [na:1.8.0_171]
Caused by: java.lang.ClassCastException: null

Exception in thread "Exec Default Executor" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "Exec Stream Pumper" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "Exec Stream Pumper" java.lang.OutOfMemoryError: GC overhead limit exceeded
2018-07-06 07:22:10.973  WARN 1 --- [tionAction-1999] c.n.s.c.cache.LoggingInstrumentation     : kubernetes:openshift-dev/KubernetesUnregisteredCustomResourceCachingAgent[1/1] completed with one or more failures

java.lang.NullPointerException: null

2018-07-06 07:22:10.975  WARN 1 --- [tionAction-1990] c.n.s.c.cache.LoggingInstrumentation     : kubernetes:openshift-prod/KubernetesUnregisteredCustomResourceCachingAgent[1/1] completed with one or more failures

java.lang.NullPointerException: null

2018-07-06 07:22:10.976  INFO 1 --- [tionAction-2000] k.v.c.a.KubernetesV2OnDemandCachingAgent : openshift-hq/KubernetesUnregisteredCustomResourceCachingAgent[1/1] is starting
Exception in thread "WATCHDOG" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "Exec Stream Pumper" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "Exec Stream Pumper" java.lang.OutOfMemoryError: GC overhead limit exceeded
2018-07-06 07:22:53.668  INFO 1 --- [tionAction-1999] c.n.s.c.k.v.c.a.KubernetesV2CachingAgent : openshift-prod/KubernetesNamespaceCachingAgent[1/1] is starting
2018-07-06 07:22:53.668  WARN 1 --- [tionAction-1994] c.n.s.c.cache.LoggingInstrumentation     : kubernetes:openshift-dev/KubernetesNamespaceCachingAgent[1/1] completed with one or more failures

java.lang.OutOfMemoryError: GC overhead limit exceeded

2018-07-06 07:22:55.861  INFO 1 --- [tionAction-1990] k.v.c.a.KubernetesV2OnDemandCachingAgent : openshift-dev/KubernetesUnregisteredCustomResourceCachingAgent[1/1] is starting
2018-07-06 07:22:57.939  INFO 1 --- [tionAction-1996] k.v.c.a.KubernetesV2OnDemandCachingAgent : openshift-prod/KubernetesUnregisteredCustomResourceCachingAgent[1/1] is starting
Exception in thread "WATCHDOG" java.lang.OutOfMemoryError: GC overhead limit exceeded
2018-07-06 07:23:05.006  WARN 1 --- [tionAction-2000] c.n.s.c.cache.LoggingInstrumentation     : kubernetes:openshift-hq/KubernetesUnregisteredCustomResourceCachingAgent[1/1] completed with one or more failures

java.lang.OutOfMemoryError: GC overhead limit exceeded

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 15 (7 by maintainers)

Most upvoted comments

👋 As @lwander mentioned, I’ve got a patch that should help this. We’ve been running it in production for a few weeks now and have not seen an OOM in some time. I’ll work on getting the patch presentable for upstream review now that I’m back from the holidays.

I think a change proposed by @benjaminws to stream the kubectl output should help here

Update from our testing: At the beginning of the week I changed the Hal config and limited the namespaces in each Kubernetes account and CloudDriver has been much better behaved; we haven’t had to restart it in several days now (except today when we upgraded to 1.8.1).

There aren’t any CRD objects in any of the namespaces Spinnaker is watching which may or may not mean something.