nomad-autoscaler: Failed to get policy (EOF)

Hi folks,

Seeing some more errors and Nomad jobs that should definitely be triggering scaling are not happening.

https://gist.github.com/evandam/4956199b01cd4303873b69cf7d3779c1

My first thought is that the Nomad agent is having issues, but it seems okay and I’m able to hit the same endpoints in the logs directly on the host 🤷‍♂️

As a side note, is there a decent way to debug things? There’s policy_id in logs but I’m not sure how that maps back to the actual policy or Nomad job in question. It seems like it would be very helpful to be able to trace these back.

We’re running Nomad 1.1.2 for what it’s worth.

Any thoughts?

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 10
  • Comments: 16 (6 by maintainers)

Most upvoted comments

We just found 6 different instances of nomad-autoscaler v0.3.6 in 6 different Nomad clusters all in this same “failed to read policy in time” state. Like @hntrmrrs mentioned, all 6 instances seem to start with receiving a 500 from the Nomad API:

2022-06-07T20:29:21.146Z [ERROR] policy_manager: encountered an error monitoring policy IDs: error="failed to call the Nomad list policies API: Unexpected response code: 500 (rpc error: EOF)"
2022-06-07T20:29:31.148Z [INFO]  policy_manager: starting policy source: source=nomad
2022-06-07T20:29:31.148Z [INFO]  policy_manager: starting policy source: source=file
2022-06-07T20:29:31.153Z [INFO]  file_policy_source: starting file policy monitor: file=/local/policies/aws-asg.hcl name=... policy_id=d7c916c9-4a0b-34db-d6ac-943abc74e1e7
2022-06-07T20:29:31.154Z [INFO]  file_policy_source: starting file policy monitor: file=/local/policies/aws-asg.hcl name=... policy_id=2830b7ec-4dfe-a6f5-1a8d-d012a7216233
2022-06-07T20:32:31.153Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=d7c916c9-4a0b-34db-d6ac-943abc74e1e7
2022-06-07T20:32:31.154Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=2830b7ec-4dfe-a6f5-1a8d-d012a7216233

I’m trying to confirm with some teammates, but I think this might have happened when we did an upgrade on the Nomad Servers.

If it happens again, could you restart the Autoscaler using a SIGABRT and send us the goroutine dump?

Here you go!
2022-07-19T18:44:52.664Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=46c49638-8dfd-b3e2-9340-9a637ed8d8da
2022-07-19T18:44:52.664Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ffbaf7d4-41c7-5703-fcee-a641f253a19f
2022-07-19T18:47:52.665Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=46c49638-8dfd-b3e2-9340-9a637ed8d8da
2022-07-19T18:47:52.665Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ffbaf7d4-41c7-5703-fcee-a641f253a19f
2022-07-19T18:50:52.665Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ffbaf7d4-41c7-5703-fcee-a641f253a19f
2022-07-19T18:50:52.665Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=46c49638-8dfd-b3e2-9340-9a637ed8d8da
2022-07-19T18:53:52.664Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=46c49638-8dfd-b3e2-9340-9a637ed8d8da
2022-07-19T18:53:52.664Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ffbaf7d4-41c7-5703-fcee-a641f253a19f
2022-07-19T18:56:52.664Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=46c49638-8dfd-b3e2-9340-9a637ed8d8da
2022-07-19T18:56:52.664Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ffbaf7d4-41c7-5703-fcee-a641f253a19f
2022-07-19T18:59:52.665Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ffbaf7d4-41c7-5703-fcee-a641f253a19f
2022-07-19T18:59:52.665Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=46c49638-8dfd-b3e2-9340-9a637ed8d8da
2022-07-19T19:02:52.666Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ffbaf7d4-41c7-5703-fcee-a641f253a19f
2022-07-19T19:02:52.666Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=46c49638-8dfd-b3e2-9340-9a637ed8d8da
SIGABRT: abort
PC=0x46aac1 m=0 sigcode=0
goroutine 0 [idle]:
runtime.futex()
	/usr/local/go/src/runtime/sys_linux_amd64.s:519 +0x21
runtime.futexsleep(0xffffffffffff8000, 0x2561be0, 0xc00004c800)
	/usr/local/go/src/runtime/os_linux.go:44 +0x36
runtime.notesleep(0x2561d30)
	/usr/local/go/src/runtime/lock_futex.go:160 +0x87
runtime.mPark()
	/usr/local/go/src/runtime/proc.go:1441 +0x2a
runtime.stoplockedm()
	/usr/local/go/src/runtime/proc.go:2602 +0x65
runtime.schedule()
	/usr/local/go/src/runtime/proc.go:3299 +0x3d
runtime.park_m(0xc000520680)
	/usr/local/go/src/runtime/proc.go:3516 +0x14d
runtime.mcall()
	/usr/local/go/src/runtime/asm_amd64.s:307 +0x43
goroutine 1 [chan receive, 1262 minutes]:
github.com/hashicorp/nomad-autoscaler/agent.(*Agent).handleSignals(0xc0003cdb20)
	/home/circleci/project/project/agent/agent.go:270 +0xe5
github.com/hashicorp/nomad-autoscaler/agent.(*Agent).Run(0xc0003cdb20)
	/home/circleci/project/project/agent/agent.go:100 +0x446
github.com/hashicorp/nomad-autoscaler/command.(*AgentCommand).Run(0xc000219a40, {0xc00003a0a0, 0x203000, 0x0})
	/home/circleci/project/project/command/agent.go:333 +0x8e5
github.com/mitchellh/cli.(*CLI).Run(0xc0003a5400)
	/go/pkg/mod/github.com/mitchellh/cli@v1.1.2/cli.go:262 +0x5f8
main.main()
	/home/circleci/project/project/main.go:26 +0x24d
goroutine 7 [select]:
go.opencensus.io/stats/view.(*worker).start(0xc0000f4980)
	/go/pkg/mod/go.opencensus.io@v0.22.5/stats/view/worker.go:276 +0xb9
created by go.opencensus.io/stats/view.init.0
	/go/pkg/mod/go.opencensus.io@v0.22.5/stats/view/worker.go:34 +0x92
goroutine 10 [IO wait]:
internal/poll.runtime_pollWait(0x7f21944c2e58, 0x72)
	/usr/local/go/src/runtime/netpoll.go:234 +0x89
internal/poll.(*pollDesc).wait(0xc0000f5800, 0x416f46, 0x0)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc0000f5800)
	/usr/local/go/src/internal/poll/fd_unix.go:402 +0x22c
net.(*netFD).accept(0xc0000f5800)
	/usr/local/go/src/net/fd_unix.go:173 +0x35
net.(*TCPListener).accept(0xc00039d170)
	/usr/local/go/src/net/tcpsock_posix.go:140 +0x28
net.(*TCPListener).Accept(0xc00039d170)
	/usr/local/go/src/net/tcpsock.go:262 +0x3d
net/http.(*Server).Serve(0xc0001e1340, {0x1a5bd18, 0xc00039d170})
	/usr/local/go/src/net/http/server.go:3002 +0x394
github.com/hashicorp/nomad-autoscaler/agent/http.(*Server).Start(0xc0002df360)
	/home/circleci/project/project/agent/http/server.go:124 +0xd2
created by github.com/hashicorp/nomad-autoscaler/command.(*AgentCommand).Run
	/home/circleci/project/project/command/agent.go:330 +0x896
goroutine 34 [chan receive]:
github.com/hashicorp/nomad-autoscaler/plugins/builtin/target/nomad/plugin.(*TargetPlugin).garbageCollectionLoop(0xc000469740)
	/home/circleci/project/project/plugins/builtin/target/nomad/plugin/plugin.go:213 +0xc9
created by github.com/hashicorp/nomad-autoscaler/plugins/builtin/target/nomad/plugin.(*TargetPlugin).SetConfig
	/home/circleci/project/project/plugins/builtin/target/nomad/plugin/plugin.go:94 +0xef
goroutine 20 [syscall, 1263 minutes]:
os/signal.signal_recv()
	/usr/local/go/src/runtime/sigqueue.go:169 +0x98
os/signal.loop()
	/usr/local/go/src/os/signal/signal_unix.go:24 +0x19
created by os/signal.Notify.func1.1
	/usr/local/go/src/os/signal/signal.go:151 +0x2c
goroutine 21 [select, 100962 minutes]:
github.com/armon/go-metrics.(*InmemSignal).run(0xc00053e240)
	/go/pkg/mod/github.com/armon/go-metrics@v0.3.7/inmem_signal.go:64 +0x6c
created by github.com/armon/go-metrics.NewInmemSignal
	/go/pkg/mod/github.com/armon/go-metrics@v0.3.7/inmem_signal.go:38 +0x174
goroutine 23 [sleep]:
time.Sleep(0x3b9aca00)
	/usr/local/go/src/runtime/time.go:193 +0x12e
github.com/armon/go-metrics.(*Metrics).collectStats(0xc0004f2000)
	/go/pkg/mod/github.com/armon/go-metrics@v0.3.7/metrics.go:230 +0x25
created by github.com/armon/go-metrics.New
	/go/pkg/mod/github.com/armon/go-metrics@v0.3.7/start.go:84 +0x17b
goroutine 25 [select, 100962 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
	/home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
	/home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a1e0, {0x1a6fb30, 0xc000400c00})
	/home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
	/home/circleci/project/project/agent/agent.go:128 +0x25f
goroutine 26 [select, 100962 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
	/home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
	/home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a230, {0x1a6fb30, 0xc000400c00})
	/home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
	/home/circleci/project/project/agent/agent.go:128 +0x25f
goroutine 27 [select, 100962 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
	/home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
	/home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a280, {0x1a6fb30, 0xc000400c00})
	/home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
	/home/circleci/project/project/agent/agent.go:128 +0x25f
goroutine 28 [select, 100962 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
	/home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
	/home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a2d0, {0x1a6fb30, 0xc000400c00})
	/home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
	/home/circleci/project/project/agent/agent.go:128 +0x25f
goroutine 29 [select, 100962 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
	/home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
	/home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a320, {0x1a6fb30, 0xc000400c00})
	/home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
	/home/circleci/project/project/agent/agent.go:128 +0x25f
goroutine 30 [select, 100962 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
	/home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
	/home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a370, {0x1a6fb30, 0xc000400c00})
	/home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
	/home/circleci/project/project/agent/agent.go:128 +0x25f
goroutine 31 [select, 100962 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
	/home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
	/home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a3c0, {0x1a6fb30, 0xc000400c00})
	/home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
	/home/circleci/project/project/agent/agent.go:128 +0x25f
goroutine 32 [select, 100962 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
	/home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
	/home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a410, {0x1a6fb30, 0xc000400c00})
	/home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
	/home/circleci/project/project/agent/agent.go:128 +0x25f
goroutine 49 [select, 100962 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
	/home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
	/home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a460, {0x1a6fb30, 0xc000400c00})
	/home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
	/home/circleci/project/project/agent/agent.go:128 +0x25f
goroutine 50 [select, 100962 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
	/home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
	/home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a4b0, {0x1a6fb30, 0xc000400c00})
	/home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
	/home/circleci/project/project/agent/agent.go:128 +0x25f
goroutine 51 [select, 59053 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
	/home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
	/home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a500, {0x1a6fb30, 0xc000400c00})
	/home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
	/home/circleci/project/project/agent/agent.go:134 +0x35b
goroutine 52 [select, 59052 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
	/home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
	/home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a550, {0x1a6fb30, 0xc000400c00})
	/home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
	/home/circleci/project/project/agent/agent.go:134 +0x35b
goroutine 53 [select, 59052 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
	/home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
	/home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a5a0, {0x1a6fb30, 0xc000400c00})
	/home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
	/home/circleci/project/project/agent/agent.go:134 +0x35b
goroutine 54 [select, 59051 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
	/home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
	/home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a5f0, {0x1a6fb30, 0xc000400c00})
	/home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
	/home/circleci/project/project/agent/agent.go:134 +0x35b
goroutine 55 [select, 59051 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
	/home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
	/home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a640, {0x1a6fb30, 0xc000400c00})
	/home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
	/home/circleci/project/project/agent/agent.go:134 +0x35b
goroutine 56 [select, 59055 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
	/home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
	/home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a690, {0x1a6fb30, 0xc000400c00})
	/home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
	/home/circleci/project/project/agent/agent.go:134 +0x35b
goroutine 57 [select, 59055 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
	/home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
	/home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a6e0, {0x1a6fb30, 0xc000400c00})
	/home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
	/home/circleci/project/project/agent/agent.go:134 +0x35b
goroutine 58 [select, 59054 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
	/home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
	/home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a730, {0x1a6fb30, 0xc000400c00})
	/home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
	/home/circleci/project/project/agent/agent.go:134 +0x35b
goroutine 59 [select, 59055 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
	/home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
	/home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a780, {0x1a6fb30, 0xc000400c00})
	/home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
	/home/circleci/project/project/agent/agent.go:134 +0x35b
goroutine 60 [select, 59053 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
	/home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
	/home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a7d0, {0x1a6fb30, 0xc000400c00})
	/home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
	/home/circleci/project/project/agent/agent.go:134 +0x35b
goroutine 61 [select, 59051 minutes]:
github.com/hashicorp/nomad-autoscaler/agent.(*Agent).runEvalHandler(0xc0003cdb20, {0x1a6fb30, 0xc000400c00}, 0xc000366cc0)
	/home/circleci/project/project/agent/agent.go:106 +0xb2
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).Run
	/home/circleci/project/project/agent/agent.go:97 +0x439
goroutine 5811358 [select, 5 minutes]:
net/http.(*persistConn).roundTrip(0xc00053a000, 0xc0006f22c0)
	/usr/local/go/src/net/http/transport.go:2614 +0x97d
net/http.(*Transport).roundTrip(0xc0001e48c0, 0xc000191100)
	/usr/local/go/src/net/http/transport.go:594 +0x7d1
net/http.(*Transport).RoundTrip(0x30, 0x1a3c1e0)
	/usr/local/go/src/net/http/roundtrip.go:18 +0x19
net/http.send(0xc000191100, {0x1a3c1e0, 0xc0001e48c0}, {0x17ff560, 0x40ce01, 0x0})
	/usr/local/go/src/net/http/client.go:252 +0x5d8
net/http.(*Client).send(0xc000465e00, 0xc000191100, {0x203000, 0x203000, 0x0})
	/usr/local/go/src/net/http/client.go:176 +0x9b
net/http.(*Client).do(0xc000465e00, 0xc000191100)
	/usr/local/go/src/net/http/client.go:725 +0x908
net/http.(*Client).Do(...)
	/usr/local/go/src/net/http/client.go:593
github.com/hashicorp/nomad/api.(*Client).doRequest(0xc0003b21c0, 0xc0000e06c0)
	/go/pkg/mod/github.com/hashicorp/nomad/api@v0.0.0-20210503143957-4ccada7924cf/api.go:722 +0x65
github.com/hashicorp/nomad/api.(*Client).query(0xc0004c37b0, {0x184fced, 0xc0006920c0}, {0x1478d40, 0xc00043adc8}, 0x464a00)
	/go/pkg/mod/github.com/hashicorp/nomad/api@v0.0.0-20210503143957-4ccada7924cf/api.go:841 +0xb4
github.com/hashicorp/nomad/api.(*Scaling).ListPolicies(0xc0004c37b0, 0xc00043e760)
	/go/pkg/mod/github.com/hashicorp/nomad/api@v0.0.0-20210503143957-4ccada7924cf/scaling.go:20 +0x65
github.com/hashicorp/nomad-autoscaler/policy/nomad.(*Source).MonitorIDs.func1()
	/home/circleci/project/project/policy/nomad/source.go:109 +0xca
created by github.com/hashicorp/nomad-autoscaler/policy/nomad.(*Source).MonitorIDs
	/home/circleci/project/project/policy/nomad/source.go:102 +0x1e6
goroutine 5206364 [select]:
github.com/hashicorp/nomad-autoscaler/policy.(*Handler).Run(0xc00021de00, {0x1a6fb30, 0xc000400c00}, 0xc0002a8fd0)
	/home/circleci/project/project/policy/handler.go:118 +0x411
github.com/hashicorp/nomad-autoscaler/policy.(*Manager).Run.func1({0xc0000406f0, 0x24})
	/home/circleci/project/project/policy/manager.go:116 +0x46
created by github.com/hashicorp/nomad-autoscaler/policy.(*Manager).Run
	/home/circleci/project/project/policy/manager.go:115 +0xed3
goroutine 5206401 [select, 1263 minutes]:
github.com/hashicorp/nomad-autoscaler/policy.(*Manager).Run(0xc00036a140, {0x1a6fb30, 0xc000400c00}, 0xc000366cc0)
	/home/circleci/project/project/policy/manager.go:72 +0x475
created by github.com/hashicorp/nomad-autoscaler/policy.(*Manager).Run
	/home/circleci/project/project/policy/manager.go:154 +0x847
goroutine 5206403 [select, 1263 minutes]:
github.com/hashicorp/nomad-autoscaler/policy/file.(*Source).MonitorIDs(0xc0004b6100, {0x1a6fb30, 0xc000400340}, {0xc00035e360, 0xc00035e1e0})
	/home/circleci/project/project/policy/file/source.go:84 +0x131
created by github.com/hashicorp/nomad-autoscaler/policy.(*Manager).Run
	/home/circleci/project/project/policy/manager.go:67 +0x1f3
goroutine 5206404 [select, 5 minutes]:
github.com/hashicorp/nomad-autoscaler/policy/nomad.(*Source).MonitorIDs(0xc00037a240, {0x1a6fb30, 0xc000400340}, {0xc00035e360, 0xc00035e1e0})
	/home/circleci/project/project/policy/nomad/source.go:113 +0x26d
created by github.com/hashicorp/nomad-autoscaler/policy.(*Manager).Run
	/home/circleci/project/project/policy/manager.go:67 +0x1f3
goroutine 5206410 [select, 1263 minutes]:
github.com/hashicorp/nomad-autoscaler/policy/file.(*Source).MonitorPolicy(0xc0004b6100, {0x1a6fb30, 0xc0004005c0}, {{0xc0000406f0, 0x24}, 0xc00007e720, 0xc00007e840, 0xc00007e6c0})
	/home/circleci/project/project/policy/file/source.go:149 +0x7bc
created by github.com/hashicorp/nomad-autoscaler/policy.(*Handler).Run
	/home/circleci/project/project/policy/handler.go:115 +0x2e5
goroutine 5206402 [select]:
github.com/hashicorp/nomad-autoscaler/policy.(*Manager).periodicMetricsReporter(0xc00036a140, {0x1a6fb30, 0xc000400340}, 0xc000366cc0)
	/home/circleci/project/project/policy/manager.go:224 +0xdf
created by github.com/hashicorp/nomad-autoscaler/policy.(*Manager).Run
	/home/circleci/project/project/policy/manager.go:61 +0x196
goroutine 5811386 [IO wait, 5 minutes]:
internal/poll.runtime_pollWait(0x7f21944c2c88, 0x72)
	/usr/local/go/src/runtime/netpoll.go:234 +0x89
internal/poll.(*pollDesc).wait(0xc000466f00, 0xc000131800, 0x0)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc000466f00, {0xc000131800, 0xb09, 0xb09})
	/usr/local/go/src/internal/poll/fd_unix.go:167 +0x25a
net.(*netFD).Read(0xc000466f00, {0xc000131800, 0xc00008d758, 0xc00008d758})
	/usr/local/go/src/net/fd_posix.go:56 +0x29
net.(*conn).Read(0xc0003720e0, {0xc000131800, 0xc000315800, 0xc00008d7f8})
	/usr/local/go/src/net/net.go:183 +0x45
crypto/tls.(*atLeastReader).Read(0xc00043ae70, {0xc000131800, 0x0, 0x409b8d})
	/usr/local/go/src/crypto/tls/conn.go:777 +0x3d
bytes.(*Buffer).ReadFrom(0xc0001c4978, {0x1a39800, 0xc00043ae70})
	/usr/local/go/src/bytes/buffer.go:204 +0x98
crypto/tls.(*Conn).readFromUntil(0xc0001c4700, {0x1a3c0c0, 0xc0003720e0}, 0x183be26)
	/usr/local/go/src/crypto/tls/conn.go:799 +0xe5
crypto/tls.(*Conn).readRecordOrCCS(0xc0001c4700, 0x0)
	/usr/local/go/src/crypto/tls/conn.go:606 +0x112
crypto/tls.(*Conn).readRecord(...)
	/usr/local/go/src/crypto/tls/conn.go:574
crypto/tls.(*Conn).Read(0xc0001c4700, {0xc000356000, 0x1000, 0x10100c0004ad800})
	/usr/local/go/src/crypto/tls/conn.go:1277 +0x16f
net/http.(*persistConn).Read(0xc00053a000, {0xc000356000, 0x40505d, 0x60})
	/usr/local/go/src/net/http/transport.go:1926 +0x4e
bufio.(*Reader).fill(0xc0002b0a80)
	/usr/local/go/src/bufio/bufio.go:101 +0x103
bufio.(*Reader).Peek(0xc0002b0a80, 0x1)
	/usr/local/go/src/bufio/bufio.go:139 +0x5d
net/http.(*persistConn).readLoop(0xc00053a000)
	/usr/local/go/src/net/http/transport.go:2087 +0x1ac
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1747 +0x1e05
goroutine 5811243 [IO wait]:
internal/poll.runtime_pollWait(0x7f21944c28e8, 0x72)
	/usr/local/go/src/runtime/netpoll.go:234 +0x89
internal/poll.(*pollDesc).wait(0xc000466e80, 0xc000122000, 0x0)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc000466e80, {0xc000122000, 0x1000, 0x1000})
	/usr/local/go/src/internal/poll/fd_unix.go:167 +0x25a
net.(*netFD).Read(0xc000466e80, {0xc000122000, 0x4cee66, 0x2560c80})
	/usr/local/go/src/net/fd_posix.go:56 +0x29
net.(*conn).Read(0xc00000ed40, {0xc000122000, 0x0, 0xc000732758})
	/usr/local/go/src/net/net.go:183 +0x45
net/http.(*connReader).Read(0xc000732750, {0xc000122000, 0x1000, 0x1000})
	/usr/local/go/src/net/http/server.go:780 +0x16d
bufio.(*Reader).fill(0xc000692060)
	/usr/local/go/src/bufio/bufio.go:101 +0x103
bufio.(*Reader).Peek(0xc000692060, 0x4)
	/usr/local/go/src/bufio/bufio.go:139 +0x5d
net/http.(*conn).serve(0xc000422000, {0x1a6fbd8, 0xc00042ede0})
	/usr/local/go/src/net/http/server.go:1955 +0xc36
created by net/http.(*Server).Serve
	/usr/local/go/src/net/http/server.go:3034 +0x4e8
goroutine 5811387 [select, 5 minutes]:
net/http.(*persistConn).writeLoop(0xc00053a000)
	/usr/local/go/src/net/http/transport.go:2386 +0xfb
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1748 +0x1e65
goroutine 5206365 [select]:
github.com/hashicorp/nomad-autoscaler/policy.(*Handler).Run(0xc00021df40, {0x1a6fb30, 0xc000400c00}, 0x0)
	/home/circleci/project/project/policy/handler.go:118 +0x411
github.com/hashicorp/nomad-autoscaler/policy.(*Manager).Run.func1({0xc000040720, 0x24})
	/home/circleci/project/project/policy/manager.go:116 +0x46
created by github.com/hashicorp/nomad-autoscaler/policy.(*Manager).Run
	/home/circleci/project/project/policy/manager.go:115 +0xed3
goroutine 5206366 [select, 1262 minutes]:
github.com/hashicorp/nomad-autoscaler/policy/file.(*Source).MonitorPolicy(0xc0004b6100, {0x1a6fb30, 0xc0006f35c0}, {{0xc000040720, 0x24}, 0xc00007e900, 0xc00007ea80, 0xc00007e8a0})
	/home/circleci/project/project/policy/file/source.go:149 +0x7bc
created by github.com/hashicorp/nomad-autoscaler/policy.(*Handler).Run
	/home/circleci/project/project/policy/handler.go:115 +0x2e5
rax    0xca
rbx    0x0
rcx    0xffffffffffffffff
rdx    0x0
rdi    0x2561d30
rsi    0x80
rbp    0x7fffd8ad9100
rsp    0x7fffd8ad90b8
r8     0x0
r9     0x0
r10    0x0
r11    0x286
r12    0x43e700
r13    0x0
r14    0x2561400
r15    0x0
rip    0x46aac1
rflags 0x286
cs     0x33
fs     0x0
gs     0x0

Maybe it would make sense to have the health check start to fail when there’s any kind failure to read policies?

For us, this problem appears when the nomad server node restarts (for instance, manually restarting it with systemctl restart nomad). I was able to reproduce it multiple times and the outcome was always the following:

2023-07-10T13:36:54.319Z [ERROR] policy_manager: encountered an error monitoring policy IDs: error=“failed to call the Nomad list policies API: Unexpected response code: 500 (rpc error: EOF)” 2023-07-10T13:37:04.320Z [INFO] policy_manager: starting policy source: source=file 2023-07-10T13:37:04.320Z [INFO] policy_manager: starting policy source: source=nomad 2023-07-10T13:37:04.321Z [INFO] file_policy_source: starting file policy monitor: file=local/policies/batch.hcl name=batch policy_id=xxx 2023-07-10T13:40:04.321Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=xxx 2023-07-10T13:43:04.321Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=xxx 2023-07-10T13:46:04.321Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=xxx 2023-07-10T13:49:04.321Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=xxx

A possible quick workaround was inspired by atulsingh0:

Disclaimer: I came out with this fast-and-furious solution to restart the autoscaler job but it’s far from perfect and it works only if the autoscaler logs are in the file autoscaler.stderr.0. Use it at your own risk

service {

        check {
          type    = "script"
          command = "/bin/sh"
          args    = ["-c", "! tail -n 1 alloc/logs/autoscaler.stderr.0 | grep -q 'policy_manager.policy_handler: timeout: failed to read policy in time'"]
          interval = "60s"
          timeout  = "10s"

          check_restart {
            limit = 3
            grace = "90s"
            ignore_warnings = false
          }
        }
      }

Hum…it’s the second time this is being reported, so there may be something wrong with our blocking query handling.

Do you happen to have anything sitting between Nomad and the Autoscaler? Like a load balancer or reverse proxy?

There’s policy_id in logs but I’m not sure how that maps back to the actual policy or Nomad job in question

Yeah, it’s not ideal but you can query Nomad for the scaling policies, either using the nomad scaling policy list command or the GET /v1/scaling/policies endpoint, to get the policy ID.