horizon: Horizon can restart gracefully but cannot stop gracefully
Hi!
I have an issue where I need to be able to stop Horizon gracefully.
Currently it works great for restarts (which you would want after a deploy), but stopping Horizon gracefully is an issue due to how Supervisord and System work.
Restarting Works Great
Supervisord / Systemd is usually set to restart Horizon when it stops (even if it stops with a non-zero exit code). This works great for restarts, as we can use horizon:terminate and/or horizon:terminate --wait to have it gracefully stop workers, knowing the process monitor will start Horizon back up.
However, horizon:terminate is async. It returns immediately (before the workers are finished). This means that any follow up command is run immediately.
This is an issue for stopping Horizon.
Stopping is Not Great™
Supervisord can only send signals to stop a process. A SIGTERM to Horizon stops everything immediately - it’s not graceful. So short of listening for an alternative signal, or changing Horizon so SIGTERM is graceful, Supervisord has no way to stop Horizon gracefully.
Systemd has a ExecStop option that could run artisan horizon:terminate. However, Systemd sends a SIGTERM immediately after the ExecStop command runs.
Since horizon:terminate is async, Systemd sends a SIGTERM immediately and kills Horizon too soon.
Why?
This is tough on long running processes that can’t necessarily be killed and tried again (think ChipperCI running builds that can last anywhere from a few minutes to an hour).
The use case I have specifically is using AutoScaling to create and terminate servers dynamically. Terminating a server with a running job would definitely cut off someone’s build.
Another option of course is re-architecting how jobs are run so they’re split up into more idempotent tasks. We’ll cross that bridge if we need as it would be a pretty big task.
Possible Solutions
I have two ideas:
- The master process (the one that Supervisord/Systemd monitors) can listen for an alternative signal or change what
SIGTERMdoes to shut down gracefully. This would work for any process monitor. - The
horizon:terminatecommand can include a--hang(or similar) flag to wait until workers are shutdown (instead of being async and returning immediately). This would allow Systemd to useExecStop.
I’m willing to help with a PR for any solution, but I’ll need some direction in where to find some code due to Horizon’s complexity. I’m guessing this may involve some polling to figure out the state of each worker, but I’m not 100% clear on that.
A Hacky Solution
If the above ideas are too complex (I can certainly see that being the case), I have a hacky solution that works in Systemd (there’s no equivalent in Supervisord):
File (in Ubuntu 18.04): /lib/systemd/system/myjobs.service.
[Unit]
Description=Long Running Job Worker
[Service]
User=forge
WorkingDirectory=/home/forge/myapp.com
# Restart "always" so we can run horizon:terminate during deploys
Restart=always
ExecStart=/usr/bin/php artisan horizon --environment=production
# Allow graceful stop/restart by running a command
# to stop, rather than sending a SIGTERM signal
ExecStop=/usr/bin/php artisan horizon:terminate --wait
# This signal is tells horizon Horizon to "unpause". We're using
# it as hack so Systemd doesn't send a SIGTERM prematurely
KillSignal=SIGCONT
# A long (1 hour) timeout to give jobs time to finish
TimeoutStopSec=3600
[Install]
WantedBy=multi-user.target
Use the following to see it working:
sudo systemctl enable myjobs
sudo systemctl start myjobs
sudo systemctl status myjobs
# run a long-running job and half-way through, run:
sudo systemctl stop myjobs
# It should hang until the workers stop gracefully
Let me know…
If you have other ideas or think I’m misunderstanding something, definitely let me know!
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 11
- Comments: 16 (11 by maintainers)
We’re adding this to the docs 😃
@graemlourens I don’t have time to look at this myself atm but I’ll try to ping @themsaid.
Thanks everyone!
I think it’s not uncommon that developers are the first time introduced to supervisor due to horizon.
AFAICS the official docs at https://laravel.com/docs/6.x/horizon don’t mention this. I mean I get it, they can’t cover every possible case, but pointing towards
stopwaitsecsmight be useful there…@driesvints would it be possible to get the ‘hint’ required by @fideloper to be able to continue analysis of this ticket? It seems to me that gracefully stopping should be an important part of horizon as queues are now the backbone of many crucial system parts.
Currently i’m not able to provide PR’s but happy to invest time into testing scenarios and reliability in case there is a proposal.