fugit: endless loop dealing with cron "seconds" string
@conet hijacked the closed issue gh-11, challenging the fix there, but provided nothing that supports his accusation.
In fact, there is a source issue report at: https://gitlab.com/gitlab-org/gitlab-ce/issues/59273
we had:
gitlab_rails['ldap_sync_worker_cron'] = "5 * * * * *" gitlab_rails['ldap_group_sync_worker_cron'] = "10 * * * * *"instead of:
gitlab_rails['ldap_sync_worker_cron'] = "5 * * * *" gitlab_rails['ldap_group_sync_worker_cron'] = "10 * * * *"Once the cron expression is fixed cpu usage goes back to normal
OP adds:
invalid cron expressions should not cause such behavior.
irb(main):001:0> require 'Fugit'
=> true
irb(main):002:0> Fugit.parse('5 * * * * *')
=> #<Fugit::Cron:0x00007f91ba8d54c0
@original="5 * * * * *", @cron_s=nil, @seconds=[5],
@minutes=nil, @hours=nil, @monthdays=nil, @months=nil, @weekdays=nil,
@zone=nil, @timezone=nil>
irb(main):003:0> Fugit.parse('10 * * * * *')
=> #<Fugit::Cron:0x00007f91ba844650 @original="10 * * * * *", @cron_s=nil, @seconds=[10],
@minutes=nil, @hours=nil, @monthdays=nil, @months=nil, @weekdays=nil,
@zone=nil, @timezone=nil>
In other words: “5 * * * * *” means “every time the second hand hits the 5”.
Sidekiq-cron allows for second crons (see https://github.com/ondrejbartas/sidekiq-cron/pull/240). These are NOT invalid cron expressions.
Questions to @conet :
- you mention
2019-03-20_18:36:54.79856 2019-03-20T18:36:54.798Z 2977 TID-ot8g7o9m9 WARN: Thread TID-ot89u8ea9 scheduler
2019-03-20_18:36:54.79857 2019-03-20T18:36:54.798Z 2977 TID-ot8g7o9m9 WARN: /opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/tzinfo-1.2.5/lib/tzinfo/time_or_datetime.rb:321:in `wrap'
...
as an endless loop. Do you mean that this warn trace repeats endlessly in your log?
- Does the LDAP sync work your settings seem to imply happen or not? Does the LDAP sync work happens every 5 / 10 seconds ? Does it happen but with a different schedule ? Or is it NOT happening at all?
Thanks in advance and best regards.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 59 (27 by maintainers)
Commits related to this issue
- Conjure up worst cast cron#next_time for gh-15 [ci-skip] — committed to floraison/fugit by jmettraux 5 years ago
- Move to max iteration count breaker, gh-15 The previous breaker system was two-pronged: * break if the delta goes over 40~ years * break if the delta stays at 0 for more than 2 iteration After consi... — committed to floraison/fugit by jmettraux 5 years ago
- Prevent #previous_time endless loop, gh-15 ```ruby c = Fugit.parse('10 * * * * *') t = c.previous_time.to_f# + 0.123 #(some float x so that 0.0 <= x < 1.0) c.previous_time(Time.at(t)) # enters in ... — committed to floraison/fugit by jmettraux 5 years ago
- Prevent #previous_time endless loop, gh-15 (spec) — committed to floraison/fugit by jmettraux 5 years ago
Thank you all! I created a merge request to update Fugit in GitLab: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/26579
Hello @conet and @godfat,
I have released fugit 1.1.9.
Thank you very much! If there is anything wrong, please tell me!
Special thanks to @conet for fighting for his case and pushing fugit in the right direction!
After several hours running the above container with your fix, the endless loop haven’t kicked in, I’d say that the issue can be closed.
Hello @conet,
if you had the time to confirm the issue is fixed. That’d be great.
The fixed fugit has been running for 8 hours on my GitLab enterprise instance and I see no issues, no endless loops. (I just dropped the updated https://github.com/floraison/fugit/blob/b50bbfdb243f4ce4e8ce0d966eab25c6833ab594/lib/fugit/cron.rb file over the defective one).
If the fix is confirmed, I will release fugit 1.1.9.
Thanks in advance.
Wow thank you both so much to get to the bottom of this! Very impressive debugging process. Sorry I didn’t try to jump in because apparently I didn’t exactly follow the reasoning, and I still don’t really understand what went wrong.
I can provide some general ideas though.
@jmettraux In general I think breaking on counting iterations is better than time based approach though, because what @conet said:
This can become very unpredictable. My favorite example was an old game The 7th Guest, there’s a mini-game which the AI was pretty weak when the game was released. 20 years later the AI became unbeatable, because the AI only stopped searching for the best move based on time, and for a 20-years later computer, which had so much more powerful computing power, it can find the absolutely best move (because the game wasn’t that complex anyway)
Beside, a timer requires another thread to watch it. If we have a global timer then this is fine resource-wise, but if we need a new timer for each calculation, then this will double the threads we need. We can also look at the time for each iteration, but again for example under heavy load, or we somehow want to suspend the execution and resume it later, they’ll all interfere it and might cause it to return too early, breaking the application under load. This will be very hard to debug unless we know here it’s time sensitive.
Yes, if something like this can be done that’ll be ideal. It’s like a sanity check, based on complexity. I don’t know cron and time enough to know if this is feasible though.
Is this because floating point precision?
@jmettraux Thank you so much for your dedication to get to the bottom of this, and willing to install GitLab to try it out. I know that can be cumbersome. Please don’t get yourself too much stress though, it’s already much appreciated if you can take a look.
@conet Thank you for providing all the logs and traces and willing to go through it patiently especially when you’re not familiar with Ruby. I know it takes a lot of efforts to spin an instance, prepare the patch, run it, and wait for it, and finally collect logs, especially when it might not reproduce it at all.
Now that you mention the
match_secondpart I do recall seeing in on the top of the stack trace often, I’m glad we got to the bottom of it, thanks for your support.Thank you very much.
Finally, here’s the current end, because it goes on and on:
and the beginning:
A few early observations:
The endless loop is not the one exposed in gh-11 / gh-13. That loop (https://github.com/floraison/fugit/blob/f1e2d002a6f59ca01c87744f2f3c78dc1931f86a/lib/fugit/cron.rb#L186-L215) is happening in
#next_timewhile your case occurs around#previous_time.As suggested by @godfat, there is now a breaker in the
#next_timeloop. At the same time, a breaker was added in the#previous_timeloop (https://github.com/floraison/fugit/blob/f1e2d002a6f59ca01c87744f2f3c78dc1931f86a/lib/fugit/cron.rb#L232-L254). The breaker never triggered for you. No re-opening of gh-11 / gh-13.The log is proving that Sidekiq-Cron is calling
#should_enquerepeatedly, with the sametimevalue 😦I’ve looked at Sidekiq-Cron (https://github.com/ondrejbartas/sidekiq-cron/blob/a9dfab248ca7/lib/sidekiq/cron/job.rb and https://github.com/ondrejbartas/sidekiq-cron/blob/a9dfab248ca7/lib/sidekiq/cron/poller.rb) and I don’t see yet how this loop comes to be.
I think that the impression you got that the problem comes from
#previous_timeis that this method is probably the costliest of the ones used in#should_enqueand your kills exposed it repeatedly.It took a bit longer but here it is:
I assume that only the last entry is important, if you want more please let me know.
Yes, that is sidekiq’s way of getting a stacktrace dumped to the logs, by running:
@conet I am in the middle of something else, but I can help with this quickly:
Run this to start a console:
sudo gitlab-rails consoleand then run:which should print the results.
@godfat you’re welcome! If it’s my code, I have to fix it 😃
Thank you for looking into this! (I can’t reproduce the issue by myself)