fugit: endless loop dealing with cron "seconds" string

@conet hijacked the closed issue gh-11, challenging the fix there, but provided nothing that supports his accusation.

In fact, there is a source issue report at: https://gitlab.com/gitlab-org/gitlab-ce/issues/59273

we had:

gitlab_rails['ldap_sync_worker_cron'] = "5 * * * * *"
gitlab_rails['ldap_group_sync_worker_cron'] = "10 * * * * *"

instead of:

gitlab_rails['ldap_sync_worker_cron'] = "5 * * * *"
gitlab_rails['ldap_group_sync_worker_cron'] = "10 * * * *"

Once the cron expression is fixed cpu usage goes back to normal

OP adds:

invalid cron expressions should not cause such behavior.

irb(main):001:0> require 'Fugit'
=> true
irb(main):002:0> Fugit.parse('5 * * * * *')
=> #<Fugit::Cron:0x00007f91ba8d54c0
  @original="5 * * * * *", @cron_s=nil, @seconds=[5],
  @minutes=nil, @hours=nil, @monthdays=nil, @months=nil, @weekdays=nil,
  @zone=nil, @timezone=nil>
irb(main):003:0> Fugit.parse('10 * * * * *')
=> #<Fugit::Cron:0x00007f91ba844650 @original="10 * * * * *", @cron_s=nil, @seconds=[10],
  @minutes=nil, @hours=nil, @monthdays=nil, @months=nil, @weekdays=nil,
  @zone=nil, @timezone=nil>

In other words: “5 * * * * *” means “every time the second hand hits the 5”.

Sidekiq-cron allows for second crons (see https://github.com/ondrejbartas/sidekiq-cron/pull/240). These are NOT invalid cron expressions.

Questions to @conet :

you mention

2019-03-20_18:36:54.79856 2019-03-20T18:36:54.798Z 2977 TID-ot8g7o9m9 WARN: Thread TID-ot89u8ea9 scheduler
2019-03-20_18:36:54.79857 2019-03-20T18:36:54.798Z 2977 TID-ot8g7o9m9 WARN: /opt/gitlab/embedded/lib/ruby/gems/2.5.0/gems/tzinfo-1.2.5/lib/tzinfo/time_or_datetime.rb:321:in `wrap'
...

as an endless loop. Do you mean that this warn trace repeats endlessly in your log?

Does the LDAP sync work your settings seem to imply happen or not? Does the LDAP sync work happens every 5 / 10 seconds ? Does it happen but with a different schedule ? Or is it NOT happening at all?

Thanks in advance and best regards.

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 59 (27 by maintainers)

Commits related to this issue

Conjure up worst cast cron#next_time for gh-15 [ci-skip] — committed to floraison/fugit by jmettraux 5 years ago
Move to max iteration count breaker, gh-15 The previous breaker system was two-pronged: * break if the delta goes over 40~ years * break if the delta stays at 0 for more than 2 iteration After consi... — committed to floraison/fugit by jmettraux 5 years ago
Prevent #previous_time endless loop, gh-15 ```ruby c = Fugit.parse('10 * * * * *') t = c.previous_time.to_f# + 0.123 #(some float x so that 0.0 <= x < 1.0) c.previous_time(Time.at(t)) # enters in ... — committed to floraison/fugit by jmettraux 5 years ago
Prevent #previous_time endless loop, gh-15 (spec) — committed to floraison/fugit by jmettraux 5 years ago

Most upvoted comments

Thank you all! I created a merge request to update Fugit in GitLab: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/26579

godfat on Mar 26, 2019

Hello @conet and @godfat,

I have released fugit 1.1.9.

Thank you very much! If there is anything wrong, please tell me!

Special thanks to @conet for fighting for his case and pushing fugit in the right direction!

jmettraux on Mar 25, 2019

After several hours running the above container with your fix, the endless loop haven’t kicked in, I’d say that the issue can be closed.

conet on Mar 25, 2019

Hello @conet,

if you had the time to confirm the issue is fixed. That’d be great.

The fixed fugit has been running for 8 hours on my GitLab enterprise instance and I see no issues, no endless loops. (I just dropped the updated https://github.com/floraison/fugit/blob/b50bbfdb243f4ce4e8ce0d966eab25c6833ab594/lib/fugit/cron.rb file over the defective one).

If the fix is confirmed, I will release fugit 1.1.9.

Thanks in advance.

jmettraux on Mar 25, 2019

Wow thank you both so much to get to the bottom of this! Very impressive debugging process. Sorry I didn’t try to jump in because apparently I didn’t exactly follow the reasoning, and I still don’t really understand what went wrong.

I can provide some general ideas though.

You’re giving me another idea: a simple timeout, whatever the iteration count, whatever the distance in time, if it takes too long, let’s fail.

@jmettraux In general I think breaking on counting iterations is better than time based approach though, because what @conet said:

Yeah but that has the disadvantage that it might mean more or less iterations depending on various factors including current load so it might actually make this more unpredictable.

This can become very unpredictable. My favorite example was an old game The 7th Guest, there’s a mini-game which the AI was pretty weak when the game was released. 20 years later the AI became unbeatable, because the AI only stopped searching for the best move based on time, and for a 20-years later computer, which had so much more powerful computing power, it can find the absolutely best move (because the game wasn’t that complex anyway)

Beside, a timer requires another thread to watch it. If we have a global timer then this is fine resource-wise, but if we need a new timer for each calculation, then this will double the threads we need. We can also look at the time for each iteration, but again for example under heavy load, or we somehow want to suspend the execution and resume it later, they’ll all interfere it and might cause it to return too early, breaking the application under load. This will be very hard to debug unless we know here it’s time sensitive.

Something like that 41 years and a few days... there wont'be a next or a previous time but expressed as the number of iterations that would be enough to reach that state under normal circumstances.

Yes, if something like this can be done that’ll be ideal. It’s like a sanity check, based on complexity. I don’t know cron and time enough to know if this is feasible though.

It was right under my nose. #previous_time called within a “matching second”.

Is this because floating point precision?

Thanks to both of you for the patience and the guidance.

@jmettraux Thank you so much for your dedication to get to the bottom of this, and willing to install GitLab to try it out. I know that can be cumbersome. Please don’t get yourself too much stress though, it’s already much appreciated if you can take a look.

@conet Thank you for providing all the logs and traces and willing to go through it patiently especially when you’re not familiar with Ruby. I know it takes a lot of efforts to spin an instance, prepare the patch, run it, and wait for it, and finally collect logs, especially when it might not reproduce it at all.

godfat on Mar 22, 2019

Now that you mention the match_second part I do recall seeing in on the top of the stack trace often, I’m glad we got to the bottom of it, thanks for your support.

conet on Mar 22, 2019

Thank you very much.

jmettraux on Mar 22, 2019

Finally, here’s the current end, because it goes on and on:

Fugit#previous_time \ {:i=>180840000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>180850000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>180860000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>180870000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>180880000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>180890000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>180900000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>180910000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}

and the beginning:

Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"0 23 * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"50 00 * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"10 0 * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"50-59 22 * * 0", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"19 * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"*/1 * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"*/1 * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"*/1 * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"*/1 * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"0 0 * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"0 */2 * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"0 12 * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"0 * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"0 * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"0 12 * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"0 */1 * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"17 * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>0, :t=>"20190322T053910.000000Z", :d=>1553233150, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>10000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>20000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>30000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>40000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>50000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>60000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>70000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>80000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>90000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}
Fugit#previous_time \ {:i=>100000, :t=>"20190322T053910.000000Z", :d=>1, :from=>"20190322T053910.159113Z", :c=>"10 * * * * *", :tz=>nil, :df=>0}

conet on Mar 22, 2019

A few early observations:

I restarted gitlab after the change, it was running without issues for a couple of minutes then it entered the endless loop, proven this time by the logs:

2019-03-21_08:33:05.75627 2019-03-21T08:33:05.756Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.75793 2019-03-21T08:33:05.757Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.76032 2019-03-21T08:33:05.759Z 29013 TID-ovgxn8kq5 LdapAllGroupsSyncWorker JID-4102ed0f02f8c832a3f2b3e4 INFO: start
2019-03-21_08:33:05.76533 2019-03-21T08:33:05.765Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
2019-03-21_08:33:05.76710 2019-03-21T08:33:05.767Z 29013 TID-ovgxgte1x WARN: should_enque: 2019-03-21 08:33:05 +0000
...

The endless loop is not the one exposed in gh-11 / gh-13. That loop (https://github.com/floraison/fugit/blob/f1e2d002a6f59ca01c87744f2f3c78dc1931f86a/lib/fugit/cron.rb#L186-L215) is happening in #next_time while your case occurs around #previous_time.

As suggested by @godfat, there is now a breaker in the #next_time loop. At the same time, a breaker was added in the #previous_time loop (https://github.com/floraison/fugit/blob/f1e2d002a6f59ca01c87744f2f3c78dc1931f86a/lib/fugit/cron.rb#L232-L254). The breaker never triggered for you. No re-opening of gh-11 / gh-13.

The log is proving that Sidekiq-Cron is calling #should_enque repeatedly, with the same time value 😦

I’ve looked at Sidekiq-Cron (https://github.com/ondrejbartas/sidekiq-cron/blob/a9dfab248ca7/lib/sidekiq/cron/job.rb and https://github.com/ondrejbartas/sidekiq-cron/blob/a9dfab248ca7/lib/sidekiq/cron/poller.rb) and I don’t see yet how this loop comes to be.

I think that the impression you got that the problem comes from #previous_time is that this method is probably the costliest of the ones used in #should_enque and your kills exposed it repeatedly.

jmettraux on Mar 21, 2019

It took a bit longer but here it is:

2019-03-21_11:57:05.48766 2019-03-21T11:57:05.487Z 19536 TID-or1c33vzs WARN:  /// should_enque() 2019-03-21 11:57:05 UTC (Time/[Time, DateAndTime::Compatibility, DateAndTime::Calculati
ons, DateAndTime::Zones, Comparable, NakayoshiFork::Behavior, ActiveSupport::ToJsonWithActiveSupportEncoder, Object, GettextI18nRails::HtmlSafeTranslations, FastGettext::Translation, E
RB::Util, PP::ObjectMixin, ActiveSupport::Dependencies::Loadable, ActiveSupport::Tryable, JSON::Ext::Generator::GeneratorMethods::Object, Kernel, BasicObject]) / "2019-03-21 11:57:05 +
0000" / now: 2019-03-21 11:57:05 UTC (d 0.304640802s) / let: 2019-03-21 11:56:32 UTC

I assume that only the last entry is important, if you want more please let me know.

conet on Mar 21, 2019

@conet, are you somehow killing the thread to get those WARN messages? The outcome is different each time (I think I start to understand your “it seems in a loop” statement).

Yes, that is sidekiq’s way of getting a stacktrace dumped to the logs, by running:

kill -TTIN <sidekiq_pid>

conet on Mar 21, 2019

@conet I am in the middle of something else, but I can help with this quickly:

bundle exec ruby -r et-orbi -e "EtOrbi._make_info"

Run this to start a console: sudo gitlab-rails console and then run:

EtOrbi._make_info

which should print the results.

godfat on Mar 21, 2019

@godfat you’re welcome! If it’s my code, I have to fix it 😃

jmettraux on Mar 21, 2019

Thank you for looking into this! (I can’t reproduce the issue by myself)

godfat on Mar 21, 2019