go: time: Timer reset broken under heavy use since go1.16 timer optimizations added [1.16 backport]

@ianlancetaylor requested issue #47329 to be considered for backport to the next 1.16 minor release.

@gopherbot Please open backport to 1.16.

This bug also exists in 1.16. It can cause programs that use Timer.Reset to fail to run a timer when it is ready.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 16 (12 by maintainers)

Commits related to this issue

Most upvoted comments

Backport complete.

@lukestoward This issue is closed and as far as we can tell, it is fixed. The fix was in the Go 1.16.7 release. Although the fix may well have introduced the problem you describe, since the patch has been released we should treat this as a new problem. I encourage you to open a new issue, ideally with a case that we can use to reproduce the problem.

See also #47762.

FYI I’m seeing increased CPU usage on various x64 and arm64 linux nodes after rolling out ed8cbbc3ae96aef98d8f9e9e7003a99ed74992b5. I will rollback to 1.16.6 tomorrow to confirm and capture some cpu profiles.

EDIT: I confirmed reverting that single commit brought the cpu usage of my otherwise idle golang program from 100% back to 0%.

Here’s a 60s cpu profile with that commit included:

(pprof) top --cum
Showing nodes accounting for 49.47s, 82.42% of 60.02s total
Dropped 25 nodes (cum <= 0.30s)
Showing top 10 nodes out of 26
      flat  flat%   sum%        cum   cum%
     4.17s  6.95%  6.95%     60.01s   100%  runtime.findrunnable
         0     0%  6.95%     58.82s 98.00%  runtime.schedule
         0     0%  6.95%     58.76s 97.90%  runtime.mcall
         0     0%  6.95%     58.63s 97.68%  runtime.park_m
     3.91s  6.51% 13.46%     42.10s 70.14%  runtime.netpoll
    38.19s 63.63% 77.09%     38.19s 63.63%  runtime.epollwait
     1.90s  3.17% 80.26%      4.70s  7.83%  runtime.checkTimers
     1.25s  2.08% 82.34%      1.54s  2.57%  runtime.nanotime (inline)
     0.01s 0.017% 82.36%      1.42s  2.37%  runtime.unlock (inline)
     0.04s 0.067% 82.42%      1.41s  2.35%  runtime.unlockWithRank (inline)

and back to normal after I reverted it:

(pprof) top --cum
Showing nodes accounting for 30ms, 75.00% of 40ms total
Showing top 10 nodes out of 17
      flat  flat%   sum%        cum   cum%
         0     0%     0%       20ms 50.00%  runtime.gcBgMarkWorker
         0     0%     0%       20ms 50.00%  runtime.gcBgMarkWorker.func2
      10ms 25.00% 25.00%       20ms 50.00%  runtime.gcDrain
         0     0% 25.00%       20ms 50.00%  runtime.mcall
         0     0% 25.00%       20ms 50.00%  runtime.park_m
         0     0% 25.00%       20ms 50.00%  runtime.schedule
         0     0% 25.00%       20ms 50.00%  runtime.systemstack
      10ms 25.00% 50.00%       10ms 25.00%  runtime.epollwait
         0     0% 50.00%       10ms 25.00%  runtime.findrunnable
      10ms 25.00% 75.00%       10ms 25.00%  runtime.futex

Closed per Ian’s comment.

I’m uncertain whether CL 337309 should be cherry picked to the 1.16 branch. The bug fix (CLs 336432, 336689) fixed a real bug that could cause timers to fail to fire. CL 337309 is a performance fix. I’m mildly concerned that CL 337309 will introduce some other unforeseen performance problem. On the 1.16 release branch we needed the bug fix, but I don’t know how hard we want to change performance issues on that branch.