go: time: make Timer/Ticker channels not receivable with old values after Stop or Reset returns

In https://github.com/golang/go/issues/14383#issuecomment-220657111, @rsc said:

There is an argument for making

tr := time.NewTimer(0)
tr.Stop()
// There may or may not be a value in the channel now. But definitely
// one should not be added after we receive it.
select {
        case <-tr.C:
        default:
}

work, but it doesn’t today, and we’re certainly not going to do that for Go 1.7. Maybe early in Go 1.8 if someone wants to file a different bug.

As far as I can tell, no different bug was filed: the documentation for (*Timer).Stop still says: https://github.com/golang/go/blob/7d2473dc81c659fba3f3b83bc6e93ca5fe37a898/src/time/sleep.go#L57-L66 and that behavior is still resulting in subtle bugs (#27169). (Note that @rsc himself assumed that a select should work in this way in https://github.com/golang/go/issues/14038#issuecomment-219909704.)

I think we should tighten up the invariants of (*Timer).Stop to eliminate this subtlety.

CC @empijei @matttproud @the80srobot @ianlancetaylor

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 11
  • Comments: 25 (25 by maintainers)

Most upvoted comments

For what it’s worth, seeing all that buggy code as evidence of people struggling to use the current t.Stop and t.Reset correctly in single-goroutine use cases only makes me that much more convinced we should try the proposed change.

Above, we identified that one possible way that the semantic change being discussed might break code is if the code used len to poll whether the timer had gone off. We would have to make len report zero all the time, simulating an unbuffered channel. Code that depended on len sometimes being non-zero could possibly break.

My intuition was that using len on a channel is racy and almost never better than a select with a default case, so there’s a decent chance that not too much code is using len. To test this hypothesis, I searched my local corpus of Go packages for length on timer channels and sifted through the results with @bcmills. We found that use of len on timer channels is (1) rare, (2) almost always incorrect, (3) when incorrect, always fixed by the new semantics.

The corpus I am using is a collection of the latest of each major version of each module listed in the Go index (index.golang.org) as of late March. There are 134,485 modules with distinct paths in my corpus. Of those, 17,021 mention time.NewTimer or time.NewTicker. Of those, 15 modules (slightly under 0.1% of the 17,021) contain code matching len\(.*\.C\) in which the len operation applies to a timer or ticker channel.

Of the 15 modules applying len to a timer or ticker channel: one is an awkward simulation of a non-blocking select, already removed in a later commit; one uses len in debug prints to monitor the ticker; and the remaining 13 are buggy.

The semantic change proposed in this issue would make all the buggy code correct, assuming we apply the same change to Reset as we’ve discussed for Stop. We should probably also apply the change to time.Ticker’s Reset and Stop.

Use of len to simulate non-blocking select

github.com/packetzoom/logslammer

github.com/packetzoom/logslammer version v0.0.3 output/elasticsearch/elasticsearch.go (April 7, 2016) contains the following code in output/elasticsearch/elasticsearch.go:

func readInputChannel(idx *Indexer, receiveChan chan *buffer.Event) {
	// Drain the channel only if we have room
	if idx.bulkService.NumberOfActions() < esSendBuffer {
		select {
		case ev := <-receiveChan:
			idx.index(ev)
		}
	} else {
		log.Printf("Internal Elasticsearch buffer is full, waiting")
		time.Sleep(1 * time.Second)
	}
}

...

// Loop events and publish to elasticsearch
tick := time.NewTicker(time.Duration(esFlushInterval) * time.Second)

for {
	readInputChannel(idx, receiveChan)

	if len(tick.C) > 0 || len(es.term) > 0 {
		select {
		case <-tick.C:
			idx.flush()
		case <-es.term:
			tick.Stop()
			log.Println("Elasticsearch received term signal")
			break
		}
	}
}

This code seems to be using the len checks as an alternative to a non-blocking select on those two channels. It’s also worth noting that when received events are rare, the tick receive and corresponding idx.flush may be arbitrarily delayed by the blocking receive in readInputChannel. It’s possible that select was meant to be non-blocking.

The semantic change proposed in this issue would make the code never process ticks, since len(tick.C) would always be zero. However, in the latest untagged commit (February 17, 2017), the test of len(tick.C) has been removed. Also, the lack of any commits in the last three years, nor even a tag of the updated code, suggests the code is unmaintained and no longer used.

Use of len for logging/debugging prints

github.com/uber/cherami-server

github.com/uber/cherami-server version v1.28.1 services/outputhost/messagecache.go (November 29, 2017) has a logPumpHealth method that logs information about the capacity and length of its ticker channel, among others:

func (msgCache *cgMsgCache) logPumpHealth() {
	logPumpHealthOnce.Do(func() {
		msgCache.lclLg.WithFields(bark.Fields{
			...
			`RedeliveryTickerCap`:         cap(msgCache.redeliveryTicker.C),
		}).Info(`cache pump health capacities`)
	})

	msgCache.lclLg.WithFields(bark.Fields{
		...
		`RedeliveryTickerLen`:         len(msgCache.redeliveryTicker.C),
	}).Info(`cache pump health`)

	lg := msgCache.lclLg
	now := common.Now()

	...
	if msgCache.lastRedeliveryTickerFull > 0 {
		lg = lg.WithField(`RedeliveryTicker`, common.UnixNanoTime(now-msgCache.lastRedeliveryTickerFull).ToSeconds())
	}

	lg.Info(`cache pump health last channel full times`)
}

And the `updatePumpHealth method does:

func (msgCache *cgMsgCache) updatePumpHealth() int {
	now := common.Now()
	fullCount := 0
	...
	if cap(msgCache.redeliveryTicker.C) <= len(msgCache.redeliveryTicker.C) {
		msgCache.lastRedeliveryTickerFull = now
		fullCount++
	}
	return fullCount
}

The logging would never trigger anymore, and the pump health update would register the channel as “full” for being unbuffered (length and capacity both 0).

However, the repository’s README says that the project is “deprecated and not maintained.”

Racy use of len to drain channel after Stop

github.com/Freezerburn/go-coroutine

github.com/Freezerburn/go-coroutine v1.0.1 src/github.com/Freezerburn/coroutine/embed.go:

// Guarantee the timer channel is drained.
// Despite the docs saying to only check the return value of Stop, it's possible for Stop to return false without
// there being an item in the channel, so we also double-check if there is an item in the channel before attempting
// to drain it.
if !e.waitTimer.Stop() && len(e.waitTimer.C) > 0 {
	<-e.waitTimer.C
}
e.waitTimer.Reset(duration)
<-e.waitTimer.C

This use of len is potentially racing with the pending send that Stop did not stop. The correct code today would be to track whether e.waitTimer.C has been received from already and use that state in place of the length check. In fact, since this code is the only use of e.waitTimer, the timer will have only already fired and not been received from on the first execution of the method containing this code. Every time the method completes, the timer has expired and been read from. The first execution is different because e.waitTimer is initialized to time.NewTimer(0).

The proposed semantic change to Stop would make this code correct: on the first call, e.waitTimer.Stop would return true—the Stop prevented the zero-duration timer event from being sent. On subsequent iterations, e.waitTimer.Stop would return false—the timer event was already received by the previous call—and len would be zero, bypassing the no-longer-necessary draining code.

There is another instance of the pattern later in the file:

if !e.receiveTimer.Stop() && len(e.receiveTimer.C) > 0 {
	<-e.receiveTimer.C
}

e.receiveTimer.Reset(duration)
select {
case <-e.receiver:
case <-e.receiveTimer.C:
}

Like e.waitTimer, e.receiveTimer is initialized to time.NewTimer(0). On top of that, the code does not record whether the select received from e.receiveTimer.C, as it must for proper interpretation of the Stop result. This lack of memory is partly compensated for by the racy (buggy) len check.

The same analysis applies: if the Stop guarantees to stop any future send, then either the receive has already happened, in which case Stop returns false and len is zero, so or the receive is prevented, in which case Stop returns true. Either way, the code would now bypass the no-longer-necessary draining code.

The current latest commit rewrites both of those code snippets to use a helper that is different (avoiding len entirely) but still buggy:

func resetTimer(t *time.Timer, d time.Duration) {
	if !t.Stop() {
		// I've seen cases where Stop returns false, but there isn't an item in the channel, causing the receive to
		// hang forever.
		select {
		case <-t.C:
		default:
		}
	}
	t.Reset(d)
}

The same analysis applies, since a non-blocking select is just a less racy “len plus receive” operation. Removing the race between the length check and the receive did not eliminate the race with the delayed timer send. But the proposed semantic change would.

github.com/MatrixAINetwork/go-matrix

github.com/MatrixAINetwork/go-matrix v1.1.7 p2p/buckets.go (May 20, 2020):

for {
	select {
	case <-timeoutTimer.C:
		...
		if !timeoutTimer.Stop() && len(timeoutTimer.C) > 0 {
			<-timeoutTimer.C
		}
		timeoutTimer.Reset(time.Second * 60)
	case h := <-b.blockChain:
		if !timeoutTimer.Stop() && len(timeoutTimer.C) > 0 {
			<-timeoutTimer.C
		}
		timeoutTimer.Reset(time.Second * 60)
		...
	}
}

This code is trying to make sure that each iteration of the select falls into the timeout case after 60 seconds. For the most part it succeeds, but a receive from b.blockChain right at the moment the timer went off would fail to drain timeoutTimer.C, causing the next iteration to fall into the timeout case immediately.

Same race as in previous example; proposed semantic change applies the same fix.

github.com/cobolbaby/log-agent

github.com/cobolbaby/log-agent 71f7f9f watchdog/watchdog.go:

timer := time.NewTimer(delay)
...
for {
	select {
	case e = <-eventChan:
		...
		// timer may be not active, and fired
		if !timer.Stop() && len(timer.C) > 0 {
			<-timer.C //ctry to drain from the channel
		}
		timer.Reset(delay)
	case <-timer.C:
		...
		return
	}
}

Same race; same fix.

github.com/dcarbone/go-cacheman

github.com/dcarbone/go-cacheman v1.1.2 key_manager.go:

if !lifespan.Stop() && len(lifespan.C) > 0 {
	<-lifespan.C
}
lifespan.Reset(km.idleTTL)

Same race; same fix (appears four times in file).

github.com/myENA/consultant/v2

github.com/myENA/consultant/v2 v2.1.1 service.go:

if !refreshTimer.Stop() && len(refreshTimer.C) > 0 {
	<-refreshTimer.C
}
refreshTimer.Reset(ms.refreshInterval)

Same race; same fix (appears three times in file).

github.com/smartcontractkit/chainlink

github.com/smartcontractkit/chainlink v0.7.8 core/services/fluxmonitor/flux_monitor.go:

// stopTimer stops and clears the timer as suggested by the documentation.
func stopTimer(arg *time.Timer) {
	if !arg.Stop() && len(arg.C) > 0 {
		// Residual events are the timer's channel and need to be cleared.
		//
		// Refer to timer.Stop's documentation or
		// https://developpaper.com/detailed-explanation-of-the-trap-of-timer-in-golang/
		<-arg.C
	}
}

Same race; same fix. This is of course not the code suggested by the documentation, although it does match the (incorrect) linked blog post.

This code was removed without comment in v0.8.0.

github.com/vntchain/go-vnt

github.com/vntchain/go-vnt v0.6.4-alpha.6 producer/worker.go:

// The stop command may be happen when the round timer is timeout but not deal with
// the timeout event, so cleaning the channel of roundTimer is needed.
if false == self.roundTimer.Stop() && len(self.roundTimer.C) > 0 {
	<-self.roundTimer.C
	log.Warn("worker.roundTimer.C still has a expired event, now has been cleaned")
}

Same race; same fix.

github.com/eBay/akutan

github.com/eBay/akutan 9a750f2 src/github.com/ebay/akutan/util/clocks/wall.go:

timer.Stop()
for len(timer.C) > 0 {
	<-timer.C
}
...
timer.Reset(wake.Sub(Wall.Now()))

Same race; same fix.

github.com/piotrnar/gocoin

github.com/piotrnar/gocoin 820d7ad client/main.go:

SaveBlockChain.Stop()
for len(SaveBlockChain.C) > 0 {
	<-SaveBlockChain.C
}
if common.BlockChainSynchronized {
	SaveBlockChain.Reset(SaveBlockChainAfter)
} else {
	SaveBlockChain.Reset(SaveBlockChainAfterNoSync)
}

Same race; same fix.

Racy use of len to drain channel before Reset

github.com/qlcchain/go-qlc

github.com/qlcchain/go-qlc v1.3.5 p2p/sync.go:

if len(ss.pullTimer.C) > 0 {
	<-ss.pullTimer.C
}
ss.pullTimer.Reset(pullReqTimeOut)

This is the usual pattern from the previous section, except there is no call to Stop, so the receives are racing against an unstopped timer instead of the dying gasps of a stopped timer. But the race is the same.

If we change Reset to have the same guarantee as Stop, namely that no send from the old timer setting (before the Reset) can complete after Reset returns, then a plain call to Reset without Stop like in this code becomes correct, even without the if statement. With the proposed change, the if body would never execute.

Racy use of len to drain channel after Reset

github.com/chenjie199234/Corelib

github.com/chenjie199234/Corelib 2d8c16542cbe logger/logger.go:

for {
	select {
	case <-tmer.C:
		...
		tmer.Reset(500 * time.Millisecond)
		if len(tmer.C) > 0 {
			<-tmer.C
		}
	case <-l.update:
		...
		tmer.Reset(500 * time.Millisecond)
		if len(tmer.C) > 0 {
			<-tmer.C
		}
	case <-l.refresh:
		...
	}
}

This is like the previous example but the draining happens after Reset, making it race with the newly reset timer. A bad scheduling pause that introduced a half-second delay in goroutine execution between the timer being reset and the draining would cause a legitimate timer event to be discarded.

This entire package was removed on April 20.

Racy use of len to drain channel without Stop/Reset

github.com/indeedsecurity/carbonbeat

github.com/indeedsecurity/carbonbeat v1.0.0 app/carbonbeat.go:

ticker := time.NewTicker(bt.config.Period)
for {
	select {
	case <-bt.done:
		...
		return nil
	case <-ticker.C:
		//Remove all ticks from the ticker channel, do not "catch up"
		for len(ticker.C) > 0 {
			<-ticker.C
		}
	}
	...
}

Strictly speaking, this code is not wrong, but it is also not useful. The ticker channel has a buffer of length 1. In the instant after the receive just completed, the buffer is almost certainly empty. The check is racing with the next send, but very unlikely to lose that race. Changing the len to always return 0 will not have any discernible impact on this code.

github.com/indeedsecurity/carbonbeat/v2

github.com/indeedsecurity/carbonbeat/v2 v2.0.2 app/carbonbeat.go contains the same code.

Discussion ongoing, so leaving this for another week, but it seems headed for likely accept. Please speak up if you think this is a bad idea.

In the current implementation, this can probably be done by changing runOneTimer to move the Cas away from timerRunning until after the invocation of f.

I think this is an excellent idea and I’d bet that I’ve written at least one piece of code which lacks, but needs, the fixup code.

Added to proposal process to resolve whether to do this, after mention in #38945.