dalli: Dalli sometimes returns incorrect values

Hi, thanks for all your work maintaining this repo! I am a security engineer investigating some incorrect Rails caching behavior observed after upgrading to the latest version from 2.7.x, and have identified a mechanism consistent with my findings:

There is a race condition in the pipelined getter implementation which allows for socket corruption. The key problem is that a lock (request_in_progress) is acquired on the connection only after sending the getkq requests; if something happens between sending those requests and acquiring the connection lock, the socket could be left in a state which is both corrupt and (superficially) ready to accept future requests.

In the corrupted state, subsequent operations which result in data being read from the socket will have incorrect behavior, because they expect that the next bytes on the socket are the responses to their own requests but the unread getkq responses will precede their own responses. This could cause the client to return incorrect values for given keys because the writing and reading operations are out of sync / “off-by-x”, where x is the number of responses to the getkq requests.

Suggested fix

I propose we should instead acquire a connection lock before writing the getkq requests to the socket. This means that if something happens after writing the getkq requests but before reading their responses, the connection will be left in the locked (request_in_progress) state and will refuse to accept additional requests (resetting instead). I am opening a PR which attempts to do this.

Open question

Note that reading the 2.7.x code, it seems that it could have had the same issue here. I am unsure if something else changed to make this failure case more likely - or if it was just a coincidence. The time window for a timeout/failure leading to this bug is narrow, so this would certainly be a rare event with randomness at play; nonetheless I observed the incorrect behavior not long after upgrading to 3.2.3 which is suspicious and I would like to be able to explain what, if anything, changed.

Demo/POC

in Rails console I used the following steps to reproduce this issue:

# code to produce a timeout on the 3.x Dalli gem
# simulates a process timeout occurring between make_getkq_requests and finish_queries
Dalli::PipelinedGetter.class_eval do 
  def setup_requests(keys)
    groups = groups_for_keys(keys)
    make_getkq_requests(groups)
    
    sleep 10 
    
    # TODO: How does this exit on a NetworkError
    finish_queries(groups.keys)
  end
end

c = ActiveSupport::Cache.lookup_store(:mem_cache_store)

# populate some cache entries by doing stuff like 
c.fetch("test10") { "test10result" }
c.fetch("test1") { "test1result1" }
c.fetch("test2") { "test1result2" }
c.fetch("test3") { "test1result3" }

# trigger timeout  during read_multi
Timeout.timeout(5) do 
  c.read_multi(*["test1", "test2", "test3", "test4"])
end

# this should now output the wrong result
c.fetch("test10")

=> “test1result1”

About this issue

Original URL
State: closed
Created a year ago
Comments: 18 (9 by maintainers)

Commits related to this issue

Better handle Thread#raise and Thread#kill Fix: https://github.com/petergoldstein/dalli/issues/956 While it's heavily discouraged, `Timeout.timeout` end up being relatively frequently used in produc... — committed to Shopify/dalli by byroot a year ago
Better handle Thread#raise and Thread#kill Fix: https://github.com/petergoldstein/dalli/issues/956 While it's heavily discouraged, `Timeout.timeout` end up being relatively frequently used in produc... — committed to Shopify/dalli by byroot a year ago
Better handle Thread#raise and Thread#kill Fix: https://github.com/petergoldstein/dalli/issues/956 While it's heavily discouraged, `Timeout.timeout` end up being relatively frequently used in produc... — committed to Shopify/dalli by byroot a year ago
Better handle Thread#raise and Thread#kill Fix: https://github.com/petergoldstein/dalli/issues/956 While it's heavily discouraged, `Timeout.timeout` end up being relatively frequently used in produc... — committed to Shopify/dalli by byroot a year ago
Better handle Thread#raise and Thread#kill Fix: https://github.com/petergoldstein/dalli/issues/956 While it's heavily discouraged, `Timeout.timeout` end up being relatively frequently used in produc... — committed to Shopify/dalli by byroot a year ago
Better handle Thread#raise and Thread#kill Fix: https://github.com/petergoldstein/dalli/issues/956 While it's heavily discouraged, `Timeout.timeout` end up being relatively frequently used in produc... — committed to Shopify/dalli by byroot a year ago
Better handle Thread#raise and Thread#kill Fix: https://github.com/petergoldstein/dalli/issues/956 While it's heavily discouraged, `Timeout.timeout` end up being relatively frequently used in produc... — committed to Shopify/dalli by byroot a year ago
Better handle Thread#raise and Thread#kill Fix: https://github.com/petergoldstein/dalli/issues/956 While it's heavily discouraged, `Timeout.timeout` end up being relatively frequently used in produc... — committed to Shopify/dalli by byroot a year ago
Better handle Thread#raise and Thread#kill (#963) Fix: https://github.com/petergoldstein/dalli/issues/956 While it's heavily discouraged, `Timeout.timeout` end up being relatively frequently used... — committed to petergoldstein/dalli by casperisfine a year ago

Most upvoted comments

@petergoldstein If you’d like me to take this one and shepherd it through the PR process, I’d be happy to take it off your plate.

mperham on May 8, 2023

although I think it might be more complex here because of memcached pipelining (a naive port of the Redis code seems unlikely to work).

So I had a quick look, and indeed it will have to work a bit differently, but it should be doable. Contrary to redis, with memcached pipelining you don’t know how many responses to expect, but you do know when you are done.

So I think incrementing by one before sending the pipelined command, and then decrementing by one once the pipeline is done reading would work. But again, I just had a quick look.

But if you want to take a swing at it, that’d be fine too.

I’m at Ruby Kaigi for about a week, so @ cornu-ammonis if you feel like trying this help yourself, otherwise I’ll try to have a stab at it once I’m back.

casperisfine on May 9, 2023

Proposed patch https://github.com/petergoldstein/dalli/pull/963

casperisfine on May 16, 2023

This seems like a valid issue to me. There could be many ways to corrupt a socket; even if Dalli is a memcached library where speed is critical, we still should value correctness over speed.

mperham on May 8, 2023

Honestly, none of the above makes much sense to me. So no.

petergoldstein on Mar 16, 2023

Hi @petergoldstein, sorry that we got off on the wrong foot last week. I’ve come to better appreciate your points about timeouts and request/resource management. It’s clear to me now why Timeout.timeout is unsafe, and any request timeout which does not fully restart the process would be playing with fire, even if library/application code tries to make it safer. https://github.com/mperham/connection_pool/pull/75#issuecomment-91571249 was another interesting example of this. So obviously you’re right that the library isn’t responsible for protecting against timeout misuse. The fundamental safety problem there is the timeouts on their own.

I’m carrying that lesson into some other Rails codebases I’m involved with, so I’m grateful you took the time to engage with me and explain it.

However, I’m now wondering if my misguided timeout examples were just a red herring, because they don’t help to explain why I and others saw this issue occur right after upgrading. I found some other historical issues in this repo referencing the same or similar behavior, such as:

#667
#123

I spoke directly with an engineer at another company who had mentioned seeing the same behavior a few years ago, who shared that it happened again recently:

This seems to happen whenever we upgraded the gem. Recently we saw the same behavior during a rails version upgrade.

Unfortunately we did not have a solution but mitigated the issue by ensuring that the different versions of the app did not write to the same cache namespace.

I’m curious if you can think of any reason the version upgrade process itself might cause issues? E.g. subtle formatting differences between versions, such that interacting with entries set by the previous version might have unexpected behavior? Just a shot in the dark there. I’m following the advice above by putting the Dalli version in our cache namespace to effectively bust cache when upgrading, but nonetheless curious about what could be going on.

I also see that there have been issues in the past related to reading get_multi responses which was addressed #321 - but seems similar enough to what I’ve seen that it’s worth mentioning.

Would any of those additional reports influence your evaluation of whether it could be worth exploring some simple logic in Dalli’s get implementation to mitigate the possibility of incorrect responses? As the comment on #123 says, debugging is near-impossible with this kind of issue; but I wonder if there might be a simple/general failsafe to pursue either way. Again, it’s really unclear if there’s a bug in Dalli at all, or how we could find it if there was, so I’m not saying you or the library have done anything wrong - just wondering if we might nonetheless be able to mitigate this rare failure mode for other users of the library.

Thanks again for your insights!

cornu-ammonis on Mar 15, 2023

@mperham No, I don’t quite agree that’s what’s going on here.

What was initially reported was a case of pure user error. A non-thread safe Dalli client was passed around after a Timeout. Unsurprisingly, it was corrupted. Feel free to read my comment above - I think it’s a correct summary. I don’t believe Dalli has any responsibility to be “safe” in that situation. And I don’t believe that the situation is trivial to detect and/or error on. Not impossible, but it would require some reasonable work. Perhaps this is where we disagree.

A solution was proposed that adds another method to the Client interface and replaces use of memcached’s get with getk and “checks” the key on return. I noted my general problems with the proposed implementation - I don’t think it’s coded well or architected well. I think breaking the abstractions in the gem is a really bad idea. It added a weird side method, implying that the standard method usage was somehow unsafe. And I don’t think the tradeoffs make any sense - guarding against this supposed problem becomes very expensive, both in terms of performance and maintainability.

Despite an agreement that the original problem was user error, the proposed solution was nonetheless viewed as a solution to some set of supposed corruption issues dug out of the depths of Dalli issues (3 over a decade of issues). And this one thing he heard from a guy he works with that they always change the cache key when they upgrade Dalli versions (which would have nothing to do with socket corruption, but hey…) so obviously it’s all the same problem. And so the code should obviously be merged…

Well, no.

Now, on to your comments. It is certainly possible for a socket to become corrupt. It can happen for reasons other than use error, and it’s nice to have tooling that mitigates that. It’s even nicer if that mitigation helps in cases of user error. Assuming the tradeoffs aren’t too onerous. In my view, given the lack of frequency of the issue, I think the tradeoffs have to be pretty small in terms of maintainability, API coherency, and performance, .

That said, that doesn’t mean a mitigating PR is not possible. It would just need to meet that tradeoff threshold.

As you note, in the memcached case the opaque field can be used for this purpose. So one approach is to use that. I’d absolutely consider a PR that:

Adds the optional use of an opaque value across memcached get (and possibly other request types) without requiring new Client methods (I’d even be open to discussing signature changes, although that can be challenging with Rails dependencies)
Works in a mostly consistent fashion across both the (now deprecated) binary and meta protocols.
Integrates with the existing Dalli::Error scheme in a sensible way, and handles the underlying connection cleanly.

@casperisfine I’d also be open to something like you’re proposing, although I think it might be more complex here because of memcached pipelining (a naive port of the Redis code seems unlikely to work). But if you want to take a swing at it, that’d be fine too.

petergoldstein on May 9, 2023