kombu: Table empty or key no longer exists

The issue with redis key getting evicted every time. I read an old issue link. I have confirmed that my Redis instance is not hacked. In fact, we are using Secured Redis.

OperationalError("\nCannot route message for exchange 'reply.celery.pidbox': Table empty or key no longer exists.\nProbably the key ('_kombu.binding.reply.celery.pidbox') has been removed from the Redis database.\n",)

kombu==4.5.0 celery==4.3.0 redis==3.2.1

Is this some issue with redis?

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 51
  • Comments: 108 (36 by maintainers)

Commits related to this issue

Most upvoted comments

faced this same issue on the first queue whenever i started a second or more queues

fixed by downgrading kombu==4.5.0 from kombu==4.6.5

had nothing to do with redis. just the missing key _kombu.binding.reply.celery.pidbox that is never created if you redis-cli monitor

I’ve been running Redis server 5.0.2 with Celery 3.1.25 and then upgraded to Celery 4.3.0, 4.4.0 and 4.4.2. and experienced this error on each 4.x release. Similar to @the01, this issue doesn’t reliably

Unfortunately, I can’t upgrade the Redis server version we use, but I would be surprised if a patch update resolved this, especially since we did not encounter this with Celery 3.x.

you need to find out your problem.

Had the same issue. I fixed by downgrading kombu from 4.6.5 to 4.6.3 I still had the bug in version 4.6.4

I found the same issue, @danleyb2, did you figure out what the problem was with the current version?

Update: Downgrading to v4.5.0 solved the issue. Thanks @danleyb2

The original report talks about OperationalError("\nCannot route message for exchange 'reply.celery.pidbox': Table empty or key no longer exists.\nProbably the key ('_kombu.binding.reply.celery.pidbox') has been removed from the Redis database.\n",). I’ve done some digging as I’m experiencing similar. It looks like pidbox is “supposed” to handle the issue but other mechamisms are in the way. This happens on the reply path for a celery control message, specifically this runs through _publish_reply: https://github.com/celery/kombu/blob/92b8c32717191e420bf30734247b7a3b2ef1af0f/kombu/pidbox.py#L272-L289

Which ultimately ends up in get_table in the redis transport to look up the reply destination: https://github.com/celery/kombu/blob/92b8c32717191e420bf30734247b7a3b2ef1af0f/kombu/transport/redis.py#L834-L840

When it can’t find the reply destination it raises an InconsistencyError which the _publish_reply method specifically is supposed to handle and just ignore (because maybe the caller went away before the response could be sent). The problem is that by the time it’s trying to handle the exception in _publish_reply the exception is no-longer an InconsitencyError, now it’s an OperationalError. This is because on the way to get_table the code runs through connection.ensure: https://github.com/celery/kombu/blob/92b8c32717191e420bf30734247b7a3b2ef1af0f/kombu/connection.py#L530 and ultimately the _reraise_as_library_errors context manager: https://github.com/celery/kombu/blob/92b8c32717191e420bf30734247b7a3b2ef1af0f/kombu/connection.py#L448-L462

In there it looks up recoverable_connection_errors which for redis takes the fallback case https://github.com/celery/kombu/blob/92b8c32717191e420bf30734247b7a3b2ef1af0f/kombu/connection.py#L923-L937

And pulls in the channel_errors: https://github.com/celery/kombu/blob/92b8c32717191e420bf30734247b7a3b2ef1af0f/kombu/connection.py#L956-L959

Which pulls in the transport channel_errors: https://github.com/celery/kombu/blob/92b8c32717191e420bf30734247b7a3b2ef1af0f/kombu/transport/redis.py#L1048

Which calls this method on the transport: https://github.com/celery/kombu/blob/92b8c32717191e420bf30734247b7a3b2ef1af0f/kombu/transport/redis.py#L1085-L1087

which gives us this giant list of errors: https://github.com/celery/kombu/blob/92b8c32717191e420bf30734247b7a3b2ef1af0f/kombu/transport/redis.py#L74-L95

Which includes InconsistencyError which means that it’ll be translated into an OperationalError and not be caught in the _publish_reply method.

For my project using celery this means that the exception is getting caught here: https://github.com/celery/celery/blob/5b86b35c81ea5a1fbfd439861f4fee6813148d16/celery/worker/pidbox.py#L48-L49

and causing a reset that ultimately means that we stop processing tasks.

I think that probably means that the InconsistencyError should not be included in the list of errors for Redis. I don’t think any other transport is going to suffer from the same issue because InconsistencyError only happens for Redis in get_table.

Downgrading from 4.6.5 to 4.6.4 worked for us @auvipy when using celery 4.4.0rc3 (with https://github.com/celery/celery/commit/8e34a67bdb95009df759d45c7c0d725c9c46e0f4 cherry picked on top to address a different issue)

OK I did investigation and here are results:

  1. set keys in Redis works in a way that when you remove the last record from the set key, the set key is removed from redis
  2. _kombu.binding.reply.celery.pidbox key in redis is of type set and it contains queues bound to the virtual exchange
  3. the celery.control.inspect().ping() method is not synchronous. It means it does not wait for response from Celery workers. If it does not get response “immediately” it returns None.
  4. Methodcelery.control.inspect().ping() in the beginning creates new queue and put it to _kombu.binding.reply.celery.pidbox. After executing logic of method, the queue is deleted/removed and hence removed from the set.

Hence, in 99.99% cases we are not able to see anything because the workers are fast enough to write response before celery.control.inspect().ping() returns. But by artificial introduction of delay using sleep() in responding of ping to client we have corner case when the worker respons after method celery.control.inspect().ping() returns. But due 4. the queue does not exists and is removed from the _kombu.binding.reply.celery.pidbox set. Moreover, if it is that queue was the only one present, the set is deleted from redis due 1. in the time when worker tries to write reply to the queue using the virtual exchange. And this is causing the exception we are seeing.

@matusvalo we use Redis as a broker and had this issue every day. yesterday I’ve installed celery 5.1.2 and kombu from git, rev 7230665e5cd82c3e1b17dc9f5e16dce085994673 so far had no issues. We’ll keep an eye on it for a few days, so far looks like it does fix the issue, thanks a lot!

Why is this issue closed when it’s clearly still happening to people, and still happening to people on new versions of Redis?

update redis server to v5+

Looks like the reason is #1087. The bug showed up last week, after 4.6.4 -> 4.6.5 migration.

thanks you, 4.6.4 it works!

hello @matusvalo I can confirm that the fix #1404 works for us. We had the issue every few hours on our pipeline (2 servers with dozen of workers each) since we changed our redis server (from version 5.0.1 to 5.0.3):

celery[redis]==4.4.6 kombu==4.6.11 redis==3.5.3 (server 5.0.3)

For those who do not want to upgrade, you can patch kombu.transport.redis :

import kombu.transport.redis
from kombu.exceptions import InconsistencyError
from kombu.transport import TRANSPORT_ALIASES


class FixedChannel(kombu.transport.redis.Channel):
    def get_table(self, exchange):
        try:
            return super().get_table(exchange)
        except InconsistencyError:  # pragma: no cover
            # table does not exists since all queues bound to the exchange
            # were deleted. We need just return empty list.
            return []


class FixedTransport(kombu.transport.redis.Transport):
    Channel = FixedChannel

# Hack to override redis transport impl
TRANSPORT_ALIASES["redis"] = "$PATH_THIS_FILE:FixedTransport"

I think I am able to fix the issue. The problem is cause by multiple workers sharing the same oid (which is used to create a keys in _kombu.binding.reply.celery.pidbox. It is causing that sometimes different worker removes it even when other is still using it. oid value must be unique per worker otherwise it will cause following issue. The fix is simple just following method should be “uncached”

https://github.com/celery/kombu/blob/5ef5e22638035a6c412e949a2fbc5d44b7b088b2/kombu/pidbox.py#L407-L413

The fix is to rewrite property as follows:

    @property
    def oid(self):
       return oid_from(self)

This change alone seems to be fixing the issue. I have executed multiple runs of aforementioned reproducer and I was not able to reproduce crash anymore. I will provide the PR with the fix but I would like to ask everyone to test it.

Note: I tried just use @property instead of @cached_property but it did not help. After additional removal of self._tls cache attribute fixed the issue.

@auvipy why was this issue closed?

We have redis-server v6.0.1 already, still facing the same issue

I can confirm the bug. I have checked it also before #1394 and it is still occurring so it is not introduced by this fix 🎉 . Hence, this bug has different root cause then the bug fixed by #1394. I have checked this bug and it is still occurring when concurrency is set to 1.

I agree that it’s confusing that this issue is closed, although no reliable solution has been proposed and this is still manifesting. We’re seeing this in Redis 5.0.3 and Celery 4.3.0, but it seems that the specific versions are not very helpful in this case.

We have upgraded to 5.0.6 as well and we’re still seeing this issue… @hsabiu can you clarify what was changed between Redis versions that caused the problem to go away? @auvipy closed the issue, so I must be missing something here.

I’m still seeing this issue with 4.6.7.

celery==4.4.0 hiredis==1.0.1 kombu==4.6.7 redis==3.4.1

image

Edit: I’ve ensured timeout is 0 and the memory policy is noeviction. I’ve also set my workers with --without-heartbeat --without-mingle --without-gossip and we’re still seeing the errors. Only thing that comes to mind is if that particular set is empty, the key gets deleted regardless of settings as per redis spec: https://redis.io/topics/data-types-intro#automatic-creation-and-removal-of-keys.

Hi @matusvalo, I added your fix to my container’s kombu v.5.1.0 manually, but I’m still facing this issue:

kombu.exceptions.OperationalError:
Cannot route message for exchange 'reply.celery.pidbox': Table empty or key no longer exists.
Probably the key ('_kombu.binding.reply.celery.pidbox') has been removed from the Redis database.

Problem is that on my local PC everything is fine, but when I’m running Celery container on VM this issue comes… So I can’t say that the solution is not working, but for some reason it still happening for me.

I confirm that Celery worker fails with following message:

[2021-09-17 22:42:46,107: ERROR/MainProcess] Control command error: OperationalError("\nCannot route message for exchange 'reply.celery.pidbox': Table empty or key no longer exists.\nProbably the key ('_kombu.binding.reply.celery.pidbox') has been removed from the Redis database.\n")
Traceback (most recent call last):
  File "/home/matus/dev/kombu/kombu/connection.py", line 447, in _reraise_as_library_errors
    yield
  File "/home/matus/dev/kombu/kombu/connection.py", line 524, in _ensured
    return fun(*args, **kwargs)
  File "/home/matus/dev/kombu/kombu/messaging.py", line 199, in _publish
    return channel.basic_publish(
  File "/home/matus/dev/kombu/kombu/transport/virtual/base.py", line 600, in basic_publish
    return self.typeof(exchange).deliver(
  File "/home/matus/dev/kombu/kombu/transport/virtual/exchange.py", line 69, in deliver
    for queue in _lookup(exchange, routing_key):
  File "/home/matus/dev/kombu/kombu/transport/virtual/base.py", line 710, in _lookup
    self.get_table(exchange),
  File "/home/matus/dev/kombu/kombu/transport/redis.py", line 1001, in get_table
    raise InconsistencyError(NO_ROUTE_ERROR.format(exchange, key))
kombu.exceptions.InconsistencyError:
Cannot route message for exchange 'reply.celery.pidbox': Table empty or key no longer exists.
Probably the key ('_kombu.binding.reply.celery.pidbox') has been removed from the Redis database.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/matus/dev/celery/celery/worker/pidbox.py", line 44, in on_message
    self.node.handle_message(body, message)
  File "/home/matus/dev/kombu/kombu/pidbox.py", line 142, in handle_message
    return self.dispatch(**body)
  File "/home/matus/dev/kombu/kombu/pidbox.py", line 109, in dispatch
    self.reply({self.hostname: reply},
  File "/home/matus/dev/kombu/kombu/pidbox.py", line 146, in reply
    self.mailbox._publish_reply(data, exchange, routing_key, ticket,
  File "/home/matus/dev/kombu/kombu/pidbox.py", line 277, in _publish_reply
    producer.publish(
  File "/home/matus/dev/kombu/kombu/messaging.py", line 177, in publish
    return _publish(
  File "/home/matus/dev/kombu/kombu/connection.py", line 557, in _ensured
    errback and errback(exc, 0)
  File "/usr/lib/python3.9/contextlib.py", line 135, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/matus/dev/kombu/kombu/connection.py", line 451, in _reraise_as_library_errors
    raise ConnectionError(str(exc)) from exc
kombu.exceptions.OperationalError:
Cannot route message for exchange 'reply.celery.pidbox': Table empty or key no longer exists.
Probably the key ('_kombu.binding.reply.celery.pidbox') has been removed from the Redis database.

Update: we don’t see this specific issue anymore. Thanks a lot!

have same issue on latest

import celery celery.version ‘5.0.5’ import kombu kombu.version ‘5.0.2’ import redis redis.version ‘3.5.3’

| InconsistencyError: Cannot route message for exchange ‘reply.celery.pidbox’: Table empty or key no longer exists. Probably the key (‘_kombu.binding.reply.celery.pidbox’) has been removed from the Redis database. – | –

OperationalError(“\nCannot route message for exchange ‘reply.celery.pidbox’: Table empty or key no longer exists.\nProbably the key (‘_kombu.binding.reply.celery.pidbox’) has been removed from the Redis database.\n”)

This gets a bit more interesting/infuriating: it still happens after the downgrade, but in our case it’s much rarer and seems (“feels correlated”) to happen only under higher loads (more tasks running concurrently).

If I were to place my bet, or throw a dart, I’d speculate maybe this is some timeout issue, i.e. under some conditions it gives up with Probably the key ('_kombu.binding.reply.celery.pidbox') has been removed from the Redis database. even though the key is there (just perhaps inaccessible at the time?).

@hsabiu that’s understandable. @auvipy, could you elaborate on why you believe upgrading Redis resolves the issue when others have stated that they are still facing it after upgrading to latest Redis? You closed this issue, so I’m simply still trying to find the resolution that you have found.

I started noticing this error after upgrading Celery to 4.4.2 and kombu to 4.6.8. I read through most of the suggestions in this thread to downgrade Kombu to previous versions but that did not work for me.

What eventually end-up working for me is upgrading Redis server from version 3.2.11 to 5.0.8. Since the upgrade, I have not seen this error again and my celery worker systemd service is not going into failed state anymore.

@killthekitten It seems to be fixed, last month we stopped freezing kombu and it seems to be working with 4.6.6.

We use it with celery btw.

@auvipy why was this closed?

Looks like the reason is #1087. The bug showed up last week, after 4.6.4 -> 4.6.5 migration.

Potential fix created in PR #1394 . Please test. For now I am marking it as draft. The best is to have multiple users confirming this fix.

In case it helps debugging this, @drbig seems like he might be on to something regarding being inaccessible. We saw the table empty error mentioned here closely following a connection error.

Kombu 4.6.11 Celery 4.4.7

Jan 05 22:49:14 -redis.exceptions.ConnectionError: Error 32 while writing to socket. Broken pipe.

Jan 05 22:50:49 - OperationalError(“\nCannot route message for exchange ‘reply.celery.pidbox’: Table empty or key no longer exists.\nProbably the key (‘_kombu.binding.reply.celery.pidbox’) has been removed from the Redis database.\n”,)

Got the issue with redis 5:5.0.3-4+deb10u1 and celery==4.4.6 on a Debian buster while running a status subcommand.

It resolved by itself while reading this issue.

@msebbar I have changed to rabbitmq for the broker and have stopped seeing this problem. There is clearly something very specific about our setup that causes the issue to manifest, but can’t seem to figure out the source so I just jump ship.

We have upgraded to 5.0.6 as well and we’re still seeing this issue… @hsabiu can you clarify what was changed between Redis versions that caused the problem to go away? @auvipy closed the issue, so I must be missing something here.

@staticfox I’m not sure what changes between Redis versions. I’m merely stating what works in my case. I tried downgrading to previous versions of Celery and Kombu but that doesn’t seem to fix the issue. Bumping Redis to 5.0.8 with Celery 4.4.2 and Kombu 4.6.8 is what worked for me.

😄

as a broker definitely Rabbitmq is better than Redis in most of the case!

is it kombu issue or your redis conf? can you dig more deeper?

There’s nothing odd in our redis conf based on everything I’ve reviewed from this thread and others: timeout is 0 and memory policy is allkeys-lru. Although we have an LRU policy, we never come close to our peak memory capacity so the LRU policy shouldn’t be invoked.

I’m assuming this is a kombu issue since the exception trace originates from kombu, but I have no evidence beyond that:

kombu.exceptions.OperationalError: 
Cannot route message for exchange 'reply.celery.pidbox': Table empty or key no longer exists.
Probably the key ('_kombu.binding.reply.celery.pidbox') has been removed from the Redis database.

Other notes on our configuration

This started happening after upgrading to Celery 4.x

We upgraded from Celery 3.1.25 to Celery 4.3, kombu 4.6.3 in December 2019 and noticed this error manifest 28 days after the upgrade.

We downgraded to Celery 4.2.1, kombu 4.5.0 and redis 3.2 and had this manifest multiple times.

We recently upgraded to Celery 4.4.0 and later Celery 4.4.2, and each time this occurred several times more.

We use autoscaling

We do use autoscaling, which various issue logs have said is pseudo-deprecated in Celery 4.x (maybe coming back in 4.5/4.6/5.x). This OperationalError exception tends to occur during peak periods when autoscaling scales us up, but this isn’t always the case.

Other than autoscaling, our configuration is fairly basic: 3 workers for ad-hoc jobs, --autoscale=25,5 and 3 workers processing periodic, scheduled jobs --autoscale=5,1 (6 total worker nodes) with low utilization outside of a few daily spikes.

I’ll continue investigating for patterns or anomalies.

This has reproduced twice since deploying celery 4.4.2 and kombu 4.6.8 for us. I’ll update here if I find more information.

We’re deploying celery==4.4.2 and kombu==4.6.8 today, but I don’t expect this will manifest right away (for us, it’s not reliably reproducible and usually takes some time).

We have also seen this with: celery==4.4.0 kombu==4.6.7 redis==3.4.1

and

kombu==4.5.0 celery==4.3.0 redis==3.2.1

Our experience has been that this runs for a period of time (anywhere from ~6 days to 28 days) successfully before a worker fails out and stops consuming tasks. We’ve ruled out timeout is 0 and memory policy is allkeys-lru.


Today, I was inspecting the "_kombu.binding.reply.celery.pidbox" key and noticed it is transient and seems to only be set while workers are processing tasks (i.e. I only see it in Redis when workers are processing tasks). When it exists in Redis, I observe it has no expiration and is a set:

> TTL "_kombu.binding.reply.celery.pidbox"
(integer) -1
> TYPE "_kombu.binding.reply.celery.pidbox"
set

This would suggest that the key is explicitly created and deleted OR, as @staticfox noted, the set is losing all members and being deleted by Redis, but Celery expects it to exist.

I also found this old issue log, https://github.com/celery/kombu/issues/226, which pointed to fanout_prefix and fanout_patterns in broker_transport_options. I believe this only affects shared Redis clusters for multiple Celery apps (we are the only tenant on ours)?

This does not appear to be set in our app when initializing via celery.config_from_object:

print(celery_ctx.celery.conf.humanize(with_defaults=True))
...
broker_transport_options: {
 }
...

@auvipy - should this be re-opened based on recent reports?

kombu==4.6.3 fixed it for me – had the same issue with Celery worker crashing.