kombu: Table empty or key no longer exists
The issue with redis key getting evicted every time. I read an old issue link. I have confirmed that my Redis instance is not hacked. In fact, we are using Secured Redis.
OperationalError("\nCannot route message for exchange 'reply.celery.pidbox': Table empty or key no longer exists.\nProbably the key ('_kombu.binding.reply.celery.pidbox') has been removed from the Redis database.\n",)
kombu==4.5.0 celery==4.3.0 redis==3.2.1
Is this some issue with redis?
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 51
- Comments: 108 (36 by maintainers)
Commits related to this issue
- wip downgrade kombu to fix redis issue https://github.com/celery/kombu/issues/1063 — committed to NewAcropolis/api by kenlt-uk 5 years ago
- fix(req.txt): Update kombu and django-extensions Django Extensions was failing on our version of Django and had to be updated. Kombu is a different, darker story. The reason it needs to be changed i... — committed to freelawproject/courtlistener by mlissner 5 years ago
- Downgrade kombu from 4.6.5 to 4.6.3 As suggested in https://github.com/celery/kombu/issues/1063 This should fixes the issue with redis key getting evicted every now and then and should mean we stop r... — committed to ministryofjustice/laa-legal-adviser-api by said-moj 5 years ago
- Downgrade kombu from 4.6.5 to 4.6.3 As suggested in https://github.com/celery/kombu/issues/1063 This should fixes the issue with redis key getting evicted every now and then and should mean we stop r... — committed to ministryofjustice/laa-legal-adviser-api by said-moj 5 years ago
- Downgrade kombu from 4.6.5 to 4.6.3 As suggested in https://github.com/celery/kombu/issues/1063 This should fixes the issue with redis key getting evicted every now and then and should mean we stop r... — committed to ministryofjustice/laa-legal-adviser-api by said-moj 5 years ago
- Downgrade kombu from 4.6.5 to 4.6.3 As suggested in celery/kombu#1063 This should fixes the issue with redis key getting evicted every now and then and should mean we stop receiving the following err... — committed to waldur/waldur-mastermind by AmbientLighter 4 years ago
- Use older kombu to avoid pidbox errors See https://github.com/celery/kombu/issues/1063 Signed-off-by: Michal Čihař <michal@cihar.com> — committed to WeblateOrg/docker by nijel 4 years ago
- Drop kombu version from 4.6.5 to 4.6.3 Fixes celery error: ``` kombu.exceptions.OperationalError: Cannot route message for exchange 'reply.celery.pidbox': Table empty or key no longer exists. Probabl... — committed to ubc/ontask_b by andrew-gardener 4 years ago
- Downgraded kombu to fix https://github.com/celery/kombu/issues/1063 — committed to tough-dev-school/education-backend by f213 4 years ago
- Additional logging on 'pidbox' related error as documented here: https://github.com/celery/kombu/issues/1063#issuecomment-704405396 — committed to hsophie-sf/celery by hsophie-sf 4 years ago
- Fix for https://github.com/celery/kombu/issues/1063. InconsistencyError is not raised here. Instead OperationalError is. We inspect ignore it if its the NO_ROUTE_ERROR, otherwise let it pass. — committed to hsophie-sf/kombu by hsophie-sf 4 years ago
- Upgrade kombu to 4.6.3 This should fix celery workers breaking -https://github.com/celery/kombu/issues/1063 — committed to NewAcropolis/api by kenlt-uk 3 years ago
- Downgraded kombu to fix https://github.com/celery/kombu/issues/1063 — committed to adonis0302/Education_Platform_Backend by f213 4 years ago
faced this same issue on the first queue whenever i started a second or more queues
fixed by downgrading
kombu==4.5.0
fromkombu==4.6.5
had nothing to do with redis. just the missing key
_kombu.binding.reply.celery.pidbox
that is never created if youredis-cli monitor
you need to find out your problem.
Had the same issue. I fixed by downgrading kombu from 4.6.5 to 4.6.3 I still had the bug in version 4.6.4
I found the same issue, @danleyb2, did you figure out what the problem was with the current version?
Update: Downgrading to v4.5.0 solved the issue. Thanks @danleyb2
The original report talks about
OperationalError("\nCannot route message for exchange 'reply.celery.pidbox': Table empty or key no longer exists.\nProbably the key ('_kombu.binding.reply.celery.pidbox') has been removed from the Redis database.\n",)
. I’ve done some digging as I’m experiencing similar. It looks like pidbox is “supposed” to handle the issue but other mechamisms are in the way. This happens on the reply path for a celery control message, specifically this runs through_publish_reply
: https://github.com/celery/kombu/blob/92b8c32717191e420bf30734247b7a3b2ef1af0f/kombu/pidbox.py#L272-L289Which ultimately ends up in get_table in the redis transport to look up the reply destination: https://github.com/celery/kombu/blob/92b8c32717191e420bf30734247b7a3b2ef1af0f/kombu/transport/redis.py#L834-L840
When it can’t find the reply destination it raises an
InconsistencyError
which the_publish_reply
method specifically is supposed to handle and just ignore (because maybe the caller went away before the response could be sent). The problem is that by the time it’s trying to handle the exception in_publish_reply
the exception is no-longer anInconsitencyError
, now it’s anOperationalError
. This is because on the way toget_table
the code runs throughconnection.ensure
: https://github.com/celery/kombu/blob/92b8c32717191e420bf30734247b7a3b2ef1af0f/kombu/connection.py#L530 and ultimately the_reraise_as_library_errors
context manager: https://github.com/celery/kombu/blob/92b8c32717191e420bf30734247b7a3b2ef1af0f/kombu/connection.py#L448-L462In there it looks up
recoverable_connection_errors
which for redis takes the fallback case https://github.com/celery/kombu/blob/92b8c32717191e420bf30734247b7a3b2ef1af0f/kombu/connection.py#L923-L937And pulls in the
channel_errors
: https://github.com/celery/kombu/blob/92b8c32717191e420bf30734247b7a3b2ef1af0f/kombu/connection.py#L956-L959Which pulls in the transport channel_errors: https://github.com/celery/kombu/blob/92b8c32717191e420bf30734247b7a3b2ef1af0f/kombu/transport/redis.py#L1048
Which calls this method on the transport: https://github.com/celery/kombu/blob/92b8c32717191e420bf30734247b7a3b2ef1af0f/kombu/transport/redis.py#L1085-L1087
which gives us this giant list of errors: https://github.com/celery/kombu/blob/92b8c32717191e420bf30734247b7a3b2ef1af0f/kombu/transport/redis.py#L74-L95
Which includes
InconsistencyError
which means that it’ll be translated into an OperationalError and not be caught in the_publish_reply
method.For my project using celery this means that the exception is getting caught here: https://github.com/celery/celery/blob/5b86b35c81ea5a1fbfd439861f4fee6813148d16/celery/worker/pidbox.py#L48-L49
and causing a reset that ultimately means that we stop processing tasks.
I think that probably means that the
InconsistencyError
should not be included in the list of errors for Redis. I don’t think any other transport is going to suffer from the same issue becauseInconsistencyError
only happens for Redis inget_table
.Downgrading from 4.6.5 to 4.6.4 worked for us @auvipy when using celery 4.4.0rc3 (with https://github.com/celery/celery/commit/8e34a67bdb95009df759d45c7c0d725c9c46e0f4 cherry picked on top to address a different issue)
OK I did investigation and here are results:
_kombu.binding.reply.celery.pidbox
key in redis is of typeset
and it contains queues bound to the virtual exchangecelery.control.inspect().ping()
method is not synchronous. It means it does not wait for response from Celery workers. If it does not get response “immediately” it returnsNone
.celery.control.inspect().ping()
in the beginning creates new queue and put it to_kombu.binding.reply.celery.pidbox
. After executing logic of method, the queue is deleted/removed and hence removed from the set.Hence, in 99.99% cases we are not able to see anything because the workers are fast enough to write response before
celery.control.inspect().ping()
returns. But by artificial introduction of delay usingsleep()
in responding of ping to client we have corner case when the worker respons after methodcelery.control.inspect().ping()
returns. But due 4. the queue does not exists and is removed from the_kombu.binding.reply.celery.pidbox
set. Moreover, if it is that queue was the only one present, the set is deleted from redis due 1. in the time when worker tries to write reply to the queue using the virtual exchange. And this is causing the exception we are seeing.@matusvalo we use Redis as a broker and had this issue every day. yesterday I’ve installed celery 5.1.2 and kombu from git, rev 7230665e5cd82c3e1b17dc9f5e16dce085994673 so far had no issues. We’ll keep an eye on it for a few days, so far looks like it does fix the issue, thanks a lot!
Why is this issue closed when it’s clearly still happening to people, and still happening to people on new versions of Redis?
update redis server to v5+
thanks you,
4.6.4
it works!hello @matusvalo I can confirm that the fix #1404 works for us. We had the issue every few hours on our pipeline (2 servers with dozen of workers each) since we changed our redis server (from version 5.0.1 to 5.0.3):
celery[redis]==4.4.6 kombu==4.6.11 redis==3.5.3 (server 5.0.3)
For those who do not want to upgrade, you can patch
kombu.transport.redis
:I think I am able to fix the issue. The problem is cause by multiple workers sharing the same
oid
(which is used to create a keys in_kombu.binding.reply.celery.pidbox
. It is causing that sometimes different worker removes it even when other is still using it.oid
value must be unique per worker otherwise it will cause following issue. The fix is simple just following method should be “uncached”https://github.com/celery/kombu/blob/5ef5e22638035a6c412e949a2fbc5d44b7b088b2/kombu/pidbox.py#L407-L413
The fix is to rewrite property as follows:
This change alone seems to be fixing the issue. I have executed multiple runs of aforementioned reproducer and I was not able to reproduce crash anymore. I will provide the PR with the fix but I would like to ask everyone to test it.
@auvipy why was this issue closed?
We have redis-server v6.0.1 already, still facing the same issue
I can confirm the bug. I have checked it also before #1394 and it is still occurring so it is not introduced by this fix 🎉 . Hence, this bug has different root cause then the bug fixed by #1394. I have checked this bug and it is still occurring when concurrency is set to 1.
I agree that it’s confusing that this issue is closed, although no reliable solution has been proposed and this is still manifesting. We’re seeing this in Redis 5.0.3 and Celery 4.3.0, but it seems that the specific versions are not very helpful in this case.
We have upgraded to 5.0.6 as well and we’re still seeing this issue… @hsabiu can you clarify what was changed between Redis versions that caused the problem to go away? @auvipy closed the issue, so I must be missing something here.
I’m still seeing this issue with 4.6.7.
celery==4.4.0 hiredis==1.0.1 kombu==4.6.7 redis==3.4.1
Edit: I’ve ensured timeout is 0 and the memory policy is noeviction. I’ve also set my workers with
--without-heartbeat --without-mingle --without-gossip
and we’re still seeing the errors. Only thing that comes to mind is if that particular set is empty, the key gets deleted regardless of settings as per redis spec: https://redis.io/topics/data-types-intro#automatic-creation-and-removal-of-keys.Hi @matusvalo, I added your fix to my container’s kombu v.5.1.0 manually, but I’m still facing this issue:
Problem is that on my local PC everything is fine, but when I’m running Celery container on VM this issue comes… So I can’t say that the solution is not working, but for some reason it still happening for me.
I confirm that Celery worker fails with following message:
Update: we don’t see this specific issue anymore. Thanks a lot!
have same issue on latest
| InconsistencyError: Cannot route message for exchange ‘reply.celery.pidbox’: Table empty or key no longer exists. Probably the key (‘_kombu.binding.reply.celery.pidbox’) has been removed from the Redis database. – | –
OperationalError(“\nCannot route message for exchange ‘reply.celery.pidbox’: Table empty or key no longer exists.\nProbably the key (‘_kombu.binding.reply.celery.pidbox’) has been removed from the Redis database.\n”)
This gets a bit more interesting/infuriating: it still happens after the downgrade, but in our case it’s much rarer and seems (“feels correlated”) to happen only under higher loads (more tasks running concurrently).
If I were to place my bet, or throw a dart, I’d speculate maybe this is some timeout issue, i.e. under some conditions it gives up with
Probably the key ('_kombu.binding.reply.celery.pidbox') has been removed from the Redis database.
even though the key is there (just perhaps inaccessible at the time?).@hsabiu that’s understandable. @auvipy, could you elaborate on why you believe upgrading Redis resolves the issue when others have stated that they are still facing it after upgrading to latest Redis? You closed this issue, so I’m simply still trying to find the resolution that you have found.
I started noticing this error after upgrading Celery to 4.4.2 and kombu to 4.6.8. I read through most of the suggestions in this thread to downgrade Kombu to previous versions but that did not work for me.
What eventually end-up working for me is upgrading Redis server from version 3.2.11 to 5.0.8. Since the upgrade, I have not seen this error again and my celery worker systemd service is not going into failed state anymore.
@killthekitten It seems to be fixed, last month we stopped freezing kombu and it seems to be working with 4.6.6.
We use it with celery btw.
@auvipy why was this closed?
Looks like the reason is #1087. The bug showed up last week, after
4.6.4
->4.6.5
migration.Potential fix created in PR #1394 . Please test. For now I am marking it as draft. The best is to have multiple users confirming this fix.
In case it helps debugging this, @drbig seems like he might be on to something regarding being inaccessible. We saw the table empty error mentioned here closely following a connection error.
Kombu
4.6.11
Celery4.4.7
Jan 05 22:49:14 -redis.exceptions.ConnectionError: Error 32 while writing to socket. Broken pipe.
Jan 05 22:50:49 - OperationalError(“\nCannot route message for exchange ‘reply.celery.pidbox’: Table empty or key no longer exists.\nProbably the key (‘_kombu.binding.reply.celery.pidbox’) has been removed from the Redis database.\n”,)
Got the issue with redis
5:5.0.3-4+deb10u1
and celery==4.4.6 on a Debian buster while running astatus
subcommand.It resolved by itself while reading this issue.
@msebbar I have changed to rabbitmq for the broker and have stopped seeing this problem. There is clearly something very specific about our setup that causes the issue to manifest, but can’t seem to figure out the source so I just jump ship.
@staticfox I’m not sure what changes between Redis versions. I’m merely stating what works in my case. I tried downgrading to previous versions of Celery and Kombu but that doesn’t seem to fix the issue. Bumping Redis to 5.0.8 with Celery 4.4.2 and Kombu 4.6.8 is what worked for me.
😄
as a broker definitely Rabbitmq is better than Redis in most of the case!
There’s nothing odd in our redis conf based on everything I’ve reviewed from this thread and others: timeout is 0 and memory policy is allkeys-lru. Although we have an LRU policy, we never come close to our peak memory capacity so the LRU policy shouldn’t be invoked.
I’m assuming this is a kombu issue since the exception trace originates from kombu, but I have no evidence beyond that:
Other notes on our configuration
This started happening after upgrading to Celery 4.x
We upgraded from Celery 3.1.25 to Celery 4.3, kombu 4.6.3 in December 2019 and noticed this error manifest 28 days after the upgrade.
We downgraded to Celery 4.2.1, kombu 4.5.0 and redis 3.2 and had this manifest multiple times.
We recently upgraded to Celery 4.4.0 and later Celery 4.4.2, and each time this occurred several times more.
We use autoscaling
We do use autoscaling, which various issue logs have said is pseudo-deprecated in Celery 4.x (maybe coming back in 4.5/4.6/5.x). This
OperationalError
exception tends to occur during peak periods when autoscaling scales us up, but this isn’t always the case.Other than autoscaling, our configuration is fairly basic: 3 workers for ad-hoc jobs,
--autoscale=25,5
and 3 workers processing periodic, scheduled jobs--autoscale=5,1
(6 total worker nodes) with low utilization outside of a few daily spikes.I’ll continue investigating for patterns or anomalies.
This has reproduced twice since deploying celery 4.4.2 and kombu 4.6.8 for us. I’ll update here if I find more information.
We’re deploying celery==4.4.2 and kombu==4.6.8 today, but I don’t expect this will manifest right away (for us, it’s not reliably reproducible and usually takes some time).
We have also seen this with: celery==4.4.0 kombu==4.6.7 redis==3.4.1
and
kombu==4.5.0 celery==4.3.0 redis==3.2.1
Our experience has been that this runs for a period of time (anywhere from ~6 days to 28 days) successfully before a worker fails out and stops consuming tasks. We’ve ruled out timeout is 0 and memory policy is allkeys-lru.
Today, I was inspecting the
"_kombu.binding.reply.celery.pidbox"
key and noticed it is transient and seems to only be set while workers are processing tasks (i.e. I only see it in Redis when workers are processing tasks). When it exists in Redis, I observe it has no expiration and is a set:This would suggest that the key is explicitly created and deleted OR, as @staticfox noted, the set is losing all members and being deleted by Redis, but Celery expects it to exist.
I also found this old issue log, https://github.com/celery/kombu/issues/226, which pointed to
fanout_prefix
andfanout_patterns
inbroker_transport_options
. I believe this only affects shared Redis clusters for multiple Celery apps (we are the only tenant on ours)?This does not appear to be set in our app when initializing via
celery.config_from_object
:@auvipy - should this be re-opened based on recent reports?
kombu==4.6.3
fixed it for me – had the same issue with Celery worker crashing.