postgresql-container: Redeployment unable to startup again
Updated the resource limits for a postgresql-persistent 9.5 deployment
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....LOG: redirecting log output to logging collector process
HINT: Future log output will appear in directory "pg_log".
done
server started
ERROR: tuple already updated by self
It seems the first pod did not shutdown cleanly and left the PID in /var/lib/pgsql/data/userdata/postmaster.pid volume thus preventing the container from starting up automatically without manual intervention
Perhaps an edge case as this is the first time seeing this with many other postgresql deployments
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 12
- Comments: 16 (8 by maintainers)
Hi, facing the same issue, using the Recreate strategy. Deleting the postmaster.pid also did not help, as I got the same error at the next pod startup. Any idea on how to fix or work around this?
Hi @martin123218
The problem with the
Rolling
strategy is that it tells Openshift to first create a new pod with the same data volume as the old one and only shut down the original pod when the new pod is up and running. Since there are at one time two pods accessing (and presumably writing to) the same data volume you can run into this issue.Please use the
Recreate
strategy instead. There will be some downtime since the new pod is only started after the old pod gets shut down but you should not run into this issue anymore.Hello, here facing the same problem: container unable to start, with same output detailed above. [Is there / does anyone have found] any solution or workaround for this problem? I quickly tried removing /var/lib/pgsql/data/userdata/postmaster.pid file but when starting the container I’m getting the same issue.
EDIT: I double checked, and in my case the output is:
This is old issue, but just faced the same with Recreate strategy. Next article explains how to reanimate failing pod and it helped me. https://pathfinder-faq-ocio-pathfinder-prod.pathfinder.gov.bc.ca/DB/PostgresqlCrashLoopTupleError.html https://serverfault.com/questions/942743/postgres-crash-loop-caused-by-a-tuple-concurrently-updated-error
We use only one database pod, so I believe that maybe it will not solve high availability issue, but at least database will work with one pod. Maybe will be useful for somebody.
Not with this trivial layout. This problem is equivalent to non-container scenario where you do
dnf update postgresql-server
. You have to shut down the old server, and start a new one. I.e. you can not let two servers write into the same data directory.Btw., PostgreSQL server has a guard against “multiple servers writing to the same data directory” situation, but unfortunately in container scenario - it has deterministic PID number (PID=1). So concurrent PostgreSQL server (in different container) checks the pid/lock file, compares the PID with it’s own PID and assumes “I’m the PID=1, so the PID file is some leftover from previous run”. So it removes the PID file and continues with data directory modifications. This has a disaster potential.
Our templates only support Recreate strategy. The fact Rolling “mostly” works is matter of luck that the old server is not under heavy load.
That said, zero downtime problem needs to be solved on higher logical layer.