stolon: One stolon-keeper fails to start

In a 3-machine synchronous cluster, one machine is now failing to start its stolon-keeper:

# sudo -u postgres /nix/store/ngy0lpk389vbn74yjhf2i5f6f192q1hv-stolon-0.6.0/bin/stolon-keeper --cluster-name test-stolon-cluster --store-backend=consul --uid node_1 --data-dir /var/lib/postgres/node_1 --pg-listen-address=10.0.0.1 --pg-bin-path=/nix/store/qq2fxv16cm5r6wshk5hbiwn284s16h1n-postgresql-9.6.3/bin --pg-su-username=postgres --pg-su-passwordfile=/etc/stolon/test-stolon-cluster.password --pg-repl-username=postgres --pg-repl-passwordfile=/etc/stolon/test-stolon-cluster.password

warning: superuser name and replication user name are the same. Different users are suggested.
[I] 2017-09-28T01:05:01Z keeper.go:1567: exclusive lock on data dir taken
[I] 2017-09-28T01:05:01Z keeper.go:408: keeper uid uid=node_1
[I] 2017-09-28T01:05:02Z postgresql.go:215: stopping database
pg_ctl: PID file "/var/lib/postgres/node_1/postgres/postmaster.pid" does not exist
Is server running?
[E] 2017-09-28T01:05:02Z keeper.go:526: cannot get configured pg parameters error=dial tcp [::1]:5432: getsockopt: connection refused
[E] 2017-09-28T01:05:02Z keeper.go:526: cannot get configured pg parameters error=dial tcp [::1]:5432: getsockopt: connection refused
[I] 2017-09-28T01:05:02Z keeper.go:839: our db boot UID is different than the cluster data one, waiting for it to be updated bootUUID=3daf23f4-2543-49ad-91e8-3466a07a0c20 clusterBootUUID=7e395ead-a25d-4612-ae09-828eae2b5194
[I] 2017-09-28T01:05:02Z postgresql.go:215: stopping database
pg_ctl: PID file "/var/lib/postgres/node_1/postgres/postmaster.pid" does not exist
Is server running?
[E] 2017-09-28T01:05:03Z keeper.go:526: cannot get configured pg parameters error=dial tcp [::1]:5432: getsockopt: connection refused
[E] 2017-09-28T01:05:03Z keeper.go:526: cannot get configured pg parameters error=dial tcp [::1]:5432: getsockopt: connection refused
[I] 2017-09-28T01:05:03Z keeper.go:839: our db boot UID is different than the cluster data one, waiting for it to be updated bootUUID=3daf23f4-2543-49ad-91e8-3466a07a0c20 clusterBootUUID=7e395ead-a25d-4612-ae09-828eae2b5194
[I] 2017-09-28T01:05:03Z postgresql.go:215: stopping database
...

There is no postgresql process.

What might be the issue here?

And how might I get that machine back into operating normally?

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 27 (23 by maintainers)

Most upvoted comments

To summarize this issue: in case of fs full the recovery.conf while wasn’t correctly atomically written ending with an empty recovery.conf file, so the keeper considered this node as a primary and refused to continue since the cluster data says that is should be a standby.

Fixed in #495

sgotti on May 29, 2018