foundationdb: fdbrestore gets OutOfMemory error while waiting for agents to complete restore

when trying to restore the db from backup to either 5.2.5 or 6.0.15, it will end up OOM and fail. the command

fdbrestore start -r back_dir -w -C cluster_file

the output

Backup Description
URL: back_dir
Restorable: true
Snapshot:  startVersion=113213694260083 (2022-08-26 02:13:19)  endVersion=113213694816262 (2022-08-26 02:13:20)  totalBytes=119085195  restorable=true
SnapshotBytes: 119085195
MinLogBeginVersion:      113213694201866 (2022-08-26 02:13:19)
ContiguousLogEndVersion: 113213714201866 (2022-08-26 02:13:39)
MaxLogEndVersion:        113213714201866 (2022-08-26 02:13:39)
MinRestorableVersion:    113213694816262 (2022-08-26 02:13:20)
MaxRestorableVersion:    113213714201865 (2022-08-26 02:13:39)
Restoring backup to version: 113213714201865

ERROR: Out of memory

the only difference between 5.2.5 and 6.0.15 in the output is on 6.0.15, it will keep showing the following

Tag: default  UID: xxxxxxxxxxxxxxxxx  State: queued  Blocks: 0/0  BlocksInProgress: 0  Files: 0  BytesWritten: 0  ApplyVersionLag: 0  LastError: None

while 5.2.5 did not show anything

no obvious logs in trace files and syslog.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 15 (8 by maintainers)

Most upvoted comments

To clarify what is going on a bit: fdbbackup and fdbrestore merely enqueue backup and restore jobs into the database (after a possibly length init for restore) for the database cluster’s backup_agent processes to make progress on when they are running. Any cluster you want to backup data from or restore data into must have at least one backup agent running. You can start backups or restores when no agents are running, but no progress will actually be made on the resulting backup/restore jobs until there are agents running. A default FDB installation will configure one backup_agent to be started by fdbmonitor.

So your fdbrestore command did not ignore the -C option, there is no bug there, it enqueued your restore job into the cluster you specified but there were no backup agents running on that cluster so no actually restore work was done.

I’m glad your restore issue is resolved, however there is still the matter of the Out Of Memory error you saw which obviously should not happen. Can you give me any further details about that? How long was fdbrestore running before it OOM’d? Do you have a trace file from that output (the --log option will produce one)?