tensorflow: checkpoints file's max_to_keep not working

What related GitHub issues or StackOverflow threads have you found by searching the web for your problem?

cannot find any related post in the web

Environment info

Operating System:

centos6.5 , tensorflow 0.11.0 without gpu(cpu only)

If possible, provide a minimal reproducible example (We usually don’t have time to read hundreds of lines of your code)

tf.train.Saver(max_to_keep = 5)
then
saver.save(sess, checkpoint_file, global_step=step)

What other attempted solutions have you tried?

IMPORTANT I run the same code in my laptop, it’s right with only 5 latest checkponits but in the centos server , it generate unlimited checkponits files and the disk almost full

Logs or other output that would be helpful

(If logs are large, please upload as attachment or provide link).

no error log

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 2
  • Comments: 16 (6 by maintainers)

Most upvoted comments

I observe similar behaviour: at some random points in time saver starts to consistently ignore max_to_keep parameter and does not delete older checkpoints.

I wish I could provide more info.

My guess is a quirk in the file system. Since the issue has been identified I’m going to close this.

I found that when run to save.py line 1157, the “checkpoint_prefix” variable has double slash like “checkpoint/adagrad-lr0.01-fs200-b10-u128.64.64.32//checkpoint.ckpt-6700.index” simply replace “//” with “/” and it’s worked

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/saver.py#L1157

From your two experiments the problem seems possibly centos6.5 specific, but not TF version specific. The max_to_keep feature is implemented in this file: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/saver.py in a function called _MaybeDeleteOldCheckpoints. It would be interesting to know whether this function is being called correctly, and if so whether the old file recognition or deletion is failing. Add some print statements?

You might look first in your logs to see whether there are any messages about this.