tensorflow: tf.contrib.learn.monitors.ValidationMonitor hangs when passed input_fn parameter

Hello! I have been working my way through the tf.contrib.learn tutorials and have been attempting to integrate the tf.contrib.learn.monitors.ValidationMonitor into the ‘deep’ classifier in wide_n_deep.py as shown below.

What related GitHub issues or StackOverflow threads have you found by searching the web for your problem? I searched both Github and Stakeoverflow with the terms ‘tensorflow,’ ‘input_fn,’ and ‘validationmonitor’ but wasn’t able to find anyone else who reported similar issues.

Environment info

Operating System: I ran this on Ubuntu Server 16.04 on a physical I7 with a GTX1080 gpu when i noticed the problem. I know that i was using the GPU on the original physical box from previous tests, and because during the hang the nvidia_smi command showed considerable load on the GPU. I was able to replicate the problem with CPU on a 16.04 VM as well.

Installed version of CUDA and cuDNN:

/home/andersonjas/libcudnn5-dev_5.1.5-1+cuda8.0_amd64.deb
/home/andersonjas/libcudnn5_5.1.5-1+cuda8.0_amd64.deb

If installed from binary pip package, provide:

  1. A link to the pip package you installed: from Anaconda 2.7 64 bit package:
pip install tensorflow
  1. The output from python -c "import tensorflow; print(tensorflow.__version__)".
andersonjas@ubuntu:~$ python -c "import tensorflow; print(tensorflow.__version__)"
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
0.11.head

If installed from source, provide

  1. The commit hash (git rev-parse HEAD)
  2. The output of bazel version

If possible, provide a minimal reproducible example (We usually don’t have time to read hundreds of lines of your code)

validation_monitor = tf.contrib.learn.monitors.ValidationMonitor(input_fn=lambda:input_fn(df_test), 
                       every_n_steps=50)
m.fit(input_fn=lambda: input_fn(df_train), steps=151,monitors=[validation_monitor])

Doing this in a jupyter notebool causes the code to hang indefinitely. To make completely sure that i don’t have a bug in my own code i can make the following change:

validation_monitor = tf.contrib.learn.monitors.ValidationMonitor(input_fn=lambda:input_fn(df_test), 
                       every_n_steps=50)
m.fit(input_fn=lambda: input_fn(df_train), steps=151) #,monitors=[validation_monitor])

and then the code executes fine.

What other attempted solutions have you tried?

I also built an input_fn interface to the iris and boston housing price predictor code code, each showed similar ‘hangs’

Logs or other output that would be helpful

(If logs are large, please upload as attachment or provide link).

As a noob, i’m learning that esoteric error messages are a luxury 😃, in this case the code just hangs indefinitely.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 3
  • Comments: 16 (9 by maintainers)

Most upvoted comments

I just found a workaround to what might be the same issue in Tensorflow 0.12.1. Maybe it will be helpful here.

ValidationMonitor did not support the x=…y=… input, and when using input_fn, I found it would loop forever the first time it tried to run the monitor during a fit() call. The log output looked like it was calculating the validation metrics over and over in an infinite loop.

After looking at the source for ValidationMonitor I found a parameter eval_steps that defaulted to None. In other places in the contrib.learn codebase ‘None’ implies ‘forever’, so I tried adding eval_steps=1 to the constructor, and now it works as advertised.

e.g.

        validation_monitor = tf.contrib.learn.monitors.ValidationMonitor(
            input_fn=lambda: input_function(test_data, test_labels),
            eval_steps=1,  # Try adding this
            metrics=validation_metrics,
            every_n_steps=10,
        )   

        estimator.fit(
            input_fn=lambda: input_function(training_data, training_labels),
            steps=training_epochs,
            monitors=[validation_monitor],
        ) 

Hope that helps!

@gmacleod Thanks eval_steps = 1 fixed the problem!

@lancerts I was able to trigger validation monitor multiple time by setting saving checkpoints.

the relevant code

classifier = tf.contrib.learn.DNNClassifier(
    ...,
    config = tf.contrib.learn.RunConfig(save_checkpoints_steps = 100, save_checkpoints_secs = None)
)

@gmacleod In tf1.1.0 , even adding eval_steps=1 to the constructor, it just output the validation log one time at the end of first n_steps, not every n_steps. Do you have the same problem?