wandb: watch() sometimes does not log the gradients

Describe the bug I am training a generative adversarial network. Without adversarial training, the gradients of the generator is logged. With adversarial training, the gradients of the generator is not logged.

To Reproduce Steps to reproduce the behavior:

Go to https://github.com/AliaksandrSiarohin/first-order-model
Add the following at line 71 of run.py

wandb.watch(generator, log='all')
wandb.watch(discriminator, log='all')
wandb.watch(kp_detector, log='all')

Run python run.py --config config/vox-256.yaml and we can see the generator and kp_detector gradients are logged.
Run python run.py --config config/vox-adv-256.yaml, which is adversarial training, and we can see that only discriminator gradients are logged, not those of generator and kp_detector.

Expected behavior The gradients of all the three network should be logged for adversarial training.

Screenshots

Operating System

OS: Linux

Additional context conda list output: https://pastebin.com/sV920TQP

About this issue

Original URL
State: open
Created 4 years ago
Reactions: 1
Comments: 16 (6 by maintainers)

Most upvoted comments

It’s likely regular RAM that’s the issue.

vanpelt on Dec 13, 2021

It’s likely getting killed due to memory preasure. We have to load the gradients from the GPU and if your model is really large your notebook may not have enough memory. Are you able to get a larger instance for your notebook?

vanpelt on Dec 9, 2021

Hi Steven, can you link us to any runs where all media / metrics failed to log? If you don’t want to post the link here you can reach me directly at aswanberg@wandb.com

adrnswanberg on Sep 29, 2021

For some reason, when I tried lowering the log_freq =10 or log_freq=1 it only shows one step and not the entire distribution in my gradient charts. The gradient charts used to work.

Please advise @cvphelps

EdwardYGLi on Jul 8, 2021

Hi @kenmbkr thanks for writing in and including a reproducible script, that’s awesome! We are on Christmas holidays right now but will get back to this after vacation. One thing I’ve seen in the past is that there aren’t enough iterations on certain models, and since the gradients aren’t logged every step, the interval might be too large to pick up the other pieces of training.

cvphelps on Dec 25, 2020