wandb: wandb: Network error (TransientError), entering retry loop.
Hi,
on our HPC cluster, some users were running some machine learning jobs that send the data to the wandb service to use the dashboard.
The nodes in our cluster don’t have direct internet access, therefore the connections were going through a proxy server. We now switched from using the proxy server to using Network Address Translation (NAT) to allow access from the compute nodes to
api.wandb.ai
Now many users are reporting that the connection to wandb service is working, but after some time their jobs get an error message
wandb: Network error (TransientError), entering retry loop.
Any help is appreciated. If you need more information, then I am happy to provide whatever is needed to fix this issue.
Best regards
Sam
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 9
- Comments: 19 (2 by maintainers)
Got the same issue yesterday. I guess the problem is releated to the network operator (in China). I solve this issue by changing the network from MOBILE (移动) to TELECOM (电信).