wandb: Network error (ReadTimeout), entering retry loop during sweep
- Weights and Biases version: 0.9.1
- Python version: 3.7.7
- Operating System: MacOS X / Debian
Description
I started a wandb sweep grid search and I am getting a lot of Network error (ReadTimeout), entering retry loop. from various kinds of machines within different kind of networks (i.e. this happens both at home on my laptop and on larger remote computers). Sometimes runs get through and sync but then at seemingly random times the network error shows up. Sometimes it resolves itself within a minute or two sometimes it just stays down for half an hour or longer. I’ve posted the log below.
What I Did
2020-08-04 16:35:53,246 ERROR MainThread:13897 [retry.py:__call__():108] Retry attempt failed:
Traceback (most recent call last):
File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/urllib3/connectionpool.py", line 426, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/urllib3/connectionpool.py", line 421, in _make_request
httplib_response = conn.getresponse()
File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/http/client.py", line 1344, in getresponse
response.begin()
File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/http/client.py", line 306, in begin
version, status, reason = self._read_status()
File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/http/client.py", line 267, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/socket.py", line 589, in readinto
return self._sock.recv_into(b)
File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/ssl.py", line 1071, in recv_into
return self.read(nbytes, buffer)
File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/ssl.py", line 929, in read
return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/urllib3/connectionpool.py", line 725, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/urllib3/util/retry.py", line 403, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/urllib3/packages/six.py", line 735, in reraise
raise value
File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/urllib3/connectionpool.py", line 677, in urlopen
chunked=chunked,
File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/urllib3/connectionpool.py", line 428, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/urllib3/connectionpool.py", line 336, in _raise_timeout
self, url, "Read timed out. (read timeout=%s)" % timeout_value
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='api.wandb.ai', port=443): Read timed out. (read timeout=10)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/wandb/retry.py", line 95, in __call__
result = self._call_fn(*args, **kwargs)
File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/wandb/apis/internal.py", line 108, in execute
return self.client.execute(*args, **kwargs)
File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/gql/client.py", line 52, in execute
result = self._get_result(document, *args, **kwargs)
File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/gql/client.py", line 60, in _get_result
return self.transport.execute(document, *args, **kwargs)
File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/gql/transport/requests.py", line 38, in execute
request = requests.post(self.url, **post_args)
File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/requests/api.py", line 119, in post
return request('post', url, data=data, json=json, **kwargs)
File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/requests/api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/requests/sessions.py", line 530, in request
resp = self.send(prep, **send_kwargs)
File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/requests/sessions.py", line 643, in send
r = adapter.send(request, **kwargs)
File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/requests/adapters.py", line 529, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='api.wandb.ai', port=443): Read timed out. (read timeout=10
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 21 (6 by maintainers)
Issue-Label Bot is automatically applying the label
bugto this issue, with a confidence of 0.85. Please mark this comment with 👍 or 👎 to give our bot feedback!Links: app homepage, dashboard and code for this bot.
I’m also getting this issue. I’m using wandb
0.13.7on linux@XikunZhang , im sorry about the error you experienced today. We had a database outage for ~40 minutes which is likely when you saw this error. https://status.wandb.com/incidents/s70l1pfsyz6c
During this time, any sweeps that were running likely were reporting errors, but the sweep should continue to run and it would recover at the end of the outage.
Any actively running training jobs started by the sweep should continue running during the outage but they will be reporting messages about failures, but they will recover at the end of the outage.
When the backend is not responding, by default we do not let new training jobs start because we want to make sure the data is saved. If you want to run offline and sync the data later, you can chose that with
wandb offlineor WANDB_MODE=“offline”, or wandb.init(mode=“offline”)We have a project to improve the error messages so it is more user readable. We have included this in 0.10.28 release in the case of rate-limited requests and will be doing it for other types of network errors in the future.