aistore: ETL WebDataset connection timeout
I tried ETL by referencing this article https://aiatscale.org/blog/2023/05/11/aisio-transforms-with-webdataset-pt-2
Keep getting timeout while trying to transform_object_inline
https://github.com/NVIDIA/aistore/blob/5c5a2a3ecabcaa8a10b7ccd7f5b9bc800481abd6/docs/examples/aisio_webdataset/etl_webdataset.py#L100
$ aistore/docs/examples/aisio_webdataset# python etl_webdataset.py
{'Ais-Atime': '1687530497164671185', 'Ais-Bucket-Name': 'images', 'Ais-Bucket-Provider': 'ais', 'Ais-Checksum-Type': 'xxhash', 'Ais-Checksum-Value': 'a487f46
d49561afd', 'Ais-Location': 't[rrViDbDG]:mp[/ais1, nvme0n2]', 'Ais-Mirror-Copies': '1', 'Ais-Mirror-Paths': '[/ais1]', 'Ais-Name': 'samples-00.tar', 'Ais-Pre
sent': 'true', 'Ais-Version': '2', 'Content-Length': '45895680', 'Date': 'Fri, 23 Jun 2023 14:34:36 GMT'}
http://<proxy-lb-public-address>/v1/objects/images/samples-00.tar?provider=ais&etl_name=wd-transform
Traceback (most recent call last):
File "<path-to-workspace>/venv/lib/python3.9/site-packages/urllib3/connection.py", line 200, in _new_conn
sock = connection.create_connection(
File "<path-to-workspace>/venv/lib/python3.9/site-packages/urllib3/util/connection.py", line 85, in create_connection raise err
File "<path-to-workspace>/venv/lib/python3.9/site-packages/urllib3/util/connection.py", line 73, in create_connection
sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<path-to-workspace>/venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 790, in urlopen
response = self._make_request(
File "<path-to-workspace>/venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 496, in _make_request
conn.request(
File "<path-to-workspace>/venv/lib/python3.9/site-packages/urllib3/connection.py", line 388, in request
self.endheaders()
File "/root/miniconda3/lib/python3.9/http/client.py", line 1250, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/root/miniconda3/lib/python3.9/http/client.py", line 1010, in _send_output
self.send(msg)
File "/root/miniconda3/lib/python3.9/http/client.py", line 950, in send
self.connect()
File "<path-to-workspace>/venv/lib/python3.9/site-packages/urllib3/connection.py", line 236, in connect
self.sock = self._new_conn()
File "<path-to-workspace>/venv/lib/python3.9/site-packages/urllib3/connection.py", line 215, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f2253c3f730>: Failed to establish a new connection: [Errno 110] Connec
tion timed out
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<path-to-workspace>/venv/lib/python3.9/site-packages/requests/adapters.py", line 486, in send
resp = conn.urlopen(
File "<path-to-workspace>/venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 844, in urlopen
retries = retries.increment(
File "<path-to-workspace>/venv/lib/python3.9/site-packages/urllib3/util/retry.py", line 515, in increment
raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='10.10.0.6', port=30186): Max retries exceeded with url: /ais%2F@%23%2Fimages%2Fsamples-00.tar (Cau
sed by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2253c3f730>: Failed to establish a new connection: [Errno 110] Connection timed o
ut'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<path-to-workspace>/aistore/docs/examples/aisio_webdataset/etl_webdataset.py", line 131, in <module>
transform_object_inline()
File "<path-to-workspace>/aistore/docs/examples/aisio_webdataset/etl_webdataset.py", line 108, in transform_object_inline
processed_shard = single_object.get(etl_name=etl_name).read_all()
File "<path-to-workspace>/venv/lib/python3.9/site-packages/aistore/sdk/object.py", line 113, in get
resp = self._client.request(
File "<path-to-workspace>/venv/lib/python3.9/site-packages/aistore/sdk/request_client.py", line 91, in request
resp = self._session.request(
File "<path-to-workspace>/venv/lib/python3.9/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "<path-to-workspace>/venv/lib/python3.9/site-packages/requests/sessions.py", line 725, in send
history = [resp for resp in gen]
File "<path-to-workspace>/venv/lib/python3.9/site-packages/requests/sessions.py", line 725, in <listcomp>
history = [resp for resp in gen]
File "<path-to-workspace>/venv/lib/python3.9/site-packages/requests/sessions.py", line 266, in resolve_redirects
resp = self.send(
File "<path-to-workspace>/venv/lib/python3.9/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "<path-to-workspace>/venv/lib/python3.9/site-packages/requests/adapters.py", line 519, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='10.10.0.6', port=30186): Max retries exceeded with url: /ais%2F@%23%2Fimages%2Fsamples-00.tar (
Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2253c3f730>: Failed to establish a new connection: [Errno 110] Connection time
d out'))
I think there was a timeout during the ETL request processing. I’m puzzled why there was a request with ‘host=‘10.10.0.6’, port=30186’. https://github.com/NVIDIA/aistore/blob/5c5a2a3ecabcaa8a10b7ccd7f5b9bc800481abd6/python/aistore/sdk/object.py#L113
$ kubectl -n ais get po -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
aistore-admin 1/1 Running 0 18h 10.72.2.36 gke-ais1-ais1-bbc650cb-6hck <none> <none>
aistore1-proxy-0 1/1 Running 0 18h 10.72.2.34 gke-ais1-ais1-bbc650cb-6hck <none> <none>
aistore1-proxy-1 1/1 Running 0 18h 10.72.4.30 gke-ais1-ais1-bbc650cb-97zb <none> <none>
aistore1-proxy-2 1/1 Running 0 18h 10.72.1.28 gke-ais1-ais1-bbc650cb-d8q1 <none> <none>
aistore1-target-0 1/1 Running 0 18h 10.72.4.31 gke-ais1-ais1-bbc650cb-97zb <none> <none>
aistore1-target-1 1/1 Running 0 18h 10.72.2.35 gke-ais1-ais1-bbc650cb-6hck <none> <none>
aistore1-target-2 1/1 Running 0 18h 10.72.1.29 gke-ais1-ais1-bbc650cb-d8q1 <none> <none>
transform-images-iumkgvzt 1/1 Running 0 68m 10.72.1.39 gke-ais1-ais1-bbc650cb-d8q1 <none> <none>
transform-images-rrvidbdg 1/1 Running 0 68m 10.72.4.41 gke-ais1-ais1-bbc650cb-97zb <none> <none>
transform-images-szsjmiqg 1/1 Running 0 68m 10.72.2.46 gke-ais1-ais1-bbc650cb-6hck <none> <none>
wd-transform-iumkgvzt 1/1 Running 0 42m 10.72.1.41 gke-ais1-ais1-bbc650cb-d8q1 <none> <none>
wd-transform-redirect-1-iumkgvzt 1/1 Running 0 135m 10.72.1.36 gke-ais1-ais1-bbc650cb-d8q1 <none> <none>
wd-transform-redirect-1-rrvidbdg 1/1 Running 0 135m 10.72.4.38 gke-ais1-ais1-bbc650cb-97zb <none> <none>
wd-transform-redirect-1-szsjmiqg 1/1 Running 0 135m 10.72.2.43 gke-ais1-ais1-bbc650cb-6hck <none> <none>
wd-transform-rrvidbdg 1/1 Running 0 42m 10.72.4.43 gke-ais1-ais1-bbc650cb-97zb <none> <none>
wd-transform-szsjmiqg 1/1 Running 0 42m 10.72.2.48 gke-ais1-ais1-bbc650cb-6hck <none> <none>
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 16 (1 by maintainers)
@aaronnw It’s working fine now, except for the issue of not being able to retrieve results from outside the cluster.
I don’t have any other questions now. Thanks for all your answers and help.
Hey @yingca1, I have also struggled setting up ETLs. How I debug is usually I run the commands to setup ETL then I see which pods were spawned. Then I use the log command on all the newly spawned pods to see if there was any issue in initialisation. 90% of the times you can figure out from the logs of these pods about the issue.