aistore: ETL WebDataset connection timeout

I tried ETL by referencing this article https://aiatscale.org/blog/2023/05/11/aisio-transforms-with-webdataset-pt-2

Keep getting timeout while trying to transform_object_inline https://github.com/NVIDIA/aistore/blob/5c5a2a3ecabcaa8a10b7ccd7f5b9bc800481abd6/docs/examples/aisio_webdataset/etl_webdataset.py#L100

$ aistore/docs/examples/aisio_webdataset# python etl_webdataset.py                                           

{'Ais-Atime': '1687530497164671185', 'Ais-Bucket-Name': 'images', 'Ais-Bucket-Provider': 'ais', 'Ais-Checksum-Type': 'xxhash', 'Ais-Checksum-Value': 'a487f46
d49561afd', 'Ais-Location': 't[rrViDbDG]:mp[/ais1, nvme0n2]', 'Ais-Mirror-Copies': '1', 'Ais-Mirror-Paths': '[/ais1]', 'Ais-Name': 'samples-00.tar', 'Ais-Pre
sent': 'true', 'Ais-Version': '2', 'Content-Length': '45895680', 'Date': 'Fri, 23 Jun 2023 14:34:36 GMT'}                                                    


http://<proxy-lb-public-address>/v1/objects/images/samples-00.tar?provider=ais&etl_name=wd-transform


Traceback (most recent call last):
  File "<path-to-workspace>/venv/lib/python3.9/site-packages/urllib3/connection.py", line 200, in _new_conn                                                
    sock = connection.create_connection(                                                                                                                     
  File "<path-to-workspace>/venv/lib/python3.9/site-packages/urllib3/util/connection.py", line 85, in create_connection                                        raise err                                                                                                                                                
  File "<path-to-workspace>/venv/lib/python3.9/site-packages/urllib3/util/connection.py", line 73, in create_connection                                    
    sock.connect(sa)                                                                                                                                         
TimeoutError: [Errno 110] Connection timed out                                                                                                               
                                                                                                                                                             
The above exception was the direct cause of the following exception:
                                                                                                                                                             
Traceback (most recent call last):                                            
  File "<path-to-workspace>/venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 790, in urlopen                                              
    response = self._make_request(                                                                                                                           
  File "<path-to-workspace>/venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 496, in _make_request 
    conn.request(                                                                                                                                            
  File "<path-to-workspace>/venv/lib/python3.9/site-packages/urllib3/connection.py", line 388, in request                                                  
    self.endheaders()                                                                                                                                        
  File "/root/miniconda3/lib/python3.9/http/client.py", line 1250, in endheaders                           
    self._send_output(message_body, encode_chunked=encode_chunked)                                                                                           
  File "/root/miniconda3/lib/python3.9/http/client.py", line 1010, in _send_output                                 
    self.send(msg)                                                                                                                                           
  File "/root/miniconda3/lib/python3.9/http/client.py", line 950, in send                                                                                    
    self.connect()                                                                                                                                           
  File "<path-to-workspace>/venv/lib/python3.9/site-packages/urllib3/connection.py", line 236, in connect                                                  
    self.sock = self._new_conn()                                              
  File "<path-to-workspace>/venv/lib/python3.9/site-packages/urllib3/connection.py", line 215, in _new_conn                                                
    raise NewConnectionError(                                                 
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f2253c3f730>: Failed to establish a new connection: [Errno 110] Connec
tion timed out                                                                
                                                                                                                                                             
The above exception was the direct cause of the following exception:                                                                                         
                                                                                                                                                             
Traceback (most recent call last):                                                                                                                           
  File "<path-to-workspace>/venv/lib/python3.9/site-packages/requests/adapters.py", line 486, in send                                                      
    resp = conn.urlopen(                                                                                                                                     
  File "<path-to-workspace>/venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 844, in urlopen
    retries = retries.increment(    
  File "<path-to-workspace>/venv/lib/python3.9/site-packages/urllib3/util/retry.py", line 515, in increment                                                
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]                                                                            
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='10.10.0.6', port=30186): Max retries exceeded with url: /ais%2F@%23%2Fimages%2Fsamples-00.tar (Cau
sed by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2253c3f730>: Failed to establish a new connection: [Errno 110] Connection timed o
ut'))                                                                         
                                                                                                                                                             
During handling of the above exception, another exception occurred:           
                                                                                                                                                             
Traceback (most recent call last):                                                                                                                           
  File "<path-to-workspace>/aistore/docs/examples/aisio_webdataset/etl_webdataset.py", line 131, in <module>        
    transform_object_inline()                                                                                                                                
  File "<path-to-workspace>/aistore/docs/examples/aisio_webdataset/etl_webdataset.py", line 108, in transform_object_inline                                
    processed_shard = single_object.get(etl_name=etl_name).read_all()                                                                                        
  File "<path-to-workspace>/venv/lib/python3.9/site-packages/aistore/sdk/object.py", line 113, in get    
    resp = self._client.request(                                                                                                                             
  File "<path-to-workspace>/venv/lib/python3.9/site-packages/aistore/sdk/request_client.py", line 91, in request 
    resp = self._session.request(                                                                                                                            
  File "<path-to-workspace>/venv/lib/python3.9/site-packages/requests/sessions.py", line 589, in request                                                   
    resp = self.send(prep, **send_kwargs)                                                                                                                    
  File "<path-to-workspace>/venv/lib/python3.9/site-packages/requests/sessions.py", line 725, in send                                                      
    history = [resp for resp in gen]                                          
  File "<path-to-workspace>/venv/lib/python3.9/site-packages/requests/sessions.py", line 725, in <listcomp>                                                
    history = [resp for resp in gen]                                          
  File "<path-to-workspace>/venv/lib/python3.9/site-packages/requests/sessions.py", line 266, in resolve_redirects                                         
    resp = self.send(                                                         
  File "<path-to-workspace>/venv/lib/python3.9/site-packages/requests/sessions.py", line 703, in send                                                      
    r = adapter.send(request, **kwargs)                                                                                                                      
  File "<path-to-workspace>/venv/lib/python3.9/site-packages/requests/adapters.py", line 519, in send                                                      
    raise ConnectionError(e, request=request)                                                                                                                
requests.exceptions.ConnectionError: HTTPConnectionPool(host='10.10.0.6', port=30186): Max retries exceeded with url: /ais%2F@%23%2Fimages%2Fsamples-00.tar (
Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2253c3f730>: Failed to establish a new connection: [Errno 110] Connection time
d out'))                                                                                                                                                    

I think there was a timeout during the ETL request processing. I’m puzzled why there was a request with ‘host=‘10.10.0.6’, port=30186’. https://github.com/NVIDIA/aistore/blob/5c5a2a3ecabcaa8a10b7ccd7f5b9bc800481abd6/python/aistore/sdk/object.py#L113

$ kubectl -n ais get po -owide
NAME                               READY   STATUS    RESTARTS   AGE    IP           NODE                          NOMINATED NODE   READINESS GATES
aistore-admin                      1/1     Running   0          18h    10.72.2.36   gke-ais1-ais1-bbc650cb-6hck   <none>           <none>
aistore1-proxy-0                   1/1     Running   0          18h    10.72.2.34   gke-ais1-ais1-bbc650cb-6hck   <none>           <none>
aistore1-proxy-1                   1/1     Running   0          18h    10.72.4.30   gke-ais1-ais1-bbc650cb-97zb   <none>           <none>
aistore1-proxy-2                   1/1     Running   0          18h    10.72.1.28   gke-ais1-ais1-bbc650cb-d8q1   <none>           <none>
aistore1-target-0                  1/1     Running   0          18h    10.72.4.31   gke-ais1-ais1-bbc650cb-97zb   <none>           <none>
aistore1-target-1                  1/1     Running   0          18h    10.72.2.35   gke-ais1-ais1-bbc650cb-6hck   <none>           <none>
aistore1-target-2                  1/1     Running   0          18h    10.72.1.29   gke-ais1-ais1-bbc650cb-d8q1   <none>           <none>
transform-images-iumkgvzt          1/1     Running   0          68m    10.72.1.39   gke-ais1-ais1-bbc650cb-d8q1   <none>           <none>
transform-images-rrvidbdg          1/1     Running   0          68m    10.72.4.41   gke-ais1-ais1-bbc650cb-97zb   <none>           <none>
transform-images-szsjmiqg          1/1     Running   0          68m    10.72.2.46   gke-ais1-ais1-bbc650cb-6hck   <none>           <none>
wd-transform-iumkgvzt              1/1     Running   0          42m    10.72.1.41   gke-ais1-ais1-bbc650cb-d8q1   <none>           <none>
wd-transform-redirect-1-iumkgvzt   1/1     Running   0          135m   10.72.1.36   gke-ais1-ais1-bbc650cb-d8q1   <none>           <none>
wd-transform-redirect-1-rrvidbdg   1/1     Running   0          135m   10.72.4.38   gke-ais1-ais1-bbc650cb-97zb   <none>           <none>
wd-transform-redirect-1-szsjmiqg   1/1     Running   0          135m   10.72.2.43   gke-ais1-ais1-bbc650cb-6hck   <none>           <none>
wd-transform-rrvidbdg              1/1     Running   0          42m    10.72.4.43   gke-ais1-ais1-bbc650cb-97zb   <none>           <none>
wd-transform-szsjmiqg              1/1     Running   0          42m    10.72.2.48   gke-ais1-ais1-bbc650cb-6hck   <none>           <none>

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 16 (1 by maintainers)

Most upvoted comments

@aaronnw It’s working fine now, except for the issue of not being able to retrieve results from outside the cluster.

I don’t have any other questions now. Thanks for all your answers and help.

@aaronnw Thanks a lot for your reply, this info looks really helpful!

  1. Does the Python version of the ETL task client need to match the ETL runtime version?
  2. Is there an easy way to check if an ETL job is running and see how it’s doing?

Hey @yingca1, I have also struggled setting up ETLs. How I debug is usually I run the commands to setup ETL then I see which pods were spawned. Then I use the log command on all the newly spawned pods to see if there was any issue in initialisation. 90% of the times you can figure out from the logs of these pods about the issue.