s3fs: Dask cluster hangs while processing file using s3fs

So I have a function which loads nc files from S3 using s3fs and then select some specific region from files and then push back to different S3 bucket. Here’s my function. I’m doing every thing in JupyterLab

def get_records(rec):
    
    d=[rec[-1][0:4], rec[-1][4:6], rec[-1][6:8], rec[-1][9:11], rec[-1][11:13]] 
    
    yr=d[0]
    mo=d[1]
    da=d[2]
    hr=d[3]
    mn=d[4]
    
    ps = s3fs.S3FileSystem(anon=True)

    period = pd.Period(str(yr)+str('-')+str(mo)+str('-')+str(da), freq='D')
    dy=period.dayofyear
    print(dy)

    cc=[7,8,9,10,11,12,13,14,15,16]  #look at the IR channels only for now
    dy="{0:0=3d}".format(dy)

    # this loop is for 10 different channels
    for c in range(10):
        ch="{0:0=2d}".format(cc[c])

    # opening 2 different time slices of given particular record
        F1=xr.open_dataset(ps.open(ps.glob('s3://noaa-goes16/ABI-L1b-RadF/'+str(yr)+'/'+str(dy)+'/'+str("{0:0=2d}".format(hr))+'/'+'OR_ABI-L1b-RadF-M3C'+ch+'*')[-2]))[['Rad']]

        F2=xr.open_dataset(ps.open(ps.glob('s3://noaa-goes16/ABI-L1b-RadF/'+str(yr)+'/'+str(dy)+'/'+str("{0:0=2d}".format(hr))+'/'+'OR_ABI-L1b-RadF-M3C'+ch+'*')[-1]))[['Rad']]
      
    # Selecting data as per given record radiance
        G1 = F1.where((F1.x >= (rec[0]-0.005)) & (F1.x <= (rec[0]+0.005)) & (F1.y >= (rec[1]-0.005)) & (F1.y <= (rec[1]+0.005)), drop=True)
        G2 = F2.where((F2.x >= (rec[0]-0.005)) & (F2.x <= (rec[0]+0.005)) & (F2.y >= (rec[1]-0.005)) & (F2.y <= (rec[1]+0.005)), drop=True)
       
    # Concating 2 time slices togethere
        G = xr.concat([G1, G2], dim  = 'time')

    # Concatiating different channels
        if c == 0:
            T = G    
        else:
            T = xr.concat([T, G], dim = 'channel')

    # Saving into nc file and storing them to S3
    path = rec[-1]+'.nc'
    T.to_netcdf(path)
    fs.put(path, bucket+path)

Now I want to use dask cluster to run this function parallelly like this this:

files = [ ]
for i in range(0, 100):
        s3_ds = dask.delayed(get_records)(records[i])
        files.append(s3_ds)

    files = dask.compute(*files)

So this thing was running perfectly fine last month but now when I try to do it again my clusters are not processing any files after a while. For example if I give 100 files to 10 cluster they won’t process any files after 60-70 even though they have memory left and will just sit idle and do nothing. So next time I tried to give just 50 files, then they’ll process around 30 files and then will sit idle. Even though they have memory left in them. I don’t know what I’m doing wrong or is it some bug in library. I upgraded all the library I was using, dask, s3fs and fssps. But still nothing is working.

Main thing is all this was working perfectly fine few weeks back, but now same code is not working anymore.

Environment:

Dask version: 2021.2.0
Python version: 3.8.8
S3fs version: 0.6.0
Fsspec: 0.9.0

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 19 (8 by maintainers)

Commits related to this issue

backrev fsspec and s3fs due to hanging bug https://github.com/dask/s3fs/issues/464 — committed to stevengillard/amazon-asdi by stevengillard 3 years ago
use spawn instead of fork method (#118) spawn is cleaner https://stackoverflow.com/a/66113051/1658314 work around an issue in s3fs fsspec/s3fs#464 — committed to rom1504/img2dataset by rom1504 2 years ago

Most upvoted comments

But really the short story is: don’t use fork!

martindurant on May 2, 2021