s3fs: S3FileSystem.exists throwing inaccurate FILENOTFOUND

#53

I am still having the same issue when I am trying to create dask dataframe form s3. I did make sure anon=False when I create the interface object. It is a sporadic problem. Not sure why. I have troubleshot my code with boto3 s3 client and s3 object. They both agreed the file is on S3 but s3fs exists is returning ‘Filenotfound’.

client  = boto3.client('s3')
def check(s3, bucket, key):
   try:
       r=s3.head_object(Bucket=bucket, Key=key)
   except ClientError as e:
       return int(e.response['Error']['Code']) != 404 #e.response['Error'] #
   return True

check(client, bucket, key))
#########################################################

resource = boto3.resource('s3')

s3resource.Object(bucket,key).get()
except botocore.exceptions.ClientError as e:
                         if e.response['Error']['Code'] == "404":
                               return False
                         else: raise
else:
        return True

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 52 (22 by maintainers)

Commits related to this issue

Fix https://github.com/dask/s3fs/issues/253 _ls_from_cache was returning FileNotFound when checking existence of directory, if parent was previously listed, and path contained the "/" suffix. — committed to martindurant/filesystem_spec by deleted user 4 years ago
Merge pull request #455 from martindurant/exists_prelisted_dir Fix https://github.com/dask/s3fs/issues/253 — committed to fsspec/filesystem_spec by martindurant 4 years ago

Most upvoted comments

We are running into this exact issue in our project too, @martindurant. We would greatly appreciate any updates/workarounds.

scuddalo on Nov 18, 2019

@TomAugspurger we were checking for the existence of the file in a dense folder. While stepping through the codebase with the debugger, it leads to exists -> info -> ls, where the result of ls is exactly 1000 files. The folder itself contains 4226 files and the file in question is the last one (given the sorting by name).

The fact that ls returns only first 1000 files is a known issue. We are always using a custom paginator to list directory contents.

abdullin on Nov 22, 2019

OK, got it - this only happens if the parent directory has been listed first, and is actually coded in fsspec.AbstractFileSystem._ls_from_cache.

martindurant on Oct 22, 2020

I want to echo an important thing raised by @mdwint .

I am also getting an exists() failure on a path that ends with a “/”. This used to work, but now does not after freshly rebuilding a container. Note: These are files that have existed for months in S3.

exists("bucket/pseudo_folder_name") --> True
exists("bucket/pseudo_folder_name/") --> False
exists("bucket/pseudo_folder_name/pseudo_file_name") --> True

jstromergalley on Oct 22, 2020

Thanks, @martindurant! I will try passing use_listings_cache=False to S3FileSystem.

mdwint on Sep 8, 2020

This is part of a distributed application, which makes it tricky to reproduce. The code that caused the error is equivalent to dd.read_parquet("s3://bucket/dataset-directory/") (with s3:// and a trailing slash). This call is nested in a loop over many URLs, most of which succeed. The failure occurs in only a small percentage of cases, and the files surely exist on S3, but may have been recently created (seconds ago). I’m certain that the URLs are correct, and that s3fs is installed and configured.

mdwint on Sep 8, 2020

@martindurant , Sorry I forgot to follow up this issue.

I got the delayed function piece. But I am not getting how I can get the partitions out of a big csv without using the read_csv with s3fs:

delays = dd.read_csv("s3://bucket/key/file.csv", dtype=str, na_filter=False, blocksize="128MB").to_delayed()  #this give me a list of delays

##################################################################### I did call the invalidate_cache, plus added 2 seconds delay if the first access to file got FILENOTFOUND error. Do you want me to invalidate_cache after delay?

Although this reduces the number of FILENOTFOUND error occurrence, it still happen when I do a series of upload. About 10 files. It is significant improvement compared to every 4 or 5 uploads to S3.

My current implementation is as below:

import fsspec
import dask.dataframe as dd

fsspec.filesystem('s3').invalidate_cache()

try:
            dd.read_csv("s3://bucket/key/file.csv")
    
except:
                  time.sleep(2)
                  print("Second attempt after 2 sec sleep")
                  dd.read_csv("s3://bucket/key/file.csv")

With the above code, I can see the print statement in cloudwatch if the first attempt failed. I can see FILENOTFOUND error twice if both attempts failed. I don’t want to increase the delay.

Thank you

chitzinwin on Nov 15, 2019