s3fs: S3FileSystem.exists throwing inaccurate FILENOTFOUND
I am still having the same issue when I am trying to create dask dataframe form s3. I did make sure anon=False when I create the interface object. It is a sporadic problem. Not sure why. I have troubleshot my code with boto3 s3 client and s3 object. They both agreed the file is on S3 but s3fs exists is returning ‘Filenotfound’.
client = boto3.client('s3')
def check(s3, bucket, key):
try:
r=s3.head_object(Bucket=bucket, Key=key)
except ClientError as e:
return int(e.response['Error']['Code']) != 404 #e.response['Error'] #
return True
check(client, bucket, key))
#########################################################
resource = boto3.resource('s3')
s3resource.Object(bucket,key).get()
except botocore.exceptions.ClientError as e:
if e.response['Error']['Code'] == "404":
return False
else: raise
else:
return True
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 52 (22 by maintainers)
Commits related to this issue
- Fix https://github.com/dask/s3fs/issues/253 _ls_from_cache was returning FileNotFound when checking existence of directory, if parent was previously listed, and path contained the "/" suffix. — committed to martindurant/filesystem_spec by deleted user 4 years ago
- Merge pull request #455 from martindurant/exists_prelisted_dir Fix https://github.com/dask/s3fs/issues/253 — committed to fsspec/filesystem_spec by martindurant 4 years ago
We are running into this exact issue in our project too, @martindurant. We would greatly appreciate any updates/workarounds.
@TomAugspurger we were checking for the existence of the file in a dense folder. While stepping through the codebase with the debugger, it leads to
exists->info->ls, where the result oflsis exactly 1000 files. The folder itself contains 4226 files and the file in question is the last one (given the sorting by name).The fact that
lsreturns only first 1000 files is a known issue. We are always using a custom paginator to list directory contents.OK, got it - this only happens if the parent directory has been listed first, and is actually coded in
fsspec.AbstractFileSystem._ls_from_cache.I want to echo an important thing raised by @mdwint .
I am also getting an exists() failure on a path that ends with a “/”. This used to work, but now does not after freshly rebuilding a container. Note: These are files that have existed for months in S3.
Thanks, @martindurant! I will try passing
use_listings_cache=Falseto S3FileSystem.This is part of a distributed application, which makes it tricky to reproduce. The code that caused the error is equivalent to
dd.read_parquet("s3://bucket/dataset-directory/")(withs3://and a trailing slash). This call is nested in a loop over many URLs, most of which succeed. The failure occurs in only a small percentage of cases, and the files surely exist on S3, but may have been recently created (seconds ago). I’m certain that the URLs are correct, and that s3fs is installed and configured.@martindurant , Sorry I forgot to follow up this issue.
I got the delayed function piece. But I am not getting how I can get the partitions out of a big csv without using the read_csv with s3fs:
##################################################################### I did call the invalidate_cache, plus added 2 seconds delay if the first access to file got FILENOTFOUND error. Do you want me to invalidate_cache after delay?
Although this reduces the number of FILENOTFOUND error occurrence, it still happen when I do a series of upload. About 10 files. It is significant improvement compared to every 4 or 5 uploads to S3.
My current implementation is as below:
With the above code, I can see the print statement in cloudwatch if the first attempt failed. I can see FILENOTFOUND error twice if both attempts failed. I don’t want to increase the delay.
Thank you