earthaccess: Document why signed S3 URLs might be giving 400s when called from inside us-west-2
Sometimes when you make a request to a URL behind earthdata login, after a series of redirects, you get sent to a signed S3 URL. This should be transparent to the client, as the URL itself contains all the authentication needed for access.
However, sometimes, in some clients, you get a generic 403 Forbidden here without much explanation. It has something to do with other auth being sent alongside (see https://github.com/nsidc/earthaccess/issues/187 for more vague info).
We should document what this is, and why you get the 403. This documentation would allow developing workarounds for various clients if needed.
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 2
- Comments: 46 (6 by maintainers)
Just to note, aiohttp finally made a release! So
fsspecnow supportsnetrccorrectly!Just linking some benchmarks from @hrodmn comparing s3:// and https:// access for a year’s worth of Harmonized Landsat Sentinel-2 (HLS) data from LP-DAAC on us-west-2 at https://hrodmn.dev/posts/nasa-s3/index.html. There’s about a 0.25 seconds speed advantage (8.08s with
s3, 7.74s withhttps) which is fairly small, but ifearthaccesscan handle switching between s3/https based on the compute region, that would be awesome!@alexgleith just FYI, there are a few catches when we access HTTPS:// instead of S3://:
Finally, if we are running our code in us-west-2, we can use S3FS with the S3:// urls and we can use earthaccess to get us the authenticated sessions if we know the DAAC.
Here is an example of it working with xarray!
Thanks @betolink
For the work I’m doing, it’s exploratory so performance isn’t important yet. And I don’t think that for the NetCDF files chunking matters, since they’re not optimised for it. (Happy to be corrected there!)
I’m just doing a little project on the EMIT data and there’s enough complexity in the data itself that I’m happy with the HTTPS loading process. Thanks for your help!
Ok, I got it working using @yuvipanda’s code above.
Needs a little fix in a related project, which I’ve raised as a PR: https://github.com/nasa/EMIT-Data-Resources/pull/24
Hi Alex,
yes, you must have an EC2 instance running in the same region as the S3 bucket (
us-west-2for NASA data) to “Directly Access” the data.Andy Barrett
Bringing in @cisaacstern to maybe provide some extra feedback to Alexey’s last message.
May not be the right thread, but dropping a note here so it’s more permanent than Slack:
A while ago, I stumbled across these Twitter threads about some climate data stored in Zarr on OpenStorageNetwork’s S3 buckets with HTTP URLs. The example they show accesses Zarr directly via an HTTP URL.
https://twitter.com/charlesstern/status/1574497421245108224?s=20&t=rLvID-0c1j1NxHgy0JOCjQ https://twitter.com/charlesstern/status/1574499938465038336?s=20&t=rLvID-0c1j1NxHgy0JOCjQ
Here’s a direct link to the corresponding Pangeo feedstock (in case Twitter dies): https://pangeo-forge.org/dashboard/feedstock/79
From what I can tell, the underlying storage here is OpenStorageNetwork, which provides the S3 API via Ceph. How exactly all of this is wired and optimized is a bit beyond me, but the end result is compelling and may have some interesting lessons for how we do S3/HTTP.
Writing this as a reminder, apache requires a capital B in Bearer which matters for on-premise files, this also work cloud files so should use the following:
curl -H "Authorization: Bearer TOKEN" -L --url ‘URL’ >outThis issue has spawned off many different things, so here’s a quick summary:
1. Support using HTTPS + .netrc with xarray universally
Currently, it is not possible to use
.netrcfiles with xarray if you are running from inside us-west-2. So if you are inside us-west-2 and want to access cloud hosted data with xarray, you must use S3 (not plain HTTPS). Once https://github.com/aio-libs/aiohttp/pull/7131 lands and a new release ofaiohttpis made, this issue will go away. So code that uses HTTPS+netrc will universally work, regardless of it being in us-west-2 or elsewhere.2. Support using EDL user tokens universally
For cloud access, EDL tokens already work with xarray (https://github.com/nsidc/earthaccess/issues/188#issuecomment-1363450269 has an example). However, it doesn’t work universally - many on-prem servers don’t support EDL tokens yet, although some do. Me (from outside NASA) and @briannapagan (from inside) are pushing on this, getting EDL token support rolled out more universally. If you are inside a DAAC, we could use your help!
3. Determine when to use s3:// protocol vs https:// protocol when inside us-west-2
s3://links only work from inside us-west-2, so we should have clear documentation on when users should use the s3:// protocol vs just https. From inside us-west-2, there could be a performance difference between these two, but my personal intuition is that it is not significant enough for man use cases, especially beginner cases. This is the part that least amount of work has been done on so far. We would need some test cases testing s3 vs https from inside us-west-2 to establish this performance difference.End goal for education
My intuitive end goal here is to be able to tell people to ‘use HTTPS links with token auth’ universally, regardless of where they are accessing the data from, with an addendum suggesting using the
s3://protocol under specific performance circumstances. A step along the way is to be able to tell people touse HTTPS links with netrcuniversally.Me and @briannapagan did another bit of deep dive here, and made some more progress.
There seem to be two primary packages supporting earthdata login on the server side:
We have established that TEA already supports bearer tokens (https://github.com/asfadmin/thin-egress-app/blob/7b0f7110b1694f553af2b71594cc19e40c179ea9/lambda/app.py#L183). But what of the apache2 module?!
As of Sep 2021, it also supports bearer tokens! https://git.earthdata.nasa.gov/projects/AAM/repos/apache-urs-authentication-module/commits/e13ddeb1c3be7767a3214191f9de31e8cc311187 is the appropriate merge commit, and we discovered an internal JIRA ticket named
URSFOUR-1600that also tracks this feature.With some more sleuthing, we discovered https://forum.earthdata.nasa.gov/viewtopic.php?t=3290. We tracked that through looking for URSFOUR-1858, mentioned in https://git.earthdata.nasa.gov/projects/AAM/repos/apache-urs-authentication-module/commits/8c4796c0467a1d5dcb8740fb86f23474db8258e3. That merge was the only further activity on the apache module since the merge for token support. Looking through that earthdata forum post, we see that LPDAAC (which maintains the dataset talked about there) mentions deploying ‘some apache change’ to help with that. So the hypothesis I had was:
I tested this hypothesis by trying to send a token to
https://e4ftl01.cr.usgs.gov/ASTT/AG5KMMOH.041/2001.04.01/ASTER_GEDv4.1_A2001091.h5- a dataset hosted by LPDAAC. And behold, it works! So all data hosted by LPDAAC supports tokens 😃So the pathway to using tokens everywhere, including onprem, boils down to getting all the DAACs to use the latest version of the official earthdata apache2 module.
This is great news for many reasons:
ok, so current summary is:
.netrcsupport to aiohttp, and hence to fsspec. This is needed for earthdata login access to work consistently in AWS us-west-2 with fsspec the same way it works elsewhere, while using earthdata username / password to login.fsspec- just passheadersas a kwargs as shown in the comment above, rather than as a part ofclient_kwargs. yay!Unfortunately, there is a limit of only two tokens per user in earthdata login right now, so you can not just generate a token for each machine you would use it in, like with GitHub Personal Access token. However, the lack of need for specific files means this would also work with dask.