earthaccess: Document why signed S3 URLs might be giving 400s when called from inside us-west-2

Sometimes when you make a request to a URL behind earthdata login, after a series of redirects, you get sent to a signed S3 URL. This should be transparent to the client, as the URL itself contains all the authentication needed for access.

However, sometimes, in some clients, you get a generic 403 Forbidden here without much explanation. It has something to do with other auth being sent alongside (see https://github.com/nsidc/earthaccess/issues/187 for more vague info).

We should document what this is, and why you get the 403. This documentation would allow developing workarounds for various clients if needed.

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 2
  • Comments: 46 (6 by maintainers)

Most upvoted comments

Just to note, aiohttp finally made a release! So fsspec now supports netrc correctly!

  • Speed: when we use HTTPS we are going through NASA’s CloudFront proxy and opening a dataset could be slower than using the S3:// schema URLs. This is why earthaccess (this library) picks the right access pattern depending on where the code is running (us-west-2 or not).

Just linking some benchmarks from @hrodmn comparing s3:// and https:// access for a year’s worth of Harmonized Landsat Sentinel-2 (HLS) data from LP-DAAC on us-west-2 at https://hrodmn.dev/posts/nasa-s3/index.html. There’s about a 0.25 seconds speed advantage (8.08s with s3, 7.74s with https) which is fairly small, but if earthaccess can handle switching between s3/https based on the compute region, that would be awesome!

@alexgleith just FYI, there are a few catches when we access HTTPS:// instead of S3://:

  • Speed: when we use HTTPS we are going through NASA’s CloudFront proxy and opening a dataset could be slower than using the S3:// schema URLs. This is why earthaccess (this library) picks the right access pattern depending on where the code is running (us-west-2 or not).
  • Chunking affecting performance: related to the first point, if a file is chunked into hundreds of chunks, each will result on a separate HTTPS request that has to go through the proxy and some datasets will be slower to access than others because of this.

Finally, if we are running our code in us-west-2, we can use S3FS with the S3:// urls and we can use earthaccess to get us the authenticated sessions if we know the DAAC.

import earthaccess

earthaccess.login()
url = "s3://some_nasa_dataset"
fs = earthaccess.get_s3fs_session("LPDAAC")

# we open our granule in a s3fs context and we work as usual
with fs.open(url) as file:
    dataset = xr.open_dataset(file)

Here is an example of it working with xarray!

from fsspec.implementations.http import HTTPFileSystem
import xarray as xr

url = "https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5"

token = 'my-long-token'
fs = HTTPFileSystem(headers={
    "Authorization": f"bearer {token}"
})
ds = xr.open_dataset(fs.open(url))
ds
image

Thanks @betolink

For the work I’m doing, it’s exploratory so performance isn’t important yet. And I don’t think that for the NetCDF files chunking matters, since they’re not optimised for it. (Happy to be corrected there!)

I’m just doing a little project on the EMIT data and there’s enough complexity in the data itself that I’m happy with the HTTPS loading process. Thanks for your help!

Ok, I got it working using @yuvipanda’s code above.

Needs a little fix in a related project, which I’ve raised as a PR: https://github.com/nasa/EMIT-Data-Resources/pull/24

Hi Alex,

yes, you must have an EC2 instance running in the same region as the S3 bucket (us-west-2 for NASA data) to “Directly Access” the data.

Andy Barrett

Bringing in @cisaacstern to maybe provide some extra feedback to Alexey’s last message.

May not be the right thread, but dropping a note here so it’s more permanent than Slack:

A while ago, I stumbled across these Twitter threads about some climate data stored in Zarr on OpenStorageNetwork’s S3 buckets with HTTP URLs. The example they show accesses Zarr directly via an HTTP URL.

https://twitter.com/charlesstern/status/1574497421245108224?s=20&t=rLvID-0c1j1NxHgy0JOCjQ https://twitter.com/charlesstern/status/1574499938465038336?s=20&t=rLvID-0c1j1NxHgy0JOCjQ

Here’s a direct link to the corresponding Pangeo feedstock (in case Twitter dies): https://pangeo-forge.org/dashboard/feedstock/79

From what I can tell, the underlying storage here is OpenStorageNetwork, which provides the S3 API via Ceph. How exactly all of this is wired and optimized is a bit beyond me, but the end result is compelling and may have some interesting lessons for how we do S3/HTTP.

Writing this as a reminder, apache requires a capital B in Bearer which matters for on-premise files, this also work cloud files so should use the following: curl -H "Authorization: Bearer TOKEN" -L --url ‘URL’ >out

This issue has spawned off many different things, so here’s a quick summary:

1. Support using HTTPS + .netrc with xarray universally

Currently, it is not possible to use .netrc files with xarray if you are running from inside us-west-2. So if you are inside us-west-2 and want to access cloud hosted data with xarray, you must use S3 (not plain HTTPS). Once https://github.com/aio-libs/aiohttp/pull/7131 lands and a new release of aiohttp is made, this issue will go away. So code that uses HTTPS+netrc will universally work, regardless of it being in us-west-2 or elsewhere.

2. Support using EDL user tokens universally

For cloud access, EDL tokens already work with xarray (https://github.com/nsidc/earthaccess/issues/188#issuecomment-1363450269 has an example). However, it doesn’t work universally - many on-prem servers don’t support EDL tokens yet, although some do. Me (from outside NASA) and @briannapagan (from inside) are pushing on this, getting EDL token support rolled out more universally. If you are inside a DAAC, we could use your help!

3. Determine when to use s3:// protocol vs https:// protocol when inside us-west-2

s3:// links only work from inside us-west-2, so we should have clear documentation on when users should use the s3:// protocol vs just https. From inside us-west-2, there could be a performance difference between these two, but my personal intuition is that it is not significant enough for man use cases, especially beginner cases. This is the part that least amount of work has been done on so far. We would need some test cases testing s3 vs https from inside us-west-2 to establish this performance difference.

End goal for education

My intuitive end goal here is to be able to tell people to ‘use HTTPS links with token auth’ universally, regardless of where they are accessing the data from, with an addendum suggesting using the s3:// protocol under specific performance circumstances. A step along the way is to be able to tell people to use HTTPS links with netrc universally.

Me and @briannapagan did another bit of deep dive here, and made some more progress.

There seem to be two primary packages supporting earthdata login on the server side:

We have established that TEA already supports bearer tokens (https://github.com/asfadmin/thin-egress-app/blob/7b0f7110b1694f553af2b71594cc19e40c179ea9/lambda/app.py#L183). But what of the apache2 module?!

As of Sep 2021, it also supports bearer tokens! https://git.earthdata.nasa.gov/projects/AAM/repos/apache-urs-authentication-module/commits/e13ddeb1c3be7767a3214191f9de31e8cc311187 is the appropriate merge commit, and we discovered an internal JIRA ticket named URSFOUR-1600 that also tracks this feature.

With some more sleuthing, we discovered https://forum.earthdata.nasa.gov/viewtopic.php?t=3290. We tracked that through looking for URSFOUR-1858, mentioned in https://git.earthdata.nasa.gov/projects/AAM/repos/apache-urs-authentication-module/commits/8c4796c0467a1d5dcb8740fb86f23474db8258e3. That merge was the only further activity on the apache module since the merge for token support. Looking through that earthdata forum post, we see that LPDAAC (which maintains the dataset talked about there) mentions deploying ‘some apache change’ to help with that. So the hypothesis I had was:

  1. LPDAAC ran into some other unrelated issue,
  2. Which required code changes to the apache module, which was done via URSFOUR-1858
  3. They have deployed this change to their servers
  4. However, since this change was deployed , it is also likely that LPDAAC has included URS-1600 (user token support) in the deployment as well. Not necessarily explicitly, but just as a side effect of trying to deploy the more recent URSFOUR-1858.

I tested this hypothesis by trying to send a token to https://e4ftl01.cr.usgs.gov/ASTT/AG5KMMOH.041/2001.04.01/ASTER_GEDv4.1_A2001091.h5 - a dataset hosted by LPDAAC. And behold, it works! So all data hosted by LPDAAC supports tokens 😃

So the pathway to using tokens everywhere, including onprem, boils down to getting all the DAACs to use the latest version of the official earthdata apache2 module.

This is great news for many reasons:

  1. No new code needs to be written! This all is already done.
  2. LPDAAC already deployed this, so it isn’t a brand new deployment
  3. This is the official apache module that DAACs are already using, not some newfangled new software.

ok, so current summary is:

  1. https://github.com/aio-libs/aiohttp/pull/7131 adds .netrc support to aiohttp, and hence to fsspec. This is needed for earthdata login access to work consistently in AWS us-west-2 with fsspec the same way it works elsewhere, while using earthdata username / password to login.
  2. However, I think we should recommend everyone use tokens for actually authenticating programmatically - https://urs.earthdata.nasa.gov/documentation/for_users/user_token. This already works with fsspec - just pass headers as a kwargs as shown in the comment above, rather than as a part of client_kwargs. yay!

Unfortunately, there is a limit of only two tokens per user in earthdata login right now, so you can not just generate a token for each machine you would use it in, like with GitHub Personal Access token. However, the lack of need for specific files means this would also work with dask.