sunpy: Fido download of EIS data stops at 80% resulting in corrupt files

Description

Fido download of Hinode EIS data stops at 80% resulting in corrupt files.

Expected vs Actual behavior

  1. Files should be downloaded fully, not just 80%
  2. If somehow that is not possible, an actual error should be raised instead of just continuing as if everything succeeded

Steps to Reproduce

from sunpy.net import Fido, attrs
res = Fido.search(attrs.Instrument('EIS'), attrs.Time("2015-07-18T11:59:46.670", "2015-07-18T15:50:41.659"))
downloaded_files = Fido.fetch(res, path="eis_obs")  # download is corrupt..

System Details

  • SunPy Version: 2.0.5 / 2.0.7
  • Astropy Version: 4.0 / 4.2
  • Python Version: 3.7.2
  • OS information: win10

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 20 (19 by maintainers)

Most upvoted comments

You are correct Stuart. The 403 is coming from mod_security on the gateway machine:

ModSecurity: Access denied with code 403 (phase 2). String match "bytes=0-"

I’ve removed the rule from the gateway machine and now get a full, compressed EIS file that can be de-compressed.

Please test again and let me know if you still see issues.

Specify your own parfive downloader instance with max_splits=0.

I’ve removed the mod_security Rule #958291 that was causing the initial error in downloading EIS data. Turns out only EIS and MEES data (that I know of) are .gzipped at SDAC.

With Rule #958291 removed, one can see the Range request being processed, here for a MEES file:

63.225.X.Y - - [23/Apr/2021:13:24:24 -0400] "GET /data/mees/MCCD/Level_0.5/200702/MCIS_20070205_171714.scan.fits.gz HTTP/1.1" 206 4

63.225.X.Y - - [23/Apr/2021:13:24:22 -0400] "GET /data/mees/MCCD/Level_0.5/200702/MCIS_20070205_171714.scan.fits.gz HTTP/1.1" 206 33855579

63.225.X.Y - - [23/Apr/2021:13:24:22 -0400] "GET /data/mees/MCCD/Level_0.5/200702/MCIS_20070205_171714.scan.fits.gz HTTP/1.1" 206 33855579

63.225.X.Y - - [23/Apr/2021:13:24:22 -0400] "GET /data/mees/MCCD/Level_0.5/200702/MCIS_20070205_171714.scan.fits.gz HTTP/1.1" 206 33855579

63.225.X.Y - - [23/Apr/2021:13:24:22 -0400] "GET /data/mees/MCCD/Level_0.5/200702/MCIS_20070205_171714.scan.fits.gz HTTP/1.1" 206 33855579

63.225.X.Y - - [23/Apr/2021:13:24:22 -0400] "GET /data/mees/MCCD/Level_0.5/200702/MCIS_20070205_171714.scan.fits.gz HTTP/1.1" 206 33855579`

63.225.X.Y - - [23/Apr/2021:13:44:55 -0400] "GET /data/mees/MCCD/Level_0.5/200702/MCIS_20070205_193304.scan.fits.gz HTTP/1.1" 200 178429659

Yet the files that are downloaded, while complete, for EIS and MEES are not decompressionable:

vi eis_l0_20150718_001226_fits.gz

"eis_l0_20150718_001226_fits.gz" [noeol] 104435L, 43329622C
Error detected while processing function gzip#read:
line   44:
Error: Could not read uncompressed file
Press ENTER or type command to continue

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>403 Forbidden</title>
</head><body>
<h1>Forbidden</h1>
<p>You don't have permission to access /data/hinode/eis/level0/2015/07/18/eis_l0_20150718_001226.fits.gz
on this server.</p>
</body></html>

In the case of EIS, the file has a 403 still in it, causing the file to not be read by gunzip, hence the error seen. I have of course checked file permissions, and they are correct. I’ve also adjusted the Apache config to have

  AddEncoding x-compress .Z
  AddEncoding x-gzip .gz .tgz

so that Parfive should indeed receive a correct .gz file from the Range request. I’ve also tested the transfer of the same data files for EIS and MEES via the VSO web client, and indeed get the full expected file which can be decompressed and is readable.

It looks like the final problem might be on the Python side of the request.

Turns out there are at least two outstanding bugs known in aiohttp about chunked transfers:

​https://github.com/aio-libs/aiohttp/issues/4435

​https://github.com/aio-libs/aiohttp/issues/4462

which seems due to a design of aiohttp itself if I read the documentation correctly:

chunked (int) – Enable chunked transfer encoding. It is up to the developer to decide how to chunk data streams. If chunking is enabled, aiohttp encodes the provided chunks in the “Transfer-encoding: chunked” format. If chunked is set, then the Transfer-encoding and content-length headers are disallowed. None by default (optional).

https://docs.aiohttp.org/en/stable/client_reference.html

One possible solution is to overload the Parfive method _get_http to use requests instead of aiohttp for the case of Range requests of small numbers of individual .gz files. For the case of “staged” tarball requests of data from SDAC, more coding in Parfive will most likely be needed, but I’m not sure that type of request is done very often.

Range requests from SunPy 2.1 for uncompressed data such as EUNIS or WISPR do indeed work:

63.225.X.Y - - [23/Apr/2021:12:23:59 -0400] "GET /data/psp/wispr/L1/20190410/psp_L1_wispr_20190410T021829_V1_1221.fits HTTP/1.1" 206 3150145

63.225.X.Y - - [23/Apr/2021:12:23:59 -0400] "GET /data/psp/wispr/L1/20190410/psp_L1_wispr_20190410T021829_V1_1221.fits HTTP/1.1" 206 3150145

63.225.X.Y - - [23/Apr/2021:12:23:59 -0400] "GET /data/psp/wispr/L1/20190410/psp_L1_wispr_20190410T021829_V1_1221.fits HTTP/1.1" 206 3150145

63.225.X.Y - - [23/Apr/2021:12:23:59 -0400] "GET /data/psp/wispr/L1/20190410/psp_L1_wispr_20190410T021829_V1_1221.fits HTTP/1.1" 200 15750720

The issue appears to be that parfive will attempt to download the file in chunks to speed up the transfer, if the response from the server on the GET request includes the Accept-Ranges: bytes header (which in this case it does). However, if we then request segments of the file with the Range: header present the content becomes corrupted.

I will shortly open a PR to parfive with the branch I used to debug this as I added some env vars and logging, however, this is an issue with the behaviour of the VSO server.

The mount containing the Hinode data is up and responding. I was able to download that 11:41 7/18/15 EIS tarball using both the VSO web client and IDL/SSW. The download did fail for that particular file using my SunPy test script. Other EIS file are downloaded by my SunPy script.

Interestingly, that particular file was a lot slower downloading via IDL/SSW than others:

5 : https://sdac.virtualsolar.org/data/hinode/eis/level0/2015/07/18/eis_l0_20150718_114126.fits.gz % SOCK_GET_MAIN: 42947405 bytes of eis_l0_20150718_114126.fits.gz copied in 200.79 seconds.

vs,

9 : https://sdac.virtualsolar.org/data/hinode/eis/level0/2015/07/18/eis_l0_20150718_054203.fits.gz % SOCK_GET_MAIN: 1670876 bytes of eis_l0_20150718_054203.fits.gz copied in 8.49 seconds.

Not sure if the problem is in Fido, parfive, or elsewhere in SunPy. It does seem to be SunPy-specific.