uproot5: File access methods performance studies
Data provided by @chrisburr; discussion below
If I’m making a fundamental mistake in these file access methods, then I don’t want to design around it. I’ve been thinking about how I can adequately test that. Do you have a suggestion for a publicly available test file?
I default to using these three open data files for XRootD:
$ xrdfs root://eospublic.cern.ch/ ls -l /eos/opendata/lhcb/AntimatterMatters2017/data
-r-- 2017-03-07 13:53:05 666484974 /eos/opendata/lhcb/AntimatterMatters2017/data/B2HHH_MagnetDown.root
-r-- 2017-03-07 13:53:08 444723234 /eos/opendata/lhcb/AntimatterMatters2017/data/B2HHH_MagnetUp.root
-r-- 2017-03-07 13:53:08 2272072 /eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root
I’ve just copied them to CERN S3 so they can be accessed using standard HTTP(S) as well:
- https://scikit-hep-test-data.s3.cern.ch/B2HHH_MagnetDown.root
- https://scikit-hep-test-data.s3.cern.ch/B2HHH_MagnetUp.root
- https://scikit-hep-test-data.s3.cern.ch/PhaseSpaceSimulation.root
Would a test through a household ISP mean anything, given that most of these files would be used on university or lab computers or on the GRID?
It’s definitely useful as the quality of institute connections varies a lot in my experience and it’s more likely to expose issues with latency, especially given you’ll be accessing them from the US. I think a good implementation should only be limited by bandwidth when reading large enough files.
If you put together a test I can easily try it at a few different institutes as well as pushing a test job to every LHCb grid site.
_Originally posted by @chrisburr in https://github.com/scikit-hep/uproot4/pull/1#issuecomment-627928856_
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 17 (11 by maintainers)
I’ve been doing a lot of work around laurelin’s I/O recently, so this is conveniently-timed.
You can get a pretty good speedup from xrootd’s VectorRead call, particularly when the hosting server is exposing the storage as a plugin to xrootd (and not a POSIX filesystem), since the vectored-ness of the read can be passed all the way down to the FS. If you’ve got it set up to where you have the offset/lengths of all the TBaskets that need to be read collected in one place, then it’s also worthwhile to coalesce adjacent reads. e.g. if baskets are separated by just the TKey, then do just one read and throw away the unneeded bytes instead of issuing two reads.
In XRootD’s case, you should additionally look at the async versions. In pyxrootd, this is implemented by providing a callback function to be called when the I/O completes. It’s not the best interface (the entire I/O has to complete before the callback is called, so you can’t progressively dispatch the decompression as the bytes come in), but you can drop dealing with threads on the I/O end that way