uproot5: uproot4 unable to open ROOT file with colons in the name that uproot3 can
Hi. I’ve come across an issue where if a ROOT file has a colon in the name of it uproot3
can open the file but uproot4
fails. I can’t give you the file, but I can show you a minimal failing example and then a reproducible example with public files.
Minimal Failing Example
$ tree .
.
├── data-tree
│ └── data16_13TeV:data16_13TeV.periodA.physics_Main.PhysCont.DAOD_JETM1.grp16_v01_p4061.root
├── issue.py
├── requirements.txt
1 directory, 3 files
$ cat requirements.txt
uproot
uproot4
$ docker run --rm -it -v $PWD:/data -w /data python:3.8 /bin/bash
root@510598a7f4e8:/data# python -m pip install --upgrade pip setuptools wheel
root@510598a7f4e8:/data# python -m pip install -r requirements.txt
root@510598a7f4e8:/data# python --version
Python 3.8.5
root@510598a7f4e8:/data# python -m pip list
Package Version
-------------- -------
awkward 0.13.0
cachetools 4.1.1
numpy 1.19.1
pip 20.2.2
setuptools 49.6.0
uproot 3.12.0
uproot-methods 0.7.4
uproot4 0.0.18
wheel 0.35.1
root@510598a7f4e8:/data# cp data-tree/data16_13TeV\:data16_13TeV.periodA.physics_Main.PhysCont.DAOD_JETM1.grp16_v01_p4061.root data-tree/renamed.root
root@510598a7f4e8:/data# python
Python 3.8.5 (default, Aug 5 2020, 08:22:02)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import uproot as uproot3
>>> import uproot4
>>> from pathlib import Path
>>>
>>> uproot3_file = uproot3.open(
... "data-tree/data16_13TeV:data16_13TeV.periodA.physics_Main.PhysCont.DAOD_JETM1.grp16_v01_p4061.root"
... )
>>> print(f"uproot3 opens file as {uproot3_file}")
uproot3 opens file as <ROOTDirectory b'/home/feickert/workarea/submitDir/data-tree//data16_13TeV:data16_13TeV.periodA.physics_Main.PhysCont.DAOD_JETM1.grp16_v01_p4061.root' at 0x7fce6da5b5e0>
>>>
>>> # uproot4 fails with the ':' in the filename
>>> uproot4.open(
... "data-tree/data16_13TeV:data16_13TeV.periodA.physics_Main.PhysCont.DAOD_JETM1.grp16_v01_p4061.root"
... )
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/uproot4/source/file.py", line 74, in __init__
self._file = numpy.memmap(self._file_path, dtype=self._dtype, mode="r")
File "/usr/local/lib/python3.8/site-packages/numpy/core/memmap.py", line 225, in __new__
f_ctx = open(os_fspath(filename), ('r' if mode == 'c' else mode)+'b')
FileNotFoundError: [Errno 2] No such file or directory: 'data-tree/data16_13TeV'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.8/site-packages/uproot4/reading.py", line 78, in open
file = ReadOnlyFile(
File "/usr/local/lib/python3.8/site-packages/uproot4/reading.py", line 265, in __init__
self._source = Source(file_path, **self._options)
File "/usr/local/lib/python3.8/site-packages/uproot4/source/file.py", line 80, in __init__
self._fallback = uproot4.source.file.FileSource(file_path, opts)
AttributeError: module 'uproot4.source.file' has no attribute 'FileSource'
>>> # even if that is inside a pathlib object
>>> uproot4.open(
... Path(
... "data-tree/data16_13TeV:data16_13TeV.periodA.physics_Main.PhysCont.DAOD_JETM1.grp16_v01_p4061.root"
... )
... )
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/uproot4/source/file.py", line 74, in __init__
self._file = numpy.memmap(self._file_path, dtype=self._dtype, mode="r")
File "/usr/local/lib/python3.8/site-packages/numpy/core/memmap.py", line 225, in __new__
f_ctx = open(os_fspath(filename), ('r' if mode == 'c' else mode)+'b')
FileNotFoundError: [Errno 2] No such file or directory: 'data-tree/data16_13TeV'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.8/site-packages/uproot4/reading.py", line 78, in open
file = ReadOnlyFile(
File "/usr/local/lib/python3.8/site-packages/uproot4/reading.py", line 265, in __init__
self._source = Source(file_path, **self._options)
File "/usr/local/lib/python3.8/site-packages/uproot4/source/file.py", line 80, in __init__
self._fallback = uproot4.source.file.FileSource(file_path, opts)
AttributeError: module 'uproot4.source.file' has no attribute 'FileSource'
>>> # but the file itself is fine
>>> uproot4.open("data-tree/renamed.root")
<ReadOnlyDirectory '/' at 0x7fce725e7670>
Failing Reproducible Example
# issue.py
import uproot as uproot3
import uproot4
from pathlib import Path
def main():
# curl -sL https://github.com/scikit-hep/scikit-hep-testdata/raw/master/src/skhep_testdata/data/uproot-HZZ-lz4.root -o uproot-HZZ-lz4.root
uproot3.open("uproot-HZZ-lz4.root")
uproot3.open("uproot:HZZ-lz4.root")
uproot3.open(Path("uproot:HZZ-lz4.root"))
uproot4.open("uproot-HZZ-lz4.root")
uproot4.open("uproot:HZZ-lz4.root")
if __name__ == "__main__":
main()
root@510598a7f4e8:/data# curl -sL https://github.com/scikit-hep/scikit-hep-testdata/raw/master/src/skhep_testdata/data/uproot-HZZ-lz4.root -o uproot-HZZ-lz4.root
root@510598a7f4e8:/data# cp uproot-HZZ-lz4.root uproot:HZZ-lz4.root
root@510598a7f4e8:/data# python issue.py
Traceback (most recent call last):
File "issue.py", line 17, in <module>
main()
File "issue.py", line 9, in main
uproot3.open("uproot:HZZ-lz4.root")
File "/usr/local/lib/python3.8/site-packages/uproot/rootio.py", line 63, in open
raise ValueError("URI scheme not recognized: {0}".format(path))
ValueError: URI scheme not recognized: uproot:HZZ-lz4.root
Comments
I realize that this is probably because uproot4
’s open
’s path
is
and it isn’t a good idea to have a file with a colon in it in general. However, there are ATLAS files that do, and it would be nice if uproot3
behavior toward filenames could still be supported. If support for this is firmly out of scope it would be great if there could be some huge warning about this in the docs.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 38 (36 by maintainers)
Commits related to this issue
- Implemented conditional colon-parsing and full error message as described here: https://github.com/scikit-hep/uproot4/issues/79#issuecomment-682020092 — committed to scikit-hep/uproot5 by jpivarski 4 years ago
- Removed the colon-parsing and replaced it with dicts. (#81) * [skip ci] Removed the colon-parsing and replaced it with dicts. Tests will fail because they still assume you can use colons. * All te... — committed to scikit-hep/uproot5 by jpivarski 4 years ago
For files ending in
.root.1
, etc., I would have thought you’re out of the easy case and have to start using dicts. The error message would say that. Would that be a big imposition?I hadn’t considered having different behavior for
str
/bytes
than forpathlib.Path
. Would that be more expected?Also, on accepting an already open Python file-handle (loosely interpreted as anything with
read
andseek
, likeio.BytesIO
), Uproot has never had this capability, but it would be possible. It would exclude parallel reading of baskets, but you get what you pay for. That could become a new Source subclass, but it’s beyond the scope of this issue.Since there are no thumbs up on the previous proposal and two objections thereafter, how about the following:
.root
substring is found.str
orbytes
is passed, do the colon parsing as before (many people have said that they want it, though it can be a stumbling block, as in @matthewfeickert’s case that started this thread).pathlib.Path
is passed, do not do colon parsing, which is a good way of enforcing what you want inuproot4.open
, but would be limiting in functions that need to open many file-tree combinations, such asuproot4.iterate
,uproot4.concatenate
, anduproot4.lazy
.uproot4.open
is a length-1 dict.What does everybody think of this? Thumbs up or comment: thanks!
(I’ve actually been distracted by other work, so just getting back to this now.)
One difficulty with optimizing usability is that “what is natural” or “what a novice might try” is rather speculative. I would expect novices (including me, approaching a system I don’t know) to get tripped up by the requirement that the object path has to be expressed as absolute, i.e. that it must start with
:/blah
and not just:blah
, even though it is always going to be evaluated relative to/
. (This was confusing in the file URI scheme, for example.) I’m not of a strong opinion about this point, though, because I think the error message is where the wide array of users’ guesses as to how it works will get narrowed down to how it actually does work—assuming users read error messages. (Sometimes a problem, but not a deep one if they copy-paste the error message into the bug report for me to read.)The other difficulty with optimizing usability is that expectations among advanced users become necessities. I hadn’t known (I guessed right) that the colon syntax was an expected feature, and so losing it would be a stumbling block for users who expect it from ROOT.
But the thing that’s driving the dict syntax is neither of the above. It’s driven by the functions that access many file-object (actually, file-TTree) pairs. They need a way—not a convenient way but any way—to express combined filesystem + ROOT object paths for hundreds or thousands of files. If we could have used the colon syntax for that, then that would have been great. But some filesystem paths have colons in them. In principle, a directory name could end with a colon, which would get incorrectly identified as
:/
. So the many file-object functions need to have a backup syntax for when the string syntax isn’t expressive enough, and that’s what the dicts are. Only one other suggestion has been made that would be as unambiguous, but it’s pretty verbose and introduces more things to remember.So novice users do not need to know about the dict syntax; it only comes up if they can’t express a group of filenames + object paths because the filenames have colons in them, and then it comes up as an error message because when they tried to do the naive thing, it couldn’t find the file.
I’m going to implement the last proposal I made. I considered not porting the dict syntax from the functions that need it (
uproot4.iterate
,uproot4.concatenate
,uproot4.lazy
) to the function that doesn’t really need it (uproot4.open
, for which only a length-1 dict would be allowed) because the “file not found” error messages don’t know whether they’re from an attempt to open a single file or many files, and tracking that provenance through everything so that they can give context-dependent error messages doesn’t seem worthwhile. Making the same syntax apply to all functions equally, even if it’s not really needed for one of them, is simpler.I think the first and third would work; the second wouldn’t for the reason you mentioned. But the preferred way to do it, following @kratsg’s advice about
pathlib.Path
, isuproot4.open(pathlib.Path("foo:bar.root"))
No colon parsing would be performed on a
pathlib.Path
, and since this isuproot4.open
, there’s no reason to bring in any fancy dicts because you don’t have to specify a TTree path. Foruproot4.iterate
,uproot4.concatenate
,uproot4.lazy
, you do have to specify TTrees because you’re iterating over, concatenating, or building lazy arrays from TTrees, not files. There wouldn’t be a good reason to make the value in a dictNone
, though it would work, for consistency’s sake.The error message should not only say what’s possible, but should say what’s preferred. Now I’m finally going to go and implement that.
There’s still a problem with requiring
.root:
. Another thing that I often see (again from rucio) is filenames ending with a number like.root.1
. For a random public example, see https://twiki.cern.ch/twiki/bin/view/Sandbox/GridNotes.But then you’d have to remember the spellings of those keys:
"file_path"
and"object_path"
. They become part of the API that has to be remembered or looked up. If I’m in a LISPy mood, I’d be inclined to expect hyphens, rather than underscores, and maybe the whole second word is unnecessary. What you suggest would be great for a REST API, but it’s pretty verbose for Python interaction.Also, it prevents multiple files in one dict, since there can only be one
"file_path"
key.In the end, the error messages look like this:
Since the colon parsing is a convenience and gaps in many-file functions can be filled in with the dict syntax, how about if it only works in limited circumstances, such as after
.root
? (@tamasgal originally made this suggestion.) So,/path/to/filename.root:path/to/ttree
picks out an object atpath/to/ttree
(requiring.root
makes requiring a leading/
in the object path superfluous, since it’s already quite unambiguous, and it’s a potential “gotcha”)/path/to/file:name.root
would be interpreted as a file, not a “name.root
” object path within a file, since “file
” doesn’t end in “.root
”/path/to/file:name.root:path/to/ttree
also picks out an object atpath/to/ttree
from a file namedfile:name.root
; earlier colons are part of the filename{"/path/to/file.root:haha-this-is-the-filename.root": None}
won’t parse the colon and won’t pick out an object within the file because of theNone
value.root
, or you’re using wildcards that don’t end in.root
(a likely example:/path/to/directory/*
). This is for functions that need to find many files, such asuproot4.iterate
,uproot4.concatenate
, anduproot4.lazy
.The colon parsing will be much simpler, based on the regex
\.root\s*:\s*
, because I’ll never need to worry about Windows drive letters or URL schemas with the\.root
disambiguator.Thumbs-up this comment if you’re in favor, write another comment below if you’re not; I’m going to start coding up this solution and if we’re in general agreement, I’ll merge it.
Right, this is exactly what I’m familiar with in dealing with my own analysis files. I think it’s important for a filename like
foo:bar.root
to work in the standard string syntax. This is a very common case (for ATLAS at least). I like dropping the colon parsing entirely myself, but I think requiring:/
is probably still good enough.rucio relies on scopes for filenames, and scopes are separated with colons:
mc15_13TeV:{filename}
and when downloading, will include the scope as well. Dataset discovery relies on these scopes.I find the existing colon functionality very convenient. One use case I have in mind are frameworks for histogram production. Users need to specify where all the inputs to their histograms are found. The existing syntax allows to do that in just one string, users could e.g. write
and specify a list of sample types and systematic variations to fill in the placeholders. If the paths inside and outside the ROOT file need to be specified separately, two such strings would be needed. Having it all in one string also reinforces the idea that for the structure of the data it does not matter much whether something is in a directory or a
TDirectory
.If one takes the above example but puts everything into a big root file, the string would look like
and it is quickly evident that the structure of content is the same. This is harder to parse when split up.
Since it came up above: I’ve had to deal with file structures before where folder names ended with
.root
(and those folders contained ROOT files), but due to the name the folders mistakenly were interpreted as a ROOT file (this was with ROOT, notuproot
). Relying on this to parse might not always work.Since there have been a few differing suggestions about what to do here, I’m not going to merge PR #81 immediately. I’ll wait a day or two to let others voice their opinions, to see if there is a majority view. (@henryiii, do you have an idea?)
We don’t have a formal voting system, but I’ll take your suggestions into account (as I did with a lot of the names in Awkward1, where I personally preferred fewer underscores).