uproot5: uproot4 unable to open ROOT file with colons in the name that uproot3 can

Hi. I’ve come across an issue where if a ROOT file has a colon in the name of it uproot3 can open the file but uproot4 fails. I can’t give you the file, but I can show you a minimal failing example and then a reproducible example with public files.

Minimal Failing Example

$ tree .
.
├── data-tree
│   └──  data16_13TeV:data16_13TeV.periodA.physics_Main.PhysCont.DAOD_JETM1.grp16_v01_p4061.root
├── issue.py
├── requirements.txt

1 directory, 3 files
$ cat requirements.txt 
uproot
uproot4
$ docker run --rm -it -v $PWD:/data -w /data python:3.8 /bin/bash
root@510598a7f4e8:/data# python -m pip install --upgrade pip setuptools wheel
root@510598a7f4e8:/data# python -m pip install -r requirements.txt
root@510598a7f4e8:/data# python --version
Python 3.8.5
root@510598a7f4e8:/data# python -m pip list
Package        Version
-------------- -------
awkward        0.13.0
cachetools     4.1.1
numpy          1.19.1
pip            20.2.2
setuptools     49.6.0
uproot         3.12.0
uproot-methods 0.7.4
uproot4        0.0.18
wheel          0.35.1
root@510598a7f4e8:/data# cp data-tree/data16_13TeV\:data16_13TeV.periodA.physics_Main.PhysCont.DAOD_JETM1.grp16_v01_p4061.root data-tree/renamed.root
root@510598a7f4e8:/data# python
Python 3.8.5 (default, Aug  5 2020, 08:22:02) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import uproot as uproot3
>>> import uproot4
>>> from pathlib import Path
>>> 
>>> uproot3_file = uproot3.open(
...     "data-tree/data16_13TeV:data16_13TeV.periodA.physics_Main.PhysCont.DAOD_JETM1.grp16_v01_p4061.root"
... )
>>> print(f"uproot3 opens file as {uproot3_file}")
uproot3 opens file as <ROOTDirectory b'/home/feickert/workarea/submitDir/data-tree//data16_13TeV:data16_13TeV.periodA.physics_Main.PhysCont.DAOD_JETM1.grp16_v01_p4061.root' at 0x7fce6da5b5e0>
>>> 
>>> # uproot4 fails with the ':' in the filename
>>> uproot4.open(
...     "data-tree/data16_13TeV:data16_13TeV.periodA.physics_Main.PhysCont.DAOD_JETM1.grp16_v01_p4061.root"
... )
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/uproot4/source/file.py", line 74, in __init__
    self._file = numpy.memmap(self._file_path, dtype=self._dtype, mode="r")
  File "/usr/local/lib/python3.8/site-packages/numpy/core/memmap.py", line 225, in __new__
    f_ctx = open(os_fspath(filename), ('r' if mode == 'c' else mode)+'b')
FileNotFoundError: [Errno 2] No such file or directory: 'data-tree/data16_13TeV'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/site-packages/uproot4/reading.py", line 78, in open
    file = ReadOnlyFile(
  File "/usr/local/lib/python3.8/site-packages/uproot4/reading.py", line 265, in __init__
    self._source = Source(file_path, **self._options)
  File "/usr/local/lib/python3.8/site-packages/uproot4/source/file.py", line 80, in __init__
    self._fallback = uproot4.source.file.FileSource(file_path, opts)
AttributeError: module 'uproot4.source.file' has no attribute 'FileSource'
>>> # even if that is inside a pathlib object
>>> uproot4.open(
...     Path(
...         "data-tree/data16_13TeV:data16_13TeV.periodA.physics_Main.PhysCont.DAOD_JETM1.grp16_v01_p4061.root"
...     )
... )
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/uproot4/source/file.py", line 74, in __init__
    self._file = numpy.memmap(self._file_path, dtype=self._dtype, mode="r")
  File "/usr/local/lib/python3.8/site-packages/numpy/core/memmap.py", line 225, in __new__
    f_ctx = open(os_fspath(filename), ('r' if mode == 'c' else mode)+'b')
FileNotFoundError: [Errno 2] No such file or directory: 'data-tree/data16_13TeV'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/site-packages/uproot4/reading.py", line 78, in open
    file = ReadOnlyFile(
  File "/usr/local/lib/python3.8/site-packages/uproot4/reading.py", line 265, in __init__
    self._source = Source(file_path, **self._options)
  File "/usr/local/lib/python3.8/site-packages/uproot4/source/file.py", line 80, in __init__
    self._fallback = uproot4.source.file.FileSource(file_path, opts)
AttributeError: module 'uproot4.source.file' has no attribute 'FileSource'
>>> # but the file itself is fine
>>> uproot4.open("data-tree/renamed.root")
<ReadOnlyDirectory '/' at 0x7fce725e7670>

Failing Reproducible Example

# issue.py
import uproot as uproot3
import uproot4
from pathlib import Path


def main():
    # curl -sL https://github.com/scikit-hep/scikit-hep-testdata/raw/master/src/skhep_testdata/data/uproot-HZZ-lz4.root -o uproot-HZZ-lz4.root
    uproot3.open("uproot-HZZ-lz4.root")
    uproot3.open("uproot:HZZ-lz4.root")
    uproot3.open(Path("uproot:HZZ-lz4.root"))

    uproot4.open("uproot-HZZ-lz4.root")
    uproot4.open("uproot:HZZ-lz4.root")


if __name__ == "__main__":
    main()
root@510598a7f4e8:/data# curl -sL https://github.com/scikit-hep/scikit-hep-testdata/raw/master/src/skhep_testdata/data/uproot-HZZ-lz4.root -o uproot-HZZ-lz4.root
root@510598a7f4e8:/data# cp uproot-HZZ-lz4.root uproot:HZZ-lz4.root 
root@510598a7f4e8:/data# python issue.py 
Traceback (most recent call last):
  File "issue.py", line 17, in <module>
    main()
  File "issue.py", line 9, in main
    uproot3.open("uproot:HZZ-lz4.root")
  File "/usr/local/lib/python3.8/site-packages/uproot/rootio.py", line 63, in open
    raise ValueError("URI scheme not recognized: {0}".format(path))
ValueError: URI scheme not recognized: uproot:HZZ-lz4.root

Comments

I realize that this is probably because uproot4’s open’s path is

https://github.com/scikit-hep/uproot4/blob/b8828069b9ae52c742fb704f07e1a4e179fe7b30/uproot4/reading.py#L43-L46

and it isn’t a good idea to have a file with a colon in it in general. However, there are ATLAS files that do, and it would be nice if uproot3 behavior toward filenames could still be supported. If support for this is firmly out of scope it would be great if there could be some huge warning about this in the docs.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 38 (36 by maintainers)

Commits related to this issue

Most upvoted comments

For files ending in .root.1, etc., I would have thought you’re out of the easy case and have to start using dicts. The error message would say that. Would that be a big imposition?

I hadn’t considered having different behavior for str/bytes than for pathlib.Path. Would that be more expected?

Also, on accepting an already open Python file-handle (loosely interpreted as anything with read and seek, like io.BytesIO), Uproot has never had this capability, but it would be possible. It would exclude parallel reading of baskets, but you get what you pay for. That could become a new Source subclass, but it’s beyond the scope of this issue.


Since there are no thumbs up on the previous proposal and two objections thereafter, how about the following:

  • No consideration for whether a .root substring is found.
  • If a str or bytes is passed, do the colon parsing as before (many people have said that they want it, though it can be a stumbling block, as in @matthewfeickert’s case that started this thread).
  • If a pathlib.Path is passed, do not do colon parsing, which is a good way of enforcing what you want in uproot4.open, but would be limiting in functions that need to open many file-tree combinations, such as uproot4.iterate, uproot4.concatenate, and uproot4.lazy.
  • Support the dict syntax, which also does not do colon parsing, since the key-value separation distinguishes between file path and object path. This is the most general way to express a file path and object path, particularly useful for expressing many flie-tree combinations, and its limit for uproot4.open is a length-1 dict.
  • All “file not found” errors explain the rules clearly with examples.

What does everybody think of this? Thumbs up or comment: thanks!

(I’ve actually been distracted by other work, so just getting back to this now.)

One difficulty with optimizing usability is that “what is natural” or “what a novice might try” is rather speculative. I would expect novices (including me, approaching a system I don’t know) to get tripped up by the requirement that the object path has to be expressed as absolute, i.e. that it must start with :/blah and not just :blah, even though it is always going to be evaluated relative to /. (This was confusing in the file URI scheme, for example.) I’m not of a strong opinion about this point, though, because I think the error message is where the wide array of users’ guesses as to how it works will get narrowed down to how it actually does work—assuming users read error messages. (Sometimes a problem, but not a deep one if they copy-paste the error message into the bug report for me to read.)

The other difficulty with optimizing usability is that expectations among advanced users become necessities. I hadn’t known (I guessed right) that the colon syntax was an expected feature, and so losing it would be a stumbling block for users who expect it from ROOT.

But the thing that’s driving the dict syntax is neither of the above. It’s driven by the functions that access many file-object (actually, file-TTree) pairs. They need a way—not a convenient way but any way—to express combined filesystem + ROOT object paths for hundreds or thousands of files. If we could have used the colon syntax for that, then that would have been great. But some filesystem paths have colons in them. In principle, a directory name could end with a colon, which would get incorrectly identified as :/. So the many file-object functions need to have a backup syntax for when the string syntax isn’t expressive enough, and that’s what the dicts are. Only one other suggestion has been made that would be as unambiguous, but it’s pretty verbose and introduces more things to remember.

So novice users do not need to know about the dict syntax; it only comes up if they can’t express a group of filenames + object paths because the filenames have colons in them, and then it comes up as an error message because when they tried to do the naive thing, it couldn’t find the file.

I’m going to implement the last proposal I made. I considered not porting the dict syntax from the functions that need it (uproot4.iterate, uproot4.concatenate, uproot4.lazy) to the function that doesn’t really need it (uproot4.open, for which only a length-1 dict would be allowed) because the “file not found” error messages don’t know whether they’re from an attempt to open a single file or many files, and tracking that provenance through everything so that they can give context-dependent error messages doesn’t seem worthwhile. Making the same syntax apply to all functions equally, even if it’s not really needed for one of them, is simpler.


In any case, I’m confused as to what the dict syntax is when you just want to open a file with a colon in the filename. Is it:

  • uproot4.open({'foo:bar.root': None})?
  • uproot4.open({'foo:bar.root': ''})?
  • uproot4.open({'foo:bar.root': '/'})?

or some/all of the above?

I think the first and third would work; the second wouldn’t for the reason you mentioned. But the preferred way to do it, following @kratsg’s advice about pathlib.Path, is

  • uproot4.open(pathlib.Path("foo:bar.root"))

No colon parsing would be performed on a pathlib.Path, and since this is uproot4.open, there’s no reason to bring in any fancy dicts because you don’t have to specify a TTree path. For uproot4.iterate, uproot4.concatenate, uproot4.lazy, you do have to specify TTrees because you’re iterating over, concatenating, or building lazy arrays from TTrees, not files. There wouldn’t be a good reason to make the value in a dict None, though it would work, for consistency’s sake.

The error message should not only say what’s possible, but should say what’s preferred. Now I’m finally going to go and implement that.

There’s still a problem with requiring .root:. Another thing that I often see (again from rucio) is filenames ending with a number like .root.1. For a random public example, see https://twiki.cern.ch/twiki/bin/view/Sandbox/GridNotes.

I don’t think using the filename as a key in a dict is so great. What about instead a list of dicts with two specific keys: [{"file_path": fname, "object_path": "tree;1"}, ...]

But then you’d have to remember the spellings of those keys: "file_path" and "object_path". They become part of the API that has to be remembered or looked up. If I’m in a LISPy mood, I’d be inclined to expect hyphens, rather than underscores, and maybe the whole second word is unnecessary. What you suggest would be great for a REST API, but it’s pretty verbose for Python interaction.

Also, it prevents multiple files in one dict, since there can only be one "file_path" key.

In the end, the error messages look like this:

FileNotFoundError: file not found

    '/really/long/path/that:in:the:end/is/not-a-file.root'

Files may be specified as:
   * str/bytes: relative or absolute filesystem path or URL, without any colons
         other than Windows drive letter or URL schema.
         Examples: "rel/file.root", "C:\abs\file.root", "http://where/what.root"
   * str/bytes: same with an object-within-ROOT path, separated by a colon.
         Example: "rel/file.root:tdirectory/ttree"
   * pathlib.Path: always interpreted as a filesystem path or URL only (no
         object-within-ROOT path), regardless of whether there are any colons.
         Examples: Path("rel:/file.root"), Path("/abs/path:stuff.root")

Functions that accept many files (uproot4.iterate, etc.) also allow:
   * glob syntax in str/bytes and pathlib.Path.
         Examples: Path("rel/*.root"), "/abs/*.root:tdirectory/ttree"
   * dict: keys are filesystem paths, values are objects-within-ROOT paths.
         Example: {"/data_v1/*.root": "ttree_v1", "/data_v2/*.root": "ttree_v2"}
   * already-open TTree objects.
   * iterables of the above.

Since the colon parsing is a convenience and gaps in many-file functions can be filled in with the dict syntax, how about if it only works in limited circumstances, such as after .root? (@tamasgal originally made this suggestion.) So,

  • /path/to/filename.root:path/to/ttree picks out an object at path/to/ttree (requiring .root makes requiring a leading / in the object path superfluous, since it’s already quite unambiguous, and it’s a potential “gotcha”)
  • /path/to/file:name.root would be interpreted as a file, not a “name.root” object path within a file, since “file” doesn’t end in “.root
  • /path/to/file:name.root:path/to/ttree also picks out an object at path/to/ttree from a file named file:name.root; earlier colons are part of the filename
  • use the dict syntax if you really need to ensure that colons are never parsed; {"/path/to/file.root:haha-this-is-the-filename.root": None} won’t parse the colon and won’t pick out an object within the file because of the None value
  • use the dict syntax if you need to pick out objects and your files do not end with .root, or you’re using wildcards that don’t end in .root (a likely example: /path/to/directory/*). This is for functions that need to find many files, such as uproot4.iterate, uproot4.concatenate, and uproot4.lazy.

The colon parsing will be much simpler, based on the regex \.root\s*:\s*, because I’ll never need to worry about Windows drive letters or URL schemas with the \.root disambiguator.

Thumbs-up this comment if you’re in favor, write another comment below if you’re not; I’m going to start coding up this solution and if we’re in general agreement, I’ll merge it.

Colon parsing by default, but not in the dict syntax. Assumes that colons in filenames are somewhat rare, which I expect they are?

rucio relies on scopes for filenames, and scopes are separated with colons: mc15_13TeV:{filename} and when downloading, will include the scope as well. Dataset discovery relies on these scopes.

Right, this is exactly what I’m familiar with in dealing with my own analysis files. I think it’s important for a filename like foo:bar.root to work in the standard string syntax. This is a very common case (for ATLAS at least). I like dropping the colon parsing entirely myself, but I think requiring :/ is probably still good enough.

Colon parsing by default, but not in the dict syntax. Assumes that colons in filenames are somewhat rare, which I expect they are?

rucio relies on scopes for filenames, and scopes are separated with colons: mc15_13TeV:{filename} and when downloading, will include the scope as well. Dataset discovery relies on these scopes.

I find the existing colon functionality very convenient. One use case I have in mind are frameworks for histogram production. Users need to specify where all the inputs to their histograms are found. The existing syntax allows to do that in just one string, users could e.g. write

mc_samples/{SAMPLE_TYPE}.root:{SYSTEMATIC_VARIATION}

and specify a list of sample types and systematic variations to fill in the placeholders. If the paths inside and outside the ROOT file need to be specified separately, two such strings would be needed. Having it all in one string also reinforces the idea that for the structure of the data it does not matter much whether something is in a directory or a TDirectory.

If one takes the above example but puts everything into a big root file, the string would look like

mc_samples.root:{SAMPLE_TYPE}/{SYSTEMATIC_VARIATION}

and it is quickly evident that the structure of content is the same. This is harder to parse when split up.

Since it came up above: I’ve had to deal with file structures before where folder names ended with .root (and those folders contained ROOT files), but due to the name the folders mistakenly were interpreted as a ROOT file (this was with ROOT, not uproot). Relying on this to parse might not always work.

Since there have been a few differing suggestions about what to do here, I’m not going to merge PR #81 immediately. I’ll wait a day or two to let others voice their opinions, to see if there is a majority view. (@henryiii, do you have an idea?)

We don’t have a formal voting system, but I’ll take your suggestions into account (as I did with a lot of the names in Awkward1, where I personally preferred fewer underscores).