xarray: bool(ds) should raise a "the truth value of a Dataset is ambiguous" error

Throwing this out there - happy to be shot down if people are opposed.

Current behavior / griping

Currently, coercing a dataset to a boolean invokes ds.__bool__, which in turn calls bool(ds.data_vars):

class Dataset(DataWithCoords, DatasetArithmetic, Mapping):
    ...
    def __bool__(self) -> bool:
        return bool(self.data_vars)

This has the unfortunate property of returning True as long as there is at least one data_variable, regardless of the contents.

Currently, the behavior of Dataset.__bool__ is, at least as far as I’ve seen, never helpful but frequently unhelpful. I’ve seen (and written) tests written for DataArrays being passed a Dataset and suddenly the tests are meaningless so many times. Conversely, I’ve never found a legitimate use case for bool(ds). As far as I can tell, this is essentially the same as len(ds.data_vars) > 0.

In fact, while testing out my proposed changes below on a fork, I found two tests in the xarray test suite that had succumbed to this issue: see https://github.com/pydata/xarray/pull/6122 and https://github.com/pydata/xarray/pull/6123.

This has been discussed before - see https://github.com/pydata/xarray/issues/4290. This discussion focused on the question “should bool(xr.Dataset({'a': False})) return False?”. I agree that it’s not clear when it should be false and picking a behavior which deviates from Mapping feels arbitrary and gross.

Proposed behavior

I’m proposing that the API be changed, so that bool(xr.Dataset({'a': False})) raise an error, similar to the implementation in pd.Series.

In this implementation in pandas, attempting to evaluate even a single-element series as a boolean raises an error:

In [14]: bool(pd.Series([False]))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-14-b0ad7f4d9277> in <module>
----> 1 bool(pd.Series([False]))

~/miniconda3/envs/rhodium-env/lib/python3.9/site-packages/pandas/core/generic.py in __nonzero__(self)
   1532     @final
   1533     def __nonzero__(self):
-> 1534         raise ValueError(
   1535             f"The truth value of a {type(self).__name__} is ambiguous. "
   1536             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I understand hesitancy around changing the core API. That said, if anyone can find an important, correct use of bool(ds) in the wild I’ll eat my hat 😃

Implementation

This could be as simple as raising an error on ds.__bool__, something like:

class Dataset(DataWithCoords, DatasetArithmetic, Mapping):
    ...
    def __bool__(self) -> bool:
        raise ValueError(
            "The truth value of a Dataset is ambiguous. Reduce the data "
            "variables to a scalar value with any(ds.values()) or "
            "all(ds.values())."
        )

The only other change that would be needed is an assertion that directly calls bool(ds) in test_dataset::TestDataset.test_properties, which checks for the exact behavior I’m changing:

        assert bool(ds)

This would need to be changed to:

        with pytest.raises(ValueError):
            bool(ds)

If this sounds good, I can submit a PR with these changes.

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 15 (12 by maintainers)

Most upvoted comments

$0.02 from the peanut gallery is that my mental model of Dataset is that it’s a dictionary on steroids, so following the Mapping protocol (including __bool__) makes sense to me. I put “I don’t know” for @shoyer’s poll, but if the question was “should” then I would have said “number of variables”.

While I’m not going to sit here and argue that if ds: is a common operation and important feature (I don’t often open files that result in empty datasets), I’d personally want a truly compelling argument for breaking from the Mapping protocol. Right now it’s “Dataset works like dict plus some stuff”. It would change to be “Dataset works like dict for all things except __bool__, plus some stuff”. The latter to me requires more mental effort since I’m no longer strictly supplementing past experience.

I made a Twitter poll, let’s see what that says 😄

https://twitter.com/xarray_dev/status/1478776987925684224?s=20

I realize this may be a larger discussion, but the implementation is so easy I went ahead and filed a PR that issues a PendingDeprecationWarning in Dataset.__bool__.

The original intention here was definitely to honor the Mapping protocol exactly, but I agree that bool() is hard to use correctly on Dataset objects. I would support deprecating this.

Yeah… I do understand how it’s currently working and why, and the behavior is certainly intuitive to those who appreciate the mapping inheritance.

That said, I feel I have to make a last stand argument because this trips people up quite often (on my team and elsewhere). I haven’t yet come across an example of anyone using this correctly, but I see users misusing it all the time.

The examples and behavior you’re showing @Illviljan seem to me like more the natural result of an implementation detail than a critical principle of the dataset design. While it’s obvious why bool(ds) resolves to True or False in the current implementation, I still think it’s not used and is out of step with how most users think of a dataset. The fact that nothing in the package requires a change in order to pass tests (the broken tests and the pathological case aside) reaffirms for me that this is not a useful feature.

I don’t know much about the mapping protocol or how closely it must be followed. Is the idea here that packages building on xarray (or interoperability features in e.g. numpy or dask) depend on a strict adherence to the full spec?