array-api: Data dependent/unknown shapes

Some libraries, particularly those with a graph-based computational model (e.g., Dask and TensorFlow), have support for “unknown” or “data dependent” shapes, e.g., due to boolean indexing such as x[y > 0] (https://github.com/data-apis/array-api/issues/84). Other libraries (e.g., JAX and Dask in some cases) do not support some operations because they would produce such data dependent shapes.

We should consider a standard way to represent these shapes in shape attributes, ideally some extension of the “tuple of integer” format used for fully known shapes. For example, TensorFlow and Dask currently use different representations:

  • TensorFlow uses a custom TensorShape object (which acts very similarly to tuple), where some values may be None
  • Dask uses tuples, where some values may be nan integer of integers

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 18 (14 by maintainers)

Commits related to this issue

Most upvoted comments

To summarize the above discussion and the discussions during the consortium meetings (e.g., 7 October 2021), the path forward seems to be as follows:

  • if rank is unknown, ndim should be None and shape should be None.
  • if rank is known but dimensions are unknown, ndim should be an int and shape should be a tuple where known shape dimensions should be ints and unknown shape dimensions should be None.
  • if rank is known and dimensions are known, ndim should be an int and shape should be a tuple whose dimensions should be ints.
  • in most cases, shape should be a tuple. For those use cases where a custom object is needed, the custom object must act like a tuple.

We can consider adding a functional shape API that supports returning the (dynamic) shape of an array as an array. This would be similar to TensorFlow’s tf.shape and MXNet’s shape_array APIs. This API would allow returning the shape of an array as the result of delayed computation. We can push this decision to the 2022 revision of the specification.

The above would not satisfy @jakirkham’s desire for behavior which poisons subsequent operations; however, as discussed here and elsewhere, NaN does not seem like the right abstraction (e.g., value equality, introducing floating-point semantics, etc). While shape arithmetic is a valid use case, we should consider ways to ensure that this is not a userland concern. For implementations, shape arithmetic can be necessary for allocation (e.g., concat, tiling, etc), but it would seem doable to mimic NaN poison behavior with None and straightforward helper functions.

A custom object as suggested by @shoyer would be doable; however, this is an abstraction which does not currently exist in or is used by array libraries (although, it does exist in dataframe libraries, such as pandas) and would be specific to each implementing array library. The advantage of None and NaN is they exist independently of any one array library.

Accordingly, I think the lowest common denominator is requiring shape return a tuple (or something tuple-like) and using None for unknown dimensions. While not perfect, this seems to me the most straightforward path atm.

Alternatively users could check for ndarray.ndim is None leaving ndarray.shape undefined for unknown rank cases.

~I don’t think we included anything in the standard that could cause this.~ ~Indexing may make the size of one or more dimensions unknown, but ndim should always be known because for data-dependent operations there’s no squeezing:~

>>> x = np.ones((3,2))
>>> x[:0, np.zeros(2, dtype=bool)]
array([], shape=(0, 0), dtype=float64)

After writing this: we did include squeeze() so a combination of indexing and explicit squeezing could indeed cause this.

Leaving any property (like .shape) undefined seems unhealthy. I’d say in this case, use ndim = None and shape is None. If dimensionality is known but exact shape isn’t, use None for the unknown dimensions (e.g., shape = (3, None, 100)).

and just has its TensorShape be a tuple of either integers or None to indicate shapes which we do not know at tracing time. So I think saying shapes are tuples here is probably a good idea.

Maybe we should say “shape is tuple” and add a note that if for backwards compat reasons an implementation is using a custom object, then it should make sure that it is a subtype of tuple so that it works when users annotate code using .shape with Tuple.

I think the custom object is only useful when you want to deal with staged computation like @shoyer mentions above. TF2 stopped using the wrapped Dimension type and just has its TensorShape be a tuple of either integers or None to indicate shapes which we do not know at tracing time. So I think saying shapes are tuples here is probably a good idea.

IIRC because using Python objects (ints, tuples, etc.) rather than custom ones can force extra synchronizations. PyTorch also uses a custom object:

>>> t = torch.tensor([[1, 2], [3, 4]])                                                         
>>> t.shape                                                                                    
torch.Size([2, 2])