array-api: Data dependent/unknown shapes
Some libraries, particularly those with a graph-based computational model (e.g., Dask and TensorFlow), have support for “unknown” or “data dependent” shapes, e.g., due to boolean indexing such as x[y > 0] (https://github.com/data-apis/array-api/issues/84). Other libraries (e.g., JAX and Dask in some cases) do not support some operations because they would produce such data dependent shapes.
We should consider a standard way to represent these shapes in shape attributes, ideally some extension of the “tuple of integer” format used for fully known shapes. For example, TensorFlow and Dask currently use different representations:
- TensorFlow uses a custom
TensorShapeobject (which acts very similarly totuple), where some values may beNone - Dask uses tuples, where some values may be
naninteger of integers
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 18 (14 by maintainers)
Commits related to this issue
- Make axis keyword to squeeze() positional As suggested by @shoyer in https://github.com/data-apis/array-api/issues/97#issuecomment-747116655 This makes it possible to predict resulting rank of output... — committed to data-apis/array-api by rgommers 4 years ago
- Make axis keyword to squeeze() positional As suggested by @shoyer in https://github.com/data-apis/array-api/issues/97#issuecomment-747116655 This makes it possible to predict resulting rank of output... — committed to rgommers/array-api by rgommers 4 years ago
- Make axis keyword to squeeze() positional As suggested by @shoyer in https://github.com/data-apis/array-api/issues/97#issuecomment-747116655 This makes it possible to predict resulting rank of output... — committed to rgommers/array-api by rgommers 4 years ago
- Make axis keyword to squeeze() positional (#100) As suggested by @shoyer in https://github.com/data-apis/array-api/issues/97#issuecomment-747116655 This makes it possible to predict resulting rank o... — committed to data-apis/array-api by rgommers 3 years ago
To summarize the above discussion and the discussions during the consortium meetings (e.g., 7 October 2021), the path forward seems to be as follows:
ndimshould beNoneandshapeshould beNone.ndimshould be anintandshapeshould be atuplewhere knownshapedimensions should beints and unknownshapedimensions should beNone.ndimshould be anintandshapeshould be atuplewhose dimensions should beints.shapeshould be atuple. For those use cases where a custom object is needed, the custom object must act like atuple.We can consider adding a functional
shapeAPI that supports returning the (dynamic) shape of an array as an array. This would be similar to TensorFlow’stf.shapeand MXNet’sshape_arrayAPIs. This API would allow returning the shape of an array as the result of delayed computation. We can push this decision to the 2022 revision of the specification.The above would not satisfy @jakirkham’s desire for behavior which poisons subsequent operations; however, as discussed here and elsewhere,
NaNdoes not seem like the right abstraction (e.g., value equality, introducing floating-point semantics, etc). While shape arithmetic is a valid use case, we should consider ways to ensure that this is not a userland concern. For implementations, shape arithmetic can be necessary for allocation (e.g.,concat, tiling, etc), but it would seem doable to mimicNaNpoison behavior withNoneand straightforward helper functions.A custom object as suggested by @shoyer would be doable; however, this is an abstraction which does not currently exist in or is used by array libraries (although, it does exist in dataframe libraries, such as pandas) and would be specific to each implementing array library. The advantage of
NoneandNaNis they exist independently of any one array library.Accordingly, I think the lowest common denominator is requiring
shapereturn atuple(or something tuple-like) and usingNonefor unknown dimensions. While not perfect, this seems to me the most straightforward path atm.~I don’t think we included anything in the standard that could cause this.~ ~Indexing may make the size of one or more dimensions unknown, but
ndimshould always be known because for data-dependent operations there’s no squeezing:~After writing this: we did include
squeeze()so a combination of indexing and explicit squeezing could indeed cause this.Leaving any property (like
.shape) undefined seems unhealthy. I’d say in this case, usendim = Noneand shape isNone. If dimensionality is known but exact shape isn’t, useNonefor the unknown dimensions (e.g.,shape = (3, None, 100)).Maybe we should say “shape is tuple” and add a note that if for backwards compat reasons an implementation is using a custom object, then it should make sure that it is a subtype of tuple so that it works when users annotate code using
.shapewithTuple.I think the custom object is only useful when you want to deal with staged computation like @shoyer mentions above. TF2 stopped using the wrapped Dimension type and just has its TensorShape be a tuple of either integers or None to indicate shapes which we do not know at tracing time. So I think saying shapes are tuples here is probably a good idea.
IIRC because using Python objects (ints, tuples, etc.) rather than custom ones can force extra synchronizations. PyTorch also uses a custom object: