arrow-rs: Inconsistent Slicing of ArrayData for StructArray
Problem
ArrayData::Slice contains a special case for StructArray where it recurses the offset into its children. However, it preserves the offset on the parent ArrayData, in order for the validity buffer to work correctly.
The result is the offset has been pushed down to the child arrays, but is still present on the parent ArrayData. This makes it unclear if an offset should or should not be applied to the child_array.
Proposal
There are longer term suggestions around handling offsets in ArrayData differently, but until then I would like to propose:
- Remove the ArrayData::slice special-case
- Slice child data within
StructArray
when constructingboxed_fields
This is consistent with how we handle offsets elsewhere, with ArrayData a dumb container, and the Array
providing the logic to interpret the offset correctly.
Thoughts @nevi-me
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 2
- Comments: 18 (18 by maintainers)
Commits related to this issue
- Handle offset consistently for StructArray (#1750) — committed to tustvold/arrow-rs by tustvold 2 years ago
Difficult to say without a spec or comment describing the expected behavior. Someone might rely on the current behavior and acccess a StructArray using something like
a.column(0).downcast().value(a.offset() + i)
. I didn’t see any such usages in the arrow-rs codebase. Our codebase makes heavy use of slices, but we do not use StructArray. If we consider the current behavior a bug, then we should clearly hightlight the change in the release notes.Implement as this:
?
Hi @tustvold Here are some thoughts of mine:
offset
on parent ArrayData (the struct array) because of the validity buffer.I am afraid this cannot solve the problem. Because
After calling
array.slice(offset = 1, length = 3)
, we will get a struct array like thisWe still have 3
offset
s in parent array and child arrays. And it is not clear whether we have pushed down the new offset to children. 3. My suggestion (workaround)boxed_fields
. Because this will produce inconsistent offsetsarray.column(idx)
. This is also the way of getting a buffer of array now: Example: getting the offsets of binary array.Slice when getting the child array. (API change, because we own the
ArrayRef
now)offset
andlength
toBuffer
(Bitmap
).