arrow-rs: Inconsistent Slicing of ArrayData for StructArray

Problem

ArrayData::Slice contains a special case for StructArray where it recurses the offset into its children. However, it preserves the offset on the parent ArrayData, in order for the validity buffer to work correctly.

The result is the offset has been pushed down to the child arrays, but is still present on the parent ArrayData. This makes it unclear if an offset should or should not be applied to the child_array.

Proposal

There are longer term suggestions around handling offsets in ArrayData differently, but until then I would like to propose:

  • Remove the ArrayData::slice special-case
  • Slice child data within StructArray when constructing boxed_fields

This is consistent with how we handle offsets elsewhere, with ArrayData a dumb container, and the Array providing the logic to interpret the offset correctly.

Thoughts @nevi-me

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 2
  • Comments: 18 (18 by maintainers)

Commits related to this issue

Most upvoted comments

which previously did not take the struct offset into account.

Is this not a bug?

Difficult to say without a spec or comment describing the expected behavior. Someone might rely on the current behavior and acccess a StructArray using something like a.column(0).downcast().value(a.offset() + i). I didn’t see any such usages in the arrow-rs codebase. Our codebase makes heavy use of slices, but we do not use StructArray. If we consider the current behavior a bug, then we should clearly hightlight the change in the release notes.

So in StructArray::from it would inspect the ArrayData, compute slices, and store those as members.

Implement as this:

    fn from(data: ArrayData) -> Self {
        let boxed_fields = data
            .child_data()
            .iter()
            .map(|cd| make_array(cd.slice(data.offset(), data.len())))
            .collect();

        Self { data, boxed_fields }
    }

?

Hi @tustvold Here are some thoughts of mine:

  1. (Agree with you) We have to preserve offset on parent ArrayData (the struct array) because of the validity buffer.
  2. Slice child data within StructArray when constructing boxed_fields If what you mean is to slice child arrays in StructArray::from:

    fn from(data: ArrayData) -> Self {
        let boxed_fields = data
            .child_data()
            .iter()
            .map(|cd| make_array(cd.slice(data.offset(), data.len())))
            .collect();

        Self { data, boxed_fields }
    }

I am afraid this cannot solve the problem. Because

  • Inconsistent offsets still exist. For example, given a struct array:
StructArray (offset = 0, length = 5, field0 = i32, field1 = bool, validity_buffer = ...)
-- Int32Array (offset = 1, length = 5)
-- BooleanArray (offset = 2, length = 5)

After calling array.slice(offset = 1, length = 3), we will get a struct array like this

StructArray (offset = 1, length = 3, field0 = i32, field1 = bool, validity_buffer = ...)
-- Int32Array (offset = 2, length = 3)
-- BooleanArray (offset = 3, length = 3)

We still have 3 offsets in parent array and child arrays. And it is not clear whether we have pushed down the new offset to children. 3. My suggestion (workaround)

  • Do not push down the offset to child arrays when slicing (aka, just update the offset and null_count of parent array)
  • Do not slicing child arrays when building boxed_fields. Because this will produce inconsistent offsets
  • Slice child arrays when you call array.column(idx). This is also the way of getting a buffer of array now: Example: getting the offsets of binary array.
    pub fn value_offsets(&self) -> &[OffsetSize] {
        unsafe {
            std::slice::from_raw_parts(
                self.value_offsets.as_ptr().add(self.data.offset()),
                self.len() + 1,
            )
        }
    }

Slice when getting the child array. (API change, because we own the ArrayRef now)

    pub fn column(&self, pos: usize) -> ArrayRef {
        self.boxed_fields[pos].slice(self.offset, self.length)
    }
  • Long term suggestion: push down offset and length to Buffer(Bitmap).