arrow: [Python] write_dataset does not preserve non-nullable columns in schema
When writing a table whose schema has not nullable columns using write_dataset the not nullable info is not saved
To reproduce
import pyarrow as pa
import pyarrow.parquet as pq
import datetime as dt
import pyarrow.dataset as ds
table = pa.Table.from_arrays([[1,2,3],[None,5,None], [dt.date(2023,1,1), dt.date(2023,1,2), dt.date(2023,1,3)]],
schema=pa.schema([pa.field("x", pa.int64(), nullable=False), pa.field("y", pa.int64(), nullable=True), pa.field("date", pa.date32(), nullable=True)]))
print(table.schema)
# schema shows column 'x' as not nullable
pq.write_to_dataset(table, parquet_test1", partitioning=['date'], partitioning_flavor='hive')
dataset = ds.dataset("parquet_test1", format="parquet", partitioning="hive")
dataset.to_table().schema
# column 'x' is nullable
pa.dataset.write_dataset(table, "parquet_test2", partitioning=['date'], partitioning_flavor='hive', format='parquet')
dataset = ds.dataset("parquet_test2", format="parquet", partitioning="hive")
dataset.to_table().schema
# column 'x' is nullable
Component(s)
Python
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 22 (17 by maintainers)
Commits related to this issue
- GH-35730: [C++] Add the ability to specify custom schema on a dataset write (#35860) ### Rationale for this change The dataset write node previously allowed you to specify custom key/value metadata ... — committed to apache/arrow by westonpace a year ago
- GH-35730: [C++] Add the ability to specify custom schema on a dataset write (#35860) ### Rationale for this change The dataset write node previously allowed you to specify custom key/value metadata ... — committed to apache/arrow by westonpace a year ago
- GH-35730: [C++] Add the ability to specify custom schema on a dataset write (#35860) The dataset write node previously allowed you to specify custom key/value metadata on a write node. This was adde... — committed to dgreiss/arrow by westonpace a year ago
So here is the change that introduced this: https://github.com/apache/arrow/issues/31452
Before the change we used to require the schema be specified on the write node options. This was a unnecessary burden when you didn’t care about any custom field information (since we’ve already calculated the schema).
I think there is still the problem that we largely ignore nullability. We can’t usually assume that all batches will have the same nullability. For example, imagine a scan node where we are scanning two different parquet files. One of the parquet files marks a column as nullable and the other does not. I suppose the correct answer, if Acero were nulalbility-aware and once evolution is a little more robust, would be to “evolve” the schema of the file with a nullable type to a non-nullable type so that we have a common input schema.
In the meantime, the quickest simple fix to this regression is to allow the user to specify an output schema instead of just key / value metadata.
ok, I’ve finally realised this is the issue, not the PR 😃
@raulcd
The error is a bit of a red herring. It is not building Arrow-C++. Instead it is downloading Arrow-C++. If you look at a passing build (e.g. from the nightly tests) you can see:
On the other hand, if you look at these failing builds, you see:
So the nightly test looks for
12.0.0.9000
which, of course, doesn’t exist. Then it falls back to building from source. This is what we want.The test build you’ve shared is looking for
12.0.0
(shouldn’t this be12.0.1
?) It finds it, and then it doesn’t build Arrow-C++ from source.