arrow: [Python] write_dataset does not preserve non-nullable columns in schema

When writing a table whose schema has not nullable columns using write_dataset the not nullable info is not saved

To reproduce

import pyarrow as pa
import pyarrow.parquet as pq
import datetime as dt
import pyarrow.dataset as ds

table = pa.Table.from_arrays([[1,2,3],[None,5,None], [dt.date(2023,1,1), dt.date(2023,1,2), dt.date(2023,1,3)]],
    schema=pa.schema([pa.field("x", pa.int64(), nullable=False), pa.field("y", pa.int64(), nullable=True), pa.field("date", pa.date32(), nullable=True)]))
print(table.schema)
# schema shows  column 'x' as not nullable

pq.write_to_dataset(table, parquet_test1", partitioning=['date'], partitioning_flavor='hive')
dataset = ds.dataset("parquet_test1", format="parquet", partitioning="hive")
dataset.to_table().schema
# column 'x' is nullable

pa.dataset.write_dataset(table, "parquet_test2", partitioning=['date'], partitioning_flavor='hive', format='parquet')
dataset = ds.dataset("parquet_test2", format="parquet", partitioning="hive")
dataset.to_table().schema
# column 'x' is nullable

Component(s)

Python

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 22 (17 by maintainers)

Commits related to this issue

Most upvoted comments

So here is the change that introduced this: https://github.com/apache/arrow/issues/31452

Before the change we used to require the schema be specified on the write node options. This was a unnecessary burden when you didn’t care about any custom field information (since we’ve already calculated the schema).

But for what we need to do about this: shouldn’t the ProjectNode just try to preserve this information for trivial field ref expressions?

I think there is still the problem that we largely ignore nullability. We can’t usually assume that all batches will have the same nullability. For example, imagine a scan node where we are scanning two different parquet files. One of the parquet files marks a column as nullable and the other does not. I suppose the correct answer, if Acero were nulalbility-aware and once evolution is a little more robust, would be to “evolve” the schema of the file with a nullable type to a non-nullable type so that we have a common input schema.

In the meantime, the quickest simple fix to this regression is to allow the user to specify an output schema instead of just key / value metadata.

ok, I’ve finally realised this is the issue, not the PR 😃

@raulcd

The error is a bit of a red herring. It is not building Arrow-C++. Instead it is downloading Arrow-C++. If you look at a passing build (e.g. from the nightly tests) you can see:

2023-05-30T01:07:19.3429074Z * installing *source* package ‘arrow’ ...
2023-05-30T01:07:19.3429654Z ** using staged installation
2023-05-30T01:07:19.3429994Z *** Found libcurl and OpenSSL >= 1.1
2023-05-30T01:07:19.3430691Z trying URL 'https://nightlies.apache.org/arrow/r/libarrow/bin/linux-openssl-1.1/arrow-12.0.0.9000.zip'
2023-05-30T01:07:19.3431226Z Error in download.file(from_url, to_file, quiet = hush) : 
2023-05-30T01:07:19.3431942Z   cannot open URL 'https://nightlies.apache.org/arrow/r/libarrow/bin/linux-openssl-1.1/arrow-12.0.0.9000.zip'
2023-05-30T01:07:19.3432612Z *** Downloading libarrow binary failed for version 12.0.0.9000 (linux-openssl-1.1)
2023-05-30T01:07:19.3433276Z     at https://nightlies.apache.org/arrow/r/libarrow/bin/linux-openssl-1.1/arrow-12.0.0.9000.zip
2023-05-30T01:07:19.3433789Z *** Found local C++ source: '/arrow/cpp'
2023-05-30T01:07:19.3434126Z *** Building libarrow from source
2023-05-30T01:07:19.3434552Z     For build options and troubleshooting, see the install guide:
2023-05-30T01:07:19.3435014Z     https://arrow.apache.org/docs/r/articles/install.html

On the other hand, if you look at these failing builds, you see:

2023-06-01T22:45:52.2820480Z * installing *source* package ‘arrow’ ...
2023-06-01T22:45:52.2820835Z ** using staged installation
2023-06-01T22:45:52.2826960Z **** pkg-config not installed, setting ARROW_DEPENDENCY_SOURCE=BUNDLED
2023-06-01T22:45:52.2827523Z *** Found libcurl and OpenSSL >= 1.1
2023-06-01T22:45:52.2830096Z trying URL 'https://apache.jfrog.io/artifactory/arrow/r/12.0.0/libarrow/bin/linux-openssl-1.1/arrow-12.0.0.zip'
2023-06-01T22:45:52.2830790Z Content type 'application/zip' length 40016664 bytes (38.2 MB)
2023-06-01T22:45:52.2831184Z ==================================================
2023-06-01T22:45:52.2835569Z downloaded 38.2 MB
2023-06-01T22:45:52.2835774Z 
2023-06-01T22:45:52.2836129Z *** Successfully retrieved C++ binaries (linux-openssl-1.1)

So the nightly test looks for 12.0.0.9000 which, of course, doesn’t exist. Then it falls back to building from source. This is what we want.

The test build you’ve shared is looking for 12.0.0 (shouldn’t this be 12.0.1?) It finds it, and then it doesn’t build Arrow-C++ from source.