iceberg: Cannot write nullable values to non-null column

In my spark jobs I am reading from JSON and merging to iceberg. In my iceberg tables I would like to have NOT NULL constraints. However, when loading data from JSON, spark doesn’t enforce the schema nullability constraints. To work around this I have discovered two alternatives:

input_dataset = spark.read.schema(my_schema).json("s3://my_bucket/my_folder/").filter(col("my_key").isNotNull())
input_dataset = spark.createDataFrame(input_dataset.rdd, schema=my_schema).createOrReplaceTempView("source")
spark.sql("MERGE INTO my_table ...

--conf spark.sql.storeAssignmentPolicy=LEGACY
input_dataset = spark.read.schema(my_schema).json("s3://my_bucket/my_folder/").filter(col("my_key").isNotNull()).createOrReplaceTempView("source")
spark.sql("MERGE INTO my_table ...

The first option is very slow, adding 40-60 minutes to the processing time of my spark application. The second option seems too permissive. I have noticed that there is a configuration option named spark.sql.iceberg.check-nullability in the code. I’d like to propose that this option be included in the AssignmentAlignmentTrait to allow writers to bypass NULL constraints while preserving the other types of compatibility checks.

https://github.com/apache/iceberg/blob/apache-iceberg-0.11.1/spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/AssignmentAlignmentSupport.scala#L152

Thanks for consideration.

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 1
Comments: 17 (7 by maintainers)

Most upvoted comments

coalesce the nullable field can change to not null like coalesce(field, not_null)

redsnow1992 on Nov 26, 2021