iceberg: Cannot write nullable values to non-null column
In my spark jobs I am reading from JSON and merging to iceberg. In my iceberg tables I would like to have NOT NULL constraints. However, when loading data from JSON, spark doesn’t enforce the schema nullability constraints. To work around this I have discovered two alternatives:
input_dataset = spark.read.schema(my_schema).json("s3://my_bucket/my_folder/").filter(col("my_key").isNotNull())
input_dataset = spark.createDataFrame(input_dataset.rdd, schema=my_schema).createOrReplaceTempView("source")
spark.sql("MERGE INTO my_table ...
or
--conf spark.sql.storeAssignmentPolicy=LEGACY
input_dataset = spark.read.schema(my_schema).json("s3://my_bucket/my_folder/").filter(col("my_key").isNotNull()).createOrReplaceTempView("source")
spark.sql("MERGE INTO my_table ...
The first option is very slow, adding 40-60 minutes to the processing time of my spark application. The second option seems too permissive. I have noticed that there is a configuration option named spark.sql.iceberg.check-nullability in the code. I’d like to propose that this option be included in the AssignmentAlignmentTrait to allow writers to bypass NULL constraints while preserving the other types of compatibility checks.
Thanks for consideration.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 17 (7 by maintainers)
coalescethe nullable field can change to not null likecoalesce(field, not_null)