hudi: [SUPPORT] why is the schema evolution done while not setting hoodie.schema.on.read.enable
we have a glue streaming job that writes to hudi table, we try to do schema evolution, when we add a new col to any record, it works fine and the new col is shown when querying the table, the thing is we expect that it should not evolute the schema because we didn’t set the config hoodie.schema.on.read.enable, and as we understand that this config is set by default to false, and as per hudi docs:
“Enables support for Schema Evolution feature Default Value: false (Optional) Config Param: SCHEMA_EVOLUTION_ENABLE”
so when didn’t define it on our config, it shouldn’t allow for the schema evolution and adding of the new columns, right? we even tried to explicitly set it to false in our connection options, but still , when we add a new col it’s shown to our table To Reproduce
Steps to reproduce the behavior:
- run the glue streaming job
- add a record with new col/ attribute( attribute in case of dynamodb)
- query the hudi table
Expected behavior
it shouldn’t show the added cols/attributes as we disabled schema evolution and the col/attribute shouldn’t be existing also in the schema of the table in the datalake.
Environment Description
-
Hudi version : .12
-
Spark version : 3
-
Storage (HDFS/S3/GCS…) : S3
-
Running on Docker? (yes/no) : no
-
glue version: 4
Additional context
our connection options are:
hudiWriteConfig = {
'className': 'org.apache.hudi',
'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
'hoodie.table.name': hudi_table_name,
'hoodie.datasource.write.table.name': hudi_table_name,
'hoodie.datasource.write.precombine.field': 'timestamp',
'hoodie.datasource.write.recordkey.field': 'user_id',
'hoodie.datasource.write.operation': 'upsert',
#"hoodie.compact.schedule.inline":"true",
'hoodie.datasource.hive_sync.use_jdbc':'false',
'hoodie.datasource.hive_sync.mode':'hms',
"hoodie.compact.inline": "true",
"hoodie.compact.inline.max.delta.commits":"3",
"hoodie.schema.on.read.enable":"false",
"hoodie.deltastreamer.schemaprovider.source.schema.file":"s3://hudi-test-table/menna/src.acsv",
"hoodie.deltastreamer.schemaprovider.target.schema.file":"s3://hudi-test-table/menna/target.acsv"
#'hoodie.datasource.write.partitionpath.field': 'year:SIMPLE,month:SIMPLE,day:SIMPLE',
#'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.CustomKeyGenerator',
#'hoodie.deltastreamer.keygen.timebased.timestamp.type': 'DATE_STRING',
#'hoodie.deltastreamer.keygen.timebased.input.dateformat': 'yyyy-mm-dd',
#'hoodie.deltastreamer.keygen.timebased.output.dateformat': 'yyyy/MM/dd'
}
hudiGlueConfig = {
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.hive_sync.sync_as_datasource': 'true',
'hoodie.datasource.hive_sync.database': database_name,
'hoodie.datasource.hive_sync.table': hudi_table_name,
'hoodie.datasource.hive_sync.use_jdbc': 'false',
'hoodie.datasource.write.hive_style_partitioning': 'false',
'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
#'hoodie.datasource.hive_sync.partition_fields': 'year,month,day'
}
commonConfig = {
'path': s3_path_hudi_table
}
combinedConf = {
**commonConfig,
**hudiWriteConfig,
**hudiGlueConfig
}
in glue streaming job we use:
glueContext.forEachBatch(
frame=data_frame_DataSource0,
batch_function=processBatch,
options={
"windowSize": window_size,
"checkpointLocation": s3_path_spark_checkpoints
}
)
and:
data_frame_DataSource0 = glueContext.create_data_frame.from_catalog(
database=database_name,
table_name=kinesis_table_name,
transformation_ctx="DataSource0",
additional_options={"inferSchema": "true", "startingPosition": starting_position_of_kinesis_iterator}
)
and the way we write our hudi table is:
kinesis_data_frame.write.format("hudi").options(**combinedConf).mode("append").save()
sometimes we write it as follows but it gives the same behaviour:
glueContext.write_dynamic_frame.from_options(
frame=DynamicFrame.fromDF(kinesis_data_frame, glueContext, "evolved_kinesis_data_frame"),
connection_type="custom.spark",
connection_options=combinedConf
)
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 16 (10 by maintainers)
I guess we need a clear doc to elaborate the schema evolution details for 0.13.0
+1 on @kazdy 's notes above on ASR. Hudi has always supported some automatic schema evolution to deal with streaming data similar to what Kafka/Schema registry model achieves. The reason was, users found it inconvenient to coordinate pausing pipelines and doing some manual maintenance/backfills when say, new columns were added. What we call full schema evolution/schema-on-read is orthogonal, and it just allows more backwards incompatible evolutions to go through as well.
Now on 0.13, I think the reconcile flag simple allows for skipping some columns in the incoming write (partial writes scenarios) and Hudi reconciles this with the table schema - while still respecting the automatic schema evolution. I think this is what 0.13 changes. https://hudi.apache.org/releases/release-0.13.0#schema-handling-in-write-path
Note to @nfarah86 and @nsivabalan to cover this in the schema docs page that is being worked on now.
Hi,
Hudi has optional “schema on read” and “out of the box schema evolution” which is the default. With “out of the box schema evolution” adding new columns is supported by default, see this link. Currently you can not “disable” schema evolution in full as far as I know.