hudi: [SUPPORT] why is the schema evolution done while not setting hoodie.schema.on.read.enable

we have a glue streaming job that writes to hudi table, we try to do schema evolution, when we add a new col to any record, it works fine and the new col is shown when querying the table, the thing is we expect that it should not evolute the schema because we didn’t set the config hoodie.schema.on.read.enable, and as we understand that this config is set by default to false, and as per hudi docs:

“Enables support for Schema Evolution feature Default Value: false (Optional) Config Param: SCHEMA_EVOLUTION_ENABLE”

so when didn’t define it on our config, it shouldn’t allow for the schema evolution and adding of the new columns, right? we even tried to explicitly set it to false in our connection options, but still , when we add a new col it’s shown to our table To Reproduce

Steps to reproduce the behavior:

run the glue streaming job
add a record with new col/ attribute( attribute in case of dynamodb)
query the hudi table

Expected behavior

it shouldn’t show the added cols/attributes as we disabled schema evolution and the col/attribute shouldn’t be existing also in the schema of the table in the datalake.

Environment Description

Hudi version : .12
Spark version : 3
Storage (HDFS/S3/GCS…) : S3
Running on Docker? (yes/no) : no
glue version: 4

Additional context

our connection options are:

hudiWriteConfig = {
    'className': 'org.apache.hudi',
    'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
    'hoodie.table.name': hudi_table_name,
    'hoodie.datasource.write.table.name': hudi_table_name,
    'hoodie.datasource.write.precombine.field': 'timestamp',
    'hoodie.datasource.write.recordkey.field': 'user_id',
    'hoodie.datasource.write.operation': 'upsert',
    #"hoodie.compact.schedule.inline":"true",
    'hoodie.datasource.hive_sync.use_jdbc':'false',
    'hoodie.datasource.hive_sync.mode':'hms',
    "hoodie.compact.inline": "true",
    "hoodie.compact.inline.max.delta.commits":"3",
    "hoodie.schema.on.read.enable":"false",
    "hoodie.deltastreamer.schemaprovider.source.schema.file":"s3://hudi-test-table/menna/src.acsv",
    "hoodie.deltastreamer.schemaprovider.target.schema.file":"s3://hudi-test-table/menna/target.acsv"
    #'hoodie.datasource.write.partitionpath.field': 'year:SIMPLE,month:SIMPLE,day:SIMPLE',
    #'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.CustomKeyGenerator',
    #'hoodie.deltastreamer.keygen.timebased.timestamp.type': 'DATE_STRING',
    #'hoodie.deltastreamer.keygen.timebased.input.dateformat': 'yyyy-mm-dd',
    #'hoodie.deltastreamer.keygen.timebased.output.dateformat': 'yyyy/MM/dd'
}

hudiGlueConfig = {
    'hoodie.datasource.hive_sync.enable': 'true',
    'hoodie.datasource.hive_sync.sync_as_datasource': 'true',
    'hoodie.datasource.hive_sync.database': database_name,
    'hoodie.datasource.hive_sync.table': hudi_table_name,
    'hoodie.datasource.hive_sync.use_jdbc': 'false',
    'hoodie.datasource.write.hive_style_partitioning': 'false',
    'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
    #'hoodie.datasource.hive_sync.partition_fields': 'year,month,day'
}

commonConfig = {
    'path': s3_path_hudi_table
}

combinedConf = {
    **commonConfig,
    **hudiWriteConfig,
    **hudiGlueConfig
}

in glue streaming job we use:

glueContext.forEachBatch(
    frame=data_frame_DataSource0,
    batch_function=processBatch,
    options={
        "windowSize": window_size,
        "checkpointLocation": s3_path_spark_checkpoints
    }
)

and:

data_frame_DataSource0 = glueContext.create_data_frame.from_catalog(
    database=database_name,
    table_name=kinesis_table_name,
    transformation_ctx="DataSource0",
    additional_options={"inferSchema": "true", "startingPosition": starting_position_of_kinesis_iterator}
)

and the way we write our hudi table is:

 kinesis_data_frame.write.format("hudi").options(**combinedConf).mode("append").save()

sometimes we write it as follows but it gives the same behaviour:

 glueContext.write_dynamic_frame.from_options(
            frame=DynamicFrame.fromDF(kinesis_data_frame, glueContext, "evolved_kinesis_data_frame"),
            connection_type="custom.spark",
            connection_options=combinedConf
        )

About this issue

Original URL
State: open
Created a year ago
Comments: 16 (10 by maintainers)

Most upvoted comments

I guess we need a clear doc to elaborate the schema evolution details for 0.13.0

danny0405 on Mar 4, 2023

+1 on @kazdy 's notes above on ASR. Hudi has always supported some automatic schema evolution to deal with streaming data similar to what Kafka/Schema registry model achieves. The reason was, users found it inconvenient to coordinate pausing pipelines and doing some manual maintenance/backfills when say, new columns were added. What we call full schema evolution/schema-on-read is orthogonal, and it just allows more backwards incompatible evolutions to go through as well.

Now on 0.13, I think the reconcile flag simple allows for skipping some columns in the incoming write (partial writes scenarios) and Hudi reconciles this with the table schema - while still respecting the automatic schema evolution. I think this is what 0.13 changes. https://hudi.apache.org/releases/release-0.13.0#schema-handling-in-write-path

We also don’t understand how exactly hoodie.datasource.write.reconcile.schema and hoodie.avro.schema.validate work in either case. Specific examples would help here for both flags.

Note to @nfarah86 and @nsivabalan to cover this in the schema docs page that is being worked on now.

vinothchandar on Mar 9, 2023

Hi,

Hudi has optional “schema on read” and “out of the box schema evolution” which is the default. With “out of the box schema evolution” adding new columns is supported by default, see this link. Currently you can not “disable” schema evolution in full as far as I know.

kazdy on Feb 22, 2023