hudi: Hoodie clean is not deleting old files

I am trying to see if hudi clean is triggering and cleaning my files, but however I do not see any action being performed on cleaning the old log files.

To Reproduce

  1. I am writing some files to S3 using hudi with below configuration multiple times (4-5 times to see the cleaning triggered.)

My hudi config

`table_name = "demosr"
hudi_options_prodcode = {
                'hoodie.table.name': table_name,
                'hoodie.datasource.write.recordkey.field': 'key',
                'hoodie.datasource.write.partitionpath.field': 'range_partition',
                'hoodie.datasource.write.table.name': table_name,
                'hoodie.datasource.write.precombine.field': 'update_date',
                'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
                'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS',
                'hoodie.consistency.check.enabled': True,
                'hoodie.bloom.index.filter.type': 'dynamic_v0',
                'hoodie.bloom.index.bucketized.checking': False,
                'hoodie.memory.merge.max.size': '2004857600000',
                'hoodie.upsert.shuffle.parallelism': 500,
                'hoodie.insert.shuffle.parallelism': 500,
                'hoodie.bulkinsert.shuffle.parallelism': 500,
                'hoodie.parquet.small.file.limit': '204857600',
                'hoodie.parquet.max.file.size': '484402653184',
                'hoodie.memory.compaction.fraction': '384402653184',
                'hoodie.write.buffer.limit.bytes': str(128 * 1024 * 1024),
                'hoodie.compact.inline': True,
                'hoodie.compact.inline.max.delta.commits': 1,
                'hoodie.datasource.compaction.async.enable': False,
                'hoodie.parquet.compression.ratio': '0.35',
                'hoodie.logfile.max.size': '268435456',
                'hoodie.logfile.to.parquet.compression.ratio': '0.5',
                'hoodie.datasource.write.hive_style_partitioning': True,
                'hoodie.keep.min.commits': 2,
                'hoodie.keep.max.commits': 3,
                'hoodie.copyonwrite.record.size.estimate': 32,
                'hoodie.cleaner.commits.retained': 1,
                 'hoodie.clean.automatic': True
}`

Writing to s3 path_to_delta_table = "s3://testdataprocessing/hudi_clean_test1/" df.write.format("org.apache.hudi").options(**hudi_options_prodcode).mode("append").save(path_to_delta_table)

Expected behavior As per my understanding the logs should be deleted after max commit which is 3 and will keep only one commit at time.

Environment Description

  • Hudi version : 0.6.0

  • Spark version : 2.4

  • Hive version : 2.3.7

  • Hadoop version :

  • Storage (HDFS/S3/GCS…) : S3

  • Running on Docker? (yes/no) : No

  • EMR : 5.31.0

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 34 (19 by maintainers)

Most upvoted comments

@vinothchandar Yes please! If you can recommend a next step, @mauropelucchi will provide whatever assistance we can.

slightly unrelated comment. guess you might have to fix your config value for hoodie.memory.compaction.fraction. It is expected to be a fraction like 0.3 or 0.5 etc. From your desc, looks like are having some large value for this config.