hudi: Hoodie clean is not deleting old files
I am trying to see if hudi clean is triggering and cleaning my files, but however I do not see any action being performed on cleaning the old log files.
To Reproduce
- I am writing some files to S3 using hudi with below configuration multiple times (4-5 times to see the cleaning triggered.)
My hudi config
`table_name = "demosr"
hudi_options_prodcode = {
'hoodie.table.name': table_name,
'hoodie.datasource.write.recordkey.field': 'key',
'hoodie.datasource.write.partitionpath.field': 'range_partition',
'hoodie.datasource.write.table.name': table_name,
'hoodie.datasource.write.precombine.field': 'update_date',
'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS',
'hoodie.consistency.check.enabled': True,
'hoodie.bloom.index.filter.type': 'dynamic_v0',
'hoodie.bloom.index.bucketized.checking': False,
'hoodie.memory.merge.max.size': '2004857600000',
'hoodie.upsert.shuffle.parallelism': 500,
'hoodie.insert.shuffle.parallelism': 500,
'hoodie.bulkinsert.shuffle.parallelism': 500,
'hoodie.parquet.small.file.limit': '204857600',
'hoodie.parquet.max.file.size': '484402653184',
'hoodie.memory.compaction.fraction': '384402653184',
'hoodie.write.buffer.limit.bytes': str(128 * 1024 * 1024),
'hoodie.compact.inline': True,
'hoodie.compact.inline.max.delta.commits': 1,
'hoodie.datasource.compaction.async.enable': False,
'hoodie.parquet.compression.ratio': '0.35',
'hoodie.logfile.max.size': '268435456',
'hoodie.logfile.to.parquet.compression.ratio': '0.5',
'hoodie.datasource.write.hive_style_partitioning': True,
'hoodie.keep.min.commits': 2,
'hoodie.keep.max.commits': 3,
'hoodie.copyonwrite.record.size.estimate': 32,
'hoodie.cleaner.commits.retained': 1,
'hoodie.clean.automatic': True
}`
Writing to s3
path_to_delta_table = "s3://testdataprocessing/hudi_clean_test1/" df.write.format("org.apache.hudi").options(**hudi_options_prodcode).mode("append").save(path_to_delta_table)
Expected behavior As per my understanding the logs should be deleted after max commit which is 3 and will keep only one commit at time.
Environment Description
-
Hudi version : 0.6.0
-
Spark version : 2.4
-
Hive version : 2.3.7
-
Hadoop version :
-
Storage (HDFS/S3/GCS…) : S3
-
Running on Docker? (yes/no) : No
-
EMR : 5.31.0
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 34 (19 by maintainers)
@vinothchandar Yes please! If you can recommend a next step, @mauropelucchi will provide whatever assistance we can.
slightly unrelated comment. guess you might have to fix your config value for hoodie.memory.compaction.fraction. It is expected to be a fraction like 0.3 or 0.5 etc. From your desc, looks like are having some large value for this config.