trino: hive connector io.trino.spi.TrinoException: Unsupported storage format

Hello,

We have hive tables that use custom input formats and serdes. We noticed that starting with Trino 423 we’re no longer able to query these tables.

Query 20230907_171018_00016_mrt64 failed: Unsupported storage format:foobar StorageFormat{serde=CUSTOM SERDE HERE, inputFormat=org.apache.hadoop.mapred.SequenceFileInputFormat, outputFormat=org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat}
io.trino.spi.TrinoException: Unsupported storage format: foobar StorageFormat{serde=CUSTOM SERDEDE HERE, inputFormat=org.apache.hadoop.mapred.SequenceFileInputFormat, outputFormat=org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat}
 at io.trino.plugin.hive.BackgroundHiveSplitLoader.lambda$loadPartition$4(BackgroundHiveSplitLoader.java:497)
 at java.base/java.util.Optional.orElseThrow(Optional.java:403)
 at io.trino.plugin.hive.BackgroundHiveSplitLoader.loadPartition(BackgroundHiveSplitLoader.java:497)
 at io.trino.plugin.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:400)
 at io.trino.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:314)
 at io.trino.plugin.hive.util.ResumableTasks$1.run(ResumableTasks.java:38)
 at io.trino.$gen.Trino_426____20230907_160032_2.run(Unknown Source)
 at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:79)
 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
 at java.base/java.lang.Thread.run(Thread.java:833)

The issue seems to be a recent change made to BackgroundHiveSplitLoader.java where a call getHiveStorageFormat was introduced which fails when querying table with a format not defined in HiveStorageFormat.

We had to make changes to HiveStorageFormat.java to add our custom serde definitions. This is a really concerning change for us. Why is hive connector all the sudden limited to only those formats defined in HiveStorageFormat?

The documentation page does not reflect that only certain SequenceFile serdes are supported: https://trino.io/docs/current/connector/hive.html

Assuming this change was done by design, what does the roadmap look like for Hive support in Trino?

About this issue

  • Original URL
  • State: open
  • Created 10 months ago
  • Reactions: 8
  • Comments: 15 (11 by maintainers)

Most upvoted comments

+1, observed the same issue

I understand you are upset. The decision to drop support for Hive SerDe and Hadoop compression codecs was not made lightly. The Hadoop and Hive codebases are difficult to work with, and not well maintained. Additionally, the community has swiftly moved away from these Hadoop and Hive formats to Parquet and ORC, and they are pushing farther with the switch to Iceberg, Delta Lake, and Hudi. I believe this is a negative reinforcing cycle that is unlikely to change.

Maintaining support for the full breadth of Hadoop/Hive features has been a herculean effort for the past ten yeas, which we happily did because of the vast usage of these systems. However, the usage of these systems has been in decline for years, and the effort to maintain support for them has not been reducing to match, and instead is actually growing as the Hadoop/Hive codebases become more difficult to work with.

This came to a head as we have attempted to add support for adding new features like Dynamic Catalogs #12709. The Hadoop/Hive have critical design flaws that make them incompatible with these new features. The only reasonable way to add these features was to decouple from the Hadoop/Hive codebases. This is a massive effort, and again we happily did it because we could finally reduce the effort required to maintain support for Hadoop/Hive, and actually add these amazing new features.

The where do we go from here. For opensource, popular, well-maintained, formats we will consider adding official support. We maybe be able add interfaces to extend the Hive plugin with new file formats and compression codecs. We have never supported extending the Hive plugin by adding jars to the plugin directory, but a few folks did and had varying degrees of success. If we do add extension points for this, they will be specific to Trino and not use Hadoop/Hive APIs (or have them available in the classpath). This means you would need to adapt your custom format to Trino APIs (I assume if you have a custom format you have programmers). That said, we would need to see a broad community need for this before we would consider adding it (as again this is not something we have ever supported).

We have custom protobuf and parquet serdes. We are heavily invested in protobufs. For protobufs, we have implemented some custom types for performance reasons. And since our schemas are encoded in protobufs, we have written a parquet serde that can infer the schema from a protobuf.

It’ll be a heavily lift to move our infrastructure off of this.

+1, without major custom changes - this is completely breaking & blocking any further upgrades to Trino for us.