iceberg: Unable to use GlueCatalog in flink environments without hadoop

When attempting to use the GlueCatalog implementation (or really any implementation) in flink, hadoop is expected to be in the classpath.

The FlinkCatalogFactory always attempts to load the hadoop config from flink but flink does not guarantee that there is a valid hadoop environment present. In environments where hadoop is not available (e.g. AWS Kinesis Data Analytics), this throws java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration.

Presently, most of the catalog implementations implement Configurable and thus the util functions like loadCatalog expect to be passed an instance of hadoopConf. In catalogs like GlueCatalog and DynamoCatalog, the only reason for the Configurable interface is to enable dynamic FileIO loading

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 23 (7 by maintainers)

Most upvoted comments

Thanks for the response, @matt-slalom. I did get something working, mainly following the comment here, though converting it to maven. One important point is that we are using the Table/SQL API (also from pyFlink), so are not explicitly instantiating the FlinkCatalog. As such, we are dependent on this code.

A few things:

  • Important here was the relocation/shading, which, for me not being a java programmer, took some time to get.
  • I still had to include (some) hadoop libraries, but I was able to drop 'org.apache.flink:flink-hadoop-fs' by writing my own HadoopUtils::getHadoopConfiguration that returns new Configuration(false);. I think it should be possible to “hack” in a Configuration class to avoid pulling in hadoop libraries, but I haven’t dug more just yet.
  • One important thing I ran across was that flink does manipulate class loading (“child first” vs “parent first”) in general, but explicitly does not do this for hadoop libraries (see here). This could have been the source of problems that some other posters mentioned above.

Anyways, I would still classify this as workaround, but, since I had to piece this together, I think it would still somehow make sense to document this until the hadoop dependencies are fully removed. I will try to come back and update this comment once I have wrapped things up.

** EDIT: **

I’ve placed the relevant code here for those that want to take a look. The pom.xml file is simple as is, I didn’t spend any time to filter out the things that aren’t needed. However, the relocations/shading configurations are basically all needed. Another note is that we did need to explicitly instantiate the Catalog, which is why we introduced a light wrapper in python calling through to the jvm. HTH.