iceberg: Unable to use GlueCatalog in flink environments without hadoop
When attempting to use the GlueCatalog implementation (or really any implementation) in flink, hadoop is expected to be in the classpath.
The FlinkCatalogFactory always attempts to load the hadoop config from flink but flink does not guarantee that there is a valid hadoop environment present. In environments where hadoop is not available (e.g. AWS Kinesis Data Analytics), this throws java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration.
Presently, most of the catalog implementations implement Configurable and thus the util functions like loadCatalog expect to be passed an instance of hadoopConf. In catalogs like GlueCatalog and DynamoCatalog, the only reason for the Configurable interface is to enable dynamic FileIO loading
About this issue
- Original URL
- State: open
- Created 3 years ago
- Comments: 23 (7 by maintainers)
Thanks for the response, @matt-slalom. I did get something working, mainly following the comment here, though converting it to maven. One important point is that we are using the Table/SQL API (also from pyFlink), so are not explicitly instantiating the FlinkCatalog. As such, we are dependent on this code.
A few things:
'org.apache.flink:flink-hadoop-fs'by writing my ownHadoopUtils::getHadoopConfigurationthat returnsnew Configuration(false);. I think it should be possible to “hack” in aConfigurationclass to avoid pulling in hadoop libraries, but I haven’t dug more just yet.Anyways, I would still classify this as workaround, but, since I had to piece this together, I think it would still somehow make sense to document this until the hadoop dependencies are fully removed. I will try to come back and update this comment once I have wrapped things up.
** EDIT: **
I’ve placed the relevant code here for those that want to take a look. The pom.xml file is simple as is, I didn’t spend any time to filter out the things that aren’t needed. However, the relocations/shading configurations are basically all needed. Another note is that we did need to explicitly instantiate the Catalog, which is why we introduced a light wrapper in python calling through to the jvm. HTH.