aws-glue-data-catalog-client-for-apache-hive-metastore: Not able to query glue/Athena views ['java.lang.IllegalArgumentException: Can not create a Path from an empty string;']

I’m running EMR cluster with the ‘AWS Glue Data Catalog as the Metastore for Hive’ option enable. Connecting through a Spark Notebook working fine e.g

spark.sql("show databases")
spark.catalog.setCurrentDatabase(<databasename>)
spark.sql("""select * from <table> limit 10""").show()

All working as expected but when querying a view got the following error:

spark.sql("""select * from <view> limit 10""").show()

An error was encountered:
'java.lang.IllegalArgumentException: Can not create a Path from an empty string;'
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 767, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'java.lang.IllegalArgumentException: Can not create a Path from an empty string;'

I guess since views are not stored I somewhere have to specify a temp path but cannot find out how?

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 6
  • Comments: 19

Most upvoted comments

So here is my generic workaround. Keep in mind that the query has to be Both Spark and Presto compatible. I suggest to keep the SQL query of the Views as simple as possible. How it works:

  • it creates first an Athena View from the query
  • gets the Presto Schema in Base64 from that View via Boto3
  • deletes the Athena View
  • Creates a spark based view for the same query
  • updates the spark view with the Presto Schema so Athena can read it as well.
import boto3
import time


def execute_blocking_athena_query(query: str):
    athena = boto3.client("athena")
    res = athena.start_query_execution(QueryString=query)
    execution_id = res["QueryExecutionId"]
    while True:
        res = athena.get_query_execution(QueryExecutionId=execution_id)
        state = res["QueryExecution"]["Status"]["State"]
        if state == "SUCCEEDED":
            return
        if state in ["FAILED", "CANCELLED"]:
            raise Exception(res["QueryExecution"]["Status"]["StateChangeReason"])
        time.sleep(1)


def create_cross_platform_view(db: str, table: str, query: str, spark_session):
    glue = boto3.client("glue")
    glue.delete_table(DatabaseName=db, Name=table)
    create_view_sql = f"create view {db}.{table} as {query}"
    execute_blocking_athena_query(create_view_sql)
    presto_schema = glue.get_table(DatabaseName=db, Name=table)["Table"][
        "ViewOriginalText"
    ]
    glue.delete_table(DatabaseName=db, Name=table)

    spark_session.sql(create_view_sql).show()
    spark_view = glue.get_table(DatabaseName=db, Name=table)["Table"]
    for key in [
        "DatabaseName",
        "CreateTime",
        "UpdateTime",
        "CreatedBy",
        "IsRegisteredWithLakeFormation",
        "CatalogId",
    ]:
        if key in spark_view:
            del spark_view[key]
    spark_view["ViewOriginalText"] = presto_schema
    spark_view["Parameters"]["presto_view"] = "true"
    spark_view = glue.update_table(DatabaseName=db, TableInput=spark_view)

spark_session = ... # insert code to create the session
create_cross_platform_view("YOUR_DB", "TEST_VIEW", "select * from YOUR_DB.YOUR_TABLE", spark_session)

I found a solution for this problem! Is possible to query Athena views by using jdbc connection thorugh spark_read_jdbc with appropriate options and configurations

For me the same issue happened, when I created a View in Athena, then tried to query it in a Glue Job via Spark. The Issue I think is that Glue Catalog is handling views in a special way. They are Table, just without PATH to a real location (on s3 or alternatives). I couldn’t find a workaround so far.

Hi @kironp, sorry for not replying sooner. Our delta table use is not a workaround, it’s our main approach for working with tables.

Our method is similar to what you said you already tried. We don’t consume Athena views from spark at all. We use the same glue catalog and create 2 table definitions and views - one for delta ( https://github.com/delta-io/delta ) and one for Athena.

Both definitions are configured to use the same path by generating an Athena table from the delta manifest ( https://docs.delta.io/0.7.0/presto-integration.html )

I may be late to the party, but I hope this may help someone who runs into one of those cryptic errors ( we encountered this during table creation when location was not defined properly in the catalog )

https://docs.databricks.com/data/metastores/aws-glue-metastore.html#troubleshooting

Accessing tables and views created by other systems, such as AWS Athena or Presto, may or may not work in Databricks Runtime or Spark. This is not supported. While they may sometimes work, such as when the table is a Hive-compatible one, others may fail with cryptic error messages. For example, accessing a view created by Athena, Databricks Runtime, or Spark may throw an exception like: IllegalArgumentException: Can not create a Path from an empty string That is because Athena and Presto store view metadata in a different format than what Databricks Runtime and Spark expect.

Personally we create a delta table over the same path for spark/spark sql and use Athena for generic querying to circumvent this.