dbt-spark: Error using spark adapter in thrift mode

Moving https://github.com/fishtown-analytics/dbt-spark/pull/20#issuecomment-497518244 here:

I tried to use this spark with this pull request but I get the following error:

2019-05-30 18:17:30,493 (MainThread): Encountered an error:
2019-05-30 18:17:30,493 (MainThread): not enough values to unpack (expected 3, got 1)
2019-05-30 18:17:30,532 (MainThread): Traceback (most recent call last):
  File "/home/paul/.local/lib/python3.6/site-packages/dbt/main.py", line 79, in main
    results, succeeded = handle_and_check(args)
  File "/home/paul/.local/lib/python3.6/site-packages/dbt/main.py", line 153, in handle_and_check
    task, res = run_from_args(parsed)
  File "/home/paul/.local/lib/python3.6/site-packages/dbt/main.py", line 209, in run_from_args
    results = run_from_task(task, cfg, parsed)
  File "/home/paul/.local/lib/python3.6/site-packages/dbt/main.py", line 217, in run_from_task
    result = task.run()
  File "/home/paul/.local/lib/python3.6/site-packages/dbt/task/runnable.py", line 256, in run
    self.before_run(adapter, selected_uids)
  File "/home/paul/.local/lib/python3.6/site-packages/dbt/task/run.py", line 85, in before_run
    self.populate_adapter_cache(adapter)
  File "/home/paul/.local/lib/python3.6/site-packages/dbt/task/run.py", line 23, in populate_adapter_cache
    adapter.set_relations_cache(self.manifest)
  File "/home/paul/.local/lib/python3.6/site-packages/dbt/adapters/base/impl.py", line 331, in set_relations_cache
    self._relations_cache_for_schemas(manifest)
  File "/home/paul/.local/lib/python3.6/site-packages/dbt/adapters/base/impl.py", line 313, in _relations_cache_for_schemas
    for relation in self.list_relations_without_caching(db, schema):
  File "/home/paul/dbt-spark/dbt/adapters/spark/impl.py", line 75, in list_relations_without_caching
    for _database, name, _ in results:
ValueError: not enough values to unpack (expected 3, got 1)

If i add a print(results[0]) right above that line, it seems like results has a single entry instead of 3: <agate.Row: ('mytable')>

I couldn’t get spark connecting in http mode (i.e. without this pull request) so I’m not sure if the issue is with this pull request or something more general.

This is connecting to an EMR 5.20.0 cluster, and thrift was started with sudo /usr/lib/spark/sbin/start-thriftserver.sh --master yarn-client.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 17 (8 by maintainers)

Most upvoted comments

AWS support helped me figure out the issue:

On my EMR cluster, port 10000 is for Hive and 10001 is for Spark. When I changed to 10001 it worked (after running start-thriftserver.sh).

@rhousewright Should we maybe mention this port difference in the docs as part of your PR? https://github.com/fishtown-analytics/dbt-spark/pull/20/files#diff-04c6e90faac2675aa89e2176d2eec7d8R22

Here’s my profile now:

default:
  target: dev
  outputs:
    dev:
      type: spark
      method: thrift
      schema: experiments
      host: 127.0.01
      port: 10001
      threads: 4

So this is super interesting. I tried running a similar thing against an EMR cluster running EMR 5.21.0, with the following generated SQL for my model, and it worked just fine for me. So that’s weird?

create table dbt_test_db.my_first_dbt_model
    using parquet
    partitioned by (id)
    as
select 1 as id, 2 as not_id

There’s nothing in the 5.21.0 release notes that would indicate any relevant changes (vs 5.20.0), and I’m not doing anything unusual / relevant in terms of cluster config (I am using Glue catalog, in case that matters). The only thing I did differently, I think, than you did is to start the thrift server with sudo /usr/lib/spark/sbin/start-thriftserver.sh (without the --master yarn-client).

I will note that I only have Spark installed on the cluster (I don’t have Hive installed) - do you have both installed? If so, is it possible that installing Hive in some way overtakes the HiveServer2 connection to the Spark backend? I haven’t had the chance to test that theory yet, though. Config I’m using right now is: Screen Shot 2019-06-13 at 4 50 54 PM

In general, I’m hoping to get some dedicated time to work on dbt-spark stuff in the next little bit, trying to set aside some time in an upcoming sprint to see if we can get a POC working in our space. Hopefully will learn a lot, and possibly generate some pull requests, through that process!