dbt-spark: Error using spark adapter in thrift mode
Moving https://github.com/fishtown-analytics/dbt-spark/pull/20#issuecomment-497518244 here:
I tried to use this spark with this pull request but I get the following error:
2019-05-30 18:17:30,493 (MainThread): Encountered an error:
2019-05-30 18:17:30,493 (MainThread): not enough values to unpack (expected 3, got 1)
2019-05-30 18:17:30,532 (MainThread): Traceback (most recent call last):
File "/home/paul/.local/lib/python3.6/site-packages/dbt/main.py", line 79, in main
results, succeeded = handle_and_check(args)
File "/home/paul/.local/lib/python3.6/site-packages/dbt/main.py", line 153, in handle_and_check
task, res = run_from_args(parsed)
File "/home/paul/.local/lib/python3.6/site-packages/dbt/main.py", line 209, in run_from_args
results = run_from_task(task, cfg, parsed)
File "/home/paul/.local/lib/python3.6/site-packages/dbt/main.py", line 217, in run_from_task
result = task.run()
File "/home/paul/.local/lib/python3.6/site-packages/dbt/task/runnable.py", line 256, in run
self.before_run(adapter, selected_uids)
File "/home/paul/.local/lib/python3.6/site-packages/dbt/task/run.py", line 85, in before_run
self.populate_adapter_cache(adapter)
File "/home/paul/.local/lib/python3.6/site-packages/dbt/task/run.py", line 23, in populate_adapter_cache
adapter.set_relations_cache(self.manifest)
File "/home/paul/.local/lib/python3.6/site-packages/dbt/adapters/base/impl.py", line 331, in set_relations_cache
self._relations_cache_for_schemas(manifest)
File "/home/paul/.local/lib/python3.6/site-packages/dbt/adapters/base/impl.py", line 313, in _relations_cache_for_schemas
for relation in self.list_relations_without_caching(db, schema):
File "/home/paul/dbt-spark/dbt/adapters/spark/impl.py", line 75, in list_relations_without_caching
for _database, name, _ in results:
ValueError: not enough values to unpack (expected 3, got 1)
If i add a print(results[0]) right above that line, it seems like results has a single entry instead of 3:
<agate.Row: ('mytable')>
I couldn’t get spark connecting in http mode (i.e. without this pull request) so I’m not sure if the issue is with this pull request or something more general.
This is connecting to an EMR 5.20.0 cluster, and thrift was started with sudo /usr/lib/spark/sbin/start-thriftserver.sh --master yarn-client.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 17 (8 by maintainers)
AWS support helped me figure out the issue:
On my EMR cluster, port 10000 is for Hive and 10001 is for Spark. When I changed to 10001 it worked (after running
start-thriftserver.sh).@rhousewright Should we maybe mention this port difference in the docs as part of your PR? https://github.com/fishtown-analytics/dbt-spark/pull/20/files#diff-04c6e90faac2675aa89e2176d2eec7d8R22
Here’s my profile now:
So this is super interesting. I tried running a similar thing against an EMR cluster running EMR 5.21.0, with the following generated SQL for my model, and it worked just fine for me. So that’s weird?
There’s nothing in the 5.21.0 release notes that would indicate any relevant changes (vs 5.20.0), and I’m not doing anything unusual / relevant in terms of cluster config (I am using Glue catalog, in case that matters). The only thing I did differently, I think, than you did is to start the thrift server with
sudo /usr/lib/spark/sbin/start-thriftserver.sh(without the--master yarn-client).I will note that I only have Spark installed on the cluster (I don’t have Hive installed) - do you have both installed? If so, is it possible that installing Hive in some way overtakes the HiveServer2 connection to the Spark backend? I haven’t had the chance to test that theory yet, though. Config I’m using right now is:
In general, I’m hoping to get some dedicated time to work on dbt-spark stuff in the next little bit, trying to set aside some time in an upcoming sprint to see if we can get a POC working in our space. Hopefully will learn a lot, and possibly generate some pull requests, through that process!