dbt-core: Can't access BigQuery tables whose original sources are spreadsheets on Google Drive with the oauth method

Describe the bug

It is impossible to access BigQuery tables whose original sources are spreadsheets on Google Drive with the oauth method, because we don’t pass any scopes parameters to the method to create BigQuery credentials. We need a scope of https://www.googleapis.com/auth/drive to access such tables. Moreover, there is no way to grant scopes, not permissions, to a google cloud authentication. We can only pass the scope information when creating a credentials.

Steps To Reproduce

  1. Create a BigQuery table whose data source is a spreadsheet.
  2. Set a profile with the oauth method for BigQuery.
  3. Create a dbt model which refers to the BigQuery table with the profile.

Expected behavior

Even when we use the oauth method of BigQuery profile, we should be able to access BigQuery tables whose original data sources are spreadsheets on Google Drive with the scope of https://www.googleapis.com/auth/drive.

Screenshots and log output

$ dbt run
...
Database Error in model my_model (models/my_model.sql)
  Access Denied: BigQuery BigQuery: Permission denied while getting Drive credentials.
  compiled SQL at target/run/my_model.sql

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

installed version: 0.19.0
   latest version: 0.19.0

Up to date!

Plugins:
  - bigquery: 0.19.0

The operating system you’re using: Mac OS 10.15.7 (19H114)

The output of python --version: Python 3.8.5

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 14
  • Comments: 22 (15 by maintainers)

Most upvoted comments

Okay, I think I found the fix. It’s necessary to run this command to add the Drive scope to the application default credentials:

gcloud auth application-default login --scopes=https://www.googleapis.com/auth/drive,https://www.googleapis.com/auth/bigquery

(Note that you also have to add the bigquery scope and anything else needed by dbt).

Okay I’m back, and I have good news. After digging through here and trying out some of the changes I think I understand how this happened and how we can resolve the issue.

To start the issue was first introduced in PR #2908 where this line which passed scopes when using OAuth was replaced with this line which relies on the defaults configured with gcloud.

This is why even after this change if you authenticate with the scopes added (using gcloud auth application-default login --scopes=https://www.googleapis.com/auth/bigquery, https://www.googleapis.com/auth/drive.readonly you will not see the failure since your default credentials have the scopes added.

However, if you have a setup like we do over at Honey 😉 you will see the problem because we use a service account in an environment with no default credentials available and authenticate by activating the service account with gcloud auth activate-service-account which does not allow for passing scopes, but will cause the dbt auth code to still go down that (OAuth) path. This is a fairly standard auth flow for GCP (and is a common approach for production authentication with service accounts)

Looking back now this is actually not a mentioned path on the BQ profile setup page (can’t remember if it was back a ~year ago when we set this up) which may be why this path got hit with the bug.

Tested our setup with #3041 and it does resolve the issue (manually adding the 4 lines of changes causes a previous failure to then work in the same environment back-to-back).

I think we should merge that change, and consider updating the docs to cover the activate-service-account authentication path.

Let me know if you have any questions. I also support making these optional in some form for those who don’t need them.

As another data point, I reproduced the same error by directly running the bq command against an external BigQuery table, but was then able to fix this error by running this command: gcloud auth login --enable-gdrive-access

However, even after running that gcloud command, dbt is still giving me the same error. The BigQuery documentation does suggest that this is an issue with not providing the appropriate scope, so I don’t know why it’s still failing even after patching in PR #3041.

https://cloud.google.com/bigquery/external-data-drive#python

Looking at that page for guidance, I’m able to reproduce the Drive error with just a small python snippet:

from google.cloud import bigquery
import google.auth

credentials, project = google.auth.default(
     scopes=[
         "https://www.googleapis.com/auth/drive",
         "https://www.googleapis.com/auth/bigquery",
     ]
)

client = bigquery.Client(credentials=credentials, project=project)
query_job = client.query('SELECT count(some_field) as c FROM my_dataset.some_external_table_linked_to_a_google_sheet')
print(list(query_job))

So this issue seems to me like it’s an issue with the BigQuery client code itself. Still digging…

Just tried it out. Can confirm that the v0.19.1rc2 release resolves the issue for our setup. 🙌

Just in case this helps anyone who may be trying to use the OAuth path within dbt from a GCP context: even without this regression (i.e. the credentials you create are requesting https://www.googleapis.com/auth/drive scope, as in 0.18 and 0.19.1), if the scopes you enable on your cluster (ex. GKE) are a subset of https://www.googleapis.com/auth/drive like https://www.googleapis.com/auth/spreadsheets.readonly, and even if you only practically use that subset, the request made by dbt may fail with the error described in this bug (Access Denied: BigQuery BigQuery: Permission denied while getting Drive credentials.).

I didn’t dig much further than to expand scope on GKE from https://www.googleapis.com/auth/spreadsheets.readonly -> https://www.googleapis.com/auth/drive and get this succeeding on 0.18, so maybe other configurations may work, but I’m doubtful that anything less than enabling the specific scope you’re requesting will allow a request to succeed.

Yes: dbt shouldn’t over-eagerly grab Drive scopes without active user input. That said, we should better handle the exception Permission denied while getting Drive credentials. If a user is connecting via the oauth method, the error message should be wrapped in a bit of context and link out to the relevant docs.

Given that, I’m happy merging #3041 and sneaking this into v0.19.1. We’ll likely want to cut a second release candidate, and ask you folks to test it out.

I think dbt should consider making scopes a configurable profile / project config option for BigQuery. If you went down the local gcloud path and set your default scopes to include https://www.googleapis.com/auth/drive, etc., you wouldn’t need to set this for local runs, so perhaps it could default to just bigquery. If you were running somewhere remotely, as a service account, you can’t set default scopes, so you may need to set this option, but making it configurable would mean you could specify something like https://www.googleapis.com/auth/spreadsheets.readonly, if that was all you needed. If you were running in that context and your request failed with insufficient scopes, definitely a pointer to the dbt docs which describe the config option would be extremely helpful.

@walshie4 Heroic! Thanks for the detailed investigative work.

we use a service account in an environment with no default credentials available and authenticate by activating the service account with gcloud auth activate-service-account which does not allow for passing scopes, but will cause the dbt auth code to still go down that (OAuth) path

It all becomes clear. If using a service account to authenticate via the gcloud-style oauth method, then the scopes passed by dbt here go unused, and no scopes were specified by gcloud auth, and so voila, scopes are missing.

Given that, I’m happy merging #3041 and sneaking this into v0.19.1. We’ll likely want to cut a second release candidate, and ask you folks to test it out.

I’d also welcome a PR here that updates “Local OAuth gcloud setup” to include mention of Service Account OAuth gcloud setup as well.

Can also confirm that this works once you re-authenticate with the required scopes! 🥳

@walshie4 We just merged the fix in #3041 and included it in v0.19.1rc2. When you get a chance, could you try installing the release candidate and authenticating via the magical service account / gcloud oauth combo?

Finally got around to testing #3041 and it appears to not solve the problem in my local reproduction setup. Added a print in here to let me know what scopes are being passed. Still errors out:

(env) adwalsh @ ~/honey/core-analytics {git:(v19*)} dbt run -m test_model                                                    [03/16/21 12:13:38]
Running with dbt=0.19.0
Found 577 models, 445 tests, 2 snapshots, 1 analysis, 540 macros, 0 operations, 10 seed files, 195 sources, 0 exposures

AUTHING WITH SCOPES: ('https://www.googleapis.com/auth/bigquery', 'https://www.googleapis.com/auth/cloud-platform', 'https://www.googleapis.com/auth/drive')
12:14:23 | Concurrency: 100 threads (target='test')
12:14:23 |
12:14:23 | 1 of 1 START table model adw.test_model.............................. [RUN]
12:14:24 | 1 of 1 ERROR creating table model adw.test_model..................... [ERROR in 1.26s]
12:14:24 |
12:14:24 | Finished running 1 table model in 5.29s.

Completed with 1 error and 0 warnings:

Database Error in model test_model (models/test_model.sql)
  Access Denied: BigQuery BigQuery: Permission denied while getting Drive credentials.
  compiled SQL at target/run/honey_dbt/models/test_model.sql

Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1
dbt run -m test_model  45.88s user 1.19s system 93% cpu 50.153 total

After running

gcloud auth application-default login \
  --scopes=https://www.googleapis.com/auth/bigquery,\
https://www.googleapis.com/auth/drive.readonly

This fixes the problem with both version 0.19 and the #3041 branch (which makes me question if #3041 is actually changing anything).

I’m still looking into why this isn’t working when running as a service account which has been given access to the document and BQ. Will report back if I find anything of value

I’m not optimistic that #3041 will actually change anything, since the OAUTH method is just using the scopes that were passed into the gcloud auth application-default command. I support adding better error messages as outlined in option 2 above.

To summarize from the conversation over in #2953:

  • If you’re connecting dbt to BigQuery via service-account, service-account-json, or oauth-secrets, dbt will use the scopes defined here.
  • If you’re using oauth-based connection method (i.e. the gcloud SDK), dbt will not use those scopes, but instead rely on the scopes defined in gcloud auth application-default.
  • It’s likely that dbt requires more expansive scopes today than it strictly needs. That’s the primary question raised by #2953, and we’ll continue that conversation over there.

Given the confusion this issue stirred up inn the meantime, there’s definitely appetite to clarify this behavior, by better documenting it or making it more intuitive. Right now, the Drive scope is “optional” for users who authenticate and develop locally via gcloud/oauth. The original intent was to make it so folks didn’t have to give dbt access to all their personal Drive files if they aren’t actually using Drive as a data source. (From the docs: “If your dbt project does not transform data in Google Sheets, then you may omit the --scopes flag.”) Since it’s likely that gcloud/oauth is based on a personal email address, this requires giving Drive access to a lot more than just GSheet data sources! That feels different from authentication via a dedicated service account, where it’s feasible to strictly control access to only the Drive files needed for a specific purpose, so dbt can bake in the Drive scope request every time.

Is making that distinction still desirable? I think there are two valid answers:

  1. No: It’s not worth the confusion. The get_bigquery_defaults() method should try to grab Drive scopes regardless (= what #3041 attempts).
  2. Yes: dbt shouldn’t over-eagerly grab Drive scopes without active user input. That said, we should better handle the exception Permission denied while getting Drive credentials. If a user is connecting via the oauth method, the error message should be wrapped in a bit of context and link out to the relevant docs.

Let’s use this issue to discuss which of these two options makes sense, and plan for a code change that will help us avoid this confusion in the future.

Hey @walshie4! Could you try installing the fix from #3041 in a virtualenv, and see if that removes the error? Something like:

$ git clone https://github.com/yu-iskw/dbt.git
$ cd dbt
$ git checkout issue-3040
$ python3 -m venv env
$ source env/bin/activate
$ pip install -e ./core -e ./plugins/bigquery

I haven’t yet been able to replicate this error myself, so hearing that the proposed fix resolves the issue is a good indication that it’s the right change.

We’re using BigQuery with dbt version 0.18.1.

We’re getting that same Database Error: Access Denied: BigQuery BigQuery: Permission denied while getting Drive credentials.

then you can try installing dbt (in a virtual environment) from the branch with the proposed fix

@jtcohen6 we installed that patch branch and got the exact same error. We couldn’t locate anything in the logs either.