great_expectations: great_expectations : not able to connect to BigQuery instance

Describe the bug print(validator.head()) gives

for (metric_name, metric_configuration) in metrics.items()
KeyError: ('table.head', 'batch_id=7d60dd00fbc6ebeac60da5ea671dccb0', '04166707abe073177c1dd922d3584468')

when connecting to BigQuery instance

To Reproduce I am trying to configure the great_expectations tool

https://greatexpectations.io/

and here is the documentation i followed

https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/database/bigquery

and i was able to create

import os

from ruamel import yaml

import great_expectations as ge
from great_expectations.core.batch import BatchRequest, RuntimeBatchRequest
gcp_project = 'my-gcp-project'
bigquery_dataset = 'my-dataset'

CONNECTION_STRING = f"bigquery://{gcp_project}/{bigquery_dataset}"

context = ge.get_context()
datasource_config = {
    "name": "my_bigquery_datasource",
    "class_name": "Datasource",
    "execution_engine": {
        "class_name": "SqlAlchemyExecutionEngine",
        "connection_string": "bigquery://my-gcp-project/my-dataset",
    },
    "data_connectors": {
        "default_runtime_data_connector_name": {
            "class_name": "RuntimeDataConnector",
            "batch_identifiers": ["default_identifier_name"],
        },
        "default_inferred_data_connector_name": {
            "class_name": "InferredAssetSqlDataConnector",
            "name": "whole_table",
        },
    },
}

datasource_config["execution_engine"]["connection_string"] = CONNECTION_STRING

context.test_yaml_config(yaml.dump(datasource_config))

context.add_datasource(**datasource_config)

batch_request = RuntimeBatchRequest(
    datasource_name="my_bigquery_datasource",
    data_connector_name="default_runtime_data_connector_name",
    data_asset_name="json_data",  # this can be anything that identifies this data
    runtime_parameters={"query": "SELECT * from mytable where colA='someVal' LIMIT 10"},
    batch_identifiers={"default_identifier_name": "default_identifier"},
    batch_spec_passthrough={
        "bigquery_temp_table": "ge_temp"
    },  # this is the name of the table you would like to use a 'temp_table'
)

context.create_expectation_suite(
    expectation_suite_name="test_suite", overwrite_existing=True
)
validator = context.get_validator(
    batch_request=batch_request, expectation_suite_name="test_suite"
)
print(validator.head())

assert isinstance(validator, ge.validator.validator.Validator)

however, when i run this i get

  Traceback (most recent call last):
  File "geBq.py", line 55, in <module>
    print(validator.head())
  File "/some/path/python3.7/site-packages/great_expectations/validator/validator.py", line 1620, in head
    "fetch_all": fetch_all,
  File "/some/path/python3.7/site-packages/great_expectations/validator/validator.py", line 364, in get_metric
    return self.get_metrics({"_": metric})["_"]
  File "/some/path/python3.7/site-packages/great_expectations/validator/validator.py", line 359, in get_metrics
    for (metric_name, metric_configuration) in metrics.items()
  File "/some/path/python3.7/site-packages/great_expectations/validator/validator.py", line 359, in <dictcomp>
    for (metric_name, metric_configuration) in metrics.items()
  KeyError: ('table.head', 'batch_id=7d60dd00fbc6ebeac60da5ea671dccb0', '04166707abe073177c1dd922d3584468')

what does this error mean? how can i resolve it? i just want to connect to my BigQuery table and then run some validations on it. Btw, i am able to query this table on the BigQuery console

Expected behavior I should be able to successfully connect Great Expectations with my data on BigQuery instance

Environment (please complete the following information):

  • Operating System: Linux
  • Great Expectations Version: 0.13.37

Additional context Please let me know if I can add any other details

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 4
  • Comments: 18 (7 by maintainers)

Most upvoted comments

@pmariac with BatchRequest it works but using RuntimeBatchRequest it returns that error as well. I think it is because of this: ( this is my code which returns the same error as you, I debug it a little bit and come across that query error)

> runtime_batch_request = RuntimeBatchRequest(
>     datasource_name= datasource_name,
>     data_connector_name="default_runtime_data_connector_name",
>     data_asset_name="default_name", 
>     runtime_parameters={"query": "SELECT * from DATASET_ID.TABLE_NAME LIMIT 10"},
>     batch_identifiers={"default_identifier_name": "default_identifier"},
>     batch_spec_passthrough={
>          "create_temp_table": False
>         , "table_name": "{}".format(table_name)
>         , "schema_name": "{project_id}.{dataset_id}".format(project_id=project_id_spec_passthrough,dataset_id=dataset_id)
>     },  # 
> `google.cloud.bigquery.dbapi.exceptions.DatabaseError: 400 Syntax error: Unexpected keyword SELECT at [2:6]
> -----Query Job SQL Follows-----
> > | . | . | . | . | . |
> 1:SELECT `COLUMN_A` AS `unexpected_values`
> 2:FROM SELECT * from DATASET_ID.TABLE_NAME LIMIT 10
> 3:WHERE `COLUMN_A` IS NULL
> 4: LIMIT @`param_1`
> | . | . | . | . | . |`

I already tried different types of queries, with and without parenthesis, etc but seems not be working because it is expecting a table after the FROM and it is not working for subquerys as well

@talagluck I think it’s worth reopening this issue.

Got the exact same error here.

I have the BigQuery Read Session User permission. I’m auth using

gcloud auth login                                                                                                      
gcloud auth application-default login  

At first, everything is working as expected :

  • The “temp” table is created and contains expected items
  • The datasource is successfully instantiated and saved
  • The Suite is created

But on calling the validator.head() method I got the same error as abtpst:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_481/3293600047.py in <module>
----> 1 validator.head(n_rows=5, fetch_all=False)

~/.local/lib/python3.8/site-packages/great_expectations/validator/validator.py in head(self, n_rows, domain_kwargs, fetch_all)
   1632             }
   1633 
-> 1634         data: Any = self.get_metric(
   1635             metric=MetricConfiguration(
   1636                 metric_name="table.head",

~/.local/lib/python3.8/site-packages/great_expectations/validator/validator.py in get_metric(self, metric)
    365     def get_metric(self, metric: MetricConfiguration) -> Any:
    366         """return the value of the requested metric."""
--> 367         return self.get_metrics({"_": metric})["_"]
    368 
    369     def graph_validate(

~/.local/lib/python3.8/site-packages/great_expectations/validator/validator.py in get_metrics(self, metrics)
    358             metrics=resolved_metrics,
    359         )
--> 360         return {
    361             metric_name: resolved_metrics[metric_configuration.id]
    362             for (metric_name, metric_configuration) in metrics.items()

~/.local/lib/python3.8/site-packages/great_expectations/validator/validator.py in <dictcomp>(.0)
    359         )
    360         return {
--> 361             metric_name: resolved_metrics[metric_configuration.id]
    362             for (metric_name, metric_configuration) in metrics.items()
    363         }

KeyError: ('table.head', 'batch_id=58be3851fb1261d02a0f5918a9adef39', '04166707abe073177c1dd922d3584468')

Hi @abtpst ! We’ve recently done some work around RuntimeBatchRequest. Would you please confirm whether this is still an issue?

Gotcha - thanks for the update, @FredRodrigues! We currently have a few other RuntimeBatchRequest issues in progress, so I was wondering if this was related, but this feels specific to BigQuery. I’m going to re-open this, and take a look over the next week.

I currently have the exact same issue. At first, it was a problem of permissions, because I have read-only access to that dataset. I solved it by using the create_temp_table to False. Now I am having the same issue as you. I even reproduce the exact same code, only changing the project id, and this error happens on a dataset that has read-only access. The project where I have full access, the code works. Hope this info helps.