great_expectations: Can't re-run a checkpoint that was previously saved

Describe the bug We have created a spark data source, a data asset, a batch request, an expectation suite and a checkpoint, and when we run it the first time, it’s working.

When we try to load the checkpoint (instead of creating it) and run the checkpoint again, it fails with the following error :

InvalidBatchSpecError: RuntimeDataBatchSpec batch_data cannot be None

To Reproduce


table_name='your_table_name'

# GX 
dataframe_asset_name='my_asset'
datasource_name='my_datasource'
expectation_suite_name = "my_expectation_suite"
checkpoint_name = "my_checkpoint"

# Create the context
os.environ["GX_CLOUD_ACCESS_TOKEN"] = admin_ge_token
os.environ["GX_CLOUD_ORGANIZATION_ID"] = organization_id

context = gx.get_context()

# Read the spark dataframe
dataframe = spark.read.table(table_name)

# Create the datasource
dataframe_datasource = context.sources.add_or_update_spark(name=datasource_name)

# Create the asset
dataframe_asset = dataframe_datasource.add_dataframe_asset(name=dataframe_asset_name,dataframe=dataframe)

# Create the batch request
batch_request = dataframe_asset.build_batch_request()

# Create the validator
context.add_expectation_suite(expectation_suite_name=expectation_suite_name)
validator = context.get_validator(batch_request=batch_request,expectation_suite_name=expectation_suite_name)

# Add some rules
validator.expect_column_values_to_not_be_null(column="name")

# Create the context
expectation_suite = validator.get_expectation_suite(expectation_suite_name)

checkpoint_config = {
      "name": checkpoint_name,
      "action_list": [],
      "validations": [{
          "expectation_suite_name": expectation_suite.expectation_suite_name,
          "expectation_suite_ge_cloud_id": expectation_suite.ge_cloud_id,
          "batch_request": {
              "datasource_name": dataframe_datasource.name,
              "data_asset_name": dataframe_asset.name,
          },
      }],
      "config_version": 1,
      "class_name": "Checkpoint"
    }

checkpoint = context.add_or_update_checkpoint(**checkpoint_config)

At this point, everything should appear in the UI and you can run your checkpoint, it’s going to work fine. Now, kill your notebook (or else) and try to re-load the checkpoint and run it as follows :

import os
import great_expectations as gx

os.environ["GX_CLOUD_ACCESS_TOKEN"] = admin_ge_token
os.environ["GX_CLOUD_ORGANIZATION_ID"] = organization_id

context = gx.get_context()
checkpoint = context.get_checkpoint('my_checkpoint')
checkpoint.run()

You should have InvalidBatchSpecError: RuntimeDataBatchSpec batch_data cannot be None

Note : we ran this code in a Databricks Notebook using the Great Expectations 0.17.22 version. We also tried with the 0.17.14 and we had the same error.

Expected behavior We should be able to re-run the checkpoint.

Environment (please complete the following information):

  • Operating System: Notebook on Databricks using DBR 13.1
  • Great Expectations Version: 0.17.22
  • Data Source: Spark
  • Cloud environment: Databricks on AWS

About this issue

  • Original URL
  • State: closed
  • Created 8 months ago
  • Comments: 16 (8 by maintainers)

Most upvoted comments

Hi @babjiloganda, excellent! Thanks for confirming and glad we got it sorted - I’ll close the tickets.