great_expectations: Can't re-run a checkpoint that was previously saved
Describe the bug We have created a spark data source, a data asset, a batch request, an expectation suite and a checkpoint, and when we run it the first time, it’s working.
When we try to load the checkpoint (instead of creating it) and run the checkpoint again, it fails with the following error :
InvalidBatchSpecError: RuntimeDataBatchSpec batch_data cannot be None
To Reproduce
table_name='your_table_name'
# GX
dataframe_asset_name='my_asset'
datasource_name='my_datasource'
expectation_suite_name = "my_expectation_suite"
checkpoint_name = "my_checkpoint"
# Create the context
os.environ["GX_CLOUD_ACCESS_TOKEN"] = admin_ge_token
os.environ["GX_CLOUD_ORGANIZATION_ID"] = organization_id
context = gx.get_context()
# Read the spark dataframe
dataframe = spark.read.table(table_name)
# Create the datasource
dataframe_datasource = context.sources.add_or_update_spark(name=datasource_name)
# Create the asset
dataframe_asset = dataframe_datasource.add_dataframe_asset(name=dataframe_asset_name,dataframe=dataframe)
# Create the batch request
batch_request = dataframe_asset.build_batch_request()
# Create the validator
context.add_expectation_suite(expectation_suite_name=expectation_suite_name)
validator = context.get_validator(batch_request=batch_request,expectation_suite_name=expectation_suite_name)
# Add some rules
validator.expect_column_values_to_not_be_null(column="name")
# Create the context
expectation_suite = validator.get_expectation_suite(expectation_suite_name)
checkpoint_config = {
"name": checkpoint_name,
"action_list": [],
"validations": [{
"expectation_suite_name": expectation_suite.expectation_suite_name,
"expectation_suite_ge_cloud_id": expectation_suite.ge_cloud_id,
"batch_request": {
"datasource_name": dataframe_datasource.name,
"data_asset_name": dataframe_asset.name,
},
}],
"config_version": 1,
"class_name": "Checkpoint"
}
checkpoint = context.add_or_update_checkpoint(**checkpoint_config)
At this point, everything should appear in the UI and you can run your checkpoint, it’s going to work fine. Now, kill your notebook (or else) and try to re-load the checkpoint and run it as follows :
import os
import great_expectations as gx
os.environ["GX_CLOUD_ACCESS_TOKEN"] = admin_ge_token
os.environ["GX_CLOUD_ORGANIZATION_ID"] = organization_id
context = gx.get_context()
checkpoint = context.get_checkpoint('my_checkpoint')
checkpoint.run()
You should have InvalidBatchSpecError: RuntimeDataBatchSpec batch_data cannot be None
Note : we ran this code in a Databricks Notebook using the Great Expectations 0.17.22 version. We also tried with the 0.17.14 and we had the same error.
Expected behavior We should be able to re-run the checkpoint.
Environment (please complete the following information):
- Operating System: Notebook on Databricks using DBR 13.1
- Great Expectations Version: 0.17.22
- Data Source: Spark
- Cloud environment: Databricks on AWS
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Comments: 16 (8 by maintainers)
Hi @babjiloganda, excellent! Thanks for confirming and glad we got it sorted - I’ll close the tickets.