aws-sdk-pandas: InvalidSchemaConvergence with redshift.copy
AWS Data Wrangler 2.0.0, used to work in 1.X
Below my my code snippet which eventually throws a InvalidSchemaConvergence. The column scan_date is defined as a string in the Glue table, but as the name implies it contains timestamps or emptry strings (and possibly more). The dataframe correctly lists scan_date as a string.
It seems there is some datatype inferencing going on which infers that the column contains both (incompatible) datatypes and then throws an error. How can I circumvent this behaviour?
df = wr.athena.read_sql_query(
generate_sql(f"{sql}.sql", table, load_date), database=database,
)
path = f"{os.getenv('ATHENA_BUCKET')}prozesszeitenloader/"
con = wr.redshift.connect("reporting")
wr.redshift.copy(
df=df,
path=path,
con=con,
schema="public",
table=sql,
mode=mode,
iam_role=os.getenv("IAM_ROLE"),
primary_keys=["request_id"],
)
This is the traceback:
Traceback (most recent call last):
File "copy_from_s3_to_redshift.py", line 276, in <module>
handler(
File "copy_from_s3_to_redshift.py", line 78, in handler
copy_to_redshift(
File "copy_from_s3_to_redshift.py", line 61, in copy_to_redshift
wr.redshift.copy(
File "/Users/dirk/Documents/Code/reports/venv/lib/python3.8/site-packages/awswrangler/redshift.py", line 1190, in copy
copy_from_files(
File "/Users/dirk/Documents/Code/reports/venv/lib/python3.8/site-packages/awswrangler/redshift.py", line 1015, in copy_from_files
created_table, created_schema = _create_table(
File "/Users/dirk/Documents/Code/reports/venv/lib/python3.8/site-packages/awswrangler/redshift.py", line 197, in _create_table
redshift_types = _redshift_types_from_path(
File "/Users/dirk/Documents/Code/reports/venv/lib/python3.8/site-packages/awswrangler/redshift.py", line 137, in _redshift_types_from_path
athena_types, _ = s3.read_parquet_metadata(
File "/Users/dirk/Documents/Code/reports/venv/lib/python3.8/site-packages/awswrangler/_config.py", line 361, in wrapper
return function(**args)
File "/Users/dirk/Documents/Code/reports/venv/lib/python3.8/site-packages/awswrangler/s3/_read_parquet.py", line 803, in read_parquet_metadata
return _read_parquet_metadata(
File "/Users/dirk/Documents/Code/reports/venv/lib/python3.8/site-packages/awswrangler/s3/_read_parquet.py", line 152, in _read_parquet_metadata
columns_types: Dict[str, str] = _merge_schemas(schemas=schemas)
File "/Users/dirk/Documents/Code/reports/venv/lib/python3.8/site-packages/awswrangler/s3/_read_parquet.py", line 117, in _merge_schemas
raise exceptions.InvalidSchemaConvergence(
awswrangler.exceptions.InvalidSchemaConvergence: Was detect at least 2 different types in column scan_date (timestamp and string).
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 15 (6 by maintainers)
You don’t need to clean it up by yourself, just leave the argument
keep_files=False
(default) and all the staging files should be deleted after the COPY automatically.In the end you just need to provide a safe s3 prefix where Wrangler will not be find old files into. 😃
Will do!
Also, for version
2.1.0
I think we should include thiswr.s3.delete_objects()
inside the function to automatically clean up the path before the COPY. What do you think?