aws-sdk-pandas: InvalidSchemaConvergence with redshift.copy

AWS Data Wrangler 2.0.0, used to work in 1.X

Below my my code snippet which eventually throws a InvalidSchemaConvergence. The column scan_date is defined as a string in the Glue table, but as the name implies it contains timestamps or emptry strings (and possibly more). The dataframe correctly lists scan_date as a string.

It seems there is some datatype inferencing going on which infers that the column contains both (incompatible) datatypes and then throws an error. How can I circumvent this behaviour?


    df = wr.athena.read_sql_query(
        generate_sql(f"{sql}.sql", table, load_date), database=database,
    )

    path = f"{os.getenv('ATHENA_BUCKET')}prozesszeitenloader/"
    con = wr.redshift.connect("reporting")

    wr.redshift.copy(
        df=df,
        path=path,
        con=con,
        schema="public",
        table=sql,
        mode=mode,
        iam_role=os.getenv("IAM_ROLE"),
        primary_keys=["request_id"],
    )

This is the traceback:

Traceback (most recent call last):
  File "copy_from_s3_to_redshift.py", line 276, in <module>
    handler(
  File "copy_from_s3_to_redshift.py", line 78, in handler
    copy_to_redshift(
  File "copy_from_s3_to_redshift.py", line 61, in copy_to_redshift
    wr.redshift.copy(
  File "/Users/dirk/Documents/Code/reports/venv/lib/python3.8/site-packages/awswrangler/redshift.py", line 1190, in copy
    copy_from_files(
  File "/Users/dirk/Documents/Code/reports/venv/lib/python3.8/site-packages/awswrangler/redshift.py", line 1015, in copy_from_files
    created_table, created_schema = _create_table(
  File "/Users/dirk/Documents/Code/reports/venv/lib/python3.8/site-packages/awswrangler/redshift.py", line 197, in _create_table
    redshift_types = _redshift_types_from_path(
  File "/Users/dirk/Documents/Code/reports/venv/lib/python3.8/site-packages/awswrangler/redshift.py", line 137, in _redshift_types_from_path
    athena_types, _ = s3.read_parquet_metadata(
  File "/Users/dirk/Documents/Code/reports/venv/lib/python3.8/site-packages/awswrangler/_config.py", line 361, in wrapper
    return function(**args)
  File "/Users/dirk/Documents/Code/reports/venv/lib/python3.8/site-packages/awswrangler/s3/_read_parquet.py", line 803, in read_parquet_metadata
    return _read_parquet_metadata(
  File "/Users/dirk/Documents/Code/reports/venv/lib/python3.8/site-packages/awswrangler/s3/_read_parquet.py", line 152, in _read_parquet_metadata
    columns_types: Dict[str, str] = _merge_schemas(schemas=schemas)
  File "/Users/dirk/Documents/Code/reports/venv/lib/python3.8/site-packages/awswrangler/s3/_read_parquet.py", line 117, in _merge_schemas
    raise exceptions.InvalidSchemaConvergence(
awswrangler.exceptions.InvalidSchemaConvergence: Was detect at least 2 different types in column scan_date (timestamp and string).

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 15 (6 by maintainers)

Most upvoted comments

A clean S3 prefix sounds I should just generate one per run and then “garbage collect” once in a while?

You don’t need to clean it up by yourself, just leave the argument keep_files=False (default) and all the staging files should be deleted after the COPY automatically.

In the end you just need to provide a safe s3 prefix where Wrangler will not be find old files into. 😃

Also I think I do have a hunch of what that means, but I think you should definitely add that to the copy tutorial.

Will do!

Also, for version 2.1.0 I think we should include this wr.s3.delete_objects() inside the function to automatically clean up the path before the COPY. What do you think?

igorborgest on Dec 11, 2020