aws-sdk-pandas: timestamps on athena.read_sql_query wrap around to give wrong value in pandas

Describe the bug

When reading timestamp values that are out of bounds for Timestamp[ns] the values just get wrapped around to different timestamps (as far as I can tell). This only seems to be a problem for when you set ctas_approach=True in the athena.read_sql_query function. If you set this to False it is read in correctly. So I guess there is a mismatch between how Athena writes the data to parquet and wrangler reads in the parquet file?

To Reproduce

Soz this is fairly long as tried to make something reproducible. If you already have tables that have columns with out of bounds timestamps (for ns) then you can skip to my # post-setup comment in the code.

# Create a table with timestamps that are out of bounds Timestamp[ns]
import awswrangler as wr
import pandas as pd
import datetime

i = pd.Series([0,1,2])
d = pd.Series([datetime.date(1000, 1, 1), datetime.date(2020,1,1), datetime.date(3000, 1, 1)])
dt = pd.Series(["1000-01-01 00:00:00", "2020-01-01 00:00:00", "3000-01-01 00:00:00"])
df = pd.DataFrame({"i": i, "d":d, "dt": dt})

# Create the database
database = "isichei_test"
wr.catalog.create_database(database)
table_path = "<Wherever you wanna throw your data>" 
wr.s3.to_parquet(
    df=df,
    path=table_path,
    index=False,
    dataset=True,
    database=database,
    table='test_table',
    mode="overwrite",
    dtype={"i": "int", "d": "date", "dt": "string"} # cannot specify timestamp as wrangler errors out here (due to same pandas TS casting issues)
)

# post-setup
## If you have tables in parquet that are stored as Timestamp[s] then you can just query them (i.e. SELECT *) - this cast in the SQL query is just for this example. In reality we discovered this on querying one of our tables that are written to parquet using Spark.

df_ctas = wr.athena.read_sql_query("SELECT *, cast(dt as timestamp) AS dt_timestamp FROM isichei_test.test_table", database="isichei_test")

df_ctas.dt_timestamp
# returns: 2169-02-08 23:09:07.419103232, 2020-01-01 00:00:00.000000000, 1830-11-23 00:50:52.580896768

Some further notes

In the example above if you set ctas_approach = False you will get an Pandas error due to trying to apply dt accessor in something that isn’t a dt. However, on our tables where the data is written to parquet via other means (spark or pyarrow) and you just query those values with ctas_approach=False you do not get the timestamp wrapping error. If you need it, I can try and create a reproducible error that lets me write parquet files to S3 using arrow (or something) but this is hopefully enough?

For our etl pipelines we now use Arrow to read in data and then try to strictly convert the timestamps to pandas object arrays to avoid Timestamp[ns] issues. I don’t know if you have an option in this function or package to do the same (/ point me in the direction to PR it)? I am not 100% sure if this is actually a pandas read in issue rather than something like a weird mismatch in the parquet metadata between the Athena engine and wrangler. Finally, dates are unaffected the resulting output for the d column was correct.

Also worth noting we are still using the Athena v1 engine.

env:

awswrangler         2.3.0
pandas              1.2.0
pyarrow             2.0.0

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 2
  • Comments: 20 (17 by maintainers)

Commits related to this issue

Most upvoted comments

Hi @jaidisido I might have time to try and apply a fix next week

Thank you for the clear explanation @isichei, please keep us posted on the outcome of the Arrow issue. Closing for now but feel free to reopen in the future

Apologies @jaidisido you’re right. Looks like the issue is specifically how pyarrow deals with Parquet INT96 types. Athena writes parquet data using the version 1.0 of parquet which uses INT96 types for timestamps (deprecated in later Parquet versions (source)).

When arrow reads the parquet file with a physical type of INT96 it will always infer that this should be a timestamp[ns] arrow type and then the wrap around occurs on conversion to the arrow table this will still happen without error for pyarrow v2 and v3. Interestingly the example in ARROW-6779 now throws an out of bounds error if ran in pyarrow v3.0.0.

As far as I can tell the conversion from parquet type to arrow type is in the Cython code for the Parquet reader in the pyarrow library and the conversion is not exposed to the python APIs in pyarrow so you can’t set it yourself. I’ve been able to create a round trip example just using pyarrow so will raise a ticket with them when I get the chance.

Thanks for your help!