aws-sdk-pandas: timestamps on athena.read_sql_query wrap around to give wrong value in pandas
Describe the bug
When reading timestamp values that are out of bounds for Timestamp[ns]
the values just get wrapped around to different timestamps (as far as I can tell). This only seems to be a problem for when you set ctas_approach=True
in the athena.read_sql_query
function. If you set this to False it is read in correctly. So I guess there is a mismatch between how Athena writes the data to parquet and wrangler reads in the parquet file?
To Reproduce
Soz this is fairly long as tried to make something reproducible. If you already have tables that have columns with out of bounds timestamps (for ns) then you can skip to my # post-setup
comment in the code.
# Create a table with timestamps that are out of bounds Timestamp[ns]
import awswrangler as wr
import pandas as pd
import datetime
i = pd.Series([0,1,2])
d = pd.Series([datetime.date(1000, 1, 1), datetime.date(2020,1,1), datetime.date(3000, 1, 1)])
dt = pd.Series(["1000-01-01 00:00:00", "2020-01-01 00:00:00", "3000-01-01 00:00:00"])
df = pd.DataFrame({"i": i, "d":d, "dt": dt})
# Create the database
database = "isichei_test"
wr.catalog.create_database(database)
table_path = "<Wherever you wanna throw your data>"
wr.s3.to_parquet(
df=df,
path=table_path,
index=False,
dataset=True,
database=database,
table='test_table',
mode="overwrite",
dtype={"i": "int", "d": "date", "dt": "string"} # cannot specify timestamp as wrangler errors out here (due to same pandas TS casting issues)
)
# post-setup
## If you have tables in parquet that are stored as Timestamp[s] then you can just query them (i.e. SELECT *) - this cast in the SQL query is just for this example. In reality we discovered this on querying one of our tables that are written to parquet using Spark.
df_ctas = wr.athena.read_sql_query("SELECT *, cast(dt as timestamp) AS dt_timestamp FROM isichei_test.test_table", database="isichei_test")
df_ctas.dt_timestamp
# returns: 2169-02-08 23:09:07.419103232, 2020-01-01 00:00:00.000000000, 1830-11-23 00:50:52.580896768
Some further notes
In the example above if you set ctas_approach = False
you will get an Pandas error due to trying to apply dt accessor in something that isn’t a dt. However, on our tables where the data is written to parquet via other means (spark or pyarrow) and you just query those values with ctas_approach=False
you do not get the timestamp wrapping error. If you need it, I can try and create a reproducible error that lets me write parquet files to S3 using arrow (or something) but this is hopefully enough?
For our etl pipelines we now use Arrow to read in data and then try to strictly convert the timestamps to pandas object arrays to avoid Timestamp[ns]
issues. I don’t know if you have an option in this function or package to do the same (/ point me in the direction to PR it)? I am not 100% sure if this is actually a pandas read in issue rather than something like a weird mismatch in the parquet metadata between the Athena engine and wrangler. Finally, dates are unaffected the resulting output for the d
column was correct.
Also worth noting we are still using the Athena v1 engine.
env:
awswrangler 2.3.0
pandas 1.2.0
pyarrow 2.0.0
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 2
- Comments: 20 (17 by maintainers)
Commits related to this issue
- Fixes #592 - Allows user to pass `coerce_int96_timestamp_unit` in a pyarrow_additional_kwargs dict. This avoids integer overflow error when reading Parquet INT96 types into Arrow. - Parquet reader wi... — committed to isichei/aws-data-wrangler by isichei 3 years ago
- Backwards compatible solution to issue #592 — committed to isichei/aws-data-wrangler by isichei 3 years ago
- Issue 592 backward compat (#949) * Fixes #592 - Allows user to pass `coerce_int96_timestamp_unit` in a pyarrow_additional_kwargs dict. This avoids integer overflow error when reading Parquet INT96... — committed to aws/aws-sdk-pandas by isichei 3 years ago
Hi @jaidisido I might have time to try and apply a fix next week
Thank you for the clear explanation @isichei, please keep us posted on the outcome of the Arrow issue. Closing for now but feel free to reopen in the future
Apologies @jaidisido you’re right. Looks like the issue is specifically how
pyarrow
deals with Parquet INT96 types. Athena writes parquet data using the version 1.0 of parquet which uses INT96 types for timestamps (deprecated in later Parquet versions (source)).When arrow reads the parquet file with a physical type of INT96 it will always infer that this should be a
timestamp[ns]
arrow type and then the wrap around occurs on conversion to the arrow table this will still happen without error for pyarrow v2 and v3. Interestingly the example in ARROW-6779 now throws an out of bounds error if ran in pyarrow v3.0.0.As far as I can tell the conversion from parquet type to arrow type is in the Cython code for the Parquet reader in the pyarrow library and the conversion is not exposed to the python APIs in pyarrow so you can’t set it yourself. I’ve been able to create a round trip example just using pyarrow so will raise a ticket with them when I get the chance.
Thanks for your help!