snowflake-connector-python: SNOW-259668: write_pandas fails for some datetime64 values using PyArrow backend

Please answer these questions before submitting your issue. Thanks!

What version of Python are you using (python --version)?

Python 3.8.5

What operating system and processor architecture are you using (python -c 'import platform; print(platform.platform())')?

macOS-10.16-x86_64-i386-64bit

What are the component versions in the environment (pip freeze)?

asn1crypto==1.4.0
attrs==20.3.0
docopt==0.6.2
ecdsa==0.16.1
Flask==1.1.2
future==0.18.2
jmespath==0.10.0
jsonformatter==0.2.3
jsonschema==3.1.1
more-itertools==8.6.0
numpy==1.19.5
oscrypto==1.2.1
packaging==20.8
paho-mqtt==1.5.1
pandas==1.0.3
pandasql==0.7.3
pbr==5.5.1
pluggy==0.13.1
py==1.10.0
pyarrow==2.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycryptodome==3.9.9
pycryptodomex==3.9.9
PyJWT==1.7.1
pykwalify==1.7.0
pyOpenSSL==20.0.1
pytest==5.4.3
python-box==5.2.0
python-dateutil==2.8.1
python-dotenv==0.12.0
python-jose==3.1.0
pytz==2020.5
PyYAML==5.3.1
requests==2.23.0
rsa==4.7
six==1.15.0
snowflake-connector-python==2.3.7
stevedore==3.3.0
urllib3==1.25.11
wcwidth==0.2.5
Werkzeug==0.16.1

What did you do?

I invoked snowflake.connector.pandas_tools.write_pandas on a DataFrame with a column of type datetime64[ns] (using PyArrow as the default backend for ParquetWriter)

What did you expect to see?

I expected the datetime data written to the database verbatim with nanosecond precision.

What did you see instead?

write_pandas fails when Arrow tries writing to Parquet using the default arguments: pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data

This problem IIUC is related to the PyArrow defaults, which for compatibility reasons defaults to Parquet version=’1.0’ behavior, which only supports ms timestamps, and also defaulting allow_truncated_timestamps=False, which raises an exception when any timestamp precision is lost during writing. The end result is always truncating ns-precision timestamps to ms-precision and therefore always throwing an exception.

Since Snowflake supports ns-precision timestamps, I would expect defaults that allow them to be written from DataFrames without error. However, since I imagine it’s messy supporting various Parquet backends, I think at a minimum write_pandas should accept kwargs to be passed to the parquet writer, so users can tailor the behavior they want (and workaround backend-specific problems like this one).

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 5
Comments: 25 (1 by maintainers)

Most upvoted comments

We are internally working on a more permanent solution and provide an update next quarter.

sfc-gh-aalam on Apr 19, 2023

Having similar issues with Snowpark write_pandas and createDataFrame, a Pandas timestamp column

with local TimeZone --> correct TIMESTAMP_NTZ in Snowflake

without TimeZone --> NUMBER in Snowflake

Thanks for pointing this out!

The following snippet helped me:

df['Date'] = df['Date'].dt.tz_localize("UTC+01:00").dt.ceil(freq='ms')

fredrike on Dec 19, 2022

I came across the same problem, i ve solved it by converting npdatetime64 to object: output[‘SUBMODEL_VALUE_DATE’] =output[‘SUBMODEL_VALUE_DATE’].dt.strftime(“%Y-%m-%d”)

bcikili on Dec 14, 2021

Hi all, we continue to work on a long-term solution internally. We’re sorry about any inconvenience this issue might be causing. As a temporary workaround, we suggest passing a timezone to any timestamp values when building your pandas dataframe. The following works for millisecond and microsecond timestamps, For example, instead of using:

ts = pd.Timestamp(1621291701002, unit="ms")
pd.DataFrame([('a', ts)], columns=['ID', 'DATE_FIELD'])

you can define the dataframe as:

ts = pd.Timestamp(1621291701002, unit="ms", tz="UTC")
pd.DataFrame([('a', ts)], columns=['ID', 'DATE_FIELD'])

When the dataframe is unloaded into a parquet file and COPY’ed into Snowflake, the value will be correctly parsed:

As for ns timestamps, specifically the issue mentioned by @willsthompson, using df.to_sql will avoid the error you are receiving as that converts nanosecond timestamps to microsecond timestamps (and then using the timezone suggestion above will ensure you see valid dates in Snowflake). Once again, we apologize for any inconvenience as a result of this issue.

sfc-gh-kzaveri on May 19, 2021

…(they just released v3 early this week)…

👀 #565

sfc-gh-mkeller on Feb 3, 2021

Adding the parameter use_deprecated_int96_timestamps=True to write_pandas() helped me. The previous snippet by @fredrike with rounding the timestamps also worked. This approach allows to preserve original timestamp values without truncating them:

write_pandas(con, df, "TABLE_NAME", use_deprecated_int96_timestamps=True)

DzimitryM on Mar 1, 2023

Having similar issues with Snowpark write_pandas and createDataFrame, a Pandas timestamp column

with local TimeZone --> correct TIMESTAMP_NTZ in Snowflake
without TimeZone --> NUMBER in Snowflake

ChuliangXiao on Nov 4, 2022

I solved the problem by adding allow_truncated_timestamps=True in chunk.to_parquet. According to pyarrow doc:

Some Parquet readers may only support timestamps stored in millisecond (‘ms’) or microsecond (‘us’) resolution. Since pandas uses nanoseconds to represent timestamps, this can occasionally be a nuisance. By default (when writing version 1.0 Parquet files), the nanoseconds will be cast to microseconds (‘us’). If a cast to a lower resolution value may result in a loss of data, by default an exception will be raised. This can be suppressed by passing allow_truncated_timestamps=True

In /snowflake/connector/pandas_tools.py line 159:

chunk.to_parquet(
                chunk_path,
                compression=compression,
                allow_truncated_timestamps=True,
            )

or alternatively:

chunk.to_parquet(
                chunk_path,
                compression=compression,
                use_deprecated_int96_timestamps=True,
            )

Does this solve your problem too ?

aduverger on May 13, 2022

It seems like this issue still persists. Moreover, for [ns] timestamps it seems like it can also affect the Pandas to_sql method even with pd_writer. Is there a combination of known versions for which this is resolved?

russellpierce on Apr 12, 2022

I drafted a possible solution to the issue here: https://github.com/snowflakedb/snowflake-connector-python/issues/616#issuecomment-806386788

daniel-sali on Mar 25, 2021

This is a little out of scope for this issue, but if at all possible it would be helpful to support newer versions of PyArrow (they just released v3 early this week), for compatibility and bugfixes, but also because the early versions of PyArrow required by the Snowflake connector are enormous payloads, over 200MB IIRC. Newer releases are in the 50MB range. This can have a significant impact on deployments.

willsthompson on Jan 29, 2021

Okay, starting with pandas 1.1.0 it automatically uses parquet 2.0 and this is why i started receiving invalid date so i will open a new issue to support pandas 1.1.0 and parquet 2.0

plotneishestvo on Jan 29, 2021

@plotneishestvo Thanks, yes, we’re doing something similar. This is probably a little quicker for big tables:

df['col'] = df['col'].dt.ceil(freq='ms')

willsthompson on Jan 13, 2021