pudl: Address parquet / pyarrow 1.0.0 vs. pandas tzinfo incompatibilites in timezone aware columns
Something about the way that we are specifying the arrow / parquet schema in epacems_to_parquet
appears to be incompatible with Apache Arrow 1.0.0 – though it works fine with Arrow 0.17.1. If you attempt to run an epacems_to_parquet
conversion with arrow 1.0.0 installed, it fails on attempting to convert the operating_datetime_utc
column from a pandas to an arrow column. However, if the timezone is left out of the timestamp column’s schema definition (and it is allowed to assume UTC, which is the default and correct in this case) it then fails on one of the int32
columns, saying that a floating point value has been truncated.
Not entirely clear what changed between the pre-1.0 and post-1.0 versions of Arrow that would have broken this, but that seems to be the controlling factor in whether what we’re doing works or fails.
See the Arrow 1.0.0 release announcement which links to a complete change log.
To minimally recreate the bad behavior, given some pre-existing EPA CEMS datapackage outputs, you can use the following code:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import pudl
# The path to wherever your EPA CEMS outputs are:
epacems_dir = "./"
# The path to wherever you want it to create a Parquet dataset:
output_dir = "./"
year = 2018
state = "ID"
df = (
pd.read_csv(
pathlib.Path(epacems_dir) / f"hourly_emissions_epacems_{year}_{state.lower()}.csv.gz",
parse_dates=["operating_datetime_utc"],
dtype=pudl.convert.epacems_to_parquet.create_in_dtypes()
)
.assign(year=year)
)
pq.write_to_dataset(
pa.Table.from_pandas(
df,
preserve_index=False,
schema=pudl.convert.epacems_to_parquet.create_cems_schema()),
root_path=output_dir,
partition_cols=["year", "state"],
compression="snappy"
)
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 19 (12 by maintainers)
Thanks for opening the issue!
I made PRs to address both bugs: https://github.com/apache/arrow/pull/8624, https://github.com/apache/arrow/pull/8625
It’s a good practice to enforce explicit schemas if you know what they’re supposed to be
@karldw Thanks for opening the issue. The problem lies in some inconsistent metadata, I think caused by creating the Table from a pandas DataFrame with tz-naive column but schema with timezone. That is certainly a case that should work, but so the
to_pandas
conversion cannot deal at the moment with this inconsistent metadata.A temporary workaround for you could be to remove the pandas metadata from the arrow Table (of course, that can also loose other information like a column to be set as the pandas Index, but depending on your use case, you might not need this information):
Will take a look at fixing it next week, unless someone else wants to give it a go.
https://issues.apache.org/jira/browse/ARROW-10511
Please definitely open a JIRA issue so that we can help get this resolved and appropriately unit tested
cc @nealrichardson @jorisvandenbossche