pudl: Address parquet / pyarrow 1.0.0 vs. pandas tzinfo incompatibilites in timezone aware columns

Something about the way that we are specifying the arrow / parquet schema in epacems_to_parquet appears to be incompatible with Apache Arrow 1.0.0 – though it works fine with Arrow 0.17.1. If you attempt to run an epacems_to_parquet conversion with arrow 1.0.0 installed, it fails on attempting to convert the operating_datetime_utc column from a pandas to an arrow column. However, if the timezone is left out of the timestamp column’s schema definition (and it is allowed to assume UTC, which is the default and correct in this case) it then fails on one of the int32 columns, saying that a floating point value has been truncated.

Not entirely clear what changed between the pre-1.0 and post-1.0 versions of Arrow that would have broken this, but that seems to be the controlling factor in whether what we’re doing works or fails.

See the Arrow 1.0.0 release announcement which links to a complete change log.

To minimally recreate the bad behavior, given some pre-existing EPA CEMS datapackage outputs, you can use the following code:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

import pudl

# The path to wherever your EPA CEMS outputs are:
epacems_dir = "./"
# The path to wherever you want it to create a Parquet dataset:
output_dir = "./"
year = 2018
state = "ID"

df = (
    pd.read_csv(
        pathlib.Path(epacems_dir) / f"hourly_emissions_epacems_{year}_{state.lower()}.csv.gz",
        parse_dates=["operating_datetime_utc"],
        dtype=pudl.convert.epacems_to_parquet.create_in_dtypes()
    )
    .assign(year=year)
)
pq.write_to_dataset(
    pa.Table.from_pandas(
        df,
        preserve_index=False,
        schema=pudl.convert.epacems_to_parquet.create_cems_schema()),
    root_path=output_dir,
    partition_cols=["year", "state"],
    compression="snappy"
)

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 19 (12 by maintainers)

Commits related to this issue

Most upvoted comments

Thanks for opening the issue!

I made PRs to address both bugs: https://github.com/apache/arrow/pull/8624, https://github.com/apache/arrow/pull/8625

It’s a good practice to enforce explicit schemas if you know what they’re supposed to be

@karldw Thanks for opening the issue. The problem lies in some inconsistent metadata, I think caused by creating the Table from a pandas DataFrame with tz-naive column but schema with timezone. That is certainly a case that should work, but so the to_pandas conversion cannot deal at the moment with this inconsistent metadata.

A temporary workaround for you could be to remove the pandas metadata from the arrow Table (of course, that can also loose other information like a column to be set as the pandas Index, but depending on your use case, you might not need this information):

In [5]: tab.replace_schema_metadata().to_pandas()
Out[5]: 
                       time
0 1970-01-01 00:00:00+00:00
1 1970-01-01 00:00:00+00:00

Will take a look at fixing it next week, unless someone else wants to give it a go.

Please definitely open a JIRA issue so that we can help get this resolved and appropriately unit tested

cc @nealrichardson @jorisvandenbossche