PyAthena: PandasCursor doesn't automatically convert int columns with NA's to floats
I’m querying a large athena table and can successfully run a query using the below code, however it’s really slow (for reasons covered in #46).
conn = pyathena.connect(**at.athena_creds)
df = pd.read_sql(sql, conn)
I would really like to take advantage of the performance boost that PandasCursor offers, however, when I run the code below, I get a value error.
conn = pyathena.connect(**at.athena_creds, cursor_class=PandasCursor)
cursor = at_con.cursor()
df = cursor.execute(sql).as_pandas()
>>> ValueError: Integer column has NA values in column 18
Now I understand why I’m getting this value error. I have a int column in my athena table which has NA values in it, which Pandas notoriously doesn’t handle well (NaN’s are floats in Pandas eyes, not ints). The pd.read_sql() seems to handle this gracefully. It recognizes there is an int column with NaN’s and converts it to a float column. It would be great if pyathena did the same thing.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 2
- Comments: 15 (9 by maintainers)
Commits related to this issue
- Update README: Add about ValueError of integer column in Dataframe. (close #60) — committed to laughingman7743/PyAthena by laughingman7743 5 years ago
- Support integer NA values in PnadasCursor (fix #60) — committed to laughingman7743/PyAthena by laughingman7743 5 years ago
- Merge pull request #80 from laughingman7743/support_integer_na_value_in_pandas_cursor Support integer NA values in PandasCursor (fix #60) — committed to laughingman7743/PyAthena by laughingman7743 5 years ago
Pandas 0.24+ has support for nullable ints, so I was able to keep my int columns as ints (rather than converting to double) by changing converter.py like so:
If you’re willing to set the minimum requirements to pandas >=0.24, I think this fix would be cleaner than converting to double.