PyAthena: PandasCursor doesn't automatically convert int columns with NA's to floats

I’m querying a large athena table and can successfully run a query using the below code, however it’s really slow (for reasons covered in #46).

conn = pyathena.connect(**at.athena_creds)
df = pd.read_sql(sql, conn)

I would really like to take advantage of the performance boost that PandasCursor offers, however, when I run the code below, I get a value error.

conn = pyathena.connect(**at.athena_creds, cursor_class=PandasCursor)
cursor = at_con.cursor()
df = cursor.execute(sql).as_pandas()

>>> ValueError: Integer column has NA values in column 18

Now I understand why I’m getting this value error. I have a int column in my athena table which has NA values in it, which Pandas notoriously doesn’t handle well (NaN’s are floats in Pandas eyes, not ints). The pd.read_sql() seems to handle this gracefully. It recognizes there is an int column with NaN’s and converts it to a float column. It would be great if pyathena did the same thing.

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 2
Comments: 15 (9 by maintainers)

Commits related to this issue

Update README: Add about ValueError of integer column in Dataframe. (close #60) — committed to laughingman7743/PyAthena by laughingman7743 5 years ago
Support integer NA values in PnadasCursor (fix #60) — committed to laughingman7743/PyAthena by laughingman7743 5 years ago
Merge pull request #80 from laughingman7743/support_integer_na_value_in_pandas_cursor Support integer NA values in PandasCursor (fix #60) — committed to laughingman7743/PyAthena by laughingman7743 5 years ago

Most upvoted comments

Pandas 0.24+ has support for nullable ints, so I was able to keep my int columns as ints (rather than converting to double) by changing converter.py like so:

import pandas as pd

PANDAS_DTYPES = {
    'boolean': bool,
    'tinyint': pd.Int64Dtype(),
    'smallint': pd.Int64Dtype(),
    'integer': pd.Int64Dtype(),
    'bigint': pd.Int64Dtype(),
    'float': float,
    'real': float,
    'double': float,
    'char': str,
    'varchar': str,
    'array': str,
    'map': str,
    'row': str,
}

If you’re willing to set the minimum requirements to pandas >=0.24, I think this fix would be cleaner than converting to double.

xinluo-gogovan on Apr 4, 2019