pandas: read_csv fails with `TypeError: object cannot be converted to an IntegerDtype` yet succeeds when reading chunks
Code Sample, a copy-pastable example if possible
Download this file upload.txt
# Your code here
import pandas as pd
from enum import Enum, IntEnum, auto
import argparse
# I attached the file in the github issue
filename = "upload.txt"
# this field is coded on 64 bits so 'UInt64' looks perfect.
column = "tcp.options.mptcp.sendkey"
with open(filename) as fd:
print("READ CHUNK BY CHUNK")
res = pd.read_csv(
fd,
comment='#',
sep='|',
dtype={column: 'UInt64' },
usecols=[column],
chunksize=1
)
for chunk in (res):
# print("chunk %d" % i)
print(chunk)
fd.seek(0) # rewind
print("READ THE WHOLE FILE AT ONCE ")
res = pd.read_csv(
fd,
comment='#',
sep='|',
usecols=[column],
dtype={"tcp.options.mptcp.sendkey": 'UInt64' }
)
print(res)
If I read in chunks, read_csv succeeds, if I try to read the column at once, I get
Traceback (most recent call last):
File "test2.py", line 34, in <module>
dtype={"tcp.options.mptcp.sendkey": 'UInt64' }
File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/io/parsers.py", line 702, in parser_f
return _read(filepath_or_buffer, kwds)
File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/io/parsers.py", line 435, in _read
data = parser.read(nrows)
File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/io/parsers.py", line 1139, in read
ret = self._engine.read(nrows)
File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/io/parsers.py", line 1995, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 900, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 915, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 992, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 1124, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas/_libs/parsers.pyx", line 1155, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas/_libs/parsers.pyx", line 1235, in pandas._libs.parsers.TextReader._convert_with_dtype
File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/core/arrays/integer.py", line 308, in _from_sequence_of_strings
return cls._from_sequence(scalars, dtype, copy)
File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/core/arrays/integer.py", line 303, in _from_sequence
return integer_array(scalars, dtype=dtype, copy=copy)
File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/core/arrays/integer.py", line 111, in integer_array
values, mask = coerce_to_array(values, dtype=dtype, copy=copy)
File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/core/arrays/integer.py", line 188, in coerce_to_array
values.dtype))
TypeError: object cannot be converted to an IntegerDtype
Expected Output
I would like the call to read_csv to succeed without having to read in chunks (which seems to have other side effects as well).
Output of pd.show_versions()
pandas: 0+unknown pytest: None pip: 18.1 setuptools: 40.6.3 Cython: None numpy: 1.16.0 scipy: 1.2.0 pyarrow: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.7.5 pytz: 2018.7 blosc: None bottleneck: 1.2.1 tables: 3.4.4 numexpr: 2.6.9 feather: None matplotlib: 3.0.2 openpyxl: 2.5.12 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: None lxml.etree: 4.2.6 bs4: 4.6.3 html5lib: 1.0.1 sqlalchemy: 1.2.14 pymysql: None psycopg2: None jinja2: None s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None gcsfs: None
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 20 (10 by maintainers)
Commits related to this issue
- list_mptcp_connections chose a wrong mptcpdest because it was comparing values of different types. For now I encode the failing fields as str instead of UInt64 (dsnraw seems concerned as well) see ht... — committed to teto/pymptcpanalyzer by teto 5 years ago
- list_mptcp_connections chose a wrong mptcpdest because it was comparing values of different types. For now I encode the failing fields as str instead of UInt64 (dsnraw seems concerned as well) see ht... — committed to teto/pymptcpanalyzer by teto 5 years ago
- BUG: permit str dtype -> IntegerDtype conversions Resolves #25472, resolves #25288. — committed to alexreg/pandas by alexreg 3 years ago
- BUG: permit str dtype -> IntegerDtype conversions Resolves #25472, resolves #25288. — committed to alexreg/pandas by alexreg 3 years ago
- BUG: permit str dtype -> IntegerDtype conversions Resolves #25472, resolves #25288. — committed to alexreg/pandas by alexreg 3 years ago
- BUG: permit str dtype -> IntegerDtype conversions Resolves #25472, resolves #25288. — committed to alexreg/pandas by alexreg 3 years ago
- BUG: permit str dtype -> IntegerDtype conversions Resolves #25472, #25288. — committed to alexreg/pandas by alexreg 3 years ago
- BUG: permit str dtype -> IntegerDtype conversions Resolves #25472, #25288. — committed to alexreg/pandas by alexreg 3 years ago
- BUG: permit str dtype -> IntegerDtype conversions Resolves #25472. — committed to alexreg/pandas by alexreg 3 years ago
- BUG: permit str dtype -> IntegerDtype conversions Resolves #25472. — committed to alexreg/pandas by alexreg 3 years ago
- TST: `read_csv` to nullable int dtype (#25472) — committed to alexreg/pandas by alexreg 3 years ago
- TST: `read_csv` to nullable int dtype (#25472) — committed to alexreg/pandas by alexreg 3 years ago
- TST: `read_csv` to nullable int dtype (#25472) — committed to alexreg/pandas by alexreg 3 years ago
- BUG: permit str dtype -> IntegerDtype conversions Resolves #25472. — committed to alexreg/pandas by alexreg 3 years ago
- TST: `read_csv` to nullable int dtype (#25472) — committed to alexreg/pandas by alexreg 3 years ago
Sorry that I have no time to properly debug this, but I hope I can contribute a little bit of knowledge.
I’m running into the same problem as OP when I read 1 of the sheets of a .xlsl file (
pandas 0.24.2
). There are NaN values, but from pandas 0.24 that should work when doing.astype(pd.Int16Dtype())
right?This gave the same problem as OP:
However, ugly, but this seemed to have worked for me:
@alexreg you or anyone is welcome to submit a PR to patch and the core team can review
OK - that’s good to know. Gets a bit too into the internals for me to follow, but was interesting to see how you all talk about this kind of stuff. If anyone else stumbles across this the relevant issues are 33254, 32586 and 33607