pandas: Series of object/strings cannot be converted to Int64Dtype()
Edited to add information.
Code Sample, a copy-pastable example if possible
a = pd.Series(['123', '345', '456'])
a.astype(int) # works
a.astype('Int64') # doesn't work
Problem description
Currently, the conversion of object dtypes (containing strings) to Int64 doesn’t work, even though it should be able to. It produces a long error (see at the end).
Important to note: the above is trying to convert to Int64
with the capital I. Those are the new nullable-integer arrays that got added to python
. pandas
seems to support them, yet I think something inside astype
wasn’t update to reflect that.
In essence, the above should work; there is no reason why it should fail and it’s quite simply a bug (in answer to some comments). Moreover, to_numeric
is not a sufficient replacement here; it doesn’t convert to Int64
when there are missing datatypes, instead it converts to float
automatically (this is actually a non-trivial problem when dealing with long integer identifiers, such as GAIA target identifiers).
Traceback:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-2-5778d49bfa2e> in <module>()
2 a = pd.Series(['123', '345', '456'])
3 print(a.astype(int))
----> 4 print(a.astype('Int64'))
/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors, **kwargs)
5880 # else, only a single dtype is given
5881 new_data = self._data.astype(
-> 5882 dtype=dtype, copy=copy, errors=errors, **kwargs
5883 )
5884 return self._constructor(new_data).__finalize__(self)
/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/internals/managers.py in astype(self, dtype, **kwargs)
579
580 def astype(self, dtype, **kwargs):
--> 581 return self.apply("astype", dtype=dtype, **kwargs)
582
583 def convert(self, **kwargs):
/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/internals/managers.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
436 kwargs[k] = obj.reindex(b_items, axis=axis, copy=align_copy)
437
--> 438 applied = getattr(b, f)(**kwargs)
439 result_blocks = _extend_blocks(applied, result_blocks)
440
/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors, values, **kwargs)
557
558 def astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):
--> 559 return self._astype(dtype, copy=copy, errors=errors, values=values, **kwargs)
560
561 def _astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):
/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/internals/blocks.py in _astype(self, dtype, copy, errors, values, **kwargs)
641 # _astype_nansafe works fine with 1-d only
642 vals1d = values.ravel()
--> 643 values = astype_nansafe(vals1d, dtype, copy=True, **kwargs)
644
645 # TODO(extension)
/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
648 # dispatch on extension dtype if needed
649 if is_extension_array_dtype(dtype):
--> 650 return dtype.construct_array_type()._from_sequence(arr, dtype=dtype, copy=copy)
651
652 if not isinstance(dtype, np.dtype):
/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/arrays/integer.py in _from_sequence(cls, scalars, dtype, copy)
321 @classmethod
322 def _from_sequence(cls, scalars, dtype=None, copy=False):
--> 323 return integer_array(scalars, dtype=dtype, copy=copy)
324
325 @classmethod
/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/arrays/integer.py in integer_array(values, dtype, copy)
105 TypeError if incompatible types
106 """
--> 107 values, mask = coerce_to_array(values, dtype=dtype, copy=copy)
108 return IntegerArray(values, mask)
109
/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/arrays/integer.py in coerce_to_array(values, dtype, mask, copy)
190 ]:
191 raise TypeError(
--> 192 "{} cannot be converted to an IntegerDtype".format(values.dtype)
193 )
194
TypeError: object cannot be converted to an IntegerDtype
Expected Output
Output of pd.show_versions()
[paste the output of pd.show_versions()
here below this line]
INSTALLED VERSIONS
commit : None python : 3.6.8.final.0 python-bits : 64 OS : Linux OS-release : 4.18.16-041816-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8
pandas : 0.25.1 numpy : 1.14.3 pytz : 2018.4 dateutil : 2.7.3 pip : 19.1.1 setuptools : 39.1.0 Cython : 0.28.2 pytest : 3.5.1 hypothesis : None sphinx : 1.7.4 blosc : None feather : None xlsxwriter : 1.0.4 lxml.etree : 4.2.1 html5lib : 1.0.1 pymysql : None psycopg2 : None jinja2 : 2.10 IPython : 6.4.0 pandas_datareader: None bs4 : 4.6.0 bottleneck : 1.2.1 fastparquet : None gcsfs : None lxml.etree : 4.2.1 matplotlib : 3.1.1 numexpr : 2.6.5 odfpy : None openpyxl : 2.5.3 pandas_gbq : None pyarrow : None pytables : None s3fs : None scipy : 1.1.0 sqlalchemy : 1.2.7 tables : 3.4.3 xarray : None xlrd : 1.1.0 xlwt : 1.3.0 xlsxwriter : 1.0.4
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 5
- Comments: 18 (8 by maintainers)
What I mean is
to_numeric
first converts tofloat
if it detects missing values, and it doesn’t seem to want to convert tiInt64
. Also, there is no reasonastype
shouldn’t work here, the array of strings above can be converted toInt64
.Doesn’t appear we have much appetite to support this. Thanks for the suggestion but we’d recommend using
to_numeric
first. Closing.