pandas: Series of object/strings cannot be converted to Int64Dtype()

Edited to add information.

Code Sample, a copy-pastable example if possible

a  = pd.Series(['123', '345', '456'])
a.astype(int)            # works
a.astype('Int64')      # doesn't work

Problem description

Currently, the conversion of object dtypes (containing strings) to Int64 doesn’t work, even though it should be able to. It produces a long error (see at the end).

Important to note: the above is trying to convert to Int64 with the capital I. Those are the new nullable-integer arrays that got added to python. pandas seems to support them, yet I think something inside astype wasn’t update to reflect that.

In essence, the above should work; there is no reason why it should fail and it’s quite simply a bug (in answer to some comments). Moreover, to_numeric is not a sufficient replacement here; it doesn’t convert to Int64 when there are missing datatypes, instead it converts to float automatically (this is actually a non-trivial problem when dealing with long integer identifiers, such as GAIA target identifiers).

Traceback:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-5778d49bfa2e> in <module>()
      2 a  = pd.Series(['123', '345', '456'])
      3 print(a.astype(int))
----> 4 print(a.astype('Int64'))

/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors, **kwargs)
   5880             # else, only a single dtype is given
   5881             new_data = self._data.astype(
-> 5882                 dtype=dtype, copy=copy, errors=errors, **kwargs
   5883             )
   5884             return self._constructor(new_data).__finalize__(self)

/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/internals/managers.py in astype(self, dtype, **kwargs)
    579 
    580     def astype(self, dtype, **kwargs):
--> 581         return self.apply("astype", dtype=dtype, **kwargs)
    582 
    583     def convert(self, **kwargs):

/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/internals/managers.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
    436                     kwargs[k] = obj.reindex(b_items, axis=axis, copy=align_copy)
    437 
--> 438             applied = getattr(b, f)(**kwargs)
    439             result_blocks = _extend_blocks(applied, result_blocks)
    440 

/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors, values, **kwargs)
    557 
    558     def astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):
--> 559         return self._astype(dtype, copy=copy, errors=errors, values=values, **kwargs)
    560 
    561     def _astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):

/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/internals/blocks.py in _astype(self, dtype, copy, errors, values, **kwargs)
    641                     # _astype_nansafe works fine with 1-d only
    642                     vals1d = values.ravel()
--> 643                     values = astype_nansafe(vals1d, dtype, copy=True, **kwargs)
    644 
    645                 # TODO(extension)

/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
    648     # dispatch on extension dtype if needed
    649     if is_extension_array_dtype(dtype):
--> 650         return dtype.construct_array_type()._from_sequence(arr, dtype=dtype, copy=copy)
    651 
    652     if not isinstance(dtype, np.dtype):

/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/arrays/integer.py in _from_sequence(cls, scalars, dtype, copy)
    321     @classmethod
    322     def _from_sequence(cls, scalars, dtype=None, copy=False):
--> 323         return integer_array(scalars, dtype=dtype, copy=copy)
    324 
    325     @classmethod

/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/arrays/integer.py in integer_array(values, dtype, copy)
    105     TypeError if incompatible types
    106     """
--> 107     values, mask = coerce_to_array(values, dtype=dtype, copy=copy)
    108     return IntegerArray(values, mask)
    109 

/home/sestovic/anaconda3/lib/python3.6/site-packages/pandas/core/arrays/integer.py in coerce_to_array(values, dtype, mask, copy)
    190         ]:
    191             raise TypeError(
--> 192                 "{} cannot be converted to an IntegerDtype".format(values.dtype)
    193             )
    194 

TypeError: object cannot be converted to an IntegerDtype

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line] INSTALLED VERSIONS

commit : None python : 3.6.8.final.0 python-bits : 64 OS : Linux OS-release : 4.18.16-041816-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8

pandas : 0.25.1 numpy : 1.14.3 pytz : 2018.4 dateutil : 2.7.3 pip : 19.1.1 setuptools : 39.1.0 Cython : 0.28.2 pytest : 3.5.1 hypothesis : None sphinx : 1.7.4 blosc : None feather : None xlsxwriter : 1.0.4 lxml.etree : 4.2.1 html5lib : 1.0.1 pymysql : None psycopg2 : None jinja2 : 2.10 IPython : 6.4.0 pandas_datareader: None bs4 : 4.6.0 bottleneck : 1.2.1 fastparquet : None gcsfs : None lxml.etree : 4.2.1 matplotlib : 3.1.1 numexpr : 2.6.5 odfpy : None openpyxl : 2.5.3 pandas_gbq : None pyarrow : None pytables : None s3fs : None scipy : 1.1.0 sqlalchemy : 1.2.7 tables : 3.4.3 xarray : None xlrd : 1.1.0 xlwt : 1.3.0 xlsxwriter : 1.0.4

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 5
  • Comments: 18 (8 by maintainers)

Most upvoted comments

The issue is that with missing data, to_numeric will convert to float first right? In which case, if you’re talking about having very long integers as identifiers, converting to double precision will approximate and change the last few digits of the identifier. This is actually the problem I was dealing with and why I started looking into Int64. It’s not as uncommon as it might seem. For example in astrophysics, GAIA has identifiers that are long enough that floating point conversion approximates and modifies them. Moreover, as far as I can see, shouldn’t .astype('Int64') and .to_numeric handle cases identically really? I mean I don’t know the in-depth details of what .to_numeric does off the top of my head, but couldn’t you make .astype('Int64') follow the same rules regarding ambiguous cases?

we don’t convert to float first

to_numeric is the workhorse

astype doesn’t have any options meaning all values must be convertible like in numpy

What I mean is to_numeric first converts to float if it detects missing values, and it doesn’t seem to want to convert ti Int64. Also, there is no reason astype shouldn’t work here, the array of strings above can be converted to Int64.

Doesn’t appear we have much appetite to support this. Thanks for the suggestion but we’d recommend using to_numeric first. Closing.