astropy: Possible Bug? `parse_single_table` lose dtype info for `char(*)` type.

Description

With a VOTable like:

<?xml version="1.0" encoding="utf-8"?>
<!-- Produced with astropy.io.votable version 5.1
     http://www.astropy.org/ -->
<VOTABLE version="1.4" xmlns="http://www.ivoa.net/xml/VOTable/v1.3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.ivoa.net/xml/VOTable/v1.3 http://www.ivoa.net/xml/VOTable/VOTable-1.4.xsd">
 <RESOURCE type="results">
   ...
   <FIELD ID="original_ext_source_id" arraysize="80" datatype="char" name="original_ext_source_id" ucd="meta.id.cross"/>
   ...
   <DATA>
    <TABLEDATA>
     ...

The field original_ext_source_id is of type char[80]

running:

>>> from astropy.io.votable import parse_single_table
>>> table = parse_single_table(<filename>)
>>> for row in table.array:
        print(type(row))
        for cell in row:
           print(type(cell))
           print(cell.dtype)
             
<class 'numpy.ma.core.mvoid'>
<class 'numpy.int64'>
int64
<class 'str'>
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[24], line 5
      3 for cell in row:
      4     print(type(cell))
----> 5     print(cell.dtype)

AttributeError: 'str' object has no attribute 'dtype'

The cell object is a builtin python string and the dtype information got lost.

Expected behavior

I would expect the same behavior as with unicodeChar type:

with the following VOTable file:

<?xml version="1.0" encoding="utf-8"?>
<!-- Produced with astropy.io.votable version 5.1
     http://www.astropy.org/ -->
<VOTABLE version="1.4" xmlns="http://www.ivoa.net/xml/VOTable/v1.3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.ivoa.net/xml/VOTable/v1.3 http://www.ivoa.net/xml/VOTable/VOTable-1.4.xsd">
 <RESOURCE type="results">
   ...
   <FIELD ID="original_ext_source_id" arraysize="80" datatype="unicodeChar" name="original_ext_source_id" ucd="meta.id.cross"/>
   ...
   <DATA>
    <TABLEDATA>
     ...

The field original_ext_source_id is now of type unicodeChar[80]

running:

>>> from astropy.io.votable import parse_single_table
>>> table = parse_single_table(<filename>)
>>> for row in table.array:
        print(type(row))
        for cell in row:
           print(type(cell))
           print(cell.dtype)
<class 'numpy.ma.core.mvoid'>
<class 'numpy.int64'>
int64
<class 'numpy.str_'>
<U80

In this case the cell object is a numpy string and the dtype information is available.

How to Reproduce

  1. Get package from ‘…’
  2. Then run ‘…’
  3. An error occurs.
>>> from astropy.io.votable import parse_single_table
>>> table = parse_single_table(<filename>)
>>> for row in table.array:
        print(type(row))
        for cell in row:
           print(type(cell))
           print(cell.dtype)
             
<class 'numpy.ma.core.mvoid'>
<class 'numpy.int64'>
int64
<class 'str'>
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[24], line 5
      3 for cell in row:
      4     print(type(cell))
----> 5     print(cell.dtype)

AttributeError: 'str' object has no attribute 'dtype'

Versions

Linux-4.19.0-9-amd64-x86_64-with-glibc2.28
Python 3.9.13+ (heads/3.9:e8f2fe355b, Jun 13 2022, 10:51:14) 
[GCC 8.3.0]
astropy 5.1
Numpy 1.22.4
pyerfa 2.0.0.1

no scipy and no matplotlib

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 18 (8 by maintainers)

Most upvoted comments

Dear @pllim not really. I could change the title though, now that we better identified the issue.

Sure! Sorry 😄 @tomdonaldson please see my comment above.

@pllim thank you for your fast answer. Indeed I was not sure if calling it a BUG was accurate. I would say it is more inconsistent than a real BUG.

The use case is the following: when parsing a VOTable you have access to extra metadata: type and length of the array. By using numpy types it is possible to store and keep this metadata in the dtype attribute, and use a clearer and richer denomination (U20, S20, >F16…) than the built-in types from python.

This is extremely useful and allow cleaner exports or interfaces to other type of services, like database, spark…

The last point is that, it used to work, so at some point in past chars should have been interpreted as numpy._str or similar.

Therefore my suggestion to cope to our usecase would be to always interpret values in numpy types and not in built-in types. Especially since numpy is a core dependence of astropy, it guess it would be legit to systematically use numpy objects.

Again that is a suggestion that would solve our problem, as a astropy non-expert, I am not sure about the implications.