astropy: BUG: ascii.read converts - to float -1.0

This was a tricky one and caused a bug being hidden in astroquery for a while (as it made parsing a mock table possible, while the one from the server was failing):

>>> from astropy.io import ascii
>>> ascii.read('1 | 2 | 3 \n 4 | 5 | 6 \n')                                                                                                                                                                                   
<Table length=2>
 col1  col2  col3
int64 int64 int64
----- ----- -----
    1     2     3
    4     5     6
>>> ascii.read('1 | 2 | 3 \n 4 | - | 6 \n') 
<Table length=2>
 col1   col2   col3
int64 float64 int64
----- ------- -----
    1     2.0     3
    4    -1.0     6

Changing the hyphen to a dash makes it work as expected, to return a string column.

>>> ascii.read('1 | 2 | 3 \n 4 | – | 6 \n') 
<Table length=2>
 col1 col2  col3
int64 str1 int64
----- ---- -----
    1    2     3
    4    –     6

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 30 (22 by maintainers)

Most upvoted comments

@pllim my pleasure,yeah it was hard to dig-it-out with out the docs

Thanks, I think you deserve at least as much credit for tracking this down; but I am probably a bit more familiar with the C part as I contributed some of that code myself! 😃

@bsipocz - very unscientific profiling on my new MBP with SSD, for a space-delimited ASCII table with 25 columns by 10000 rows, which is about 1 Mb in size:

In [11]: time out = Table.read('junk.dat', format='ascii.basic', guess=False, fast_reader=False)
CPU times: user 151 ms, sys: 6.07 ms, total: 157 ms
Wall time: 157 ms

In [12]: time out = Table.read('junk.dat', format='ascii.basic', guess=False, fast_reader=True)
CPU times: user 22.9 ms, sys: 1.28 ms, total: 24.2 ms
Wall time: 29.7 ms

One can roughly scale from there, but I agree you are most likely to be limited by server speeds in getting the text file into memory/disk before trying to parse into a table. So it may be a consideration to go with the pure Python parser.

Having said that, I cannot thank @dhomeier and @Harshil-C enough for trying to maintain the fast C parser!! I don’t understand that code for the most part, so having people ready to step in and investigate bugs is hugely important.

Whoever ends up taking this up as a PR, there is an option to give the other person credit via GitHub’s Co-Author feature. 😄 Thanks!

https://github.blog/2018-01-29-commit-together-with-co-authors/

iam not confident enough to write c yet,in the future i hope for it (tracking it was pretty fun:) )i learned many things

i think you own this pr brother, go ahead (i didnt make much changes as iam new to c,i just touched cython and c for this issue )

@Harshil-C you are most welcome to take care of the fixes - attaching a diff for the changes I have tested so far. Note that I had to change it a bit to accommodate floats without leading digits like -.25. lone-signs.patch.gz

The crucial point is that no CONVERSION_ERROR is raised in these cases, although neither a NaN nor inf are identified. So I think at https://github.com/astropy/astropy/blob/1cb6c81e874a785e2d7a3075a75a372188b0d844/astropy/io/ascii/src/tokenizer.c#L719

    else
    {
        self->code = CONVERSION_ERROR;
    }

is missing. But further, the use_fast_converter=True never jumps to the conversion_error as it xstrtod quietly skips through all the trailing spaces. Analogously to strtod - from the manpage:

     If endptr is not NULL, a pointer to the character after the last charac-
     ter used in the conversion is stored in the location referenced by
     endptr.

     If no conversion is performed, zero is returned and the value of nptr is
     stored in the location referenced by endptr.

I’d suggest to add this check after the sign has been parsed:

    // Handle optional sign
    negative = 0;
    switch (*p)
    {
    case '-': negative = 1; // Fall through to increment position
    case '+': p++;
    }

    // No data after the sign - make no conversion and jump back to str
    // (consistent with strtod)
    if (*p == '\0' || isspace(*p))
    {
        if (endptr) *endptr = (char *) str;
        return 0e0;
    }

Oh, I can get it with option+‘-’