cudf: [BUG] Groupby aggregation sums are more non-deterministic than expected with float32 values

Groupby sum aggregations are more non-deterministic than expected when there are float32 values in the column to be summed. Casting to float64 resolves this issue. It’s hard to reproduce with only a few rows, so attaching a CSV file of 200 rows.

import cudf

dtype_dict = {'0': 'int32',
 '1': 'int32',
 '2': 'int32',
 '3': 'float32'}

data = cudf.read_csv('small_clean_cols.txt', dtype=dtype_dict)

old = None
for i in range(100):
    temp = data.groupby(by=['0','1','2']).agg({'3':'sum'})
    if not old:
        old = float(temp['3'].iloc[0])
        continue
        
    current = float(temp['3'].iloc[0])
    if abs(current - old) > .0001:
        print(old, current)
        print(i)
        
    old = current
129200.9453125 129200.9609375
1
129200.9609375 129200.953125
2
129200.953125 129200.9453125
...

No differences here:

import cudf

dtype_dict = {'0': 'int32',
 '1': 'int32',
 '2': 'int32',
 '3': 'float64'}

data = cudf.read_csv('small_clean_cols.txt', dtype=dtype_dict)

old = None
for i in range(100):
    temp = data.groupby(by=['0','1','2']).agg({'3':'sum'})
    if not old:
        old = float(temp['3'].iloc[0])
        continue
        
    current = float(temp['3'].iloc[0])
    if abs(current - old) > .0001:
        print(old, current)
        print(i)
        
    old = current

Data: small_clean_cols.txt

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 18 (18 by maintainers)

Most upvoted comments

This may have been the most fascinating GitHub issue I’ve seen in a while 😃 Thank you team.

datametrician on Dec 4, 2019