cudf: [BUG] Groupby aggregation sums are more non-deterministic than expected with float32 values
Groupby sum aggregations are more non-deterministic than expected when there are float32 values in the column to be summed. Casting to float64 resolves this issue. It’s hard to reproduce with only a few rows, so attaching a CSV file of 200 rows.
import cudf
dtype_dict = {'0': 'int32',
'1': 'int32',
'2': 'int32',
'3': 'float32'}
data = cudf.read_csv('small_clean_cols.txt', dtype=dtype_dict)
old = None
for i in range(100):
temp = data.groupby(by=['0','1','2']).agg({'3':'sum'})
if not old:
old = float(temp['3'].iloc[0])
continue
current = float(temp['3'].iloc[0])
if abs(current - old) > .0001:
print(old, current)
print(i)
old = current
129200.9453125 129200.9609375
1
129200.9609375 129200.953125
2
129200.953125 129200.9453125
...
No differences here:
import cudf
dtype_dict = {'0': 'int32',
'1': 'int32',
'2': 'int32',
'3': 'float64'}
data = cudf.read_csv('small_clean_cols.txt', dtype=dtype_dict)
old = None
for i in range(100):
temp = data.groupby(by=['0','1','2']).agg({'3':'sum'})
if not old:
old = float(temp['3'].iloc[0])
continue
current = float(temp['3'].iloc[0])
if abs(current - old) > .0001:
print(old, current)
print(i)
old = current
Data: small_clean_cols.txt
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 18 (18 by maintainers)
This may have been the most fascinating GitHub issue I’ve seen in a while 😃 Thank you team.