vaex: [BUG-REPORT] problem using data from parquet/pyarrow in correlation function
Dear Vaex developers,
Description
I would like to use the Vaex correlation function to calculate the correlation coefficient between two columns of a dataframe (one type int64, the other type str). I am loading the dataframe from a parquet file generated using the pyarrow engine. It seems that correlation may not work correctly with data in this format.
Here is a reproducer:
#use pandas to create and save dataframe as parquet
data = [[0,'hello'], [1,'world'], [3,'this'], [5,'example'], [0,'hello']]
pdf = pd.DataFrame(data, columns=['a', 'b'], dtype=int)
pdf.to_parquet('test', engine='pyarrow')
#now load into vaex
vdf = vaex.open('test')
#try correlation without .values
out = vdf.correlation(vdf.a, vdf.b)
This fails with a traceback that ends in ValueError: could not convert string to float: 'hello'.
I thought I might need to explictly request .values:
#try correlation with .values
out = vdf.correlation(vdf.a.values, vdf.b.values)
This fails with a traceback that ends in
ValueError: <pyarrow.lib.Int64Array object at 0x2aaae733da60>
[
0,
1,
3,
5,
0
] is not of string or Expression type, but <class 'pyarrow.lib.Int64Array'>
Software information
- Vaex version:
{'vaex-core': '4.1.0',
'vaex-viz': '0.5.0',
'vaex-hdf5': '0.7.0',
'vaex-server': '0.4.0',
'vaex-astro': '0.8.0',
'vaex-jupyter': '0.6.0',
'vaex-ml': '0.11.1'}
- Vaex was installed via: conda-forge
- OS: Ubuntu 20.04 (in Docker)
Apologies if I am using Vaex or this function incorrectly.
Thank you very much, Laurie
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 19 (10 by maintainers)
Hi
Regarding the usage of correlation (same goes for the mutual_information method): it is not a bug, it is a feature 😃
This is part of an… older part of vaex that has not been updated in a long time, and the API is a bit different. You can use it in two ways: one as @lastephey found out, by passing a single
x, andyarguments which are expressions.If you want to calculate more correlations at once, you need to pass a list of tuples to the
xargument like this:@kmcentush don’t bother with this, it touches a very very old part of vaex that might be tricky to understand. I think @maartenbreddels has basically updated this to a more modern API, but it lives in a branch somewhere and needs a final bit of ironing out before it is merged. I expect it to happen soon-ish.
I’d be curious to see if label encoding in Vaex then correlating returns the same results as doing it in Pandas. Please let me know what you find!