vaex: [BUG-REPORT] problem using data from parquet/pyarrow in correlation function

Dear Vaex developers,

Description I would like to use the Vaex correlation function to calculate the correlation coefficient between two columns of a dataframe (one type int64, the other type str). I am loading the dataframe from a parquet file generated using the pyarrow engine. It seems that correlation may not work correctly with data in this format.

Here is a reproducer:

#use pandas to create and save dataframe as parquet
data = [[0,'hello'], [1,'world'], [3,'this'], [5,'example'], [0,'hello']]
pdf = pd.DataFrame(data, columns=['a', 'b'], dtype=int)
pdf.to_parquet('test', engine='pyarrow')

#now load into vaex
vdf = vaex.open('test')
#try correlation without .values
out = vdf.correlation(vdf.a, vdf.b)

This fails with a traceback that ends in ValueError: could not convert string to float: 'hello'.

I thought I might need to explictly request .values:

#try correlation with .values
out = vdf.correlation(vdf.a.values, vdf.b.values)

This fails with a traceback that ends in

ValueError: <pyarrow.lib.Int64Array object at 0x2aaae733da60>
[
  0,
  1,
  3,
  5,
  0
] is not of string or Expression type, but <class 'pyarrow.lib.Int64Array'>

Software information

  • Vaex version:
{'vaex-core': '4.1.0',
'vaex-viz': '0.5.0',
'vaex-hdf5': '0.7.0',
'vaex-server': '0.4.0',
'vaex-astro': '0.8.0',
'vaex-jupyter': '0.6.0',
'vaex-ml': '0.11.1'}
  • Vaex was installed via: conda-forge
  • OS: Ubuntu 20.04 (in Docker)

Apologies if I am using Vaex or this function incorrectly.

Thank you very much, Laurie

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 19 (10 by maintainers)

Most upvoted comments

Hi

Regarding the usage of correlation (same goes for the mutual_information method): it is not a bug, it is a feature 😃

This is part of an… older part of vaex that has not been updated in a long time, and the API is a bit different. You can use it in two ways: one as @lastephey found out, by passing a single x, and y arguments which are expressions.

If you want to calculate more correlations at once, you need to pass a list of tuples to the xargument like this:

# assume the dataframe has x, y, z columns:
df.correlation(x=[('x', 'y'), ('x', 'z'), ('y', 'z')])

@kmcentush don’t bother with this, it touches a very very old part of vaex that might be tricky to understand. I think @maartenbreddels has basically updated this to a more modern API, but it lives in a branch somewhere and needs a final bit of ironing out before it is merged. I expect it to happen soon-ish.

I’d be curious to see if label encoding in Vaex then correlating returns the same results as doing it in Pandas. Please let me know what you find!