vaex: [BUG-REPORT] - `to_numpy` corrupts numpy array

Thank you for reaching out and helping us improve Vaex!

Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.

Description Please provide a clear and concise description of the problem. This should contain all the steps needed to reproduce the problem. A minimal code example that exposes the problem is very appreciated.

When you have a column whose values are a pyarrow Arrays, and you try to convert that to numpy, it corrupts the numpy array and it cannot be worked with directly.

NUM_SAMPLES = 100
WIDTH = 20

vals = [pa.array([random() for _ in range(WIDTH)]) for _ in range(NUM_SAMPLES)]
ids = list(range(NUM_SAMPLES))
data = {
    'vals': vals,
    'id': ids
}
df = vaex.from_pandas(pd.DataFrame(data))

bad = df['vals'].to_numpy() # np.array of type Object, cannot become a (NUM_SAMPLES,WIDTH) matrix
good = np.array(vals) # (NUM_SAMPLES,WIDTH) matrix

In the example above, good is a numpy matrix that you’d expect. But bad is a numpy array of pyarrow arrays. Numpy, for some reason, cannot convert it to a numpy matrix, it thinks the object is of unequal length. The only way to do it is either to

  1. Cast it to a list, then back to numpy
  2. Convert each pa.array to a numpy array first, then call the np.array constructor

Both of those are very slow. Do you have a suggestion for a way to get around this? Thanks!

Software information

  • Vaex version (import vaex; vaex.__version__):
{'vaex': '4.5.0',
 'vaex-core': '4.5.1',
 'vaex-viz': '0.5.0',
 'vaex-hdf5': '0.10.0',
 'vaex-server': '0.6.1',
 'vaex-astro': '0.9.0',
 'vaex-jupyter': '0.6.0',
 'vaex-ml': '0.14.0'}
  • Vaex was installed via: pip / conda-forge / from source : pip
  • OS: MacOS big sur

Additional information Please state any supplementary information or provide additional context for the problem (e.g. screenshots, data, etc…).

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 26 (26 by maintainers)

Most upvoted comments

Good detective work! A but unexpected indeed… to be sure… maybe you can try to export to feather via pandas to see if you get the same error ( i imagine you would but… one never knows…)