vaex: Kernel died when using lots of column in groupby

I.e. if i have below table

from itertools import product

listA =list(range(1, 14))
listB =list(range(1, 26))
listC =list(range(1, 23))
listD =list(range(1, 25))
listE =list(range(1, 11))
PDF = pd.DataFrame(list(product(listA, listB,listC ,listD,listE )), columns=['listA', 'listB','listC','listD','listE'])

above will create pandas dataframe with 2mio row

VDF = vaex.from_pandas(PDF)

convert above into vaex

VDF.groupby(VDF.listA,agg='count') above are fast

VDF.groupby([VDF.listA,VDF.listB],agg='count') above still fast but considerably slower

VDF.groupby([VDF.listA,VDF.listB,VDF.listC],agg='count') above are really slow

VDF.groupby([VDF.listA,VDF.listB,VDF.listC,VDF.listD],agg='count') above will kill the kernel

the funny thing is, dask and pandas easily do that. Do you have any lead about this behaviour?

right now im using a loop to limit the groupby into only 2 item, but its adding considerable time

Btw this is coming from real financial data, but since im not able to release said data, this is equal representative. the real data is~50times bigger since each combination will have several period and multiple lines

Thankss

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 15 (7 by maintainers)

Most upvoted comments

This was addressed. Please reopen if you still have the same problems.

JovanVeljanoski on Oct 20, 2021

Hi @pietrolesci

Thanks you very much for the report. Indeed this is a know issues - if you check elsewhere on the issue board you will find a more verbose explanation on why this happens. In short, when doing a groupby on multiple columns vaex computes the cartesian product of all unique values in the groupby columns. Even if not all combinations exist. So the memory error happens since vaex expects that to be the output dataframe. So for the time being, you need to be mindful for this. We are looking to improve this in the near future.

Thanks, J.

JovanVeljanoski on Aug 19, 2020

Hi @JovanVeljanoski,

I did a few checks:

To test my machine I ran the taxi ride notebook as a benchmark.
To test whether the problem is related to the fact that my yearly data are divided into monthly files I created a unique hdf5 file and retried the analysis. It failed again. As a side note (maybe it’s a known issue), when I call df.head() after reading in the separate monthly files, I get the error: ValueError: array is of length 80000000, while the length of the DataFrame is 10 (note that 80’000’000 is the size of just one monthly file)
I repeated the experiment with 100M rows and got the same problem - I had to downsample to 10M to receive an output

Finally, I found the source of the problem. Following your suggestion, I took a look at the “nasty” variable which is causing the problem (in my case it’s a rather important one: customer_id) and it has 9M unique values. So, to see if I only had the problem when grouping by variables with many unique values, I tried grouping by other combinations of variables with less unique values. I managed to use 7-8 variable - with multiple aggregation functions - before getting the usual “kernel restart” error - of course, as soon as I add grouping variables, the number of return rows starts to grow exponentially.

So, I think the problem is linked to the number of rows that the grouping function returns (as it ends up eating all the memory - I have 32 GB).

Also, to provide more context on the use-case: I need to group by the “nasty” variable (and many more) because the first step in my analysis is deduplication. Unfortunately, at the moment I am not able to proceed with the actual analysis using Vaex.

Besides, I started testing out the library 1 week ago and fell in love with it. The memory management is amazing (apart from this small issue on the grouping). I hope this feedback can help to improve the library.

Best, Pietro

pietrolesci on Jun 26, 2020