vaex: Kernel died when using lots of column in groupby
I.e. if i have below table
from itertools import product
listA =list(range(1, 14))
listB =list(range(1, 26))
listC =list(range(1, 23))
listD =list(range(1, 25))
listE =list(range(1, 11))
PDF = pd.DataFrame(list(product(listA, listB,listC ,listD,listE )), columns=['listA', 'listB','listC','listD','listE'])
above will create pandas dataframe with 2mio row
VDF = vaex.from_pandas(PDF)
convert above into vaex
VDF.groupby(VDF.listA,agg='count')
above are fast
VDF.groupby([VDF.listA,VDF.listB],agg='count')
above still fast but considerably slower
VDF.groupby([VDF.listA,VDF.listB,VDF.listC],agg='count')
above are really slow
VDF.groupby([VDF.listA,VDF.listB,VDF.listC,VDF.listD],agg='count')
above will kill the kernel
the funny thing is, dask and pandas easily do that. Do you have any lead about this behaviour?
right now im using a loop to limit the groupby into only 2 item, but its adding considerable time
Btw this is coming from real financial data, but since im not able to release said data, this is equal representative. the real data is~50times bigger since each combination will have several period and multiple lines
Thankss
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 15 (7 by maintainers)
This was addressed. Please reopen if you still have the same problems.
Hi @pietrolesci
Thanks you very much for the report. Indeed this is a know issues - if you check elsewhere on the issue board you will find a more verbose explanation on why this happens. In short, when doing a groupby on multiple columns vaex computes the cartesian product of all unique values in the groupby columns. Even if not all combinations exist. So the memory error happens since vaex expects that to be the output dataframe. So for the time being, you need to be mindful for this. We are looking to improve this in the near future.
Thanks, J.
Hi @JovanVeljanoski,
I did a few checks:
To test my machine I ran the taxi ride notebook as a benchmark.
To test whether the problem is related to the fact that my yearly data are divided into monthly files I created a unique hdf5 file and retried the analysis. It failed again. As a side note (maybe it’s a known issue), when I call
df.head()after reading in the separate monthly files, I get the error:ValueError: array is of length 80000000, while the length of the DataFrame is 10(note that 80’000’000 is the size of just one monthly file)I repeated the experiment with 100M rows and got the same problem - I had to downsample to 10M to receive an output
Finally, I found the source of the problem. Following your suggestion, I took a look at the “nasty” variable which is causing the problem (in my case it’s a rather important one:
customer_id) and it has 9M unique values. So, to see if I only had the problem when grouping by variables with many unique values, I tried grouping by other combinations of variables with less unique values. I managed to use 7-8 variable - with multiple aggregation functions - before getting the usual “kernel restart” error - of course, as soon as I add grouping variables, the number of return rows starts to grow exponentially.So, I think the problem is linked to the number of rows that the grouping function returns (as it ends up eating all the memory - I have 32 GB).
Also, to provide more context on the use-case: I need to group by the “nasty” variable (and many more) because the first step in my analysis is deduplication. Unfortunately, at the moment I am not able to proceed with the actual analysis using Vaex.
Besides, I started testing out the library 1 week ago and fell in love with it. The memory management is amazing (apart from this small issue on the grouping). I hope this feedback can help to improve the library.
Best, Pietro