altair: MaxRowsError for pandas.df with > 5000 rows

Hey,

Thanks for the package, I’m very keen to try it out on my own data. When trying to create a simple histogram with my own data, VegaLite fails on dataframes with more than 5000 rows. Here’s a minimal reproducible example:

import altair as alt
import os
import pandas as pd
import numpy as np

lengths = np.random.randint(0,2000,6000)
lengths_list = lengths.tolist()
labels = [str(i) for i in lengths_list]
peak_lengths = pd.DataFrame.from_dict({'coords': labels, 'length': lengths_list},orient='columns')
alt.Chart(peak_lengths).mark_bar().encode(alt.X('lengths:Q', bin=True),y='count(*):Q')

Here’s the error:

---------------------------------------------------------------------------
MaxRowsError                              Traceback (most recent call last)
~/anaconda/envs/py3/lib/python3.5/site-packages/altair/vegalite/v2/api.py in to_dict(self, *args, **kwargs)
    259         copy = self.copy()
    260         original_data = getattr(copy, 'data', Undefined)
--> 261         copy._prepare_data()
    262 
    263         # We make use of two context markers:

~/anaconda/envs/py3/lib/python3.5/site-packages/altair/vegalite/v2/api.py in _prepare_data(self)
    251             pass
    252         elif isinstance(self.data, pd.DataFrame):
--> 253             self.data = pipe(self.data, data_transformers.get())
    254         elif isinstance(self.data, six.string_types):
    255             self.data = core.UrlData(self.data)

~/anaconda/envs/py3/lib/python3.5/site-packages/toolz/functoolz.py in pipe(data, *funcs)
    550     """
    551     for func in funcs:
--> 552         data = func(data)
    553     return data
    554 

~/anaconda/envs/py3/lib/python3.5/site-packages/toolz/functoolz.py in __call__(self, *args, **kwargs)
    281     def __call__(self, *args, **kwargs):
    282         try:
--> 283             return self._partial(*args, **kwargs)
    284         except TypeError as exc:
    285             if self._should_curry(args, kwargs, exc):

~/anaconda/envs/py3/lib/python3.5/site-packages/altair/vegalite/data.py in default_data_transformer(data)
    122 @curry
    123 def default_data_transformer(data):
--> 124     return pipe(data, limit_rows, to_values)
    125 
    126 

~/anaconda/envs/py3/lib/python3.5/site-packages/toolz/functoolz.py in pipe(data, *funcs)
    550     """
    551     for func in funcs:
--> 552         data = func(data)
    553     return data
    554 

~/anaconda/envs/py3/lib/python3.5/site-packages/toolz/functoolz.py in __call__(self, *args, **kwargs)
    281     def __call__(self, *args, **kwargs):
    282         try:
--> 283             return self._partial(*args, **kwargs)
    284         except TypeError as exc:
    285             if self._should_curry(args, kwargs, exc):

~/anaconda/envs/py3/lib/python3.5/site-packages/altair/vegalite/data.py in limit_rows(data, max_rows)
     47             return data
     48     if len(values) > max_rows:
---> 49         raise MaxRowsError('The number of rows in your dataset is greater than the max of {}'.format(max_rows))
     50     return data
     51 

MaxRowsError: The number of rows in your dataset is greater than the max of 5000

A quick issues search didn’t turn up any hits for MaxRowsError. There is a related issue (#287), but this was a data set with > 300k rows, and I have a measly 35k. Also, the FAQ link referenced in that issue now turns up a 404. For the meantime, does the advice in #249 still apply?

Package info: Running on Altair 2.0.0rc1, JupyterLab 0.31.12-py35_1 conda-forge

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 39 (21 by maintainers)

Most upvoted comments

+1 to documenting how to workaround these limits and including a link to the docs in the error message. I think that could go a long way to addressing newbie frustrations.

I just opened #672, which would allow users to run alt.data_transformers.enable('no_max_rows') and then be able to embed arbitrarily large datasets in the notebook, if that is what they wish to do.

@ellisonbg, do you think we should offer that option?

Here is the issue on JupyterLab tracking the fix on that side. We are planning on a new release on April 9th, should be in that release.

Hi al, sorry about the delay. Two sides of this:

max rows

the magic to increase the max rows is this:

from altair import pipe, limit_rows, to_values
t = lambda data: pipe(data, limit_rows(max_rows=10000), to_values)
alt.data_transformers.register('custom', t)
alt.data_transformers.enable('custom')

I have thought that we probably want to have a higher level API than this - it is powerful and flexible, but a bit too low level for this simple usage case. I thought about trying to have the data transformers to have a dict of options that get passed as **kwargs to the individual stages, but not sure how that will work.

CSV/JSON URL bugs

I too am seeing some bugs related to using the csv and json data transformers. The issue is that the actual URL path to the generated file depends on how the notebook server is setup. There are options in the underlying data transformers to customize that, but it is painful. I think the right solution will be to change the renderers in lab/notebook to resolve the URLs automatically. Here is some hack’ish code I am using the get around this:


def custom(data):
    return alt.pipe(data, alt.to_json(
        base_url='<url to the base of the server for your user>'
    ))
alt.data_transformers.register('custom', custom)
alt.data_transformers.enable('custom')

@domoritz is this the option we want to use for this in the renderers?

https://github.com/vega/vega-embed/blob/master/src/embed.ts#L20

Great, thanks! For anyone else who comes across this, find more information on options here. There are also built-in functions altair.limit_rows() and altair.sample() for testing figures with subsets of large data.

Update: if you use version 0.32 or newer of jupyterlab, this will work correctly; i.e. you can run

alt.data_transformers.enable('json')

and data will be saved as a file and loaded into the chart by URL.

This is a deliberate setting, to keep users from inadvertently generating notebooks that are too big and unwieldy. This error is set in place to prevent users from creating notebooks that are so large that they crash the browser.

For large datasets, we’d suggest not embedding the data directly into the notebook (the default behavior), but rather saving it as a CSV or JSON file before rendering. You can do this by running

alt.data_transformers.enable('csv')

That said, I’m not able to get this to work when I try it; the resulting chart was empty.

@ellisonbg designed the data transformer interface; hopefully he can weigh in and recommend a solution that will work.