pandas: Memory leak in `df.to_json`

Code Sample, a copy-pastable example if possible

import pandas as pd                                                                                                                              
import numpy as np


df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))

while True:
    body = df.T.to_json()
    print("HI")


Problem description

If we repeatedly call to_json() on a dataframe, memory usage grows continuously:

image

Expected Output

I would expect memory usage to stay constant

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line] /usr/local/lib/python3.6/dist-packages/psycopg2/init.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use “pip install psycopg2-binary” instead. For details see: http://initd.org/psycopg/docs/install.html#binary-install-from-pypi. “”")

INSTALLED VERSIONS

commit: None python: 3.6.8.final.0 python-bits: 64 OS: Linux OS-release: 4.15.0-43-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.23.4 pytest: None pip: 18.1 setuptools: 40.6.3 Cython: 0.25.2 numpy: 1.15.3 scipy: 1.1.0 pyarrow: None xarray: None IPython: 7.1.1 sphinx: None patsy: 0.5.1 dateutil: 2.7.5 pytz: 2018.7 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 3.0.1 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 1.0.1 sqlalchemy: None pymysql: None psycopg2: 2.7.5 (dt dec pq3 ext lo64) jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 2
  • Comments: 16 (5 by maintainers)

Most upvoted comments

hi, i am facing the same issue about memory leak in df.to_json().

Here I am using df.to_dict() and pass it to python’s json.dump, the memory use is stable to-json-workaround

But when I use the df.to_json() using_to_json

Code Sample

import json

import pandas as pd


def list_to_df_json( data):
        point_classified = {}
        for i in data:
                if i['point_id'] not in point_classified:
                        point_classified[i['point_id']] = {}
                point_classified[i['point_id']][i['timestamp']] = i['point_value']
        return point_classified


def boo(a):

        data = list_to_df_json(a)

        for point_id, point_value_of_that_id in data.items():
                # logging.info(f"pushing data from pointid : {point_id} ")
                df = pd.DataFrame.from_dict(point_value_of_that_id, orient='index', columns=[point_id])
                
                # dict_df = df.to_dict(orient='index')
                # workaround
                # json_df = json.dumps(dict_df)
                
                # memory leak
                json_df = df.to_json(orient='index')
        return json_df

while 1:
      a = [{'point_id': 'a',
              'point_value': 346.9,
              'timestamp': '2019-12-01 08:15:00'},
             {'point_id': 'a',
              'point_value': 247.2,
              'timestamp': '2019-12-01 08:30:00'},
             {'point_id': 'a',
              'point_value': 237.9,
              'timestamp': '2019-12-01 08:45:00'},
             {'point_id': 'a',
              'point_value': 215.2,
              'timestamp': '2019-12-01 09:00:00'},
             {'point_id': 'b',
              'point_value': 276.8,
              'timestamp': '2019-12-01 09:15:00'},
             {'point_id': 'b',
              'point_value': 296.1,
              'timestamp': '2019-12-01 09:30:00'},
             {'point_id': 'b',
              'point_value': 328.0,
              'timestamp': '2019-12-01 09:45:00'}]

        print(boo(a))
        # pd.show_versions()

INSTALLED VERSIONS

commit: None python: 3.6.10.final.0 python-bits: 64 OS: Darwin OS-release: 19.0.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None LOCALE: en_GB.UTF-8

pandas: 0.24.2 pytest: 5.4.3 pip: 19.3.1 setuptools: 44.0.0.post20200106 Cython: None numpy: 1.19.1 scipy: 1.5.2 pyarrow: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.8.1 pytz: 2019.3 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 3.3.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml.etree: 4.5.0 bs4: 4.8.2 html5lib: None sqlalchemy: 1.3.18 pymysql: 0.9.3 psycopg2: 2.7.7 (dt dec pq3 ext lo64) jinja2: 2.11.2 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None gcsfs: None

It’s also worth isolating the to_json part from the df.T part.

On Thu, Jan 24, 2019 at 4:03 PM chris-b1 notifications@github.com wrote:

FWIW this seems to take a ton of iterations and doesn’t really leak much memory, but investigations are welcome.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/24889#issuecomment-457375046, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIudiRAcurMKPP8cCqhjRO7t3FAzxks5vGi26gaJpZM4aPu6O .