DataProfiler: diff_report = profile1.diff(profile2) -> Object of type int64 is not JSON serializable

General Information:

  • OS: OSX
  • Python version: python 3.8
  • Library version: current

Describe the bug:

Object of type int64 is not JSON serializable

Problem:

'conservative': {'df': xxxx, 'p-value': 0.0}, 'welch': {'df': xxxx, 'p-value': 0.0}}

To Reproduce:

diff_report = profile1.diff(profile2)
json.dumps(diff_report)

Expected behavior:

Screenshots:

Additional context:

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 29 (11 by maintainers)

Most upvoted comments

Excellent - thanks, both!

Fix incoming soon: @turtlemonvh, ty for identifying the issue!

I ran into this issue using the example in your README (https://github.com/capitalone/DataProfiler/blob/main/README.md), which uses

readable_report = profile.report(report_options={"output_format": "compact"})
print(json.dumps(readable_report, indent=4))

So I think it would make sense to either

  • Make it so compact doesn’t hit this serialization error
  • Change your example to use serializable so people know to use that when dumping to JSON

@turtlemonvh The fix only alters the profile if serializable is requested such that the profile_schema still matches the original dataframe which was input. Hence, either serializable or pretty must be set as the option.

LMK if you think that it should be otherwise.

@turtlemonvh till fix, a work around would be to set the column headers or if profile_schema contains keys which are np.int64 to convert them to integers

Yup, got the bug.

import json

import dataprofiler as dp


data = dp.Data("./dataprofiler/tests/data/csv/iris_no_header.csv")
profiler = dp.Profiler(data)
readable_report = profiler.report(report_options={"output_format": "serializable"})
json.dumps(readable_report)

@turtlemonvh which version of python?

EDIT: nvm i see py3.8?

@taylorfturner this might be dataset dependent, e.g. if the dataset itself has columns.

Yeah that worked.

readable_report = profile.report(report_options={"output_format": "serializable"})

# Error
json.dumps(readable_report)

# OK
del readable_report['global_stats']['profile_schema']
json.dumps(readable_report)

Yep - I can’t share the data, but the code is pretty much straight out of your readme.

import json
from dataprofiler import Data, Profiler
import numpy as np

data = Data("some-data.csv") # Auto-Detect & Load: CSV, AVRO, Parquet, JSON, Text, URL

print(data.data.head(5)) # Access data directly via a compatible Pandas DataFrame

profile = Profiler(data) # Calculate Statistics, Entity Recognition, etc

# https://capitalone.github.io/DataProfiler/docs/0.7.10/html/profiler.html
for foption in ["pretty", "compact", "serializable", "flat"]:
    try:
        print(f"Trying option: {foption}")
        readable_report = profile.report(report_options={"output_format": foption})

        with open("capone_profile.json", "w+") as f:
            json.dump(readable_report, f, indent=4)
    except TypeError as e:
        print(e)
        pass

I’ll take a look at the structure of the Profiler object now to see if anything pops out.

EDIT: I have profile.report() pulled up in an ipython shell now if there is anything in particular you want me to check in that output.

I’m hitting something similar on v0.7.10

Traceback (most recent call last):
  File "profile_capone.py", line 13, in <module>
    json.dump(readable_report, f, indent=4)
  File "/home/timothy/anaconda3/lib/python3.8/json/__init__.py", line 179, in dump
    for chunk in iterable:
  File "/home/timothy/anaconda3/lib/python3.8/json/encoder.py", line 431, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/home/timothy/anaconda3/lib/python3.8/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/home/timothy/anaconda3/lib/python3.8/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/home/timothy/anaconda3/lib/python3.8/json/encoder.py", line 376, in _iterencode_dict
    raise TypeError(f'keys must be str, int, float, bool or None, '
TypeError: keys must be str, int, float, bool or None, not int64

I tried 3 formatting options (from: https://capitalone.github.io/DataProfiler/docs/0.7.10/html/profiler.html#reporting-structure) and received these errors

Trying option: pretty
keys must be str, int, float, bool or None, not int64
Trying option: compact
keys must be str, int, float, bool or None, not int64
Trying option: serializable
keys must be str, int, float, bool or None, not int64