jsonschema: performance regression from 3.2.0 to 4.0.1

Hi @Julian, first of all thanks for this library and all the hard work. We rely heavily on it and are delighted to see Draft 2020-12 support as it is required for OpenAPI 3.1.0

When updating, I noticed our test suite ran noticeably slower. I compiled a quick reproduction for you. On my machine I get a 5x slowdown with the new version. Is this performance hit to be expected due to the new capabilities or is it a regression?

import jsonschema
import yaml
import json
import time

# https://raw.githubusercontent.com/tfranzel/drf-spectacular/master/tests/test_basic.yml
with open('tests/test_basic.yml') as fh:
    data = yaml.load(fh.read(), Loader=yaml.SafeLoader)

# https://raw.githubusercontent.com/tfranzel/drf-spectacular/master/drf_spectacular/validation/openapi3_schema.json
# which comes from:
# https://github.com/OAI/OpenAPI-Specification/blob/6d17b631fff35186c495b9e7d340222e19d60a71/schemas/v3.0/schema.json
with open('drf_spectacular/validation/openapi3_schema.json') as fh:
    openapi3_schema_spec = json.load(fh)

t_acc = 0

for i in range(500):
    t0 = time.time()
    jsonschema.validate(instance=data, schema=openapi3_schema_spec)
    t1 = time.time()
    t_acc += t1 - t0

print(f'{t_acc} sec')
✗ python --version; pip freeze | grep json; python test.py
Python 3.9.5
jsonschema==3.2.0
5.254251718521118 sec

✗ python --version; pip freeze | grep json; python test.py
Python 3.9.5
jsonschema==4.0.1
28.189855813980103 sec

✗ python --version; pip freeze | grep json; python test.py
Python 3.9.7
jsonschema==4.2.1
27.27832531929016 sec

✗ python --version; pip freeze | grep json; python test.py
Python 3.9.7
jsonschema==4.3.1
8.10183048248291 sec

EDIT: included measurement for 4.2.1 release EDIT: included measurement for 4.3.1 release

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 9
  • Comments: 18 (9 by maintainers)

Most upvoted comments

Thanks all for the feedback and a big thanks again to @Stranger6667. Sounds like we can close this for now.

Seems like it improved a lot for the specific testcase I used, 80 or so times faster and only a few % slower than 3.2.0

jsonschema.__version__='4.3.1' completed in 0.83s v4.3

v4.3.1 is out with great thanks to @Stranger6667 for putting in the time to make the fix. I haven’t fully tested myself against the examples above but please do share feedback.

This is definitely a reasonable assumption so sounds like such a patch would be very much appreciated. Thanks for the investigation.

Great! Happy to help 😃

After some more adjustments:

class RefResolver:
    ...
    @lru_cache()
    def _find_ids(self):
        return list(self._finditem(self.referrer, "$id"))

    @lru_cache()
    def _find_in_subschema(self, url):
        uri, fragment = urldefrag(url)
        for subschema in self._find_ids():
            target_uri = self._urljoin_cache(
                self.resolution_scope, subschema["$id"],
            )
            if target_uri.rstrip("/") == uri.rstrip("/"):
                if fragment:
                    subschema = self.resolve_fragment(subschema, fragment)
                return url, subschema
        return None

    def resolve(self, ref):
        """
        Resolve the given reference.
        """
        url = self._urljoin_cache(self.resolution_scope, ref).rstrip("/")

        match = self._find_in_subschema(url)
        if match is not None:
            return match

        return url, self._remote_cache(url)

Cuts the execution time to 11.305032968521118 sec

I’ll submit a patch shortly

It will take some digging, but there is indeed a “known” performance chance in 4.0.0: https://github.com/Julian/jsonschema/blob/main/CHANGELOG.rst#v400

specifically:

False and 0 are now properly considered non-equal even recursively within a container (#686). As part of this change, uniqueItems validation may be slower in some cases. Please feel free to report any significant performance regressions, though in some cases they may be difficult to address given the specification requirement.

If someone could confirm (or deny) whether that’s the culprit it’d be helpful, but yeah this may need some investigating (some I won’t know I have time for for at least a few days).

CC @skamensky as well who I know was interested in doing some performance optimization – here’s another benchmark we may want to adopt or use to drive any change.

I noticed this too, a set of tests I run for an application that uses jsonschema heavily the test suite now takes 60 minutes vs 14 before.

updated my measurements in the OP. thank you guys for putting in the work! ❤️ it is a 3x improvement, but still a little bit slower than the 3.2 version. I think this is a manageable slowdown now and from my side the ticket could be closed.

I wanted to have a quick look to see If I could spot something obviously wrong, but havent successed. But I do have some nice flamecharts to share:

jsonschema.__version__='3.2.0' completed in 0.77s v3.2

jsonschema.__version__='4.2.0' completed in 46.77s v4.2

Made using

from urllib.request import urlopen
import json
import jsonschema
from time import time

with urlopen('https://raw.githubusercontent.com/vega/schema/master/vega-lite/v4.8.1.json') as f:
  schema = json.load(f)
with urlopen('https://raw.githubusercontent.com/vega/vega-lite/master/examples/specs/bar.vl.json') as f:
  instance = json.load(f)

v = jsonschema.Draft7Validator(schema)

start = time()
for i in range(1000):
    v.validate(instance)

print(f"{jsonschema.__version__=} completed in {time() - start: 0.2f}s")

Another quick note here about the impact of this: in altair_viewer, with jsonschema<4.0, the test suite runs in 30 seconds. With jsonschema 4.0 or newer, the test suite times out after 6 hours. We fixed this in https://github.com/altair-viz/altair_viewer/pull/44 by pinning to jsonschema<4.0