pydantic: Unable to `cloudpickle` Pydantic model classes

Initial Checks

  • I confirm that I’m using Pydantic V2

Description

cloudpickle cannot serialize Pydantic model classes. It fails with a TypeError: cannot pickle 'pydantic_core._pydantic_core.SchemaSerializer' object exception.

Example Code

# bug.py

"""Cloudpickling Pydantic models raises an exception."""

from pydantic import BaseModel
import ray.cloudpickle as cloudpickle

class SimpleModel(BaseModel):
    val: int

cloudpickle.dumps(SimpleModel)

"""
Output:

% python bug.py

Traceback (most recent call last):
  File "/Users/shrekris/Desktop/scratch/dump4.py", line 9, in <module>
    cloudpickle.dumps(SimpleModel)
  File "/Users/shrekris/Desktop/ray/python/ray/cloudpickle/cloudpickle_fast.py", line 88, in dumps
    cp.dump(obj)
  File "/Users/shrekris/Desktop/ray/python/ray/cloudpickle/cloudpickle_fast.py", line 733, in dump
    return Pickler.dump(self, obj)
TypeError: cannot pickle 'pydantic_core._pydantic_core.SchemaSerializer' object
"""

Python, Pydantic & OS Version

/Users/shrekris/miniforge3/envs/pydantic-fix/lib/python3.9/site-packages/pydantic/_migration.py:275: UserWarning: `pydantic.utils:version_info` has been moved to `pydantic.version:version_info`.
  warnings.warn(f'`{import_path}` has been moved to `{new_location}`.')
             pydantic version: 2.0.3
        pydantic-core version: 2.3.0 release build profile
                 install path: /Users/shrekris/miniforge3/envs/pydantic-fix/lib/python3.9/site-packages/pydantic
               python version: 3.9.16 | packaged by conda-forge | (main, Feb  1 2023, 21:38:11)  [Clang 14.0.6 ]
                     platform: macOS-11.4-arm64-arm-64bit
     optional deps. installed: ['typing-extensions']

Selected Assignee: @lig

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 1
  • Comments: 31 (18 by maintainers)

Commits related to this issue

Most upvoted comments

I believe we are going to release 2.5 beta today, with a view to a final 2.5 production release next week.

@edoakes sorry for the delay I was just able to test the updated packages, and I can confirm it worked perfectly for my use case 😃 Thanks a lot!

I don’t think SchemaSerializer needs to be picklable, you should just store the core scheme and recreate the serializer.

@davidhewitt any update on when the next pydantic release is going to come out? This is a pretty big pain point for our users and I want to make sure we can unpin the dependency soon.

Reading from https://github.com/pydantic/pydantic/issues/8028 there seems to be a couple of items still pending before 2.5 comes out

@davidhewitt any update on when the next pydantic release is going to come out? This is a pretty big pain point for our users and I want to make sure we can unpin the dependency soon.

No specific ETA, I would assume 2.5.

Thanks for the help @davidhewitt 🚀 do you know when the next release is scheduled and what its version tag will be?

Both PRs linked above are now merged. I’ve begun manually testing and they appear to address the issue for all of my use cases. @jrapin if you could test out your workflow with pydantic_core and pydantic installed from main that would be helpful.

@davidhewitt Please let me know when a pydantic_core release can be done in order to add integration testing & catch the next pydantic release.

I have opened PRs for each of the above issues:

I will begin testing that these fixes are comprehensive using locally installed copies. Once these PRs have been merged and a new version of pydantic_core has been released, I will add integration/regression tests to the pydantic repo.

Yes exactly, we can release pydantic_core shortly after your PR is merged so there’s not a long delay in getting this working on pydantic main.

@davidhewitt I’d prefer to merge the functionality into pydantic so that cloudpickle “just works” out of the box and folks don’t have to worry about patching things.

Storing the members and defining __reduce__ on SchemaSerializer itself would indeed be preferable, I’m just not sure how to accomplish that using PyO3 (not familiar with the framework). I can try to get it working.

So then the plan of action would be:

  • PR 1 (against pydantic_core): Make SchemaSerializer directly cloudpickleable in pydantic_core by storing references to the constructor arguments.
  • PR 2 (against pydantic): Use a WeakrefWrapper similar to the above rather than the weakref directly.

With these two, we should be good to go. Given that these changes would be split across the repos, is there any special versioning story between them? Or does pydantic treat pydantic_core like any other Python package dependency? In PR 2 I’ll add a regression test but that will depend on using a version of pydantic_core that includes PR 1.

I also ran into the above issues:

  • SchemaSerializer not being cloudpickleable due to being a native type (written in Rust).
  • _PydanticWeakref not being cloudpickleable due to inheriting from weakref.ref, which has known issues with serialization.

I was able to enable cloudpickleing a wide variety of Pydantic model definitions with the following two patches:

  1. Wrap SchemaSerializer to save the Python arguments necessary to reconstruct it:
# Define a wrapper that saves the `schema` and `core_config` to reconstruct the native `SchemaSerializer`.
class CloudpickleableSchemaSerializer:
    def __init__(self, schema, core_config):
        self._schema = schema
        self._core_config = core_config
        self._schema_serializer = SchemaSerializer(self._schema, self._core_config)

    def __reduce__(self):
        return CloudpickleableSchemaSerializer, (self._schema, self._core_config)

    def __getattr__(self, attr: str):
        return getattr(self._schema_serializer, attr)

# Override all usages of `SchemaSerializer` (obviously not needed if we upstream the above wrapper):
pydantic._internal._model_construction.SchemaSerializer = CloudpickleableSchemaSerializer
pydantic._internal._dataclasses.SchemaSerializer = CloudpickleableSchemaSerializer
pydantic.type_adapter.SchemaSerializer = CloudpickleableSchemaSerializer

The __getattr__ bit is somewhat hacky. This is required because SchemaSerializer does not allow Python classes to subclass it. This can be cleaned up by adding the subclass parameter to the PyO3 #[pyclass] macro in pydantic_core. I’ve tested this as well, with the wrapper looking like:

class CloudpickleableSchemaSerializer(SchemaSerializer):
    def __init__(self, schema, core_config):
        self._schema = schema
        self._core_config = core_config
        # No need for `super().__init__()` because `SchemaSerializer` initialization happens in `__new__`.

    def __reduce__(self):
        return CloudpickleableSchemaSerializer, (self._schema, self._core_config)
  1. Wrap weakref.ref instead of inheriting from it:
class WeakRefWrapper:
    def __init__(self, obj: Any):
        if obj is None:
            self._wr = None
        else:
            self._wr = weakref.ref(obj)

    def __reduce__(self):
        return WeakRefWrapper, (self(),)

    def __call__(self) -> Any:
        if self._wr is None:
            return None
        else:
            return self._wr()

# Override all usages of `_PydanticWeakRef` (obviously not needed if we upstream the above wrapper):
pydantic._internal._model_construction._PydanticWeakRef = WeakRefWrapper

AFAICT there’s no downside to this wrapper but it gets around the strange ABC-related pickling error.

@davidhewitt @dmontagu @lig I’m happy to contribute a patch if you think this is a reasonable direction. Let me know what you think. The only downside I can see is that the SchemaSerializer wrapper will hold a reference to the schema and core_config objects (though I imagine these are probably already referenced somewhere in one of the BaseModel or ModelMetaclass members).

One low-baggage alternative is to delete the serializer upon serialization and reconstruct it whenever it’s first called. The SchemaSerializer’s __reduce__ function could be:

def __reduce__(self):
    return lambda: None, tuple()

Then whenever the SchemaSerializer is first called, the Pydantic model can initialize it using the schema and config and cache it.

This should only affect users that are serializing the SchemaSerializer, and the only added cost is the initialization upon the first call.

I did some digging and it looks like cloudpickle switches between “by reference” or “by value” pickling modes according to whether your type is importable.

So in the repro discussed above, the class is __main__.SimpleModel which is treated as not importable. In this case it looks to me like cloudpickle attempts to recreate a “skeleton” class which functions the same as the provided type. I don’t see a way to customise this behaviour. So to support “by value” pickling we need to support naive pickling for all the attributes of the class, as @shrekris-anyscale says. I can’t see a way to customise this behaviour. Maybe cloudpickle maintainers know of solutions.

On the other hand, if SimpleModel is moved into a module and imported from there (e.g. from foo import SimpleModel), then cloudpickle will use “by reference” pickling. This already works fine (the pickled data just contains the reference to the import path).

So @shrekris-anyscale a possible workaround may be to move your model definitions out of __main__ files / entry points into modules. Without knowing the full details of your application I don’t know if that’s actually viable.