Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Equality doesn't work after deserialization of dataclass #500

Closed
debakarr opened this issue Jun 5, 2022 · 11 comments
Closed

Equality doesn't work after deserialization of dataclass #500

debakarr opened this issue Jun 5, 2022 · 11 comments
Labels
Milestone

Comments

@debakarr
Copy link

debakarr commented Jun 5, 2022

from dill import dumps, loads
from dataclasses import dataclass, asdict


@dataclass
class A:
    x: int
    y: str


@dataclass
class B:
    a: A


if __name__ == "__main__":
    a = A(1, "test")
    before = B(a)
    save = dumps(before)
    after = loads(save)
    # assert before == after  # Fails
    assert asdict(before) == asdict(after)  # Fails

Equality fails. Even asdict also does not returns the same dictionary.

@mmckerns
Copy link
Member

mmckerns commented Jun 5, 2022

I don't think this has anything to do with the class being nested...

Python 3.9.13 (main, May 21 2022, 02:35:37) 
[Clang 10.0.1 (clang-1001.0.46.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import dill
>>> class Foo:
...   pass
... 
>>> f = Foo()
>>> ff = dill.loads(dill.dumps(f))
>>> ff == f
False
>>> 
>>> class Bar(object):
...   pass
... 
>>> f = Bar()
>>> ff = dill.loads(dill.dumps(f))
>>> ff == f
False
>>> dill.__version__
'0.3.6.dev0'

What versions of dill and python are you using?

@anivegesana
Copy link
Contributor

I think this is expected behavior right? The default behavior is to serialize all classes in __main__ so the copied object will have a type that is a copy of B, not B itself. Same for the instance of A. Try dumping with byref=True as an additional parameter.

@mmckerns
Copy link
Member

mmckerns commented Jun 5, 2022

@anivegesana: Yes, this is behavior I'd expect, and consistent with:

 >>> f = Foo()
>>> ff = Foo()
>>> 
>>> f == ff
False

@Dibakarroy1997: I'd close this, except that you noted that asdict does not return the same dictionary -- which is true. I was thinking this issue either could be closed, or maybe should be converted to why there might be a difference in the return of asdict. What do you think?

@mmckerns
Copy link
Member

mmckerns commented Jun 5, 2022

>>> import dill
>>> from dataclasses import dataclass, asdict
>>> 
>>> @dataclass
... class A:
...   x: int
... 
>>> a = A(1)
>>> b = dill.copy(a)
>>> asdict(a)
{'x': 1}
>>> asdict(b)
{'x': 1}
>>> 
>>> @dataclass
... class B:
...   a: A
... 
>>> b = B(a)
>>> c = dill.copy(b)
>>> asdict(b)
{'a': {'x': 1}}
>>> asdict(c)
{'a': {}}

@anivegesana
Copy link
Contributor

Well there's the problem

>>> c.a.__dataclass_fields__['x']._field_type is dataclasses._FIELD
False

We should probably either monkey-patch this class with a __reduce__ function or add special code to serialize dataclasses. The first is a little hacky and won't update dataclasses with new Python features (like __match_args__), but will be much simpler.

@debakarr
Copy link
Author

debakarr commented Jun 6, 2022

This might be similar to cloudpipe/cloudpickle#386.

>>> import dill
>>> dill.__version__
'0.3.5.1'
>>> from dataclasses import dataclass
>>> import dataclasses
>>> @dataclass
... class Test:
...     x: int
>>> print(dataclasses.fields(Test))
(Field(name='x',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object at 0x000001815533CCD0>,default_factory=<dataclasses._MISSING_TYPE object at 0x000001815533CCD0>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),_field_type=_FIELD),)
>>> deserialized_class = dill.loads(dill.dumps(Test))
>>> print(dataclasses.fields(deserialized_class))
()

@debakarr debakarr changed the title Equality doesn't work after deserialization of nested class Equality doesn't work after deserialization of dataclass Jun 6, 2022
@anivegesana
Copy link
Contributor

anivegesana commented Jun 6, 2022

Yep. Same issue. Here is a sloppy prototype that will be refined and might be added to dill. If not, a more elegant solution will.

import dataclasses

import dill
from pickle import GLOBAL

if hasattr(dataclasses, "_HAS_DEFAULT_FACTORY_CLASS"):
    @dill.register(dataclasses._HAS_DEFAULT_FACTORY_CLASS)
    def save_dataclasses_HAS_DEFAULT_FACTORY_CLASS(pickler, obj):
        pickler.write(GLOBAL + b"dataclasses\n_HAS_DEFAULT_FACTORY\n")

if hasattr(dataclasses, "MISSING"):
    @dill.register(type(dataclasses.MISSING))
    def save_dataclasses_MISSING_TYPE(pickler, obj):
        pickler.write(GLOBAL + b"dataclasses\nMISSING\n")

if hasattr(dataclasses, "KW_ONLY"):
    @dill.register(type(dataclasses.KW_ONLY))
    def save_dataclasses_KW_ONLY_TYPE(pickler, obj):
        pickler.write(GLOBAL + b"dataclasses\nKW_ONLY\n")

if hasattr(dataclasses, "_FIELD_BASE"):
    @dill.register(dataclasses._FIELD_BASE)
    def save_dataclasses_FIELD_BASE(pickler, obj):
        pickler.write(GLOBAL + b"dataclasses\n" + obj.name.encode() + b"\n")
>>> @dataclasses.dataclass
... class Test:
...     x: int
... 
>>> deserialized_class = dill.loads(dill.dumps(Test))
>>> print(dataclasses.fields(deserialized_class))
(Field(name='x',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object at 0x103af0f70>,default_factory=<dataclasses._MISSING_TYPE object at 0x103af0f70>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),_field_type=_FIELD),)

I definitely believe that this should be a pull request that should also made in python/cpython as well because it doesn't make sense why the sentinel values are duplicated during serialization.

@mmckerns
Copy link
Member

mmckerns commented Jun 6, 2022

Thanks @anivegesana. Yeah, I think we should add code to handle dataclasses in dill.

@anivegesana
Copy link
Contributor

The most worthwhile solution would be to include this temporary fix for dill 0.3.6 and come back and create the correct pickling function later. Updating the save_type function would require adding another case to the code in #450 and I should probably create a reasonable but powerful way to handle pickling metaclasses and class decorators.

This solution will allow people to create mostly correct pickles of dataclasses that will remain compatible with future versions of dill while allowing me to consider corner cases that I haven't thought of yet.

anivegesana added a commit to anivegesana/dill that referenced this issue Jun 8, 2022
This quick fix will be removed when proper dataclass serialization 
support is added to dill. This is just here to allow for better support, 
at least for now. dataclasses pickled with this PR will be unpicklable 
by future versions of dill, but the future versions of dill will be able 
to be automatically use the newer features in dataclasses.py that were 
not available in older versions of Python. That forward compatibility 
features is not present in this PR.
anivegesana added a commit to anivegesana/dill that referenced this issue Jun 8, 2022
This quick fix will be removed when proper dataclass serialization 
support is added to dill. This is just here to allow for better support, 
at least for now. dataclasses pickled with this PR will be unpicklable 
by future versions of dill, but the future versions of dill will be able 
to be automatically use the newer features in dataclasses.py that were 
not available in older versions of Python. That forward compatibility 
features is not present in this PR.
@shaperilio
Copy link

At least for this simple case:

from dataclasses import dataclass, field, fields
import dill
import pickle


@dataclass
class Inner:
    member: int = 0


@dataclass
class Outer:
    inner: Inner = field(default_factory=Inner)


od = dill.loads(dill.dumps(Outer()))
# fields of `inner` are lost because "_FIELD is not _FIELD"
assert len(fields(od.inner)) == 0  # should be 1

op = pickle.loads(pickle.dumps(Outer()))
assert len(fields(op.inner)) == 1  # works correctly

pickle works just fine. Is there a way to maybe subclass dill.Unpickler.find_class to work around this issue? Or to use pickle only for dataclasses?

mmckerns pushed a commit that referenced this issue Jul 21, 2022
* A temporary quick fix for dataclass serialization (#500)

This quick fix will be removed when proper dataclass serialization 
support is added to dill. This is just here to allow for better support, 
at least for now. dataclasses pickled with this PR will be unpicklable 
by future versions of dill, but the future versions of dill will be able 
to be automatically use the newer features in dataclasses.py that were 
not available in older versions of Python. That forward compatibility 
features is not present in this PR.

* Fix bug in pickling MappingProxyType in PyPy 3.7+
@mmckerns
Copy link
Member

This should be fixed by: #503. Closing.

>>> from dill import dumps, loads
>>> from dataclasses import dataclass, asdict
>>> 
>>> @dataclass
... class A:
...     x: int
...     y: str
... 
>>> 
>>> @dataclass
... class B:
...     a: A
... 
>>> 
>>> a = A(1, "test")
>>> before = B(a)
>>> save = dumps(before)
>>> after = loads(save)
>>> assert asdict(before) == asdict(after)
>>> assert asdict(after.a) == asdict(before.a)

@mmckerns mmckerns added the bug label Jul 23, 2022
@mmckerns mmckerns added this to the dill-0.3.6 milestone Jul 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants