datasets: "Property couldn't be hashed properly" even though fully picklable
Describe the bug
I am trying to tokenize a dataset with spaCy. I found that no matter what I do, the spaCy language object (nlp
) prevents datasets
from pickling correctly - or so the warning says - even though manually pickling is no issue. It should not be an issue either, since spaCy objects are picklable.
Steps to reproduce the bug
Here is a colab but for some reason I cannot reproduce it there. That may have to do with logging/tqdm on Colab, or with running things in notebooks. I tried below code on Windows and Ubuntu as a Python script and getting the same issue (warning below).
import pickle
from datasets import load_dataset
import spacy
class Processor:
def __init__(self):
self.nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser", "ner", "lemmatizer"])
@staticmethod
def collate(batch):
return [d["en"] for d in batch]
def parse(self, batch):
batch = batch["translation"]
return {"translation_tok": [{"en_tok": " ".join([t.text for t in doc])} for doc in self.nlp.pipe(self.collate(batch))]}
def process(self):
ds = load_dataset("wmt16", "de-en", split="train[:10%]")
ds = ds.map(self.parse, batched=True, num_proc=6)
if __name__ == '__main__':
pr = Processor()
# succeeds
with open("temp.pkl", "wb") as f:
pickle.dump(pr, f)
print("Successfully pickled!")
pr.process()
Here is a small change that includes Hasher.hash
that shows that the hasher cannot seem to successfully pickle parts form the NLP object.
from datasets.fingerprint import Hasher
import pickle
from datasets import load_dataset
import spacy
class Processor:
def __init__(self):
self.nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser", "ner", "lemmatizer"])
@staticmethod
def collate(batch):
return [d["en"] for d in batch]
def parse(self, batch):
batch = batch["translation"]
return {"translation_tok": [{"en_tok": " ".join([t.text for t in doc])} for doc in self.nlp.pipe(self.collate(batch))]}
def process(self):
ds = load_dataset("wmt16", "de-en", split="train[:10]")
return ds.map(self.parse, batched=True)
if __name__ == '__main__':
pr = Processor()
# succeeds
with open("temp.pkl", "wb") as f:
pickle.dump(pr, f)
print("Successfully pickled class instance!")
# succeeds
with open("temp.pkl", "wb") as f:
pickle.dump(pr.nlp, f)
print("Successfully pickled nlp!")
# fails
print(Hasher.hash(pr.nlp))
pr.process()
Expected results
This to be picklable, working (fingerprinted), and no warning.
Actual results
In the first snippet, I get this warning Parameter ‘function’=<function Processor.parse at 0x7f44982247a0> of the transform datasets.arrow_dataset.Dataset._map_single couldn’t be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won’t be showed.
In the second, I get this traceback which directs to the Hasher.hash
line.
Traceback (most recent call last):
File " \Python\Python36\lib\pickle.py", line 918, in save_global
obj2, parent = _getattribute(module, name)
File " \Python\Python36\lib\pickle.py", line 266, in _getattribute
.format(name, obj))
AttributeError: Can't get local attribute 'add_codes.<locals>.ErrorsWithCodes' on <function add_codes at 0x00000296FF606EA0>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File " scratch_4.py", line 40, in <module>
print(Hasher.hash(pr.nlp))
File " \lib\site-packages\datasets\fingerprint.py", line 191, in hash
return cls.hash_default(value)
File " \lib\site-packages\datasets\fingerprint.py", line 184, in hash_default
return cls.hash_bytes(dumps(value))
File " \lib\site-packages\datasets\utils\py_utils.py", line 345, in dumps
dump(obj, file)
File " \lib\site-packages\datasets\utils\py_utils.py", line 320, in dump
Pickler(file, recurse=True).dump(obj)
File " \lib\site-packages\dill\_dill.py", line 498, in dump
StockPickler.dump(self, obj)
File " \Python\Python36\lib\pickle.py", line 409, in dump
self.save(obj)
File " \Python\Python36\lib\pickle.py", line 521, in save
self.save_reduce(obj=obj, *rv)
File " \Python\Python36\lib\pickle.py", line 634, in save_reduce
save(state)
File " \Python\Python36\lib\pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File " \lib\site-packages\dill\_dill.py", line 990, in save_module_dict
StockPickler.save_dict(pickler, obj)
File " \Python\Python36\lib\pickle.py", line 821, in save_dict
self._batch_setitems(obj.items())
File " \Python\Python36\lib\pickle.py", line 847, in _batch_setitems
save(v)
File " \Python\Python36\lib\pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File " \Python\Python36\lib\pickle.py", line 781, in save_list
self._batch_appends(obj)
File " \Python\Python36\lib\pickle.py", line 805, in _batch_appends
save(x)
File " \Python\Python36\lib\pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File " \Python\Python36\lib\pickle.py", line 736, in save_tuple
save(element)
File " \Python\Python36\lib\pickle.py", line 521, in save
self.save_reduce(obj=obj, *rv)
File " \Python\Python36\lib\pickle.py", line 634, in save_reduce
save(state)
File " \Python\Python36\lib\pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File " \Python\Python36\lib\pickle.py", line 736, in save_tuple
save(element)
File " \Python\Python36\lib\pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File " \lib\site-packages\dill\_dill.py", line 990, in save_module_dict
StockPickler.save_dict(pickler, obj)
File " \Python\Python36\lib\pickle.py", line 821, in save_dict
self._batch_setitems(obj.items())
File " \Python\Python36\lib\pickle.py", line 847, in _batch_setitems
save(v)
File " \Python\Python36\lib\pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File " \lib\site-packages\dill\_dill.py", line 1176, in save_instancemethod0
pickler.save_reduce(MethodType, (obj.__func__, obj.__self__), obj=obj)
File " \Python\Python36\lib\pickle.py", line 610, in save_reduce
save(args)
File " \Python\Python36\lib\pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File " \Python\Python36\lib\pickle.py", line 736, in save_tuple
save(element)
File " \Python\Python36\lib\pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File " \lib\site-packages\datasets\utils\py_utils.py", line 523, in save_function
obj=obj,
File " \Python\Python36\lib\pickle.py", line 610, in save_reduce
save(args)
File " \Python\Python36\lib\pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File " \Python\Python36\lib\pickle.py", line 751, in save_tuple
save(element)
File " \Python\Python36\lib\pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File " \lib\site-packages\dill\_dill.py", line 990, in save_module_dict
StockPickler.save_dict(pickler, obj)
File " \Python\Python36\lib\pickle.py", line 821, in save_dict
self._batch_setitems(obj.items())
File " \Python\Python36\lib\pickle.py", line 847, in _batch_setitems
save(v)
File " \Python\Python36\lib\pickle.py", line 521, in save
self.save_reduce(obj=obj, *rv)
File " \Python\Python36\lib\pickle.py", line 605, in save_reduce
save(cls)
File " \Python\Python36\lib\pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File " \lib\site-packages\dill\_dill.py", line 1439, in save_type
StockPickler.save_global(pickler, obj, name=name)
File " \Python\Python36\lib\pickle.py", line 922, in save_global
(obj, module_name, name))
_pickle.PicklingError: Can't pickle <class 'spacy.errors.add_codes.<locals>.ErrorsWithCodes'>: it's not found as spacy.errors.add_codes.<locals>.ErrorsWithCodes
Environment info
Tried on both Linux and Windows
datasets
version: 1.14.0- Platform: Windows-10-10.0.19041-SP0 + Python 3.7.9; Linux-5.11.0-38-generic-x86_64-with-Ubuntu-20.04-focal + Python 3.7.12
- PyArrow version: 6.0.0
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 5
- Comments: 23 (18 by maintainers)
I’ve been having an issue that might be related to this when trying to pre-tokenize a corpus and caching it for using it later in the pre-training of a RoBERTa model. I always get the following warning:
And when I launch the pre-training the pre-tokenized corpus is not found and it is tokenized again, which makes me waste precious GPU hours.
For me, the workaround was downgrading
dill
andmultiprocess
to the following versions:It can be even simpler to hash the bytes of the pipeline instead
IMO we should integrate the custom hashing for spacy models into
datasets
(we use a custom Pickler for that). What could be done on Spacy’s side instead (if they think it’s nice to have) is to implement a custom pickling for these classes usingto_bytes
/from_bytes
to have deterministic pickle dumps.Finally I think it would be nice in the future to add an API to let
datasets
users control this kind of things. Something like being able to define your own hashing if you use complex objects.Hi ! If your function is not picklable, then the fingerprint of the resulting dataset can’t be computed. The fingerprint is a hash that is used by the cache to reload previously computed datasets: the dataset file is named
cache-<fingerprint>.arrow
in your dataset’s cache directory.As a workaround you can set the fingerprint that is going to be used by the cache:
Any future call to
map
with the samenew_fingerprint
will reload the result from the cache.Be careful using this though: if you change your
func
, be sure to change thenew_fingerprint
as well.Hi Matthew, thanks for chiming in! We are currently implementing exactly what you suggest:
to_bytes()
as a default before pickling - but we may preferto_dict
to avoid double dumping.datasets
uses pickle dumps (actually dill) to get unique representations of processing steps (a “fingerprint” or hash). So it never needs to re-load that dump - it just needs its value to create a hash. If a fingerprint is identical to a cached fingerprint, then the result can be retrieved from the on-disk cache. (@lhoestq or @mariosasko can correct me if I’m wrong.)I was experiencing the issue that parsing with spaCy gave me a different fingerprint on every run of the script and thus it could never load the processed dataset from cache. At first I thought the reason was that spaCy Language objects were not picklable with recursive dill, but even after adjusting for that the issue persisted. @lhoestq found that this is due to the changing
id
, which you discussed here. So yes, you are right. On the surface there simply seems to be an incompatibility betweendatasets
default caching functionality as it is currently implemented andspacy.Language
.The linked PR aims to remedy that, though. Up to now I have put some effort into making it easier to define your own “pickling” function for a given type (and optionally any of its subclasses). That allows us to tell
datasets
that instead of doingdill.save(nlp)
(non-deterministic), to usedill.save(nlp.to_bytes())
(deterministic). When I find some more time, the PR will be expanded to improve the user-experience a bit and add a built-in function to picklespacy.Language
as one of the defaults (usingto_bytes()
).