datasets: "Property couldn't be hashed properly" even though fully picklable

Describe the bug

I am trying to tokenize a dataset with spaCy. I found that no matter what I do, the spaCy language object (nlp) prevents datasets from pickling correctly - or so the warning says - even though manually pickling is no issue. It should not be an issue either, since spaCy objects are picklable.

Steps to reproduce the bug

Here is a colab but for some reason I cannot reproduce it there. That may have to do with logging/tqdm on Colab, or with running things in notebooks. I tried below code on Windows and Ubuntu as a Python script and getting the same issue (warning below).

import pickle

from datasets import load_dataset
import spacy


class Processor:
    def __init__(self):
        self.nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser", "ner", "lemmatizer"])

    @staticmethod
    def collate(batch):
        return [d["en"] for d in batch]

    def parse(self, batch):
        batch = batch["translation"]
        return {"translation_tok": [{"en_tok": " ".join([t.text for t in doc])} for doc in self.nlp.pipe(self.collate(batch))]}

    def process(self):
        ds = load_dataset("wmt16", "de-en", split="train[:10%]")
        ds = ds.map(self.parse, batched=True, num_proc=6)


if __name__ == '__main__':
    pr = Processor()

    # succeeds
    with open("temp.pkl", "wb") as f:
        pickle.dump(pr, f)
    print("Successfully pickled!")

    pr.process()

Here is a small change that includes Hasher.hash that shows that the hasher cannot seem to successfully pickle parts form the NLP object.


from datasets.fingerprint import Hasher
import pickle

from datasets import load_dataset
import spacy


class Processor:
    def __init__(self):
        self.nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser", "ner", "lemmatizer"])

    @staticmethod
    def collate(batch):
        return [d["en"] for d in batch]

    def parse(self, batch):
        batch = batch["translation"]
        return {"translation_tok": [{"en_tok": " ".join([t.text for t in doc])} for doc in self.nlp.pipe(self.collate(batch))]}

    def process(self):
        ds = load_dataset("wmt16", "de-en", split="train[:10]")
        return ds.map(self.parse, batched=True)


if __name__ == '__main__':
    pr = Processor()

    # succeeds
    with open("temp.pkl", "wb") as f:
        pickle.dump(pr, f)
    print("Successfully pickled class instance!")

    # succeeds
    with open("temp.pkl", "wb") as f:
        pickle.dump(pr.nlp, f)
    print("Successfully pickled nlp!")

    # fails
    print(Hasher.hash(pr.nlp))
    pr.process()

Expected results

This to be picklable, working (fingerprinted), and no warning.

Actual results

In the first snippet, I get this warning Parameter ‘function’=<function Processor.parse at 0x7f44982247a0> of the transform datasets.arrow_dataset.Dataset._map_single couldn’t be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won’t be showed.

In the second, I get this traceback which directs to the Hasher.hash line.

Traceback (most recent call last):
  File " \Python\Python36\lib\pickle.py", line 918, in save_global
    obj2, parent = _getattribute(module, name)
  File " \Python\Python36\lib\pickle.py", line 266, in _getattribute
    .format(name, obj))
AttributeError: Can't get local attribute 'add_codes.<locals>.ErrorsWithCodes' on <function add_codes at 0x00000296FF606EA0>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File " scratch_4.py", line 40, in <module>
    print(Hasher.hash(pr.nlp))
  File " \lib\site-packages\datasets\fingerprint.py", line 191, in hash
    return cls.hash_default(value)
  File " \lib\site-packages\datasets\fingerprint.py", line 184, in hash_default
    return cls.hash_bytes(dumps(value))
  File " \lib\site-packages\datasets\utils\py_utils.py", line 345, in dumps
    dump(obj, file)
  File " \lib\site-packages\datasets\utils\py_utils.py", line 320, in dump
    Pickler(file, recurse=True).dump(obj)
  File " \lib\site-packages\dill\_dill.py", line 498, in dump
    StockPickler.dump(self, obj)
  File " \Python\Python36\lib\pickle.py", line 409, in dump
    self.save(obj)
  File " \Python\Python36\lib\pickle.py", line 521, in save
    self.save_reduce(obj=obj, *rv)
  File " \Python\Python36\lib\pickle.py", line 634, in save_reduce
    save(state)
  File " \Python\Python36\lib\pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File " \lib\site-packages\dill\_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File " \Python\Python36\lib\pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File " \Python\Python36\lib\pickle.py", line 847, in _batch_setitems
    save(v)
  File " \Python\Python36\lib\pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File " \Python\Python36\lib\pickle.py", line 781, in save_list
    self._batch_appends(obj)
  File " \Python\Python36\lib\pickle.py", line 805, in _batch_appends
    save(x)
  File " \Python\Python36\lib\pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File " \Python\Python36\lib\pickle.py", line 736, in save_tuple
    save(element)
  File " \Python\Python36\lib\pickle.py", line 521, in save
    self.save_reduce(obj=obj, *rv)
  File " \Python\Python36\lib\pickle.py", line 634, in save_reduce
    save(state)
  File " \Python\Python36\lib\pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File " \Python\Python36\lib\pickle.py", line 736, in save_tuple
    save(element)
  File " \Python\Python36\lib\pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File " \lib\site-packages\dill\_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File " \Python\Python36\lib\pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File " \Python\Python36\lib\pickle.py", line 847, in _batch_setitems
    save(v)
  File " \Python\Python36\lib\pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File " \lib\site-packages\dill\_dill.py", line 1176, in save_instancemethod0
    pickler.save_reduce(MethodType, (obj.__func__, obj.__self__), obj=obj)
  File " \Python\Python36\lib\pickle.py", line 610, in save_reduce
    save(args)
  File " \Python\Python36\lib\pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File " \Python\Python36\lib\pickle.py", line 736, in save_tuple
    save(element)
  File " \Python\Python36\lib\pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File " \lib\site-packages\datasets\utils\py_utils.py", line 523, in save_function
    obj=obj,
  File " \Python\Python36\lib\pickle.py", line 610, in save_reduce
    save(args)
  File " \Python\Python36\lib\pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File " \Python\Python36\lib\pickle.py", line 751, in save_tuple
    save(element)
  File " \Python\Python36\lib\pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File " \lib\site-packages\dill\_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File " \Python\Python36\lib\pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File " \Python\Python36\lib\pickle.py", line 847, in _batch_setitems
    save(v)
  File " \Python\Python36\lib\pickle.py", line 521, in save
    self.save_reduce(obj=obj, *rv)
  File " \Python\Python36\lib\pickle.py", line 605, in save_reduce
    save(cls)
  File " \Python\Python36\lib\pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File " \lib\site-packages\dill\_dill.py", line 1439, in save_type
    StockPickler.save_global(pickler, obj, name=name)
  File " \Python\Python36\lib\pickle.py", line 922, in save_global
    (obj, module_name, name))
_pickle.PicklingError: Can't pickle <class 'spacy.errors.add_codes.<locals>.ErrorsWithCodes'>: it's not found as spacy.errors.add_codes.<locals>.ErrorsWithCodes

Environment info

Tried on both Linux and Windows

datasets version: 1.14.0
Platform: Windows-10-10.0.19041-SP0 + Python 3.7.9; Linux-5.11.0-38-generic-x86_64-with-Ubuntu-20.04-focal + Python 3.7.12
PyArrow version: 6.0.0

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 5
Comments: 23 (18 by maintainers)

Most upvoted comments

I’ve been having an issue that might be related to this when trying to pre-tokenize a corpus and caching it for using it later in the pre-training of a RoBERTa model. I always get the following warning:

Dataset text downloaded and prepared to /gpfswork/rech/project/user/.cache/hf-datasets/text/default-1850886023af0077/0.0.0/acc32f2f2ef863c93c2f30c52f7df6cc9053a1c2230b8d7da0d210404683ca08. Subsequent calls will reuse this data.
Parameter 'function'=<function encode_dataset.<locals>.<lambda> at 0x14a92157b280> of the transform datasets.arrow_dataset.Dataset.filter@2.0.1 couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.

And when I launch the pre-training the pre-tokenized corpus is not found and it is tokenized again, which makes me waste precious GPU hours.

For me, the workaround was downgrading dill and multiprocess to the following versions:

dill             0.3.4
multiprocess     0.70.12.2

pjox on Jul 19, 2022

It can be even simpler to hash the bytes of the pipeline instead

nlp1.to_bytes() == nlp2.to_bytes()  # True

IMO we should integrate the custom hashing for spacy models into datasets (we use a custom Pickler for that). What could be done on Spacy’s side instead (if they think it’s nice to have) is to implement a custom pickling for these classes using to_bytes/from_bytes to have deterministic pickle dumps.

Finally I think it would be nice in the future to add an API to let datasets users control this kind of things. Something like being able to define your own hashing if you use complex objects.

@datasets.register_hash(spacy.language.Language)
def hash_spacy_language(nlp):
    return Hasher.hash(nlp.to_bytes())

lhoestq on Nov 3, 2021

Hi ! If your function is not picklable, then the fingerprint of the resulting dataset can’t be computed. The fingerprint is a hash that is used by the cache to reload previously computed datasets: the dataset file is named cache-<fingerprint>.arrow in your dataset’s cache directory.

As a workaround you can set the fingerprint that is going to be used by the cache:

result = my_dataset.map(func, new_fingerprint=new_fingerprint)

Any future call to map with the same new_fingerprint will reload the result from the cache.

Be careful using this though: if you change your func, be sure to change the new_fingerprint as well.

lhoestq on Apr 1, 2022

Hi Matthew, thanks for chiming in! We are currently implementing exactly what you suggest: to_bytes() as a default before pickling - but we may prefer to_dict to avoid double dumping.

datasets uses pickle dumps (actually dill) to get unique representations of processing steps (a “fingerprint” or hash). So it never needs to re-load that dump - it just needs its value to create a hash. If a fingerprint is identical to a cached fingerprint, then the result can be retrieved from the on-disk cache. (@lhoestq or @mariosasko can correct me if I’m wrong.)

I was experiencing the issue that parsing with spaCy gave me a different fingerprint on every run of the script and thus it could never load the processed dataset from cache. At first I thought the reason was that spaCy Language objects were not picklable with recursive dill, but even after adjusting for that the issue persisted. @lhoestq found that this is due to the changing id, which you discussed here. So yes, you are right. On the surface there simply seems to be an incompatibility between datasets default caching functionality as it is currently implemented and spacy.Language.

The linked PR aims to remedy that, though. Up to now I have put some effort into making it easier to define your own “pickling” function for a given type (and optionally any of its subclasses). That allows us to tell datasets that instead of doing dill.save(nlp) (non-deterministic), to use dill.save(nlp.to_bytes()) (deterministic). When I find some more time, the PR will be expanded to improve the user-experience a bit and add a built-in function to pickle spacy.Language as one of the defaults (using to_bytes()).

BramVanroy on Nov 18, 2021