joblib: slow memory retrieval (significantly slower then simple pickle)

Hi,

I’m little confused by why does reading and writing to (file based) “memory” take so enormous amount of time compared to bare pickling/unpickling.

In my case, func() is a tiny memorized function that takes a short string argument and returns a (short) dict with (long) lists of ~complex objects. For some reason, function retrieval from cache takes significantly more time then just unpickleing file. Resulting file is approximately 70Mb.

I observe same thing for any other function.

%prun func(some_str)

        1   12.436   12.436   52.011   52.011 pickle.py:1014(load)
 41531482    7.665    0.000   11.931    0.000 pickle.py:226(read)
  1922386    5.547    0.000    7.339    0.000 pickle.py:1504(load_build)
 41531483    4.266    0.000    4.266    0.000 {method 'read' of '_io.BufferedReader' objects}
  6490284    3.753    0.000    6.666    0.000 pickle.py:1439(load_long_binput)
  2645763    2.666    0.000    4.764    0.000 pickle.py:1192(load_binunicode)
 30070039    2.403    0.000    2.403    0.000 {built-in method builtins.isinstance}
  4140172    1.870    0.000    3.225    0.000 pickle.py:1415(load_binget)
  1922386    1.369    0.000    2.049    0.000 pickle.py:1316(load_newobj)
  9196954    1.359    0.000    1.359    0.000 {built-in method _struct.unpack}
  1922386    1.114    0.000    8.724    0.000 numpy_pickle.py:319(load_build)
 10857316    0.962    0.000    0.962    0.000 {method 'pop' of 'list' objects}
 14536246    0.873    0.000    0.873    0.000 {method 'append' of 'list' objects}
  1922386    0.873    0.000    1.218    0.000 pickle.py:1472(load_setitem)
  1922393    0.816    0.000    0.816    0.000 {built-in method builtins.getattr}
   676815    0.765    0.000    1.384    0.000 pickle.py:1458(load_appends)
  1922387    0.730    0.000    0.832    0.000 pickle.py:1257(load_empty_dictionary)
        1    0.715    0.715   53.099   53.099 <string>:1(<module>)
  1245385    0.559    0.000    0.848    0.000 pickle.py:1451(load_append)
...

%prun len(pickle.load(open("..file..", 'rb')))
        1    4.587    4.587    4.587    4.587 {built-in method _pickle.load}
        1    0.553    0.553    5.140    5.140 <string>:1(<module>)
        1    0.000    0.000    5.140    5.140 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {built-in method io.open}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

About this issue

  • Original URL
  • State: open
  • Created 8 years ago
  • Comments: 37 (29 by maintainers)

Most upvoted comments

I think I got the following patch to memory.py to work:

index 14d7552..536826e 100644
--- a/../joblib/joblib/memory.py
+++ b/joblib/memory.py
@@ -34,7 +34,7 @@ from .func_inspect import format_call
 from .func_inspect import format_signature
 from ._memory_helpers import open_py_source
 from .logger import Logger, format_time, pformat
-from . import numpy_pickle
+import pickle
 from .disk import mkdirp, rm_subdirs, memstr_to_bytes
 from ._compat import _basestring, PY3_OR_LATER
 from .backports import concurrency_safe_rename
@@ -134,7 +134,7 @@ def _load_output(output_dir, func_name, timestamp=None, metadata=None,
         raise KeyError(
             "Non-existing cache value (may have been cleared).\n"
             "File %s does not exist" % filename)
-    result = numpy_pickle.load(filename, mmap_mode=mmap_mode)
+    result = pickle.load(open(filename, "rb"))
 
     return result
 
@@ -208,7 +208,7 @@ def concurrency_safe_write(to_write, filename, write_func):
     thread_id = id(threading.current_thread())
     temporary_filename = '{}.thread-{}-pid-{}'.format(
         filename, thread_id, os.getpid())
-    write_func(to_write, temporary_filename)
+    write_func(to_write, open(temporary_filename,"wb"))
     concurrency_safe_rename(temporary_filename, filename)
 
 
@@ -759,8 +759,7 @@ class MemorizedFunc(Logger):
         try:
             filename = os.path.join(dir, 'output.pkl')
             mkdirp(dir)
-            write_func = functools.partial(numpy_pickle.dump,
-                                           compress=self.compress)
+            write_func = pickle.dump
             concurrency_safe_write(output, filename, write_func)
             if self._verbose > 10:
                 print('Persisting in %s' % dir)

Of course its a huge hack that just bypasses everything. I wonder if it breaks anything.

Actually thinking about it, maybe the cleanest thing to do is to add a use_joblib_pickling (for lack of a better name) argument to Memory, which should be True by default.