dedupe: The dbm solution seems making the blocking process extremely slow

For me I have a 30K records to match agains to, and if I use the default dbm way it takes more than 10 minutes to match on, for example a 2 entries records. During the matching you could see output like this:

[2017-06-29 20:43:56,841: INFO/PoolWorker-1] 10000, 182.5327482 seconds
[2017-06-29 20:53:40,909: INFO/PoolWorker-1] 20000, 758.9884932 seconds

which I believe is the output from https://github.com/dedupeio/dedupe/blob/master/dedupe/blocking.py#L42

As so far we have enough memory, I had to change the code here to let the blocking happen in a dictionary in memory : https://github.com/dedupeio/dedupe/blob/master/dedupe/api.py#L1072

Basically, instead of returning shelf, return an empty python dictionary:

def _temp_shelve():
    fd, file_path = tempfile.mkstemp()
    os.close(fd)

    try:
        shelf = shelve.open(file_path, 'n',
                                      protocol=pickle.HIGHEST_PROTOCOL)
    except Exception as e:
        if 'db type could not be determined' in str(e):
            os.remove(file_path)
            shelf = shelve.open(file_path, 'n',
                                protocol=pickle.HIGHEST_PROTOCOL)
        else:
            raise

    return {}, file_path # return python dictionary instead of shelf

This will make the blocking and matching process takes lots of memory but it can finish a 2 entries matching against 30K records in a few seconds.

Does this looks normal?


Also the dbm thing is not working for large data set on macOS, as by default there is no gdbm available for python3 on macOS (not exactly sure why) and it causes issue like this:

HASH: Out of overflow pages.  Increase page size
Traceback (most recent call last):
  File "/Users/tendres/PycharmProjects/dedupe/tests/test_shelve.py", line 25, in <module>
    shelf[k] += [(i, record, ids)]
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/shelve.py", line 125, in __setitem__
    self.dict[key.encode(self.keyencoding)] = f.getvalue()
_dbm.error: cannot add item to database

Process finished with exit code 1

also mentioned here: https://github.com/dedupeio/csvdedupe/issues/67


And it would be nice if we could have an option on the matching API to decide whether using shelve(or dbm), I suppose.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 16 (6 by maintainers)

Commits related to this issue

Most upvoted comments

@betocollin worked great! Thanks for the tip.

@fgregg looks like a bug in 1.7 on Mac OSX

If you do end up testing 1.7.0, have the larger dataset be the second one.

Yes, we are not moving away from dbm yet, but should have a fix for your problem.

On Tue, Jul 11, 2017 at 9:04 AM, Fuyang Liu notifications@github.com wrote:

No I didn’t, do the newest 1.7.0 forces me to use dbm?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dedupeio/dedupe/issues/585#issuecomment-314454129, or mute the thread https://github.com/notifications/unsubscribe-auth/AAgxbRxYRNMP8aljYbAR305xqJhDw-pHks5sM4DlgaJpZM4ONcGF .

– 773.888.2718