OpenRefine: Processing of reconciliation is slow?

As discussed on Gitter, I’m developing a reconciliation service and I noticed that the time it takes to do the fuzzy search is dominated by the time it takes to process the results.

I’m not sure if you have access to Python, but I’ve tried to make the instructions as straightforward as possible. It should work on any platform as long as you set the environment properly. Also, it’s currently set up so you can lower the log level just by pointing FLASK_APP at venv/lib/python3.7/site-packages/csv_reconcile instead of csv_reconcile.

I just threw this up so you could test it. The tool could use some more polishing.

You can load sample/progressives.tsv into OpenRefine and then reconcile the member column using the service. On my machine OpenRefine 3.4.1 sends query batches of size 10. The service takes <2 seconds to process, but there are ~10 secs between batches getting sent.

bash-5.0$ venv/bin/python -m flask run
 * Serving Flask app "venv/lib/python3.7/site-packages/csv_reconcile"
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
start timer
Elapsed: 1.5238202030000014
127.0.0.1 - - [13/Feb/2021 19:45:17] "POST /reconcile HTTP/1.1" 200 -
start timer
Elapsed: 1.5245007290000103
127.0.0.1 - - [13/Feb/2021 19:52:42] "POST /reconcile HTTP/1.1" 200 -
start timer
Elapsed: 1.5689725690000387
127.0.0.1 - - [13/Feb/2021 19:52:44] "POST /reconcile HTTP/1.1" 200 -
start timer
Elapsed: 1.5700575550000053
127.0.0.1 - - [13/Feb/2021 19:52:52] "POST /reconcile HTTP/1.1" 200 -
start timer
Elapsed: 1.7569187119999583
127.0.0.1 - - [13/Feb/2021 19:52:57] "POST /reconcile HTTP/1.1" 200 -
start timer
Elapsed: 1.4609496090000107
127.0.0.1 - - [13/Feb/2021 19:53:09] "POST /reconcile HTTP/1.1" 200 -
start timer
Elapsed: 1.8019815920000042
127.0.0.1 - - [13/Feb/2021 19:53:15] "POST /reconcile HTTP/1.1" 200 -
start timer
Elapsed: 1.667001018999997
127.0.0.1 - - [13/Feb/2021 19:53:18] "POST /reconcile HTTP/1.1" 200 -
start timer
Elapsed: 1.6668618590000506
127.0.0.1 - - [13/Feb/2021 19:53:29] "POST /reconcile HTTP/1.1" 200 -
start timer
Elapsed: 1.7560411690000137
127.0.0.1 - - [13/Feb/2021 19:53:36] "POST /reconcile HTTP/1.1" 200 -
start timer
Elapsed: 1.6570629999999937
127.0.0.1 - - [13/Feb/2021 19:53:39] "POST /reconcile HTTP/1.1" 200 -
start timer
Elapsed: 0.5511451229999693
127.0.0.1 - - [13/Feb/2021 19:53:50] "POST /reconcile HTTP/1.1" 200 -

Hmm… A few of the batches above are faster than I had been seeing. Not sure why it wouldn’t be uniform. In any event, this is how I produced the results. Maybe it’s affected by other projects running in OpenRefine?? I’m happy to provide more information if you need it.

Regards,

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 16 (16 by maintainers)

Most upvoted comments

Makes sense! If you feel like it, I am sure PRs to the docs in this direction will be welcome 😃

Just had another simpler idea. Both Levenshtein distance and Word distance can be exposed as functions so that users can call them as they see fit. For instance, you could split out all the candidates returned from reconciliation and then apply Levenshtein distance with a GREL expression. Are these functions already available? The links to GREL controls, Clojure and Jython seem broken here.