LokiJS: bulk remove very slow

I bulk remove item using following code

the speed is very fast when size of items less than 50000 (approximate) and it become very very very slow when size of items large than 60000 (approximate)

50000 items: 146 ms 60000 items: 14274 ms

CPU: Intel Core2 Duo E8400 3GHz RAM: 8GB Windows 7 x64 Chrome 67.0.3371.0 (Official Build) dev (64-bit)

const docs = [
  { name: 'mjolnir', owner: 'thor', maker: 'dwarves' },
  { name: 'gungnir', owner: 'odin', maker: 'elves' },
  { name: 'tyrfing', owner: 'Svafrlami', maker: 'dwarves' },
  { name: 'draupnir', owner: 'odin', maker: 'elves' },
]
var db = new loki('sandbox.db')
// Add a collection to the database
var items = db.addCollection('items')

function addLokiItems(items, title) {
  console.time(title)
  for (var i = 0; i < 50000; i++) {
    items.insert({ ...docs[0], data: Math.random() })
  }
  console.timeEnd(title)
}

addLokiItems(items, 'loki-add mem')

function removeLokiItems(items, title) {
  console.time(title)
  var data = items.data
  var idIndex=items.idIndex
  items.findAndRemove({$loki: { '$gte' : 10 }})
  console.timeEnd(title)
}
removeLokiItems(items, 'loki-remove mem')

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 22

Commits related to this issue

Most upvoted comments

Not entirely sure the fix I just commited will work with non-synthentic benchmarks but assuming that your doc array is ordered similary to how they are ordered in the data array, the above commit seems to help significantly. Since your above example relies on a find filter to determine its ordering, and that find operation returns its results sorted on $loki, I assume this should help in most cases.

The main expense is the slice as mentioned, and in the above case reverse iterating… maybe we can force a sort on $loki in the outer (batch remove()) before reverse iterating and calling remove() on each doc.

(Update) : Splitting the operation into find() and remove() calls and doing array reverse on find results yields similar (bad) performance as the original… this was to be expected. Adding a simple javascript sort on the loki property for 60k+ docs added no significant performance penalty (maybe 50ms), allowing < 200ms response regardless of ordering of docs… forcing sort might not be a bad idea or we can add as method option but i am hesitant to alter function signatures at this point.

I have checked in an optimized algorithm for batch removes, which is currently tied into chained removes.

A new benmark_remove.js benchmark has been added to the benchmarks folder and when removing ~50k docs from 70k total docs it reduces time from ~1111ms to ~25ms.

To utilize/test performance with that you might do something like :

coll.chain().find({data: { $ne: 2 }}).remove();

I am employing an array intersection algorithm similar to resulset simplesort() optimization made recently which replaces array slicing for each document with array filter() on data and idIndex arrays.

If DynamicViews or binary indices are involved, they will be notified of all removals.

@Viatorus (and all) : I am considering not emitting ‘delete’ events for any of the documents, and the current new code path does not at the moment. If you do batch removes, you should notify yourself. If anyone disagrees, feel free to comment.

I will next look into retrofitting this new code path with the collection.remove() function. I will need to test for arrays, and if user passes in (batch) array I will convert that array to what is expected of the new internal method. Typescript definitions should not need to be changed.

@z543, feel free to try out the above syntax and let me know if that resolves performance issues on large batch removals.

So I have done testing and discovered what I believe are further optimizations that I will look at implementing over the weekend.

I believe the previous commit does help but mainly when you are removing contiguous elements near end of array. As more and more elements need to be shifted down for every individual remove, the cost builds up exponentially.

My current approach is to add removeBatch and use hash object along with single array filter() rather than many individual calls to array slice(). That seems to yield significant improvements, I will add in all requisite safety checks and dynamic view / binary index notifications and verify performance is still significantly better.