azure-functions-durable-extension: Entity Operation called too many times under load

DF 2.1 Consumption Plan UK South


@cgillum I’m doing a load test of Entites and getting some unexpected results.

Given 3 Event types G, N, D

Each event is published to 10,000 entities using the IDurableEntityClient Proxy such like…

client.SignalEntityAsync<IActor>(entityId, proxy => ...

When handling an event inside the Entity, the last operation is to signal to a singleton aggregation Entity which keep a running total of how many instances of each event type has been received.

I expected that when I read back the entity state of the aggregation Entity, it would be :

G = 10,000 N = 10,000 D = 10,000

but what I’m actually seeing is :

G = 10,086 N = 10,111 D = 10,163


I’ve added App Insights telemetry to my Entities so I could observe how many times each entity is processing an event by tracking a customEvent as the last operation when handling an event.

image

The results don’t add up.

My assumption is that the entity state is being persisted, but the control queue message is not being completed, so it may be replayed thus running the method again causing downstream duplication of signals etc. Notice how the events are duplicated on another host

image

Another thing that I’ve noticed is that the Function App appears to just stop processing for 10 minutes and then resumes. There is no user-code in the App that would explain this.

image


I’ve ran the above test a few times now, I keep getting similar results so its not a one-off.

{
  "version": "2.0",
  "extensions": {
    "durableTask": {
      "hubName": "MessagesTaskHub",
      "storageProvider": {
        "partitionCount": 4
      }
    }
  },
  "logging": {
    "logLevel": {
      "Function": "Warning",
      "Host" : "Warning"
    }
  }
}

Example execution ID of an entity that generated too many events between the time of 12/30/2019, 9:20:11.826 UTC and 12/30/2019, 9:27:11.624 PM

5c720ba8f91b4396912987555ac175db

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 29 (15 by maintainers)

Most upvoted comments

@sebastianburckhardt

It’s been over 3 years since I raised this issue, and no longer work at the same org, so unfortunately I can’t provide any insight.

If you need a quick repro, just spin up a bunch of Orchestrations which send signals to a singleton Entity, and the Entity simply aggregates those signals into a count (don’t hand roll any event de-dupe logic in the Entity itself, let the framework do its thing).

image

Personally, I would close it given how much time has passed, and that Netherite was always specified to be a better solution for this kind of work.

It looks like a lot of issues have the same root cause: the way leases are managed. The algorithm in the Partition manager is really more of a load-balancing device than a precise lease management; it uses (optimistic) lease stealing and does not prevent the situation where multiple nodes think they own the same control queue (at least for some time, until they try to renew or release the lease). I think this can be improved; we can use another set of (pessimistic) leases to hand off control queues without overlap.

I am planning to implement this algorithm for the new back-end anyway. It may be possible to backport the same logic to the current back-end.

@olitomlinson I understand. It’s a very imperfect workaround.

Your comments about improved integrity when the app is warm is consistent with my theory about what’s going on. When the app is warmed up, it has likely finished scaling out and lease partition movement stops because everything is already balanced. At that point, the risk of split-brain drops to basically zero.

I think the two approaches we can take for permanently fixing this is either a) fundamentally changing the way we handle partition movement or b) changing the way we checkpoint state and side-effects.