sdk-php: [Bug] Activity not scheduled in signal handler

What are you really trying to do?

Schedule/invoke an activity in a signal handler

Describe the bug

Having the following workflow code, where addName is a #[SignalMethod],


    public function addName(string $name): void
    {
        $this->greetingActivity->composeGreeting('Hello', $name);
    }

One customer reported that sometimes the activity is not scheduled after the signal is delivered to the workflow execution.
It does not make any difference if the activity invocation sync or async

I have tried to reproduce the issue with no luck.

Attaching two event histories following this same path. As can be noted, in one of them the activity is not scheduled after receiving the signal (event 15), which later causes an NDE on replay.

event-history-revised

event-history-revised-2

Minimal Reproduction

Environment/Versions

  • OS and processor: [e.g. M1 Mac, x86 Windows, Linux]
  • Temporal Version: [e.g. 1.14.0?] and/or SDK version
  • Are you using Docker or Kubernetes or building Temporal from source?

Additional context

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 23 (1 by maintainers)

Most upvoted comments

Hi @rustatian,

We temporarily moved away from using activities in signals, so we weren’t able to test whether the fix that was pushed as part of v2.5.3 and v.2.5.4 has addressed our issue. Since the bug is difficult to reproduce, as you saw yourself, it’s also difficult to verify that it has been fixed. That being said, we will come up with a plan on reintroducing the activities and monitor the situation.

We are expecting that cumulative update by v2.5.3 and v2.5.4 will fix possible Signal issues.

Thanks @dbakiu 👍🏻 Looks like we found the problem. On it.

Hey @dbakiu 👋🏻 Could you please check, what RR version you use? ./rr --version command output.

Hey, here you go: Roadrunner: rr version 2023.1.1 (build time: 2023-04-20T12:40:33+0000, go1.20.3) PHP: 8.1.18

Also, if that’s possible, could you please provide debug (would be super cool) or at least error logs before/after this happens?

We tried to reproduce the bug, no luck. Sending hundreds of signals to the dozens of workflows - no luck. I think that these failures happened because of the WF worker restarts due to out of memory or some other exception.

Since the issue is only happening in production and we’re unable to reproduce it locally, it is difficult to provide logs, apart from the error message that we got, which is a fairly generic one:

  "message": "nondeterministic workflow: extra replay command for ScheduleActivityTask: (ActivityId:19, ActivityType:(Name:WorkflowClassNameX.ActivityNameY.activityMethodZ), TaskQueue:(Name:default, Kind:Normal), Input:(Payloads:[len=1]), ScheduleToCloseTimeout:0s, ScheduleToStartTimeout:0s, StartToCloseTimeout:5s, HeartbeatTimeout:0s, RequestEagerExecution:true)",
  "source": "GoSDK",
  "stackTrace": "",
  "encodedAttributes": null,
  "cause": null,
  "applicationFailureInfo": {
    "type": "historyMismatchError",
    "nonRetryable": false,
    "details": null
  }
}

Regarding your OOM suspicion, I took a look at our monitoring systems, and the memory usage on the pods has been consistently low. On average, it’s around 10%, with a peak of 14% over the last 3 months. We’re also monitoring the sticky cache size, and during that same period, it averages at about 220, with a maximum of 590, which is well under the default limit of 10000. Additionally, we have increased the sticky cache size to 20000, which was one advice we got from the Temporal support crew after the last batch of errors appeared.

As for any other exceptions that may occur, there’s monitoring in place which should alert us if any erratic behavior is detected, and apart from the NDE-related exceptions, there have been no other exceptions in that period.

In your case the workflow may be closed before an activity from a signal method will be done

Some signal method activity counter might help:

class Test
{
    private mixed $arg1;
    private mixed $arg2;
    private ActivityProxy $activityAlpha;
    private ActivityProxy $activityBeta;
    private ActivityProxy $activityGamma;
    // FIX: Add signals counter
    private ActivityProxy $waitingSignals;

    public function handler(mixed $arg1, mixed $arg2)
    {
        $this->arg1 = $arg1;
        $this->arg2 = $arg2;

        yield $this->activityAlpha->execute();
        /*if version...*/ yield $this->activityBeta->execute();

        $result = [];
        for ($i = 3; $i > 0; $i--) {
            yield Workflow::timer('5 days');

            yield $this->activityGamma->execute();
            /*if version...*/ yield $this->activityBeta->execute();
        }
        yield Workflow::timer('5 days');

        /*if version...*/ yield $this->activityBeta->execute();

        // FIX: Wait signals completion
        // Use awaitWithTimeout to add timeout here (better option)
        /* if yet another version... */ yield Workflow::await(fn() => $this->waitingSignals === 0);
    }

    public function signalAlpha(string $arg)
    {
        // FIX
        ++$this->waitingSignals;
        try {
            yield $this->activityAlpha->execute($arg);
        } finally {
            --$this->waitingSignals;
        }
    }

    public function signalBeta(string $arg)
    {
        // FIX
        ++$this->waitingSignals;
        try {
            yield $this->activityBeta->execute($arg);
        } finally {
            --$this->waitingSignals;
        }
    }

    public function signalGamma(string $arg)
    {
        // FIX
        ++$this->waitingSignals;
        try {
            yield $this->activityGamma->execute($arg);
        } finally {
            --$this->waitingSignals;
        }
    }
}