sdk-php: [Bug] Activity not scheduled in signal handler
What are you really trying to do?
Schedule/invoke an activity in a signal handler
Describe the bug
Having the following workflow code, where addName is a #[SignalMethod],
public function addName(string $name): void
{
$this->greetingActivity->composeGreeting('Hello', $name);
}
One customer reported that sometimes the activity is not scheduled after the signal is delivered to the workflow execution.
It does not make any difference if the activity invocation sync or async
I have tried to reproduce the issue with no luck.
Attaching two event histories following this same path. As can be noted, in one of them the activity is not scheduled after receiving the signal (event 15), which later causes an NDE on replay.
Minimal Reproduction
Environment/Versions
- OS and processor: [e.g. M1 Mac, x86 Windows, Linux]
- Temporal Version: [e.g. 1.14.0?] and/or SDK version
- Are you using Docker or Kubernetes or building Temporal from source?
Additional context
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 23 (1 by maintainers)
Hi @rustatian,
We temporarily moved away from using activities in signals, so we weren’t able to test whether the fix that was pushed as part of v2.5.3 and v.2.5.4 has addressed our issue. Since the bug is difficult to reproduce, as you saw yourself, it’s also difficult to verify that it has been fixed. That being said, we will come up with a plan on reintroducing the activities and monitor the situation.
We are expecting that cumulative update by v2.5.3 and v2.5.4 will fix possible Signal issues.
Thanks @dbakiu 👍🏻 Looks like we found the problem. On it.
Hey, here you go: Roadrunner:
rr version 2023.1.1 (build time: 2023-04-20T12:40:33+0000, go1.20.3)PHP: 8.1.18Since the issue is only happening in production and we’re unable to reproduce it locally, it is difficult to provide logs, apart from the error message that we got, which is a fairly generic one:
Regarding your OOM suspicion, I took a look at our monitoring systems, and the memory usage on the pods has been consistently low. On average, it’s around 10%, with a peak of 14% over the last 3 months. We’re also monitoring the sticky cache size, and during that same period, it averages at about 220, with a maximum of 590, which is well under the default limit of 10000. Additionally, we have increased the sticky cache size to 20000, which was one advice we got from the Temporal support crew after the last batch of errors appeared.
As for any other exceptions that may occur, there’s monitoring in place which should alert us if any erratic behavior is detected, and apart from the NDE-related exceptions, there have been no other exceptions in that period.
In your case the workflow may be closed before an activity from a signal method will be done
Some signal method activity counter might help: