azure-functions-durable-extension: Not returning to suborchestrator after calling activity
Description
A clear and concise description of what the bug is. Please make an effort to fill in all the sections below; the information will help us investigate your issue.
We’ve successfully used Durable Functions for a year or more, and just recently getting an issue where the suborchestrator isn’t returning from the activity.
This was working fine when we just had one set of Durable Functions, and we recently added two more, and I believe this is when the problem started.
(UPDATE: just re-ran our original Durable Function, and realized it was seeing same behavior, but there’s no functional code that’s run after the suborchestrator’s activity, so we never noticed it. I’m not seeing our log messages that say the activity was successful, and the orchestration is left in Running state.)
I tried separating them out into separate function apps, and separate storage, thinking there was a problem using same storage account and task hub.
I’m totally stumped and blocked at this point.
I’ve trimmed my code down to the simple ‘ping’ case below.
Expected behavior
A clear and concise description of what you expected to happen.
Orchestration, suborchestration and activity to complete as expected.
Actual behavior
A clear and concise description of what actually happened.
I’m seeing these log messages in App Insights: Attempting to create job [{job.Id}] entity workflow. Sending ping [{ping}]. Received ping [{ping}].
But that’s all I get; I never see:
Sent ping [{ping}]. Created job [{job.Id}] processing workflow.
It appears that the Ping method completes properly, and I can see it marked TaskCompleted.
After that, I see the events: ExecutionStarted OrchestrationStarted OrchestrationCompleted
This reproduces exactly the same every time. The serialized ‘Job’ is only about 5KB, so I don’t think it’s a memory issue.
Here’s the key part I see in the logs (attached below).
Ping activity finishes, and after processing TaskCompleted, it executes CreatePingWorkflow but leaves itself in running state.
2022-04-10T07:33:41.422 [Information] fab972cedda44432a6441240e5a29dee:0: Task activity [Ping] completed successfully
2022-04-10T07:33:41.431 [Information] fab972cedda44432a6441240e5a29dee:0: Sending [TaskCompleted] message to knowledge-control-00 for instance 'fab972cedda44432a6441240e5a29dee:0'
2022-04-10T07:33:41.437 [Information] fab972cedda44432a6441240e5a29dee:0: Deleting [TaskScheduled] message from knowledge-workitems
2022-04-10T07:33:41.441 [Information] fab972cedda44432a6441240e5a29dee:0: Fetched [TaskCompleted] message from knowledge-control-00 (delay = 441ms)
2022-04-10T07:33:41.445 [Information] fab972cedda44432a6441240e5a29dee:0: Processing [TaskCompleted] (total delay = 445ms)
2022-04-10T07:33:41.445 [Debug] fab972cedda44432a6441240e5a29dee:0: Preparing to process a [TaskCompleted] message
2022-04-10T07:33:41.445 [Information] fab972cedda44432a6441240e5a29dee:0: Executing 'CreatePingWorkflow' orchestration logic
2022-04-10T07:33:41.447 [Debug] TaskActivityDispatcher-dfe70fa2702b4f908452067b7c5d7bb4-0: Finished processing work-item '7e3526a3-6612-4ecc-a3f3-81194cf0ebdc'
2022-04-10T07:33:41.455 [Information] Executing 'CreatePingWorkflow' (Reason='(null)', Id=7d9cf6a9-6075-4c34-bd78-42b50a141eb1)
2022-04-10T07:33:41.457 [Information] Executed 'CreatePingWorkflow' (Succeeded, Id=7d9cf6a9-6075-4c34-bd78-42b50a141eb1, Duration=8ms)
2022-04-10T07:33:41.457 [Information] fab972cedda44432a6441240e5a29dee:0: Function 'CreatePingWorkflow (Orchestrator)' awaited. IsReplay: False. State: Awaited. HubName: knowledge. AppName: unstruk-knowledge-perf-functions-jobs. SlotName: Production. ExtensionVersion: 2.6.1. SequenceNumber: 13.
2022-04-10T07:33:41.457 [Information] fab972cedda44432a6441240e5a29dee:0: Orchestration 'CreatePingWorkflow' awaited and scheduled 0 durable operation(s).
2022-04-10T07:33:41.467 [Information] fab972cedda44432a6441240e5a29dee:0: Appended 3 new events to the history table in 0ms
2022-04-10T07:33:41.473 [Information] fab972cedda44432a6441240e5a29dee:0: Updated Instances table and set the runtime status to 'Running'
2022-04-10T07:33:41.473 [Information] fab972cedda44432a6441240e5a29dee:0: Deleting [TaskCompleted] message from knowledge-control-00
Relevant source code snippets
[FunctionName(nameof(CreateJobOrchestration))]
public async Task CreateJobOrchestration(
[OrchestrationTrigger] IDurableOrchestrationContext context)
{
string correlationId = context.NewGuid().ToString();
try
{
var job = context.GetInput<Job>();
if (job == null)
throw new InvalidOperationException("Invalid job.");
if (!context.IsReplaying)
Logger.LogDebug($"[{correlationId}] Attempting to create job [{job.Id}] entity workflow.");
await context.CallSubOrchestratorAsync(nameof(CreatePingWorkflow), "ping");
if (!context.IsReplaying)
Logger.LogDebug($"[{correlationId}] Created job [{job.Id}] processing workflow.");
if (!context.IsReplaying)
Logger.LogDebug($"[{correlationId}] Successfully created job [{job.Id}] orchestration.");
}
catch (Exception e)
{
throw new InvalidOperationException($"Failed to create job orchestration.", e);
}
}
[FunctionName(nameof(CreatePingWorkflow))]
public async Task CreatePingWorkflow(
[OrchestrationTrigger] IDurableOrchestrationContext context)
{
string correlationId = context.NewGuid().ToString();
try
{
var ping = context.GetInput<string>();
if (!context.IsReplaying)
Logger.LogDebug($"[{correlationId}] Sending ping [{ping}].");
ping = await context.CallActivityAsync<string>(nameof(Ping), ping);
if (!context.IsReplaying)
Logger.LogDebug($"[{correlationId}] Sent ping [{ping}].");
}
catch (Exception e)
{
throw new InvalidOperationException($"Failed to create ping workflow.", e);
}
}
[FunctionName(nameof(Ping))]
public async Task<string> Ping(
[ActivityTrigger] IDurableActivityContext context)
{
string correlationId = Guid.NewGuid().ToString();
try
{
var ping = context.GetInput<string>();
Logger.LogDebug($"[{correlationId}] Received ping [{ping}].");
return ping;
}
catch (Exception e)
{
throw new InvalidOperationException($"Failed to ping.", e);
}
}
App Details
- Durable Functions extension version (e.g. v1.8.3): 2.6.1
- Azure Functions runtime version (1.0 or 2.0): 4.0
- Programming language used: C#, .NET 6.0
Screenshots
If applicable, add screenshots to help explain your problem.
If deployed to Azure
We have access to a lot of telemetry that can help with investigations. Please provide as much of the following information as you can to help us investigate!
- Timeframe issue observed: 12:20 AM PT, 4/10/22 (but have been debugging this for last several days)
- Function App name: unstruk-knowledge-perf-functions-jobs
- Function name(s): CreateJobOrchestration, CreatePingWorkflow, Ping
- Azure region: southcentralus
- Orchestration instance ID(s): d116bc25088547cd8c2b31bd349f763f
- Azure storage account name: unstrukknowledgejobsperf
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 30 (20 by maintainers)
Yep, still working 100% on 2.6.0 so far.
@kirk-marple great find! Thanks for taking the time to try that out! It’s really appreciated!
pinging @cgillum - looks like there might be a regression in 2.6.1
@davidmrdavid Yep, this is the sequence that gets run.
You can see the code below, and the next in line would be ProcessJobOrchestration, which appears to never run.
Interestingly, it appears that time window is actually when I rolled out 2.7.2. I rolled back to 2.6.0 sometime after 2pm PDT which would match the end of that time window.
I’d be happy to send the code over for the functions/entity. It uses a bunch of services via DI for the “real work”, which you can just stub out, most likely. So these methods are actually pretty simple to read through.
I’ll go back to 2.7.2 and do some more testing with it, likely tomorrow PDT, and post back here when done.
I’d be happy to debug in real-time (or, I’m Seattle-based, and would be happy to meet up to expediate) if helpful.
Also here’s the host.json that’s been running today, if helpful:
@kirk-marple if you could hold fire on that as @davidmrdavid looks to be preparing a special build of the extension which includes more telemetry 😃
I take it everything is still working good for you on 2.6.0 so far?
@davidmrdavid All good so far; not a single problem with 2.6.0.
Unfortunately I’ve rebuilt that environment via Terraform, so I don’t have the App Insights history to check for the earliest timestamp. It was probably within 2 days before posting this issue that I noticed it; spent a couple days banging my head on it before trimming the code down to the simple ‘ping’ methods.
.NET 6 was ~4 months ago; I think it was over the Xmas holiday. 2.6.1 was in Feb, right after it came out.
This new Durable Function I just wrote a week or so ago, and that’s when I noticed it. It may have been happening with the previous Durable Function, but I didn’t notice it, since functionally it was working - and I wouldn’t have seen the missing logs w/o explicitly checking.
Definitely a good idea, thanks - and really my only next option, I’m rolling back now to 2.6.0 and will update back with any findings.