sdk-go: Temporal SDK crash
Expected Behavior
Concurrent accesses to the maps need to be protected
Actual Behavior
Concurrent access to the map is causing SDK to crash, leading to a stuck client.
Steps to Reproduce the Problem
We are unaware of specific ways to reproduce the issue. This happens intermittently and when it does, the impact is catastrophic as it impacts all the workflows executing on the runner.
We are using Temporal (and Temporal SDK) to develop a Serverless Workflow based orchestrator. We encountered this issue when a parent workflow triggers a large number of child workflows (Ex: ForEach statement in SWF). During the execution of the child, we are getting stack-traces like below
fatal error: concurrent map writes
goroutine 84634 [running]:
go.temporal.io/sdk/internal.(*commandsHelper).addCommand(0xc00211ed20, {0x31d45e0, 0xc0011bbfe0})
/go/pkg/mod/go.temporal.io/sdk@v1.23.1/internal/internal_command_state_machine.go:958 +0x1eb
go.temporal.io/sdk/internal.(*activityCommandStateMachine).cancel(0xc00100d0e0)
/go/pkg/mod/go.temporal.io/sdk@v1.23.1/internal/internal_command_state_machine.go:591 +0x7f
go.temporal.io/sdk/internal.(*commandsHelper).requestCancelActivityTask(0x31a57a0?, {0xc0013a0580?, 0x132eb00?})
/go/pkg/mod/go.temporal.io/sdk@v1.23.1/internal/internal_command_state_machine.go:1020 +0x39
go.temporal.io/sdk/internal.(*workflowEnvironmentImpl).RequestCancelActivity(0xc00111f080, {{0xc0013a0580?, 0x2705c20?}})
/go/pkg/mod/go.temporal.io/sdk@v1.23.1/internal/internal_event_handlers.go:663 +0x38
go.temporal.io/sdk/internal.(*workflowEnvironmentInterceptor).ExecuteActivity.func3({0xc000a74fb8?, 0xc000615f28?}, 0x0?)
/go/pkg/mod/go.temporal.io/sdk@v1.23.1/internal/workflow.go:622 +0x75
go.temporal.io/sdk/internal.(*channelImpl).Close(0xc002418040?)
/go/pkg/mod/go.temporal.io/sdk@v1.23.1/internal/internal_workflow.go:887 +0x97
go.temporal.io/sdk/internal.(*cancelCtx).cancel(0xc002418000, 0x0, {0x31a57a0, 0xc0000bfd40})
/go/pkg/mod/go.temporal.io/sdk@v1.23.1/internal/context.go:333 +0xff
go.temporal.io/sdk/internal.(*cancelCtx).cancel(0xc002c6dc20, 0x1, {0x31a57a0, 0xc0000bfd40})
/go/pkg/mod/go.temporal.io/sdk@v1.23.1/internal/context.go:340 +0x1e2
go.temporal.io/sdk/internal.WithCancel.func1()
/go/pkg/mod/go.temporal.io/sdk@v1.23.1/internal/context.go:196 +0x30
runtime.Goexit()
/usr/local/go/src/runtime/panic.go:540 +0x1bb
go.temporal.io/sdk/internal.(*coroutineState).exit.func1({0xc000ac4000?, 0x3196760?}, 0xc000a58101?)
/go/pkg/mod/go.temporal.io/sdk@v1.23.1/internal/internal_workflow.go:992 +0x17
go.temporal.io/sdk/internal.(*coroutineState).initialYield(0xc00211edc0, 0x3, {0xc000a58120, 0x19})
/go/pkg/mod/go.temporal.io/sdk@v1.23.1/internal/internal_workflow.go:914 +0x89
go.temporal.io/sdk/internal.(*coroutineState).yield(...)
/go/pkg/mod/go.temporal.io/sdk@v1.23.1/internal/internal_workflow.go:923
go.temporal.io/sdk/internal.(*channelImpl).Receive(0xc000f2e510, {0x31c5740, 0xc0015b28a0}, {0x0, 0x0})
/go/pkg/mod/go.temporal.io/sdk@v1.23.1/internal/internal_workflow.go:715 +0x237
go.temporal.io/sdk/internal.(*futureImpl).Get(0xc0000ce3c0, {0x31c5740, 0xc0015b28a0}, {0x2718020?, 0xc000137080})
/go/pkg/mod/go.temporal.io/sdk@v1.23.1/internal/internal_workflow.go:318 +0x66
gitlab.eng.vmware.com/core-build/swf-runtime/pkg/runtime.(*ForEach).runBatch(0x7?, {0x31c5740, 0xc0015b28a0}, {0xc000972000?, 0xc000aac501?, 0xc000844ba0?}, {0xc0007f9000?, 0xc00099f758?, 0x2c179b4?}, {0x2c00278, ...}, ...)
/work/pkg/runtime/foreach.go:153 +0x215
The snippet of the code for foreach.go that's relevant to this is given below
futures := lo.Map(list, func(input map[string]any, i int) workflow.Future {
future, settable := workflow.NewFuture(ctx)
workflow.Go(ctx, func(ctx workflow.Context) {
config := &RunActionConfiguration{
Actions: actions,
Mode: mode,
Params: input,
Workflow: configuration.Workflow,
EnvironmentID: configuration.EnvironmentID,
WorkflowURL: configuration.WorkflowURL,
OrganizationID: configuration.OrganizationID,
}
settable.Set(f.actionsRunner.Run(ctx, config))
})
return future
})
var results []map[string]any
var errList error
for _, future := range futures {
result := map[string]any{}
if err := future.Get(ctx, &result); err != nil {
errList = multierr.Append(errList, err)
} else {
results = append(results, result)
}
}
The line 153 in the foreach.go refers to future.Get()
Specifications
- Version: 1.23.1
- Platform: Linux/Kubernetes
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 16 (9 by maintainers)
I wanted to thank you for the Workaround. We have applied the WA and ran 10-12 runs at scale. We did not see the issue.
@itendtolosealot I believe I found the issue. A potential work around, until I can PR and release a fix, would be to not use defers in workflows.
The failure is seen while the parent workflow is waiting on Futures from all the Child Executions.
We are not using go routines in the workflow tasks. We use workflow.Go.
In the runner we have a single long running house keeping task, which is running as a go routine. It is not responsible for activities or tasks. This go routine uses Synchronization primitives.
At the point when this go routine gets triggered, there is no workflow.Context. Consequently, it is not possible to trigger this as a separate workflow.Go routine.