dd-trace-dotnet: Performance degradation in 2.21.0

Describe the bug After installing 2.21.0, the CPU profile of a couple of our microservices went crazy, increasing significantly. On occasion a container running the service would start failing a large percentage of requests timing out calling downstream services and need to be terminated. Upon rolling back to 2.20.0, the performance returned to normal.

To Reproduce Steps to reproduce the behavior: Install Datadog.Tracing.Bundle 2.21.0

Expected behavior Updating to latest version should not completely break our applications.

Screenshots

Cursor_and_CloudWatch_Management_Console

Above is for an application where you can clearly see the impact of running with the impacted version. (graph is CPU utilization for the ECS service in log scale for clarity).

Runtime environment (please complete the following information):

  • Instrumentation mode: Using Datadog.Tracing.Bundle to be able to use both automatic and manual instrumentation.
  • Tracer version: 2.21.0
  • OS: Ubuntu Linux running on ECS Docker container
  • CLR: .NET 6

Additional context We originally saw this problem before switching to the bundle (we were previously downloading the .deb file and installing in the container and matching the version in code). I thought this may have been the issue based on previous issues I’ve encountered, so switched to using the bundle, but it was still an issue. In both services where we saw this issue, deploying a rollback of this library (and only this library) was enough to return it to a functioning state.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 1
  • Comments: 30

Most upvoted comments

@Recio I think that would be great if you can open a new ticket. I pinged someone from the ASM team to look at this.

@george-zubrienko I’m glad to hear that. I was defining a strategy to investigate 😃 out of curiosity, was it something in your code? machine? option enabled from our side ?

If anything happen new, do not hesitate.

@Recio Yeah the fix in 2.22 was for a deadlock on linux. Can you try check if DD_APSEC_ENABLED is not set or at least set to 0 ? This should not be enabled by default. I hope it will restore your happiness and sleep schedule 😃

@gleocadie we’ve rolled out 2.22 on some installations, but i didn’t yet try running code profiling for >1h. I’ll set up a guy on test now and let you know on Monday if we had any issues.

@george-zubrienko It looks like the malloc-deadlock I managed to reproduce on my end and fixed in the 2.22. Can you try it out and give me feedback please ? Do not hesitate to run the gdb command-line if the application is still stuck. Thanks in advance.