skywalking: The Java Agent may cause frequent GC or OOM in extreme scenarios

Please answer these questions before submitting your issue.

  • Why do you submit this issue?
    • Question or discussion
    • Bug
    • Requirement
    • Feature or performance improvement

Requirement or improvement

SkyWalking Java Agent is a powerful language instrument, it makes us build our tracing system more easily.

We have used SkyWalking with our Java Applications in production serval mouths, it runs fine mostly. Recently, we found some applications occur with frequent GC and some occur OOM. We dump the memory heap and use Memory Analyzer (MAT) find there has a lot of TraceSegmentRef Object in the heap. Here are two cases as follows:

Case 1: Frequency GC

In this case, the app has 1000 Dubbo handler threads, each handler will do a lot RPCs and DB operations.

  • JVM Max Heap: 8g
  • Machine: 8 core 16g
  • SkyWalking Agent: 8.4.0, collect all traces

image

image

Case 2: OOM

In this case, the app has 20 RocketMQ consume threads, in the consume thread, it will do some RPCs and DB operations.

  • JVM Max Heap: 8g
  • Machine: 8 core 16g
  • SkyWalking Agent: 8.4.0, collect all traces

image

image


On the application side, I think there have 3 reasons:

  1. sudden high throughput will cause all threads busy to handle requests.
  2. each request handle has a lot of RPCs and DB operations, cause create a lot of spans
  3. Handle requests slowly, some will elapse 10s even more.

On the agent side, I have read the source code and know some design:

  • The Segment in the SkyWalking concept, is the Object in the RingBuffer on the client-side, and SkyWalking has a consume thread consume the RingBuffer data send to the OAP.
  • Before put the Segment Object in the RingBuffer, will build it first. Each request will create some spans, and there are put in the stack data structure, the Segment will finish building utils the stack empty, which means the request in the application has finished. It will take some time. Meanwhile, the data will keep in the thread-local. And the garbage collector cannot collect them before the request finished.

I wonder why put the segment in the ring buffer, could we put the span? I don’t familiar with the Segment design purpose. And I know we should improve our application at the same time, but in some scenarios, people can tolerate it, even though handling requests slowly. So how SkyWalking Java Agent can do in such extreme scenarios? Because the application availability is very important, all of us won’t hope the APM instrument occupies a lot of memory.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 21 (21 by maintainers)

Most upvoted comments

@nisiyong This issue is going to be closed once #6715 gets merged, it has 2 approvals already.

To other people reading this issue, that PR is a precautionary measure, rather than a real bug fix or resolve this particular issue.