opentelemetry-collector-contrib: otelcol_processor_tail_sampling_sampling_traces_on_memory is only incrementing, it is not a gauge

Component(s)

processor/tailsampling

What happened?

Description

The help text of the metric indicates that it is a gauge. This metric is only increasing, as if it is just the count of spans processed.

Steps to Reproduce

Run this collector and send 300 spans. Wait 2 minutes. See the metric is not going down.

# HELP otelcol_processor_tail_sampling_sampling_traces_on_memory Tracks the number of traces current on memory
# TYPE otelcol_processor_tail_sampling_sampling_traces_on_memory gauge
otelcol_processor_tail_sampling_sampling_traces_on_memory{service_instance_id="490bb5de-9af9-4d47-8b16-afb69583fbc7",service_name="otelcol-contrib",service_version="0.79.0"} 300

Expected Result

Actual Result

Collector version

0.79.0

Environment information

otelcol-contrib_0.79.0_darwin_arm64

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      http:

exporters:
  logging:
    verbosity: detailed
    sampling_initial: 1000
    sampling_thereafter: 1000

processors:
  batch:
  tail_sampling:
    decision_wait: 60s
    policies:
      [
        {
          name: composite-policy,
          type: composite,
          composite:
            {
              max_total_spans_per_second: 10,
              policy_order: [composite-policy-errors, test-composite-always],
              composite_sub_policy:
                [
                  {
                    name: composite-policy-errors,
                    type: status_code,
                    status_code: {status_codes: [ERROR]}
                  },
                ],
              rate_allocation:
                [
                  {
                    policy: composite-policy-errors,
                    percent: 100
                  },
                ]
            }
        },
        {
          name: test-policy-8,
          type: rate_limiting,
          rate_limiting: {spans_per_second: 2}
        },
#        {
#          name: test-policy-1,
#          type: always_sample
#        },
#        {
#          name: test-policy-5,
#          type: status_code,
#          status_code: {status_codes: [ERROR]}
#        },
#        {
#          name: test-policy-4,
#          type: probabilistic,
#          probabilistic: {sampling_percentage: 50}
#        },
      ]

service:
  telemetry:
    logs:
      level: debug
    metrics:
      level: detailed
      address: ":8888"
  pipelines:
    traces:
      receivers: [otlp]
      #processors: [batch, tail_sampling]
      processors: [tail_sampling]
      exporters: [logging]

Log output

No response

Additional context

No response

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 15 (15 by maintainers)

Most upvoted comments

I think decision_wait control the workflow before the code block I posted above. It’s not relevant at all.

Shouldn’t they be thrown away and lower the otelcol_processor_tail_sampling_sampling_traces_on_memory accordingly?

I think you are correct. To me, it sounds like an expected behavior more than the current one.

We may just need to set up a goroutine to drop those useless in-memory trace data. But currently it’s not impl’ed like this. And we may need a PR to optimize it.

I suggest we could have a little discussion on the Collector SIG meeting tomorrow. Feel free to attend and comment: https://docs.google.com/document/d/1r2JC5MB7GupCE7N32EwGEXs9V_YIsPgoFiLP4VWVMkE/edit#heading=h.rbf22rxu3mij

Do you know why these traces are kept in memory if the decision_wait has passed for them? They have already expired. Shouldn’t they be thrown away and lower the otelcol_processor_tail_sampling_sampling_traces_on_memory accordingly?

Trace span data has been deleted after decision_wait. The sync.Map idToTrace now retains all trace data except for ReceivedBatches. the following is the relevant code

// Sampled or not, remove the batches
trace.Lock()
allSpans := trace.ReceivedBatches
trace.FinalDecision = decision
trace.ReceivedBatches = ptrace.NewTraces()
trace.Unlock()

I don’t think the trace id in the memory should be deleted as the decision_wait expires, because the trace data will still be received and sent after decision_wait, and the trace at this time will judge whether to sample according to the FinalDecision before the trace id retained in the memory.

d, loaded := tsp.idToTrace.Load(id)
if !loaded {
	d, loaded = tsp.idToTrace.LoadOrStore(id, &sampling.TraceData{
		Decisions:       initialDecisions,
		ArrivalTime:     time.Now(),
		SpanCount:       atomic.NewInt64(lenSpans),
		ReceivedBatches: ptrace.NewTraces(),
	})
}
if loaded {
	actualData.SpanCount.Add(lenSpans)
} else {
	...
}

// The only thing we really care about here is the final decision.
actualData.Lock()
finalDecision := actualData.FinalDecision

num traces must be set large enough to ensure that the trace id is still in the idToTrace when make decision, the otelcol_processor_tail_sampling_sampling_trace_dropped_too_early metric can know whether the trace id has been dropped early

statDroppedTooEarlyCount = stats.Int64("sampling_trace_dropped_too_early", "Count of traces that needed to be dropped the configured wait time", stats.UnitDimensionless)

for _, id := range batch {
    d, ok := tsp.idToTrace.Load(id)
    if !ok {
	    metrics.idNotFoundOnMapCount++
	    continue
    }

statDroppedTooEarlyCount.M(metrics.idNotFoundOnMapCount),