opentelemetry-collector-contrib: otelcol_processor_tail_sampling_sampling_traces_on_memory is only incrementing, it is not a gauge
Component(s)
processor/tailsampling
What happened?
Description
The help text of the metric indicates that it is a gauge. This metric is only increasing, as if it is just the count of spans processed.
Steps to Reproduce
Run this collector and send 300 spans. Wait 2 minutes. See the metric is not going down.
# HELP otelcol_processor_tail_sampling_sampling_traces_on_memory Tracks the number of traces current on memory
# TYPE otelcol_processor_tail_sampling_sampling_traces_on_memory gauge
otelcol_processor_tail_sampling_sampling_traces_on_memory{service_instance_id="490bb5de-9af9-4d47-8b16-afb69583fbc7",service_name="otelcol-contrib",service_version="0.79.0"} 300
Expected Result
Actual Result
Collector version
0.79.0
Environment information
otelcol-contrib_0.79.0_darwin_arm64
OpenTelemetry Collector configuration
receivers:
otlp:
protocols:
http:
exporters:
logging:
verbosity: detailed
sampling_initial: 1000
sampling_thereafter: 1000
processors:
batch:
tail_sampling:
decision_wait: 60s
policies:
[
{
name: composite-policy,
type: composite,
composite:
{
max_total_spans_per_second: 10,
policy_order: [composite-policy-errors, test-composite-always],
composite_sub_policy:
[
{
name: composite-policy-errors,
type: status_code,
status_code: {status_codes: [ERROR]}
},
],
rate_allocation:
[
{
policy: composite-policy-errors,
percent: 100
},
]
}
},
{
name: test-policy-8,
type: rate_limiting,
rate_limiting: {spans_per_second: 2}
},
# {
# name: test-policy-1,
# type: always_sample
# },
# {
# name: test-policy-5,
# type: status_code,
# status_code: {status_codes: [ERROR]}
# },
# {
# name: test-policy-4,
# type: probabilistic,
# probabilistic: {sampling_percentage: 50}
# },
]
service:
telemetry:
logs:
level: debug
metrics:
level: detailed
address: ":8888"
pipelines:
traces:
receivers: [otlp]
#processors: [batch, tail_sampling]
processors: [tail_sampling]
exporters: [logging]
Log output
No response
Additional context
No response
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 15 (15 by maintainers)
I think
decision_wait
control the workflow before the code block I posted above. It’s not relevant at all.I think you are correct. To me, it sounds like an expected behavior more than the current one.
We may just need to set up a goroutine to drop those useless in-memory trace data. But currently it’s not impl’ed like this. And we may need a PR to optimize it.
I suggest we could have a little discussion on the Collector SIG meeting tomorrow. Feel free to attend and comment: https://docs.google.com/document/d/1r2JC5MB7GupCE7N32EwGEXs9V_YIsPgoFiLP4VWVMkE/edit#heading=h.rbf22rxu3mij
Trace span data has been deleted after
decision_wait
. The sync.MapidToTrace
now retains all trace data except forReceivedBatches
. the following is the relevant codeI don’t think the trace id in the memory should be deleted as the
decision_wait
expires, because the trace data will still be received and sent afterdecision_wait
, and the trace at this time will judge whether to sample according to theFinalDecision
before the trace id retained in the memory.num traces
must be set large enough to ensure that the trace id is still in theidToTrace
when make decision, theotelcol_processor_tail_sampling_sampling_trace_dropped_too_early
metric can know whether the trace id has been dropped early