dapr: [Proposal] Simplify tracing configuration and resolve the potential exporting bug
In what area(s)?
/area runtime
Goals
- Remove
expandParamsandincludeBodyto simplify tracing options and remove potential bugs - Use tracing sampler and Simplify tracing onboarding experience
Problem
In order to enable tracing, we need to do three steps:
- Apply tracing config
apiVersion: dapr.io/v1alpha1
kind: Configuration
metadata:
name: tracing
spec:
tracing:
enabled: true
expandParams: false
includeBody: true
- Apply tracing component yaml
- Add dapr.io/tracing annotation to each app
This proposal addresses the issue of tracing config. Dapr provides three options: enabled, expandParams, and includeBody.
enabledoptions is to enable tracing globally; even if app needs to configure annotationexpandParamsoption exports all headers of http or grpc into trace span’s annotation andincludeBodyoption exports request body into trace span’s annotation.
If users passes non-utf8 characters and 512+ character body or header values, opencensus exporter fails to export trace span to the telemetry backend (such as zipkin, ocagent, etc) without any verbosed error because:
Opencensus trace allows only utf-8 512 characters for span annotation.
In this cases, all trace spans in the buffer will be discarded silently (see the issue). User could disable expandParams and includeBody as a workaround but, this will not be the right fix.
Benefit of opencensus trace sampler
The standard correlation headers include four fields: version, trace-id, parent-id, trace-flags
examples:
Value = 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
base16(version) = 00
base16(trace-id) = 4bf92f3577b34da6a3ce929d0e0e4736
base16(parent-id) = 00f067aa0ba902b7
base16(trace-flags) = 01 // sampled
To understand how sampler works, assume AppA has 1.0 rate and AppB has 0.0 rate. According to W3C spec–opencensus follows W3C standard–, AppA needs to pass standard correlation headers to AppB.
The sampling rate between AppA and AppB is different.
but opencensus sdk in AppB looks at parent span (AppA’s span) trace-flags. If parent span is sampled, AppB’s span will be sampled. In this case, AppB’s span will be sampled even if sample rate is 0
Proposal
The proposal simplifies the tracing config option and deliver the correct functional features.
Simplified tracing config
- Default behavior: daprd enables tracing with low tracing sampling rate by default
- Remove
expandParamsandincludeBodyoptions- spanAnnotation value must be UTF-8 charset and shorter than 512 characters.
- populating all headers for HTTP and gRPC requests to span annotation exposes to the potential security and PII information violation.
- Introduce
samplingRateoption to leverage opencensus probabilitstic sampling concept - Remove
enabledoption- to turn off trace, user can set
samplingRateto 0 - With samplingRate=0, Start Trace Span operation in opencensus will not hurt performance.
- to turn off trace, user can set
apiVersion: dapr.io/v1alpha1
kind: Configuration
metadata:
name: myappconfig
namespace: app_namespace
spec:
tracing:
samplingRate: 0.001
Changed behaviors
- Track the headers that has
dapr-prefix into span annotation instead of havingexpandParams.- Instead of exporting all headers in HTTP or gRPC requests, user can selectively expose the headers to trace span on their demands; if we export all headers, authentication token will be exposed to trace which will be a security violation or if the headers include users’ PII, it will be GDPR violation.
- Track response status and body in the span
App-level configuration
App-level configuration is crucial in production environment when devops investigates the incident. We will use the current tracing configuration method as is via dapr.io/config annotation in application deployment object/yaml.
Currently, customer can have multiple configurations at namespace level and select configuration at app-level by using dapr.io/config annotation. Assume we have two trace configurations in namespace.
- prod-tracing
apiVersion: dapr.io/v1alpha1
kind: Configuration
metadata:
name: prod-tracing
namespace: app_namespace
spec:
tracing:
samplingRate: 0.001
- debug-tracing
apiVersion: dapr.io/v1alpha1
kind: Configuration
metadata:
name: debug-tracing
namespace: app_namespace
spec:
tracing:
samplingRate: 1.0
In the normal production environment, all apps would use prod-tracing config. When the incident occurs in production environment, user can change dapr.io/config to debug-tracing for the problematic microservice app to investigate the tracing.
Consequences
- Simplify the tracing configuration - User can configure tracing at namespace level or app level simplify
- Prevent the potential issue of trace span export due to the span annotation limitation.
- Offer the sampling option to reduce performance impact when user turns on tracing.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 3
- Comments: 31 (31 by maintainers)
@Haishi2016 @yaron2 @amanbha @shalabhms @RicardoNiepel @lynn-orrell
Thanks for all your feedbacks. I’ve updated the proposal based on your feedbacks. When it comes to minimum sampling rate, we can adjust it later. I could not come up with good number now.
Please review it again.
cc/ @msfussell @orizohar
I really love @amanbha and @lynn-orrell suggestions and would clearly vote for these.
Could we get rid of the required configuration but keep sane defaults? For example, by default, tracing would be enabled at a samplingRate of 0.5. If a global configuration manifest is present, it would override the baked in daprd defaults.
Then, also allow command line args (and annotations for k8s deployments) that you could set to override the defaults?
So, you’d have:
Thoughts?
@youngbupark @Haishi2016 @amanbha @yaron2 @shalabhms
After reading this whole thread, I believe:
This would mean following:
Thoughts?