dapr: [Proposal] Simplify tracing configuration and resolve the potential exporting bug

In what area(s)?

/area runtime

Goals

  • Remove expandParams and includeBody to simplify tracing options and remove potential bugs
  • Use tracing sampler and Simplify tracing onboarding experience

Problem

In order to enable tracing, we need to do three steps:

  1. Apply tracing config
apiVersion: dapr.io/v1alpha1
kind: Configuration
metadata:
  name: tracing
spec:
  tracing:
    enabled: true
    expandParams: false
    includeBody: true
  1. Apply tracing component yaml
  2. Add dapr.io/tracing annotation to each app

This proposal addresses the issue of tracing config. Dapr provides three options: enabled, expandParams, and includeBody.

  • enabled options is to enable tracing globally; even if app needs to configure annotation
  • expandParams option exports all headers of http or grpc into trace span’s annotation and
  • includeBody option exports request body into trace span’s annotation.

If users passes non-utf8 characters and 512+ character body or header values, opencensus exporter fails to export trace span to the telemetry backend (such as zipkin, ocagent, etc) without any verbosed error because:

Opencensus trace allows only utf-8 512 characters for span annotation.

In this cases, all trace spans in the buffer will be discarded silently (see the issue). User could disable expandParams and includeBody as a workaround but, this will not be the right fix.

Benefit of opencensus trace sampler

The standard correlation headers include four fields: version, trace-id, parent-id, trace-flags

examples:

Value = 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
base16(version) = 00
base16(trace-id) = 4bf92f3577b34da6a3ce929d0e0e4736
base16(parent-id) = 00f067aa0ba902b7
base16(trace-flags) = 01  // sampled

To understand how sampler works, assume AppA has 1.0 rate and AppB has 0.0 rate. According to W3C spec–opencensus follows W3C standard–, AppA needs to pass standard correlation headers to AppB.

The sampling rate between AppA and AppB is different.

but opencensus sdk in AppB looks at parent span (AppA’s span) trace-flags. If parent span is sampled, AppB’s span will be sampled. In this case, AppB’s span will be sampled even if sample rate is 0

Open census sampling code

Proposal

The proposal simplifies the tracing config option and deliver the correct functional features.

Simplified tracing config

  • Default behavior: daprd enables tracing with low tracing sampling rate by default
  • Remove expandParams and includeBody options
    • spanAnnotation value must be UTF-8 charset and shorter than 512 characters.
    • populating all headers for HTTP and gRPC requests to span annotation exposes to the potential security and PII information violation.
  • Introduce samplingRate option to leverage opencensus probabilitstic sampling concept
  • Remove enabled option
    • to turn off trace, user can set samplingRate to 0
    • With samplingRate=0, Start Trace Span operation in opencensus will not hurt performance.
apiVersion: dapr.io/v1alpha1
kind: Configuration
metadata:
  name: myappconfig
  namespace: app_namespace
spec:
  tracing:
    samplingRate: 0.001

Changed behaviors

  • Track the headers that has dapr- prefix into span annotation instead of having expandParams.
    • Instead of exporting all headers in HTTP or gRPC requests, user can selectively expose the headers to trace span on their demands; if we export all headers, authentication token will be exposed to trace which will be a security violation or if the headers include users’ PII, it will be GDPR violation.
  • Track response status and body in the span

App-level configuration

App-level configuration is crucial in production environment when devops investigates the incident. We will use the current tracing configuration method as is via dapr.io/config annotation in application deployment object/yaml.

Currently, customer can have multiple configurations at namespace level and select configuration at app-level by using dapr.io/config annotation. Assume we have two trace configurations in namespace.

  1. prod-tracing
apiVersion: dapr.io/v1alpha1
kind: Configuration
metadata:
  name: prod-tracing
  namespace: app_namespace
spec:
  tracing:
    samplingRate: 0.001
  1. debug-tracing
apiVersion: dapr.io/v1alpha1
kind: Configuration
metadata:
  name: debug-tracing
  namespace: app_namespace
spec:
  tracing:
    samplingRate: 1.0

In the normal production environment, all apps would use prod-tracing config. When the incident occurs in production environment, user can change dapr.io/config to debug-tracing for the problematic microservice app to investigate the tracing.

Consequences

  • Simplify the tracing configuration - User can configure tracing at namespace level or app level simplify
  • Prevent the potential issue of trace span export due to the span annotation limitation.
  • Offer the sampling option to reduce performance impact when user turns on tracing.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 3
  • Comments: 31 (31 by maintainers)

Most upvoted comments

@Haishi2016 @yaron2 @amanbha @shalabhms @RicardoNiepel @lynn-orrell

Thanks for all your feedbacks. I’ve updated the proposal based on your feedbacks. When it comes to minimum sampling rate, we can adjust it later. I could not come up with good number now.

Please review it again.

cc/ @msfussell @orizohar

I really love @amanbha and @lynn-orrell suggestions and would clearly vote for these.

Could we get rid of the required configuration but keep sane defaults? For example, by default, tracing would be enabled at a samplingRate of 0.5. If a global configuration manifest is present, it would override the baked in daprd defaults.

Then, also allow command line args (and annotations for k8s deployments) that you could set to override the defaults?

So, you’d have:

  • daprd baked in defaults (tracing on / sampling 0.5)
  • global configuration (if present, this would override the daprd baked in defaults)
  • cmd line / annotations (if present, this would override both of the other options)

Thoughts?

@youngbupark @Haishi2016 @amanbha @yaron2 @shalabhms

After reading this whole thread, I believe:

  • A global configuration for cluster wide control. To set customers on correct path this is enabled by default in k8s (like mTLS) with low sampling rate.
  • Then more granular controls as you are suggesting (if user wants to configure it differently for each app) but they are only appllied only when its enabled globally. If global is disabled, this is ingored.

This would mean following:

  • By default, in k8sdoesnt need to do anything and its enabled with minimal sampling.
  • To change global sampling rate or disable its a global config change.
  • User only needs to provide on each app basis if he wants different level for each app.

Thoughts?