opentelemetry-go: Resource merge semantics make upgrades problematic
Description
We’ve encountered issues upon a recent upgrade to OTEL v1.1.0 (which upgraded to semconv v1.7.0) and existing code or contrib code which still used semconv v1.4.0. See https://github.com/open-telemetry/opentelemetry-go-contrib/issues/1384 for an example.
The spec (https://github.com/open-telemetry/opentelemetry-specification/blob/bad49c714a62da5493f2d1d9bafd7ebe8c8ce7eb/specification/resource/sdk.md#merge) seems a bit contradictory in how this should be handled:
The interface MUST provide a way for an old resource and an updating resource to be merged into a new resource.
while also mentioning the merge algorithm (which is implemented to the spec in opentelemetry-go) as:
Else this is a merging error (this is the case when the Schema URL of the old and updating resources are not empty and are different).
From a consumer perspective, there is no real way to determine if all of the resources are compatible prior to calling resource.New(…), and thus we’re failing to create a resource detector.
Instead of the current algorithm, it feels like if the two resource schemas being merged are compatible (same major version), the merge should be able to proceed. Otherwise, we’ll need to coordinate an upgrade of every telemetry package together to ensure we’re on a consistent version of semconv.
Possible algorithms:
- Create merged resource using the oldest version of semconv. This might introduce new attributes which are introduced with later specs, however it should still be compatible with an older version.
- Create merged resource using the newest version of semconv (perhaps with migration of certain well known attributes - i.e.
rpc.jsonrpc.method
->rpc.method
). - Create unversioned resource if the versions don’t match.
Environment
- OS: Linux/Mac
- Architecture: x86_64
- Go Version: 1.16+
- opentelemetry-go version: v1.1.0
Steps To Reproduce
- Configure a resource with a combination of resource detectors which use different versions of semconv package (i.e. v1.4.0 + v1.7.0).
resource.New
fails because two different semconv versions are in use: https://github.com/open-telemetry/opentelemetry-go/blob/7ce58f355851d0412e45ceb79d977bc612701b3f/sdk/resource/resource.go#L190-L192
Expected behavior
Resource detector should be able to be initialized.
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 11
- Comments: 23 (13 by maintainers)
We hit this issue multiple times already because of routine package updates and without any changes from our side. The main problem is that often the issue can only be noticed in production, and it feels like the impact is not justified by the severity of the problem we’re trying to solve using schema versions. I mean, should we really care if name of some field (which may never be really looked at) was changed and now we temporarily have duplicated field? Especially considering, that often resources define mutually exclusive set of fields.
While I can imagine the cases where it is important to enforce compatible schemas, maybe we can just introduce a flag to ignore schema version incompatibility?
@MrAlias How was this resolved? I got bit by this today.
Excepting that the current strategy adopted in the related
opentelemetry-go-contrib
packages fails to ensure that each detector/instrumentation is separately implemented and available for each semconv – instead it afixes only a single semconv per release.https://github.com/open-telemetry/opentelemetry-go-contrib/issues/3657#issuecomment-1765527620
What’s more challenging still is that within that single package for a given release, there are currently three different semconvs in use –
gcp
even has differing semconv within the same ‘namespace’.While it is not necessarily one to discuss here, the problems with interoperability due to brittle design decisions – particularly with distributed systems – is rather concerning to me as a consumer of the standard, especially as it is currently adopted in Go.
Right now I’m seeing version bumps due to security related issues for the contrib packages. The problem is that they don’t abide by the compatibility contract of golang modules. Arguably the semconv strategy and contract of otel conflicts with golang’s own module contract to the extent that IMHO each semconv alteration should see a similar
vN
release of a module in the opentelemetry-go-contrib space. That is, each otel semconv has its own package for each detector/instrumentation/instrgen etc module.@pellared sorry to follow up but I’m looking for clarity on a couple of things.
as casual consumers of otel, it’s a bit unreasonable to expect them to “know” that it follows a different meaning of semver.
I understand they are different packages, but how come I have to change packages if I haven’t changed my code other than a dependency? If my app uses
v1.7.0
to implement a schemaless detector (to set some attributes from the environment), shouldn’t it continue working even if the cloud provider detector (detecting other attributes) was upgraded tov1.21.0
?It feels like
resource.New
should be versioned with semconv if it’s reliant on all detectors’ semconv versions matching.Telemetry schema and their description of changes from one version to the next are documented in the specification. The gap lies in the resource specification including a requirement for schema URL support and stating that merging resources with different schema URLs is an error without including a requirement that schema translation be applied or that there be a mechanism for indicating that schema translation should be applied.
There was significant discussion of this topic at the maintainers’ meeting today and I expect there will be more discussion of this issue at the specification SIG meeting tomorrow. I’m sorry if my earlier comment appeared to imply that I did not care about this issue or that we were not intending to do anything about it. We certainly do intend to ensure that resources with associated telemetry schema can be used easily and safely. I think that the proposal @MrAlias has made in #3944 is good, and the implementation is the logical conclusion of work started almost two years ago when we began introducing telemetry schema. However, I think it does not go far enough and the spec should instead direct that telemetry schema migration SHOULD be used to align attributes on resources to be merged, either implicitly at the newer of the two versions or explicitly at the version provided by the user. We do, however, need to be careful to ensure that we do not introduce changes to our stable SDK that are incompatible with the specification or that get ahead of specification language under consideration that may change before it is stabilized.
Closing as this is now addressed with the OTel schema.
There are plans for the collector to have a processor with the same logic and the ability to normalize all incoming attributes to a given schema version.
This is a problem we’re in the process of addressing. The semantic conventions now include a schema that describes changes from version to version in a way that will enable migrating a resource from one version to the next. We recently landed https://github.com/open-telemetry/opentelemetry-go/pull/2267 which gives us the ability to parse those schema definitions. The next step is to implement the transformations required and then to update the
Resource.Merge()
method to migrate resources with differing schemas to a common schema.Note for those following here: https://github.com/open-telemetry/opentelemetry-go/pull/4484#issuecomment-1710539976
It would be not compliant with the specification.
I created https://github.com/open-telemetry/opentelemetry-go/pull/4484 that should help to mitigate the issue.
Here’s a new repo I created to handle schema translations: https://github.com/MrAlias/otel-schema-utils
Only thing supported currently is resource translations.
I’m not sure there is anything we can do in the Go implementation at this time to improve the situation. The specification does not detail how to perform resource merging with multiple schema URLs involved, instead it defines this as an error and states that the resulting resource is undefined. There is a proposal to separate semantic conventions from the specification, which may result in a new schema family and versions, which I think would only further complicate this situation.
While I think it might be argued that it would be spec compliant to return an error and a merged resource with the schema URL removed, doing so would likely condition users to ignore the error and result in surprising behavior when the implementation changes and/or different behavior is specified.