jaeger: Jaeger ingester panics on convertProcess function

Describe the bug The Jaeger ingester is panicking when taking data from kafka and attempting to write it into elasticsearch.

I believe the following code is the reason for the panic.

https://github.com/jaegertracing/jaeger/blob/4f55a70524b29ef87178b4deffbd5a99be5d017a/plugin/storage/es/spanstore/dbmodel/from_domain.go#L123

There seems to be no safeguards for a null process or null tags?

This is the following stack trace

{"level":"info","ts":1646779105.5014815,"caller":"consumer/processor_factory.go:65","msg":"Creating new processors","partition":31}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0xd94dc6]

goroutine 84516 [running]:
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel.FromDomain.convertProcess(...)
	github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel/from_domain.go:123
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel.FromDomain.convertSpanEmbedProcess({0x0, 0xc0003c48a0, {0x1499618, 0xf23872}}, 0xc00c544690)
	github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel/from_domain.go:64 +0x126
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel.FromDomain.FromDomainEmbedProcess(...)
	github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel/from_domain.go:43
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore.(*SpanWriter).WriteSpan(0xc0005f8cc0, {0x0, 0x0}, 0xc00c544690)
	github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/writer.go:152 +0x7a
github.com/jaegertracing/jaeger/cmd/ingester/app/processor.KafkaSpanProcessor.Process({{0x14ad480, 0x1dee2a8}, {0x14ad340, 0xc0005f8cc0}, {0x0, 0x0}}, {0x14af760, 0xc0111950e0})
	github.com/jaegertracing/jaeger/cmd/ingester/app/processor/span_processor.go:67 +0xd3
github.com/jaegertracing/jaeger/cmd/ingester/app/processor/decorator.(*retryDecorator).Process(0xc00008f180, {0x14af760, 0xc0111950e0})
	github.com/jaegertracing/jaeger/cmd/ingester/app/processor/decorator/retry.go:110 +0x37
github.com/jaegertracing/jaeger/cmd/ingester/app/consumer.(*comittingProcessor).Process(0xc0085cfaa0, {0x14af760, 0xc0111950e0})
	github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/committing_processor.go:44 +0x5e
github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*metricsDecorator).Process(0xc00f38fa00, {0x14af760, 0xc0111950e0})
	github.com/jaegertracing/jaeger/cmd/ingester/app/processor/metrics_decorator.go:44 +0x5b
github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*ParallelProcessor).Start.func1()
	github.com/jaegertracing/jaeger/cmd/ingester/app/processor/parallel_processor.go:57 +0x42
created by github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*ParallelProcessor).Start
	github.com/jaegertracing/jaeger/cmd/ingester/app/processor/parallel_processor.go:53 +0x10c

To Reproduce

We aren’t sure how to reproduce this as the only component panicking is the ingester.

We also haven’t seen what set of traces are causing this.

The best way to attempt to reproduce this could be to send a trace to the collector without process tags. We just haven’t tried that yet.

Expected behavior We expect the Jaeger ingester to gracefully handle any “bad” data and give us the configuration to either panic or discard the data.

Screenshots N/A

Version (please complete the following information):

  • OS: Amazon Linux 2 (5.4.176-91.338.amzn2.x86_64)
  • Jaeger version: 1.31.0 ( We were on 1.20.0 and didn’t see this issue)
  • Elastic version: 7.16.2
  • Kafka version: quay.io/strimzi/kafka:0.25.0-kafka-2.8.0
  • K8s version: 1.19 ( We are also seeing this on 1.21)
  • Deployment: Kubernetes via Operator

What troubleshooting steps did you try?

We tried downgrading the ingesters back to 1.20.0 and saw the same issue.

We recreated the Jaeger deployment and still saw the issue.

The only temporary fix was clearing the kafka queue, which implies bad data is making it to the queue.

Additional context We are going to add collector tags to everything which should force process tags if I am reading the doc correctly.

https://www.jaegertracing.io/docs/1.32/cli/#jaeger-collector-kafka

–collector.tags |   – | – One or more tags to be added to the Process tags of all spans passing through this collector. Ex: key1=value1,key2=${envVar:defaultValue}

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 38 (16 by maintainers)

Most upvoted comments

The processing in the ingester is really barebones

https://github.com/jaegertracing/jaeger/blob/cdbdb926b669e3be9581b6389aff2d7e6b9e06f2/cmd/ingester/app/processor/span_processor.go#L61-L68

It feels weird to me to be adding data format validation at this point, especially if the root cause is data corruption, which can affect any other aspect of the span message, not just the null process that came up in the exception.

The ingester - if seeing an invalid message, should just log and skip.

there was a fix for this specific NPE in 1.34

Also, people still seem to be getting panics with 1.35 - #3765

The collector seems to writing null message to Kafka topic, that seems to cause the panic in the ingester

{"level":"debug","ts":1650237044.222346,"caller":"consumer/consumer.go:138","msg":"Got msg","msg":{"Headers":null,"Timestamp":"0001-01-01T00:00:00Z","BlockTimestamp":"0001-01-01T00:00:00Z","Key":null,"Value":null,"Topic":"tools_tools_us_prod_jaeger_spans","Partition":2,"Offset":39319}}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x198f046]

goroutine 148 [running]:
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel.FromDomain.convertProcess(...)
	github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel/from_domain.go:123
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel.FromDomain.convertSpanEmbedProcess({0x40, 0xc000518d80, {0x2094088, 0x1b1eaf2}}, 0xc0003b85a0)
	github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel/from_domain.go:64 +0x126
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel.FromDomain.FromDomainEmbedProcess(...)
	github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/dbmodel/from_domain.go:43
github.com/jaegertracing/jaeger/plugin/storage/es/spanstore.(*SpanWriter).WriteSpan(0xc0005205a0, {0x0, 0x0}, 0xc0003b85a0)
	github.com/jaegertracing/jaeger/plugin/storage/es/spanstore/writer.go:152 +0x7a
github.com/jaegertracing/jaeger/cmd/ingester/app/processor.KafkaSpanProcessor.Process({{0x20a8120, 0x29ebfc0}, {0x20a7fe0, 0xc0005205a0}, {0x0, 0x0}}, {0x20aa520, 0xc00028caa0})
	github.com/jaegertracing/jaeger/cmd/ingester/app/processor/span_processor.go:67 +0xd3
github.com/jaegertracing/jaeger/cmd/ingester/app/processor/decorator.(*retryDecorator).Process(0xc00019c230, {0x20aa520, 0xc00028caa0})
	github.com/jaegertracing/jaeger/cmd/ingester/app/processor/decorator/retry.go:110 +0x37
github.com/jaegertracing/jaeger/cmd/ingester/app/consumer.(*comittingProcessor).Process(0xc0005d1bc0, {0x20aa520, 0xc00028caa0})
	github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/committing_processor.go:44 +0x5e
github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*metricsDecorator).Process(0xc00041c340, {0x20aa520, 0xc00028caa0})
	github.com/jaegertracing/jaeger/cmd/ingester/app/processor/metrics_decorator.go:44 +0x5b
github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*ParallelProcessor).Start.func1()
	github.com/jaegertracing/jaeger/cmd/ingester/app/processor/parallel_processor.go:57 +0x42
created by github.com/jaegertracing/jaeger/cmd/ingester/app/processor.(*ParallelProcessor).Start
	github.com/jaegertracing/jaeger/cmd/ingester/app/processor/parallel_processor.go:53 +0x10c

We are having the same issue with the setup of ElasticSearch backend after upgrading it from 7.17.2 (Jaeger 1.30.0 works great with this) to 7.17.5.

It seems like either the process / process.Tags was null. Adding a safeguard there might solve this?

Update my naive attempt: https://github.com/locmai/jaeger/commit/32a8e31fb9180c31a530b4004ebfafef209e81d2

Another approach could be also good:

Instead of using:

tags, tagsMap := fd.convertKeyValuesString(process.Tags)

We could:

tags, tagsMap := fd.convertKeyValuesString(process.GetTags())`

The process.GetTags() has the nil check that we could reuse.

At this point I don’t know what is wrong with the messages that cause panic. We already skip messages that are cannot be deserialized. I have a fix for null Process fix that you can try, but if it’s a random corruption, it won’t help. Being isolated to the Process field does not sound like random to me.

I put a PR to sanitize spans with empty service name or null process. This should protect the pipeline from malformed spans submitted externally. It will not protect against malformed spans written directly to Kafka - we could do that too, but it changes the expectations of the deployment that Kafka is a public API into ingester, which would be a new guarantee and will require ingester to run preprocessing logic similar to collector.

There are a few places that use the parent function (FromDomainEmbedProcess).

Couldn’t we just add second return value of “error” and bubble it up to function that is trying to write the span?

The goal wouldn’t be to pass the bad data or even transform it. We just want to inform the logs that a bad span made it to the Ingester and have the original caller of (FromDomainEmbedProcess) skip the write.