go: log/slog: JSONHandler should deduplicate keys

What version of Go are you using (go version)?

8edcdddb23c6d3f786b465c43b49e8d9a0015082

Does this issue reproduce with the latest release?

yes

What did you do?

	logger := slog.New(slog.NewJSONHandler(os.Stdout))
	logger.LogAttrs(
		context.Background(),
		slog.LevelWarn,
		"fancy message",
		slog.Bool("msg", false),
		slog.String("my attr", "something"),
	)

What did you expect to see?

JSONHandler should be fixed by deduplicating those special keys (time, message, level, and source). I use “source” as a key in a lot of my code right now, so I’d run into this problem.

Either

{
	"time": "2023-04-01T04:27:46.0959443-06:00",
	"level": "WARN",
	"msg": "fancy message",
	"my attr": "something"
}

Or

{
	"time": "2023-04-01T04:27:46.0959443-06:00",
	"level": "WARN",
	"msg": "fancy message",
	"msg.msg": false,
	"my attr": "something"
}

Or

{
	"time": "2023-04-01T04:27:46.0959443-06:00",
	"level": "WARN",
	"msg": "fancy message",
	"fields": {
		"msg": false,
		"my attr": "something"
	}
}

Or

{
	"time": "2023-04-01T04:27:46.0959443-06:00",
	"level": "WARN",
	"msg": "fancy message",
	"msg#01": false,
	"my attr": "something"
}

What did you see instead?

Invalid json. It would break our log aggregation. Default/Builtin handlers should never create invalid output.

{
	"time": "2023-04-01T04:27:46.0959443-06:00",
	"level": "WARN",
	"msg": "fancy message",
	"msg": false,
	"my attr": "something"
}

@jba

EDIT: Hey all, I’ve created a logging handler middleware that will deduplicate attribute keys, before they go to the json logger. Please check it out: https://github.com/veqryn/slog-dedup

go get github.com/veqryn/slog-dedup

package main

import (
	"log/slog"
	"os"

	slogdedup "github.com/veqryn/slog-dedup"
)

func main() {
	// OverwriteHandler
	overwriter := slogdedup.NewOverwriteHandler(slog.NewJSONHandler(os.Stdout, nil), nil)
	slog.SetDefault(slog.New(overwriter))

	/*
	  {
	    "time": "2023-10-03T01:30:00Z",
	    "level": "INFO",
	    "msg": "this is the dedup overwrite handler",
	    "duplicated": "two"
	  }
	*/
	slog.Info("this is the dedup overwrite handler",
		slog.String("duplicated", "zero"),
		slog.String("duplicated", "one"),
		slog.String("duplicated", "two"),
	)

	// IgnoreHandler
	ignorer := slogdedup.NewIgnoreHandler(slog.NewJSONHandler(os.Stdout, nil), nil)
	slog.SetDefault(slog.New(ignorer))

	/*
	  {
	    "time": "2023-10-03T01:30:00Z",
	    "level": "INFO",
	    "msg": "this is the dedup ignore handler",
	    "duplicated": "zero"
	  }
	*/
	slog.Info("this is the dedup ignore handler",
		slog.String("duplicated", "zero"),
		slog.String("duplicated", "one"),
		slog.String("duplicated", "two"),
	)

	// IncrementHandler
	incrementer := slogdedup.NewIncrementHandler(slog.NewJSONHandler(os.Stdout, nil), nil)
	slog.SetDefault(slog.New(incrementer))

	/*
	  {
	    "time": "2023-10-03T01:30:00Z",
	    "level": "INFO",
	    "msg": "this is the dedup incrementer handler",
	    "duplicated": "zero",
	    "duplicated#01": "one",
	    "duplicated#02": "two"
	  }
	*/
	slog.Info("this is the dedup incrementer handler",
		slog.String("duplicated", "zero"),
		slog.String("duplicated", "one"),
		slog.String("duplicated", "two"),
	)

	// AppendHandler
	appender := slogdedup.NewAppendHandler(slog.NewJSONHandler(os.Stdout, nil), nil)
	slog.SetDefault(slog.New(appender))

	/*
	  {
	    "time": "2023-10-03T01:30:00Z",
	    "level": "INFO",
	    "msg": "this is the dedup appender handler",
	    "duplicated": [
	      "zero",
	      "one",
	      "two"
	    ]
	  }
	*/
	slog.Info("this is the dedup appender handler",
		slog.String("duplicated", "zero"),
		slog.String("duplicated", "one"),
		slog.String("duplicated", "two"),
	)
}

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Reactions: 3
  • Comments: 48 (26 by maintainers)

Most upvoted comments

I think the current situation is fine, duplicate keys aren’t invalid, just less interoperable. Changing the key name to deduplicate it seems worse.

As for the original issue, I’ll note that option 3 was why WithGroup was added (though this doesn’t affect deduplication in general).

Hey all, I’ve created a logging handler middleware that will deduplicate attribute keys, before they go to the json logger. Please check it out: https://github.com/veqryn/slog-dedup (I’m also looking for constructive criticism, but please do it as an issue in that repo, not here.)

go get github.com/veqryn/slog-dedup

package main

import (
	"log/slog"
	"os"

	slogdedup "github.com/veqryn/slog-dedup"
)

func main() {
	// OverwriteHandler
	overwriter := slogdedup.NewOverwriteHandler(slog.NewJSONHandler(os.Stdout, nil), nil)
	slog.SetDefault(slog.New(overwriter))

	/*
	  {
	    "time": "2023-10-03T01:30:00Z",
	    "level": "INFO",
	    "msg": "this is the dedup overwrite handler",
	    "duplicated": "two"
	  }
	*/
	slog.Info("this is the dedup overwrite handler",
		slog.String("duplicated", "zero"),
		slog.String("duplicated", "one"),
		slog.String("duplicated", "two"),
	)

	// IgnoreHandler
	ignorer := slogdedup.NewIgnoreHandler(slog.NewJSONHandler(os.Stdout, nil), nil)
	slog.SetDefault(slog.New(ignorer))

	/*
	  {
	    "time": "2023-10-03T01:30:00Z",
	    "level": "INFO",
	    "msg": "this is the dedup ignore handler",
	    "duplicated": "zero"
	  }
	*/
	slog.Info("this is the dedup ignore handler",
		slog.String("duplicated", "zero"),
		slog.String("duplicated", "one"),
		slog.String("duplicated", "two"),
	)

	// IncrementHandler
	incrementer := slogdedup.NewIncrementHandler(slog.NewJSONHandler(os.Stdout, nil), nil)
	slog.SetDefault(slog.New(incrementer))

	/*
	  {
	    "time": "2023-10-03T01:30:00Z",
	    "level": "INFO",
	    "msg": "this is the dedup incrementer handler",
	    "duplicated": "zero",
	    "duplicated#01": "one",
	    "duplicated#02": "two"
	  }
	*/
	slog.Info("this is the dedup incrementer handler",
		slog.String("duplicated", "zero"),
		slog.String("duplicated", "one"),
		slog.String("duplicated", "two"),
	)

	// AppendHandler
	appender := slogdedup.NewAppendHandler(slog.NewJSONHandler(os.Stdout, nil), nil)
	slog.SetDefault(slog.New(appender))

	/*
	  {
	    "time": "2023-10-03T01:30:00Z",
	    "level": "INFO",
	    "msg": "this is the dedup appender handler",
	    "duplicated": [
	      "zero",
	      "one",
	      "two"
	    ]
	  }
	*/
	slog.Info("this is the dedup appender handler",
		slog.String("duplicated", "zero"),
		slog.String("duplicated", "one"),
		slog.String("duplicated", "two"),
	)
}

@jba can I get this added to the slog wiki please? https://github.com/golang/go/wiki/Resources-for-slog It would also be great if the stdlib json logger mentioned in its docs somewhere that it does not prevent duplicate field/property keys in the json.

Whatever solution is chosen here, I would argue that it should work even if the user explicitly provides duplicate names:

logger.LogAttrs(
	context.Background(),
	slog.LevelWarn,
	"fancy message",
	slog.String("duplicate", "alpha"),
	slog.String("duplicate", "bravo"),
	slog.String("duplicate", "charlie"),
	slog.String("duplicate", "delta"),
	slog.String("duplicate", "echo"),
)

One possible solution is to use a “#%d” scheme similar to how testing.T.Run deduplicates conflicting test names. For example:

{
	"time": "2023-04-01T04:27:46.0959443-06:00",
	"level": "WARN",
	"msg": "fancy message",
	"duplicate": "alpha",
	"duplicate#01": "bravo",
	"duplicate#02": "charlie",
	"duplicate#03": "delta",
	"duplicate#04": "echo",
}

@dbofmmbt

This piece is not encouraging in regards to generating duplicated fields in the stdlib’s JSONHandler, IMO, since its behavior looks unspecified.

That’s exactly it, it’s unspecified by JSON:

The RFC qualifies that:

The names within an object SHOULD be unique. ...

with SHOULD more precisely defined:

3. SHOULD   This word, or the adjective "RECOMMENDED", mean that there
   may exist valid reasons in particular circumstances to ignore a
   particular item, but the full implications must be understood and
   carefully weighed before choosing a different course.

Relative to other Go logging libraries: duplicate keys are possible with slog, zap and zerolog, and last-seen-wins in logrus. This could be seen as driven by performance concerns, in contrast to the former’s (fast) data models, logrus uses (expensive) maps. But it’s not the case that logrus is doing something that is more correct according to the JSON specification, the RFC positively identifies it as only one possible implementation, with no correct implementation identified.

Option 2 is pretty clearly the correct choice.

Options 2 and 4b discard information, which seems antithetical to the point of a logging library.

I’m not sure why anyone would expect a structured logging library to never discard information. Emitting multiple structured log events with the same key is pretty clearly going to resolve to a single “winner” for that key, most likely the last event, but really any event would be acceptably valid.

It would be very unexpected if repeated calls to Log with the same key "duplicate" were to produce output like {"duplicate": "alpha", "duplicate#01": "bravo", "duplicate#02": "charlie"}. Among many other reasons, this behavior would make it impossible to reliably assert that a log event emitted with some key K would actually be rendered with that key K in the ultimate output format.

Structured logging defines a concept of a log event which is conceptually above any specific encoding like JSON or logfmt or et cetera. That concept of a log event is well-defined by convention, and that definition doesn’t accommodate repeated keys.

@dbofmmbt “unpredictable” is still standardised. The same way like - what if you have a 50 digit long integer. It’s standard compliant, yet not every parser would be able to parse it or even when they do - they might interpret it differently.

Also this: https://github.com/nst/JSONTestSuite

Like what? The other security issue I can imagine is a DOS attack where you’re tricked into logging something massive, but that’s a distinctly different class of security issue from being able to hide activity or perform remote-code-execution.

Consider a type which renders in one way in some version X, and a different (and richer) way in a subsequent version Y. Consider the case where that change results in sensitive information from that type which was previously not logged becoming logged. This happens easily if the type grows a new sensitive field in a new version. Such changes are in practice never considered breaking in terms of versioning. A downstream application which logs a value of that type will have one behavior if it includes version X of the relevant package, and a different behavior when it includes version Y, and that version upgrade is in practice never going to be noticed, or tested, by the downstream application.

Printf based logging does not generally have this problem, because it is not conventional to pass values directly as a Printf log arguments.

Values passed to a structured logging library necessarily include a key which is, by definition, understood as unique within the target log event. So, no: it’s definitely not the case that calling a structured logging method like LogAttrs can be assumed to produce any specific log output. Structured logging is, hopefully obviously, not equivalent to log.Printf, or fmt.Printf, or anything like that.

duplicate keys aren’t invalid, just less interoperable.

Technically you are correct. Under RFC 8259, section 4, duplicate names are essentially treated as undefined behavior.

In a possible future major version of “encoding/json”, we will likely make duplicate keys an explicit error (which is still complaint behavior with RFC 8259) as this lack of checking has been exploited, resulting in severe security vulnerabilities (e.g., CVE-2017-12635). It’s unlikely that duplicate names will have such drastic security implications in logging, but strict checking of duplicate names is likely the overall direction that JSON will gradually move towards.

As an aside, Tim Bray, the author of RFC 8259 actually recommends against naïve compliance with RFC 8259, but rather that implementations adhere to RFC 7493 which makes strict decisions on issues that RFC 8259 leaves as undefined.

In “encoding/json/v2”, we are aiming to reject duplicate names by default, but make it fairly easy to opt-out of the strict check.

The decision to reject duplicate names is driven by various security exploits that have occurred over the years that depended on abusing the unspecified behavior that occurs with duplicate names (across a variety of JSON implementations in various languages). See CVE-2017-12635 as one such example. This particular security stance doesn’t benefit logging as much, since the content of a log message is rarely used to alter some critical piece of control flow. However, JSON is used for so much more than just logging (e.g., RPCs), that it is safer to be secure by default even if it is the more annoying default for logs analysis.

Also, even if various libraries allow for duplicate names, the behavior of how duplicates are handled varies widely. It is not enough to just ask the question of whether it is permissible. In the CVE mentioned above, Erland and JavaScript disagreed about the semantic meaning of a JSON object with duplicate names. In a similar way, permitting duplicate names, means that you may end up with logs analysis results that differ (perhaps wildly) depending on what JSON implementation you happened to use.

Are there any other commonly-used JSON parsers that will fail to parse {"a":1,"a":2} by default?

JavaScript, the father of JSON si definitely not. Python’s stdlib json takes the second as well.

Maybe it’s not a good strategy to be more Catholic than the Pope.

@carlmjohnson Not keys, the whole log line can be dropped as an invalid JSON, if the JSON parser considers duplicate keys as an invalid JSON object

Edit: I wanted to do it later, after app-level migration, but it seems that our stack processes duplicate keys fine. The last occurrence wins. stack: Vector+Loki

@prochac Using one of @veqryn’s dedup handlers is a good way to address your concerns.

@jba would you mind informing the section that the RFC specifies duplicated fields? I only found this:

When the names within an object are not unique, the behavior of software that receives such an object is unpredictable. Many implementations report the last name/value pair only. Other implementations report an error or fail to parse the object, and some implementations report all of the name/value pairs, including duplicates.

This piece is not encouraging in regards to generating duplicated fields in the stdlib’s JSONHandler, IMO, since its behavior looks unspecified.

@codeliger, I’ll let others speak to the utility of the built-in JSON handler. But duplicate keys are valid JSON, according to the RFC.

Thanks @AndrewHarrisSPU for that quick and accurate diagnosis. @dbofmmbt, there’s nothing we can do about that multiple-mutex issue, unfortunately. In fact the original log package has the same issue. I think the best solution is to wrap your io.Writer in one that acquires the mutex during Write, then install the wrapped io.Writer in multiple loggers. A well-behaved handler should call Write once per log event, and the built-in ones do, though that is not a hard requirement.

About deduping or overriding, I don’t think the default handlers will add it. It’s expensive, especially combined with pre-formatting. You can easily write your own wrapping Handler whose WithAttrs method keeps a map, so calls to Logger.With will override previous values. You’ll give up pre-formatting (since you can’t format anything in that map until Handle is called) but you may not need the extra speed.

@AndrewHarrisSPU indeed, I was creating multiple handlers in a middleware, so each request had its own mutex to os.Stdout. Thank you! It makes a lot of sense.

Scrambled text could result from racing writes to os.Stdout. IIRC this shouldn’t happen with the code you’ve listed(?) but just to toss out an idea quickly.

edit: NewJSONHandler puts a mutex on the writer, but that mutex is only checked by that handler and handlers derived from it. So a, b := slog.New(slog.NewJSONHandler(os.Stdout, nil)), slog.New(slog.NewJSONHandler(os.Stdout, nil)) can have races between a and b. Apologies if this is the wrong track, but might be worth checking.

As far as attribute overriding, there’s no deduplication in the provided handlers. I think @jba was writing a Handler guide which would be a great next step, it’s definitely possible.

What if the fields are changed to dupe.1 dupe.2 etc but to keep it from going unnoticed there’s a new field added with a canonical name like “duplicated-fields”: [“dupe”] and if you see anything called duplicated-fields in your logs you know to go figure out what’s misconfigured.

fwiw we once used a database driver that logged our database password in plain text when in debug mode, using regular old Printf (it was patched right away)

I don’t really see any security issues with structured logging that don’t already exist with regular logging.

And its not like this is log4j and we are executing the statements in the logs or trying resolve remote xml or something. If the data is strings, the worst case I can think of is accidental logging of sensitive data.

Another argument against option 2 (which discards data) is security.

If untrusted input is able affect what names are present, then that can be used as a mechanism to hide malicious activity, knowing that it will not be logged. IMO, it seems better for readers to possibly choke on the malformed messages, than for the malicious activity to not be properly logged. An human investigator looking at the malformed message can see the duplicate shenanigan at play.

One option might be to rename duplicated keys to !BADKEY(duplicateKey) as a sort of combination of @dsnet’s options 3 and 4a while carrying some symmetry with the existing error-while-logging mechanism. Then again, there is also some asymmetry if it is only JSONHandler that checks duplicates.

Values passed to a structured logging library necessarily include a key which is, by definition, understood as unique within the target log event

By technical definition, I agree. However, in practice, duplicates often occur due to human error. If I make a mistake, I would rather still have all duplicates logged (whether with option 1 or 3).

I’m not sure how to evaluate whether or not a particular mistake made by consumers of an API is common enough to warrant changing the API to accommodate that mistake. Especially so if that change would violate the intuitive expectations of the API, as is the case here.

It would be very unexpected if repeated calls to Log with the same key "duplicate" were to produce output like {"duplicate": "alpha", "duplicate#01": "bravo", "duplicate#02": "charlie"}.

Possibly, but there are no good answers when it comes to duplicate names.

It would be nice if the API made it statically impossible to have duplicates, but I don’t see how that can be done without taking in a map[string]any as the input, which would not be performant.

Given the possibility of runtime duplicates, your options are:

  1. Emit the object with all duplicate fields
  2. Emit the object, but only keep one of the duplicate fields (e.g., the first occurrence)
  3. Rename the duplicate fields to avoid a conflict
  4. Report an error a. and still do either 1, 2, or 3 b. or simply reject the message and not log it.

With option 1, you risk the log parser later not being able to read the message entirely or accurately. Duplicate names result in undefined behavior where the parser might ignore duplicates, only use the first one, only use the last one, or use some merged form of all duplicates.

With option 2, you avoid any ambiguities when reading, but are subtly dropping information when writing.

With option 3, you avoid any ambiguities as to the meaning when reading, but are changing the meaning when writing. At least the meaning when reading is consistent across all parsers.

Option 4a is decent as it allows the caller to at least realize that they’re possibly holding the API wrong, but I suspect most people will not check the error of logging calls.

Of all the bad options, option 4a with either 1 or 3 is probably the best approach. Options 2 and 4b discard information, which seems antithetical to the point of a logging library.