pyyaml: Duplicate keys are not handled properly

YAML spec in version 1.2 says that in a valid YAML file, mapping keys are unique.

This is not the case in pyyaml, as can be seen by loading this sample file:

a:
  - b
  - c

a:
  q: a
  q: b

The correct result from loading this file is an error. pyyaml instead naively overwrites results with the last key, resulting in this dict: {'a': {'q': 'b'}}.

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Reactions: 42
  • Comments: 20 (7 by maintainers)

Commits related to this issue

Most upvoted comments

PyYAML should reject duplicate keys per the YAML spec. These are silent errors in users’ code. Do what ruamel.yaml does, and add a yaml.allow_duplicate_keys = True property for those who want to opt-in.

The YAML 1.1 spec also says that keys should be unique: https://yaml.org/spec/1.1/#id861060

The content of a mapping node is an unordered set of key: value node pairs, with the restriction that each of the keys is unique

So I concatenate the general parameter file with each individual parameter file which thus overwrite the general parameter file when there is a duplicate entry.

@parrenin I think this is misguided. Since this violates the spec, you don’t know whether the original or the overridden value would be active (without looking at the implementation, which could change at any time, since it is not part of the spec.) Simply reading the “base” and “overriding” settings as two separate files, and combining as you see fit would be more correct.

I just wanted to comment that in my case, the overwrite is a feature and not a bug. In my model, I have several “sites”, each of them with a yaml parameter file. And I have also a parameter file which apply to all sites. So I concatenate the general parameter file with each individual parameter file which thus overwrite the general parameter file when there is a duplicate entry. So it would be helpful to at least keep an option for the overwrite feature.

@parrenin

So I concatenate the general parameter file with each individual parameter file which thus overwrite the general parameter file when there is a duplicate entry

I think this sounds like a use case for merge keys, actually.

The workaround that anz-rfc posted works well enough for me, though a more canonical solution would be preferable.

some discussion of workarounds here: https://gist.github.com/pypt/94d747fe5180851196eb

@wimglenn the correct way to use multiple merge keys is actually:

- k: v
  extra:          # merged vars from blob0 + blob1
    <<: [ *splat0, *splat1 ]

That way duplicate merge keys can be avoided.

Other than that, the spec is actually not very specific about equality of nodes. For example, 0xa and 10 are different before resolving, but equal after.

One problem is that this logic would break the functionality of merge keys: https://yaml.org/type/merge.html See def flatten_mapping.

To fix this, we would need a seperate handling of the merge keys and the actual keys in the mapping.

I don’t know which tests were failing (haven’t tested your patch yet), but I guess the ones with merge keys were among them.

This seems to duplicate #41.