pyyaml: Why does pyyaml disallow trailing spaces in block scalars?

We do not permit trailing spaces for block scalars.

Block style (|, <) is overruled with quoted style (") if the string contains trailing spaces. I don’t see any reason for this, and the fact that I can’t force a certain style (even it means losing information) has caused me a lot of grief.

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Reactions: 13
  • Comments: 22 (8 by maintainers)

Commits related to this issue

Most upvoted comments

It works for me:

import yaml

def yaml_multiline_string_pipe(dumper, data):
    text_list = [line.rstrip() for line in data.splitlines()]
    fixed_data = "\n".join(text_list)
    if len(text_list) > 1:
        return dumper.represent_scalar('tag:yaml.org,2002:str', fixed_data, style="|")
    return dumper.represent_scalar('tag:yaml.org,2002:str', fixed_data)

yaml.add_representer(str, yaml_multiline_string_pipe)
print(yaml.dump({"multiline":"First line         \r\nSecond line\r\nThird line"}, allow_unicode=True))
multiline: |-
  First line
  Second line
  Third line

Thank you @perlpunk for these examples, they perfectly illustrate what I mean, and thank you both for looking into this.

My use case was that I had a large number of dictionaries that I wanted to store as yaml records, in | block style. Some of these happened to have trailing spaces and were output wrongly, as a double quoted scalar. This was frustrating because

  1. I had to reverse-engineer why this was happening, especially since I couldn’t find any reason in the yaml spec why my strings shouldn’t be representable in | block style. This really boggles the mind: pyyaml is needlessly deviating from the yaml spec. So this is a bug, that should be fixed.
  2. I was explicitly telling the emitter that I wanted | block style. The way every other Python function that I know of works is that the option I set is obeyed, possibly leading to warnings and/or errors, and that the consequences (like losing information in edge cases) are for me to deal with. This is programming, I should be able to control things. For pyyaml to override my option due to style considerations really is not acceptable.

My ‘solution’ was to strip trailing spaces before emitting so I would get | block style across the board, so ironically, this led me to lose information I wouldn’t have lost otherwise.

@ingydotnet , I was trying to give examples, to every reader of this issue, where I can understand the wish for block style to be preserved. I was not writing this to imply that you don’t know how it works.

I think we’re all agreed this is an old implementation issue, not a spec issue- I’m willing to spend some time trying to make it work in the next release, but as @perlpunk mentioned earlier, I’d want to integrate the more comprehensive YAML test suite first to have some confidence we didn’t break anything in the process.

I’ve added this issue to the PyYAML 6.1 planning project.

It works for me:

import yaml

def yaml_multiline_string_pipe(dumper, data):
    text_list = [line.rstrip() for line in data.splitlines()]
    fixed_data = "\n".join(text_list)
    if len(text_list) > 1:
        return dumper.represent_scalar('tag:yaml.org,2002:str', fixed_data, style="|")
    return dumper.represent_scalar('tag:yaml.org,2002:str', fixed_data)

yaml.add_representer(str, yaml_multiline_string_pipe)
print(yaml.dump({"multiline":"First line         \r\nSecond line\r\nThird line"}, allow_unicode=True))
multiline: |-
  First line
  Second line
  Third line

This is not a solution as it does not preserve the spaces. You cannot round-trip this!

Identifying that this was the issue I was facing just killed a few hours for me.

At the very least, can pyyaml please throw a warning that says “cannot use style | on content with trailing spaces”

@ingydotnet I agree loss of information is not an option.

For literal | block style it’s easy to imagine use cases IMHO. If you have text data in a specific format that requires trailing spaces, you still might want to dump it as a block scalar for readability.

some text format: |
    line 1___
    line 2
    line 3

Folded style is used when you have long lines. Imagine you have a long input line, with words seperated by different amount of spaces.

# _ == trailing spaces
foo: >
  xxxxxxxxxx    xxxxxxxxxx    xxxxxxxxxx    xxxxxxxxxx____
  xxxxxxxxxx    xxxxxxxxxx    xxxxxxxxxx    xxxxxxxxxx____
  xxxxxxxxxx    xxxxxxxxxx    xxxxxxxxxx    xxxxxxxxxx____
  xxxxxxxxxx    xxxxxxxxxx    xxxxxxxxxx    xxxxxxxxxx____

The emitter now tries to break this up into several lines to get below a limit of characters per line.

I think there are only two possibilities to to that, double quoted and folded block. libyaml, pyyaml and ruamel all emit this as a double quoted scalar, overruling the requested block scalar style:

foo: "xxxxxxxxxx    xxxxxxxxxx    xxxxxxxxxx    xxxxxxxxxx     xxxxxxxxxx    xxxxxxxxxx
  \   xxxxxxxxxx    xxxxxxxxxx     xxxxxxxxxx    xxxxxxxxxx    xxxxxxxxxx    xxxxxxxxxx
  \    xxxxxxxxxx    xxxxxxxxxx    xxxxxxxxxx    xxxxxxxxxx    \n"

Which I think is not very readable. The only advantage is that there are no trailing spaces,.

I’m another person who spent a day trying to work out why my long strings weren’t being styled as block literals -.-’

I agree that trailing spaces shouldn’t be a reason to disallow block scalars. Can’t say what changes are needed to allow this, though. If we could integrate https://github.com/yaml/yaml-test-suite and check that parsing the output again returns the same parsing events we could make sure that we don’t break anything.