pulldown-cmark: Text events in code block start with newlines

As can be seen in issue #457 , if you parse the following code block

```
test

test
```

you get

Start(CodeBlock(Fenced(Borrowed(""))))
Text(Borrowed("test"))
Text(Borrowed("\n"))
Text(Borrowed("\ntest"))
Text(Borrowed("\n"))
End(CodeBlock(Fenced(Borrowed(""))))

The strange behaviour here is that the lines start with \n, but don’t end with \n. I think this is highly unusual and makes the strings harder to work with than necessary. Desired behaviour would be:

Start(CodeBlock(Fenced(Borrowed(""))))
Text(Borrowed("test\n"))
Text(Borrowed("\n"))
Text(Borrowed("test\n"))
End(CodeBlock(Fenced(Borrowed(""))))

This behaviour would make much more sense (it is also more natural because it reflects the actual lines seen in the code block) and provides enhanced compatibility to other libraries like syntect which usually parse code on a line-by-line basis, where a line is defined as a string terminated by \n.

I am currently running into issues with this and the only remedy seems to be string slicing and copying, which costs performance.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 20 (3 by maintainers)

Most upvoted comments

The consecutive text events issue has been fixed in #686. The normalization issue is open and it could be handled in the future but it is not guaranteed.

Martin1887 on Jul 1, 2023

The TextMergeStream seems a very useful utility that could be included in the crate, and it also serves as an opportunity to emphasize in docs that usually text events are fired separately due to performance and algorithmic reasons.

If Raph approves it, I will include it in the next release.

Thanks!

Martin1887 on May 21, 2023

There’s no Rust compiler shenanigans going with the newlines. The language has no concept of line endings in strings. It’s all just codepoints.

I beg to differ: <images>

Thanks for that, I honestly did not know the compiler normalized line-endings.

Okay, I now understand that if you normalize line endings to \n and strip away the \r, and you always want to use Borrowed and not do any additional copies of the strings, you are forced to emit a new Text event whenever \r comes up because you have to jump over the \r.

Correct!

I think @BenjaminRi brings up a good point – there needs to be clarification on what a Text event is, or clients will need to handle continuous text events.

If by continuous text events you mean consecutive text events, then yes: pulldown does not guarantee that all text is combined into a single event, even when it can be. We try to do it sometimes as an optimization, but it cannot be relied on. It would be good to clarify this indeed.

Considering how we don’t emit multiple Text events for non-normalized strings, I suggest we stay consistent and don’t attempt to normalize strings. The stdlib already provides tools for correctly handling splitting strings in a platform-independent way, so I’m not convinced that this library should try and do so on the user’s behalf.

These are solid arguments. It would benefit throughput as well, as reducing the number of events we emit is one of the most effective ways to speed up the parser (and renderer!). We need to exercise some caution here though. This would be a breaking change. Clients may implicitly or even unknowingly rely on this behavior. I am not against the change, but we would need consensus that this would indeed benefit the majority of users.

edit: While it would be nice to be able guarantee that continuous blocks of text are indeed emitted as a single event, it seems to not be optimal from a performance standpoint. By the nature of the pull based parsing, we are only emitting events as we walk the tree. Consider the following markdown:

This is text. <notquitehtml This is more text.

The first pass sees the opening bracket and marks it MaybeHtml. When we do the second pass, we can only emit Text("This is text. ") since the next bit could be html. But it is not since it lacks a closing bracket, and the bit <notquitehtml ends up being more text. So even though This is text. <notquitehtml This is more text. could have been emitted as a single text event, we didn’t know that a priori. There are ways to work around this, but in my experiments they ended up being slower and more complex than just emitting separate events.

marcusklaas on Dec 6, 2020

I think @BenjaminRi brings up a good point – there needs to be clarification on what a Text event is, or clients will need to handle continuous text events.

Considering how we don’t emit multiple Text events for non-normalized strings, I suggest we stay consistent and don’t attempt to normalize strings. The stdlib already provides tools for correctly handling splitting strings in a platform-independent way, so I’m not convinced that this library should try and do so on the user’s behalf. Since I’m advocating for this change as well, I don’t mind making a PR for this if you’d like, @marcusklaas.

edward-shen on Dec 6, 2020