pulldown-cmark: Text events in code block start with newlines
As can be seen in issue #457 , if you parse the following code block
```
test
test
```
you get
Start(CodeBlock(Fenced(Borrowed(""))))
Text(Borrowed("test"))
Text(Borrowed("\n"))
Text(Borrowed("\ntest"))
Text(Borrowed("\n"))
End(CodeBlock(Fenced(Borrowed(""))))
The strange behaviour here is that the lines start with \n, but don’t end with \n. I think this is highly unusual and makes the strings harder to work with than necessary. Desired behaviour would be:
Start(CodeBlock(Fenced(Borrowed(""))))
Text(Borrowed("test\n"))
Text(Borrowed("\n"))
Text(Borrowed("test\n"))
End(CodeBlock(Fenced(Borrowed(""))))
This behaviour would make much more sense (it is also more natural because it reflects the actual lines seen in the code block) and provides enhanced compatibility to other libraries like syntect which usually parse code on a line-by-line basis, where a line is defined as a string terminated by \n.
I am currently running into issues with this and the only remedy seems to be string slicing and copying, which costs performance.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 20 (3 by maintainers)
The consecutive text events issue has been fixed in #686. The normalization issue is open and it could be handled in the future but it is not guaranteed.
The
TextMergeStreamseems a very useful utility that could be included in the crate, and it also serves as an opportunity to emphasize in docs that usually text events are fired separately due to performance and algorithmic reasons.If Raph approves it, I will include it in the next release.
Thanks!
Thanks for that, I honestly did not know the compiler normalized line-endings.
Correct!
If by continuous text events you mean consecutive text events, then yes: pulldown does not guarantee that all text is combined into a single event, even when it can be. We try to do it sometimes as an optimization, but it cannot be relied on. It would be good to clarify this indeed.
These are solid arguments. It would benefit throughput as well, as reducing the number of events we emit is one of the most effective ways to speed up the parser (and renderer!). We need to exercise some caution here though. This would be a breaking change. Clients may implicitly or even unknowingly rely on this behavior. I am not against the change, but we would need consensus that this would indeed benefit the majority of users.
edit: While it would be nice to be able guarantee that continuous blocks of text are indeed emitted as a single event, it seems to not be optimal from a performance standpoint. By the nature of the pull based parsing, we are only emitting events as we walk the tree. Consider the following markdown:
The first pass sees the opening bracket and marks it
MaybeHtml. When we do the second pass, we can only emitText("This is text. ")since the next bit could be html. But it is not since it lacks a closing bracket, and the bit<notquitehtmlends up being more text. So even thoughThis is text. <notquitehtml This is more text.could have been emitted as a single text event, we didn’t know that a priori. There are ways to work around this, but in my experiments they ended up being slower and more complex than just emitting separate events.I think @BenjaminRi brings up a good point – there needs to be clarification on what a Text event is, or clients will need to handle continuous text events.
Considering how we don’t emit multiple Text events for non-normalized strings, I suggest we stay consistent and don’t attempt to normalize strings. The stdlib already provides tools for correctly handling splitting strings in a platform-independent way, so I’m not convinced that this library should try and do so on the user’s behalf. Since I’m advocating for this change as well, I don’t mind making a PR for this if you’d like, @marcusklaas.