gitea: Web interface does not render composed unicode characters correctly

Description

All composed UTF-8 characters, like s̄_b, ṡ_b, etc., are not rendered correctly in Gitea.

Screenshots

See how s̄_b is being rendered. It even shows that there are a hidden unicode characters in this line.

Captura de Tela 2022-06-07 às 14 54 35

Gitea Version

1.16.8

Can you reproduce the bug on the Gitea demo site?

Yes

Operating System

macOS

Browser Version

Safari 15.4

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 17 (13 by maintainers)

Commits related to this issue

Most upvoted comments

So it’s definitely a Safari bug: https://jsfiddle.net/9j0zc4su/

But, should <span>&#x1F44B;</span>&#x1F3FD; be really merged? In Safari, <span>&#x1F44B;&#x1F3FD;</span> works.

I think composition should not be influenced by tag boundaries and other browsers seem to agree as well.

GitHub doesn’t report the warning.

Github is wrong to not report that there is something odd. ë and ë are not the same characters.


The problems stems from Safari’s rendering. The code looks like:

<span class="n">s<span class="escaped-code-point" data-escaped="[U+0304]"><span class="char">̄</span></span>_b</span>

The <span class="escaped-code-point" data-escaped="[U+0304]"><span class="char"> is all inline and zero width so the overbar (U+0304) should still be being rendered over the s. Safari is incorrectly rendering this.


Now, how could we fix? Without having access to a Safari browser I’m not sure. Is there any way get Safari to just do the right thing with the spacing of the combining character here?


We do this splitting because it makes writing the escaping/unescaping extremely easy for combining marks:

https://github.com/go-gitea/gitea/blob/ac88f21ecc5612befe51f7ab6ffcb76c681daba5/modules/charset/escape.go#L166-L179

In order to not do it we’d have to coalesce bytes that can be combined together and emit an escape block for them together eg.

<span class="n"><span class="escaped-code-point" data-escaped="s[U+0304]"><span class="char">s̄</span></span>_b</span>

That coalescing would require us to write the escaper to understand the rules for rendering of combining characters and have the state for handling these.

For example if we had: ̄ (that is [U020] + [U304]) Should that be coalesced as <space>+[U304] or kept as [U304]? How about multiple combining characters e.g.: ē̂ e +U304 + U302? How about when/if we get round to properly dealing with ambiguous characters like с and С (which are not the same as c and C)? If these have combining characters do we coalesce or escape the ambiguous character separately?


If it is possible to get Safari just render the combining character in the right place that would be deeply helpful instead.