terminal: Incorrect display of characters written on top of the wide emojis
Environment
Windows build number: 10.0.18363.0
Windows Terminal version: 0.8.10091.0
Steps to reproduce
In PowerShell type two wide emojis, move the cursor back (eg: ESC[nD) to place it on top of or between emojis, and then type a narrow character over it or copy and paste the following:
"Place X at the end: 👨👨" + "X"
"Place X one left: 👨👨" + [char]0x1b + "[1D" + "X"
"Place X two left: 👨👨" + [char]0x1b + "[2D" + "X"
"Place X tree left: 👨👨" + [char]0x1b + "[3D" + "X"
"Place X four left: 👨👨" + [char]0x1b + "[4D" + "X"
"Place X five left: 👨👨" + [char]0x1b + "[5D" + "X"
Expected behavior
“X” is placed as expected:
Actual behavior
The letter “X” is displayed incorrectly, moreover, emojis are unexpectedly shifted:
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 43 (7 by maintainers)
I’d like to understand this scenario a bit more.
Consider an application drawing a UI over top of a two-cell character.
Consider the case of an application wrapping a two-cell character at the edge of the screen.
At the right side of my screen, I see:
and on the left, I see
It does not seem trivial as a human to (mentally) reconstruct a symbol from an ideographic language split and moved to the other side of the screen. It’s a readability disaster, and no other application anywhere (even complicated word processors!) implements wrapping that breaks a character in half. I’m sure there’s a very good reason why.
I think we need to let UI libraries, ones that operate on pixel buffers instead of cellular text buffers, solve this problem for graphical UIs and not try to add single-cell occlusion to terminals to bring them closer to word processors.
I believe that the only correct way to handle a partial destruction of a double-width character is to remove the remaining half. There is no way for us to properly cut an emoji, or a CJK symbol that spans two cells, in half. Its meaning will be lost, and half of a character is not a unit that is representable in any encoding scheme or language.
RXVT-Unicode, which I believe set the standard for unicode use in terminals, treats your reproduction case as follows:
It doesn’t support Emoji, but it does support a double-width glyph. Printing X over half of the double-width glyph destroys it.
The Windows Terminal currently has a bug where it does not destroy the double-width glyph. That, we should fix. I think it’s tracked elsewhere, though.
Discussion is moved to the Terminals Working Group\Specifications\Issues.
click to expand...
Character Segmentation and Scaling (Fractaling)
1. Abstract
This paper provides a general description of the solution to the problem of presenting and processing multi-sized (up to 4x4 cells) characters in a cell-based grid, where each cell one-to-one represents the visilbe part (fraction) of the character.
The following applications are related to multisize characters and character segmentation:
The solution allows to manipulate and store individual fractions of the whole character in a single cell for displaying them, as well as displaying multi-size characters in a cell-based grid and even allow their vertical splitting. The solution also solves the problem of displaying wide characters in terminals by letting the terminal or the application running in it decide how wide the character will be, rather than relying on external data sources of these values that are subject to regular changes.
2. Solution
2.1 Definitions
Accordingly to the Unicode® Standard Annex #29, “UNICODE TEXT SEGMENTATION”
This paper defines a character as user-perceived character (or grapheme cluster).
2.2 Mathematical Presentation
To correctly display either a whole character of any size (up to 4x4 cells) or any selected character segment, only four numeric parameters
Ps
=Dx
,Nx
,Dy
,Ny
with range values of each from 1 to 4 are required.Parameters
Dx
- count of parts along X-axisNx
- either width of the whole character or segment selector of theDx
available parts from left to right along the X-axisDy
- count of parts along Y-axisNy
- either width of the whole character or segment selector of theDy
available parts from top to bottom along the Y-axisInterpretation
There are several cases possible (for each axis accordingly)
D = 0
N <= D
N
of the character fromD
available parts and use it as a sinle-cell character (along the corresponding axis).N > D AND D = ANY
N
cells.2.3 Storing In Memory
Screen Buffer / Monospaced Text File
Each multisize character with a size of
n x m
that is greater than1x1
is stored in the screen buffer (or monospaced text file)W x H
as a matrix ofn x m
Example:
3x2 stretched character
"A"
is located atx=3, y=2
in the screen buffer (of monospaced text file)Ps
can be packed in one byte and overhead of screen buffer is 1 byte per cell:Also there are only 256 variants for the Unicode modifier character value 0 - 255.
Characters with parameters
N > D
are not allowed to be stored in a cell-based grid. When such a character is to be printed to the grid, it must be segmented for each grid cell, and the parameters are recalculated for each filled cell.2.4 Naming
2.4.1 VT-Sequence
Variants of the name for the VT-sequence
2.4.2 Unicode Standard
Name of the Unicode modifier letter
3. Usage
3.1 Unicode Standard
Latest Unicode Standard defines three types of variation sequences:
Only those three types of variation sequences are sanctioned for use by conformant implementations.
Accorginly to the Standardized variation sequences FAQ
Accodingly to the Section 23.4, Variation Selectors, UTR #25
Variation Sequence
Placement in the Text
If such a modifier appears the first in the input stream the terminal should be triggered to text reflowing as in the case of window resize.
3.2 VT-Sequence
XTerm Control Sequences, Functions using CSI
Assing VT-sequence as a CSI/SGR command, because it define characters rendition state and sets the appearance of the following characters.
n1
,n2
,n3
,n4
are from 0 to 4.n1 = Dx
,n2 = Nx
,n3 = Dy
,n4 = Ny
.P = (Nx-1) + (Dx-1) * 4 + (Ny-1) * 16 + (Dy-1) * 64
from 0 to 255.110
: missing numbers are treated as 1.111
: missing number is treated as 0.ESC[m
(all attributes off) also resets the scaling mode.110
,111
suggest...
yours.4. Expected Behavior
It doesn’t matter what size (cwidth) the character has, it allows put a wide character to a single cell if you want.
4.1 Unicode Standard
4.2 VT-Sequence
4.2.1 Printing
Output examples (VT sequence <SCALE;;;>)
It is also possible with this technique to print out mathematical expressions and multi-level formulas (monospace textual documents with formulas, CJK, wide emoji and so on - are the Unicode problems that outside terminal world).
Line Wrap
Side effects
4.2.2 Capturing
5. Applications
5.1 Cost of Initial Implementation
6. Existed Infrastructure Compatibility
7. Security Issues
7.1 Unicode Security Considerations
Unicode Technical Report #36
This section describes some of the security considerations that programmers, system analysts, standards developers, and users should take into account.
For example, consider visual spoofing, where a similarity in visual appearance fools a user and causes him or her to take unsafe actions.
Emoji cannot be split in half. Can you point to an example of a terminal that handles emoji in the way your “expected behavior” image reports? Thanks!