terminal: C1 control characters detection breaks output (regression)
Windows Terminal version (or Windows build number)
1.9.1445.0
Other Software
No response
Steps to reproduce
Compile and run the following code:
#include <windows.h>
int main()
{
const char data[] = "\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F";
wchar_t buffer[sizeof(data)];
if (!MultiByteToWideChar(1252, MB_USEGLYPHCHARS, data, -1, buffer, sizeof(buffer)))
{
printf("%d\n", GetLastError());
}
DWORD n;
WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), buffer, sizeof(buffer) / sizeof(wchar_t), &n, 0);
printf("\n\n");
for (int i = 0; i != sizeof(data) - 1; ++i)
{
if (buffer[i] == (unsigned char)data[i])
{
printf("%04X not converted\n", (unsigned char)data[i]);
}
}
}
Expected Behavior
€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ or something similar, depending on your output codepage
Actual Behavior
€‚ƒ„…†‡ˆ‰Š‹Œ - most of the characters are missing.
After f91b53d5fdc5b20387e297357668f5f14a795d4c the whole range 0x80 - 0x9F is considered control characters.
The comment above boldly claims that
“we do not need to worry about confusion whether a single byte, for example, \x9b in a single-byte stream represents a C1 CSI or some other glyph, because by the time we get here, everything is Unicode. Knowing whether a single-byte \x9b represents a single-character C1 CSI or some other glyph is handled by MultiByteToWideChar before we get here (if the stream was not already UTF-16). For instance, in CP_ACP, if a \x9b shows up, it will get converted to \x203a. So, if we get here, and have a \x009b, we know that it unambiguously represents a C1 CSI”
, but that is simply not true: as the example code above demonstrates, \x81, \x8D, \x8F, \x90, and \x9D are not handled by MultiByteToWideChar (at least in codepage 1252 ANSI - Latin I, hopefully popular enough), so no, not everything is Unicode by the time we get here, and no, we do need to worry about such confusion and implement a proper check to avoid breaking existing applications.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 25 (8 by maintainers)
Commits related to this issue
- Disable the acceptance of C1 control codes by default (#11690) There are some code pages with "unmapped" code points in the C1 range, which results in them being translated into Unicode C1 control c... — committed to microsoft/terminal by j4james 3 years ago
- Fix broken VT stuff (enable C1 control chars) Conhost/Terminal supported C1 control sequences for a while... but then that apparently caused some problems. So they disabled them by default, which cau... — committed to microsoft/DbgShell by jazzdelightsme 2 years ago
I’ll probably prepare this one for isolated ingestion into Windows so it can be released in a servicing update. 😄 Folks may eventually be more broadly upset that conhost is “acting weird” and “displaying corrupt text” and “stole my lunch.”
A few more thoughts:
dir, sit back and watch the world burn:Don’t ask “but why?” - users can, so they will. And incorrect codepage conversions are still a thing, especially during processing of various metadata. There are lots and lots of weird file names in the wild.
Sanitising file names, everywhere, even in scenarios not related to outputting anything for the sake of the feature that is “almost never used”… 🤔 Supposedly sooner or later C1 will make it into the conhost and this is where the fun begins.