terminal: C1 control characters detection breaks output (regression)

Windows Terminal version (or Windows build number)

1.9.1445.0

Other Software

No response

Steps to reproduce

Compile and run the following code:

#include <windows.h>

int main()
{
	const char data[] = "\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F";
	wchar_t buffer[sizeof(data)];
	if (!MultiByteToWideChar(1252, MB_USEGLYPHCHARS, data, -1, buffer, sizeof(buffer)))
	{
		printf("%d\n", GetLastError());
	}

	DWORD n;
	WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), buffer, sizeof(buffer) / sizeof(wchar_t), &n, 0);

	printf("\n\n");

	for (int i = 0; i != sizeof(data) - 1; ++i)
	{
		if (buffer[i] == (unsigned char)data[i])
		{
			printf("%04X not converted\n", (unsigned char)data[i]);
		}
	}
}

Expected Behavior

€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ or something similar, depending on your output codepage

Actual Behavior

€‚ƒ„…†‡ˆ‰Š‹Œ - most of the characters are missing.

After f91b53d5fdc5b20387e297357668f5f14a795d4c the whole range 0x80 - 0x9F is considered control characters.

The comment above boldly claims that

“we do not need to worry about confusion whether a single byte, for example, \x9b in a single-byte stream represents a C1 CSI or some other glyph, because by the time we get here, everything is Unicode. Knowing whether a single-byte \x9b represents a single-character C1 CSI or some other glyph is handled by MultiByteToWideChar before we get here (if the stream was not already UTF-16). For instance, in CP_ACP, if a \x9b shows up, it will get converted to \x203a. So, if we get here, and have a \x009b, we know that it unambiguously represents a C1 CSI”

, but that is simply not true: as the example code above demonstrates, \x81, \x8D, \x8F, \x90, and \x9D are not handled by MultiByteToWideChar (at least in codepage 1252 ANSI - Latin I, hopefully popular enough), so no, not everything is Unicode by the time we get here, and no, we do need to worry about such confusion and implement a proper check to avoid breaking existing applications.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 25 (8 by maintainers)

Commits related to this issue

Most upvoted comments

I’ll probably prepare this one for isolated ingestion into Windows so it can be released in a servicing update. 😄 Folks may eventually be more broadly upset that conhost is “acting weird” and “displaying corrupt text” and “stole my lunch.”

A few more thoughts:

Except for SS2 and SS3 in EUC-JP text, and NEL in text transcoded from EBCDIC, the 8-bit forms of these codes are almost never used. CSI, DCS and OSC are used to control text terminals and terminal emulators, but almost always by using their 7-bit escape code representations. Their ISO/IEC 2022 compliant single-byte representations are invalid in UTF-8, and the UTF-8 encodings of their corresponding codepoints are two bytes long like their escape code forms (for instance, CSI at U+009B is encoded as the bytes 0xC2, 0x9B in UTF-8), so there is no advantage to using them rather than the equivalent two-byte escape sequence. When these codes appear in modern documents, web pages, e-mail messages, etc., they are usually intended to be printing characters at that position in a proprietary encoding such as Windows-1252 or Mac OS Roman that use the C1 codes to provide additional graphic characters.

  • Windows does allow 0x80 - 0x9F in filenames. You can literally create a file named “€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ”, type dir, sit back and watch the world burn:

image image

Don’t ask “but why?” - users can, so they will. And incorrect codepage conversions are still a thing, especially during processing of various metadata. There are lots and lots of weird file names in the wild.

Sanitising file names, everywhere, even in scenarios not related to outputting anything for the sake of the feature that is “almost never used”… 🤔 Supposedly sooner or later C1 will make it into the conhost and this is where the fun begins.