terminal: Bad characters occasionally displayed when writing lots of identical UTF-8 lines

  • Your Windows build number: (Type ver at a Windows Command Prompt) Microsoft Windows [Version 10.0.18342.8] (But this happens on all versions of Windows since at least Windows 7.)

  • What you’re doing and what’s happening: (Copy & paste specific commands and their output, or include screen shots) In the cmd.exe console, set to code page 65001, type a large UTF-8 file containing hundreds of short lines with just the characters “ü€ü€ü€ü€ü€”

  • What’s wrong / what should be happening instead: 95% of the lines are displayed correctly, but about 5% contain spurious characters. Ex: “ü���ü€ü€ü€ü€” This happens with any application, not just cmd.exe’s built-in TYPE command. So I suspect this is a bug in the UTF-8 to Unicode conversion routine in the console output handler. For more details, and files and scripts for reproducing it, see the discussion thread on this forum: https://www.dostips.com/forum/viewtopic.php?f=3&t=9017

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 36 (34 by maintainers)

Most upvoted comments

I updated the latest test code again (with VT Processing enabled) in order to see how it might be related to the occurrence of � (Unicode Replacement Character U+FFFD) as they appear in #455 and #666. For that reason I’ve stolen @PhMajerus ’ test pattern, converted it to UTF-8, and attached it to the above ZIP archive. TYPE only: image

Piped to the test code: image

The replacement characters don’t appear at the same positions as in the linked issues. But I guess this is due to different sizes of character buffers in the Console and Terminal source codes.

Sorry, that’s what this proposal is! When we encounter an incomplete codepoint we need to wait for the next trip around the I/O loop to complete it. We can still send everything up to but not including the incomplete codepoint.

My statement about ReadFile was moreso that it is safer/easier to cache the partial data and wait for the next read loop instead of trying to expand the buffer and read a couple more bytes immediately.

Even if there’s an issue in conhost, I’m convinced there’s an issue in Terminal as well.

https://github.com/microsoft/terminal/blob/4c47631bf4aa907aad4f7088bc7edc7e5cde11b9/src/cascadia/TerminalConnection/ConhostConnection.cpp#L188-L190

We are absolutely cutting UTF-8 sequences in half right there.

We can do fun games and cache partial sequences (partial code below), but it does add complexity. It’s probably necessary complexity!

if ((buffer[iLastGoodByte] & 0b1'0000000) == 0b1'0000000)
{
    // does the encoding fit?
    const auto leadByte = buffer[iLastGoodByte];
    auto sequenceLen = dwRead - iLastGoodByte;
    if ((((leadByte & 0b111'00000) == 0b110'00000) && sequenceLen < 2) ||  // lead reqs 2 bytes
        (((leadByte & 0b1111'0000) == 0b1110'0000) && sequenceLen < 3) ||  // lead reqs 3 bytes
        (((leadByte & 0b11111'000) == 0b11110'000) && sequenceLen < 4))    // lead reqs 4 bytes
    {
        ::memmove_s(utf8Partials, std::extent<decltype(utf8Partials)>::value, buffer + iLastGoodByte, dwRead - iLastGoodByte);
        // ... do the rest ...

I hacked it up to write 0 1 2 3 over the lead bytes where we cut off the ends of partial sequences, and A B C D over the trail bytes where we cut off the beginnings of partial sequences, and we get this:

image

(you can see where one ReadFile ended and the next began where we transitioned from numbers to letters.)

It wouldn’t be strictly easier to just ReadFile until we got a complete UTF-8 codepoint because we would still need to figure out when we didn’t have a complete codepoint.

I’m taking this because it feels similar to some of the other bugs I have assigned to me talking about U+FFFD.

As best we can!

That’s possible. There’s also a chance that we are getting the 3-byte UTF-8 encoded emoji in two parts from the application.