terminal: Bad characters occasionally displayed when writing lots of identical UTF-8 lines

Your Windows build number: (Type ver at a Windows Command Prompt) Microsoft Windows [Version 10.0.18342.8] (But this happens on all versions of Windows since at least Windows 7.)
What you’re doing and what’s happening: (Copy & paste specific commands and their output, or include screen shots) In the cmd.exe console, set to code page 65001, type a large UTF-8 file containing hundreds of short lines with just the characters “ü€ü€ü€ü€ü€”
What’s wrong / what should be happening instead: 95% of the lines are displayed correctly, but about 5% contain spurious characters. Ex: “ü��ü€ü€ü€ü€” This happens with any application, not just cmd.exe’s built-in TYPE command. So I suspect this is a bug in the UTF-8 to Unicode conversion routine in the console output handler. For more details, and files and scripts for reproducing it, see the discussion thread on this forum: https://www.dostips.com/forum/viewtopic.php?f=3&t=9017

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 36 (34 by maintainers)

Most upvoted comments

I updated the latest test code again (with VT Processing enabled) in order to see how it might be related to the occurrence of � (Unicode Replacement Character U+FFFD) as they appear in #455 and #666. For that reason I’ve stolen @PhMajerus ’ test pattern, converted it to UTF-8, and attached it to the above ZIP archive. TYPE only:

Piped to the test code:

The replacement characters don’t appear at the same positions as in the linked issues. But I guess this is due to different sizes of character buffers in the Console and Terminal source codes.

german-one on Jun 8, 2019

Sorry, that’s what this proposal is! When we encounter an incomplete codepoint we need to wait for the next trip around the I/O loop to complete it. We can still send everything up to but not including the incomplete codepoint.

My statement about ReadFile was moreso that it is safer/easier to cache the partial data and wait for the next read loop instead of trying to expand the buffer and read a couple more bytes immediately.

DHowett-MSFT on May 25, 2019

Even if there’s an issue in conhost, I’m convinced there’s an issue in Terminal as well.

https://github.com/microsoft/terminal/blob/4c47631bf4aa907aad4f7088bc7edc7e5cde11b9/src/cascadia/TerminalConnection/ConhostConnection.cpp#L188-L190

We are absolutely cutting UTF-8 sequences in half right there.

We can do fun games and cache partial sequences (partial code below), but it does add complexity. It’s probably necessary complexity!

if ((buffer[iLastGoodByte] & 0b1'0000000) == 0b1'0000000)
{
    // does the encoding fit?
    const auto leadByte = buffer[iLastGoodByte];
    auto sequenceLen = dwRead - iLastGoodByte;
    if ((((leadByte & 0b111'00000) == 0b110'00000) && sequenceLen < 2) ||  // lead reqs 2 bytes
        (((leadByte & 0b1111'0000) == 0b1110'0000) && sequenceLen < 3) ||  // lead reqs 3 bytes
        (((leadByte & 0b11111'000) == 0b11110'000) && sequenceLen < 4))    // lead reqs 4 bytes
    {
        ::memmove_s(utf8Partials, std::extent<decltype(utf8Partials)>::value, buffer + iLastGoodByte, dwRead - iLastGoodByte);
        // ... do the rest ...

I hacked it up to write 0 1 2 3 over the lead bytes where we cut off the ends of partial sequences, and A B C D over the trail bytes where we cut off the beginnings of partial sequences, and we get this:

(you can see where one ReadFile ended and the next began where we transitioned from numbers to letters.)

It wouldn’t be strictly easier to just ReadFile until we got a complete UTF-8 codepoint because we would still need to figure out when we didn’t have a complete codepoint.

DHowett-MSFT on May 25, 2019

I’m taking this because it feels similar to some of the other bugs I have assigned to me talking about U+FFFD.

miniksa on May 21, 2019

As best we can!

DHowett-MSFT on May 21, 2019

That’s possible. There’s also a chance that we are getting the 3-byte UTF-8 encoded emoji in two parts from the application.

DHowett-MSFT on May 21, 2019