terminal: Very slow rendering of colored text

WriteConsole() performs reasonably well when writing plain uncolored text to the console (~2ms at best, ~20ms at worst with a 240x64 buffer), but is incredibly slow when writing text with color escape sequences. Below is some code that demonstrates this.

I write an entire screen worth of colored text as fast as possible shifting every character to the one next to it in ASCII as to avoid any possible caching that might happen. I measure the time it takes for each WriteConsole() call and print the results at the end. No third-party libraries are needed to compile.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <windows.h>

#define NUM_OF_MEASUREMENTS 100


static LONGLONG perf_counter_freq;


LONGLONG
get_timestamp(void)
{
    LARGE_INTEGER timestamp;
    QueryPerformanceCounter(&timestamp);
    LONGLONG result = timestamp.QuadPart;
    return result;
}


float
get_seconds_elapsed(LONGLONG start, LONGLONG end)
{
    float result = (float)(end - start) / (float)perf_counter_freq;
    return result;
}


void
generate_text(char start_char, char *buffer, size_t buffer_size)
{
    for (size_t i = 0; i < buffer_size; i++)
    {
        buffer[i] = 'A' + (((start_char - 'A') + i) % 26);
    }
}


char *
copy_string(char *dest, char *source)
{
    while (*source)
    {
        *dest++ = *source++;
    }

    return dest;
}


int main(int argc, char **argv)
{
    HANDLE con_out = CreateFile("CONOUT$", GENERIC_READ | GENERIC_WRITE, FILE_SHARE_WRITE, 0, OPEN_EXISTING, 0, 0);
    if (con_out == INVALID_HANDLE_VALUE) goto cleanup;

    DWORD console_mode;
    GetConsoleMode(con_out, &console_mode);

    console_mode |= ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_WRAP_AT_EOL_OUTPUT;
    SetConsoleMode(con_out, console_mode);

    CONSOLE_SCREEN_BUFFER_INFO buffer_info;
    GetConsoleScreenBufferInfo(con_out, &buffer_info);

    LARGE_INTEGER perf_counter_freq_result;
    QueryPerformanceFrequency(&perf_counter_freq_result);
    perf_counter_freq = perf_counter_freq_result.QuadPart;

    int console_width  = (buffer_info.srWindow.Right  - buffer_info.srWindow.Left) + 1;
    int console_height = (buffer_info.srWindow.Bottom - buffer_info.srWindow.Top);
    size_t buffer_size = console_width * console_height;

    char *buffer1 = (char*)malloc(buffer_size);
    if (!buffer1) goto cleanup;

    char *buffer2 = (char*)malloc(buffer_size);
    if (!buffer2) goto cleanup;

    generate_text('A', buffer1, buffer_size);
    generate_text('B', buffer2, buffer_size);

    size_t screen_buffer_size = 1048576;
    char *screen_buffer = (char*)malloc(screen_buffer_size);
    if (!screen_buffer) goto cleanup;

    int measurement_index = 0;
    int num_of_measurements = NUM_OF_MEASUREMENTS;
    float *measurements = (float*)malloc(num_of_measurements*sizeof(float));
    if (!measurements) goto cleanup;

    int c = 0;
    int b = 1;
    for (int count = 0; count < num_of_measurements; count++)
    {
        char *src = b ? buffer1 : buffer2;
        char *at = screen_buffer;

        at = copy_string(at, "\x1b[H");
        for (int j = 0; j < console_height; j++)
        {
            size_t row_offset = j * console_width;
            char *row = &src[row_offset];

            for (int i = 0; i < console_width; i++)
            {
                at = copy_string(at, "\x1b[38;2;");
                switch(c)
                {
                    case 1:  { at = copy_string(at, "28;215;119m"); } break;
                    case 2:  { at = copy_string(at, "0;159;255m");  } break;
                    default: { at = copy_string(at, "226;51;73m");  } break;
                }
                c = (c + 1) % 3;

                *at++ = row[i];
                at = copy_string(at, "\x1b[0m");
            }
            at = copy_string(at, "\n");
        }

        LONGLONG start = get_timestamp();

        DWORD written = 0;
        size_t chars_to_write = at - screen_buffer;
        WriteConsole(con_out, screen_buffer, (DWORD)chars_to_write, &written, 0);

        LONGLONG end = get_timestamp();

        float seconds_elapsed = get_seconds_elapsed(start, end);
        measurements[measurement_index++] = seconds_elapsed;

        b = !b;
    }

    DWORD written = 0;
    WriteConsole(con_out, "\x1b[2J\x1b[H", 7, &written, 0);

    float fastest = measurements[0];
    float slowest = measurements[0];
    float total = 0;
    for (int i = 0; i < num_of_measurements; i++)
    {
        float measurement = measurements[i];

        fastest = measurement < fastest ? measurement : fastest;
        slowest = measurement > slowest ? measurement : slowest;
        total += measurement;
    }
    float average = total / num_of_measurements;

    printf("Fastest: %.4f\n", fastest);
    printf("Slowest: %.4f\n", slowest);
    printf("Average: %.4f\n", average);

cleanup:
    if (con_out != INVALID_HANDLE_VALUE) CloseHandle(con_out);
    if (buffer1)       free(buffer1);
    if (buffer2)       free(buffer2);
    if (screen_buffer) free(screen_buffer);
    if (measurements)  free(measurements);
}

Is there anything I can do to make this run fast enough for animation (at least 30 FPS)? Techniques like only writing what changed are not an option, although I doubt that doing that would be faster than just writing the whole buffer to the console at once. I know about WriteConsoleOutput() and double-buffering with SetConsoleActiveScreenBuffer() but I don’t know if they could be made to work with VT escape sequences.

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 6
  • Comments: 15 (6 by maintainers)

Most upvoted comments

@zadjii-msft I didn’t mean to come off as insulting. I found this thread, saw that it was added to a milestone 2 years ago, and then nothing else, so I assumed (apparently incorrectly) that nobody was paying it any mind. Good to hear you guys are on it!

@lhecker Thanks for the update. I suppose maybe it is just the line breaks that are causing it to run so slowly. I also made an assumption about the cause of the slow throughput because all of the output from my library has some escape codes, even if it’s just to use the default foreground and background colors.

Sorry to hear that it is such a difficult technical challenge. Perhaps you will have an epiphany, or these individual tweaks mentioned by @zadjii-msft will have a cumulative effect and make it much less noticeable.

Honestly, if I hadn’t run a performance test that hammers the console as fast as one CPU can do it, I would probably not have noticed. The performance is ‘good enough’ for most applications, but I believe it is probably the bottleneck for any super intensive operation that is also logging its progress to stdout.

Anyhow, godspeed, and I’ll keep my eye on this thread for updates (and if you guys need a guinea pig, I’ll volunteer to test dev builds).

Cheers

Unfortunately even with the faster code we’re not at a state yet where watching a video at this resolution would be pleasant. We’re currently limited to something in the ballpark of 60MB/s of VT processing at the moment which at 640x180 cells as in your screenshot is 5-10 FPS. That’s roughly 1 order of magnitude faster than it used to be (so, try it out if you can!), but we’d need to make it another order of magnitude faster to be decent. That’s theoretically possible, but requires lots and lots and lots of code refactorizations in this code base. [^1]

[^1]: The renderer holds the text buffer lock across D3D and D2D calls, but fixing that is difficult because there’s tons of places that access the buffer directly via huge, narrow interfaces. There’s the UTF8<>UTF16 translation which happens 3 times and it uses stringapiset.h which isn’t the fastest. The console calls result in lots of individual heap allocations due to the reliance on the STL (as most C++ projects do; but that’s bad for perf). The VT parser heavily relies on dynamic dispatch which results in a disappointing upper perf ceiling due to CFG. It uses lots of if conditions for parsing and could use a faster LL(1) design with LUTs instead. The text buffer stores VT sequences and text split up separately using RLE compression for the sequences (as metadata), which is also slow. Basically, there’s a long list of long-term cleanups that need to occur and they have been and continue to be long-term ongoing.

image

It takes like a full second to write a frame

This is still an issue in late 2023. I publish a library that, among other things, writes ANSI escape colored text to the console (libsir).

On Windows (using WriteConsole), the performance is so much worse than any other platform/terminal emulator, that I’ve had to cut the number of iterations to 10% of normal for the performance evaluation test.

Here are some numbers (from the same machine; everything other than macOS is a VM):

Platform # lines Elapsed Lines/sec Emulator
macOS 1m 7.418s 134807.2 iTerm2
Debian 12 1m 7.055s 141743.4 gnome-terminal
FreeBSD 1m 3.310s 302114.8 Konsole
Windows 100k 16.596s 6025.6 Windows Terminal

I know that Microsoft can do better than this, so why has this bug report been ignored for nearly 4 years now? It’s the age of cross-platform software, and compatibility (which has never been Windows’ strong suit) is becoming more important. You took it seriously enough to implement the parsing and rendering of the color codes, so why not finish the job and make it at least compete with other emulators in terms of throughput?

I, for one, would really appreciate it!

I noticed that in general, rendering of a text in Windows terminal is very slow

Indeed. As a point of comparison, running seq 1 70000 takes this long on various terminals (local and remote), as determined by using a physical stopwatch:

Windows Terminal + Ubuntu WSL: 12.99s Ubuntu WSL: 5.32s Ubuntu WSL + tmux: 0.70s PuTTY (ssh to remote linux machine): 2.55s VSCode terminal (in remote SSH session to linux machine): 1.15s Ubuntu (in Konsole running natively in X11 on slower machine): 0.55s