zsh-bench: Prompt times don't match the real times in a screen recording, and zsh4humans seem to be getting quite a good advantage

It’s hard to be sure how exactly zsh-bench is measuring what it claims to be measuring, by looking at its code.

I’m going to trust a screen recording as the source of truth for the prompt times in my machine. I’ve recorded zim and zsh4humans, 3 times each. ​Each time the steps are:

  1. Sleep 1 second. In this time I type exit and ENTER, so it’s used as input in the next steps.
  2. Print the framework name. I consider the counting starts when the output of this print appears in the recording.
  3. Start a new shell with HOME as the specific framework installation dir. I consider the counting ends when the prompt appears.

This translates to: for f in zim zsh4humans; do repeat 3 do; sleep 1; print ${f}; HOME=${PWD:h}/${f} zsh -li; done; done

The steps above are executed in a dir with a git repo with 1,000 directories and 10,000 files, set up using the code here and here.

Each framework was installed with the setup script here and here, respectively for zim and zsh4humans.

This should be enough to guarantee that the recordings are using the same scenario used by zsh-bench.

This is the screen recording:

Kapture 2021-10-24 at 16 54 36

This is the times extracted from checking the recording frames:

begin_frame prompt_frame frames ms
zim 1 161 169 9 300.000
zim 2 200 207 8 266.667
zim 3 238 246 9 300.000
zsh4humans 1 276 282 7 233.333
zsh4humans 2 316 321 6 200.000
zsh4humans 3 354 359 6 200.000

The recording has 30fps, so each frame represents 33.333ms. We can consider the error of each measurement above to be +/- 33.333ms.

This is what I get when running zsh-bench on my same machine. Ran 3 times each of the mentioned frameworks:

❯ ./zsh-bench zim
==> setting up a container for benchmarking ...
==> benchmarking zim ...
creates_tty=0
has_compsys=1
has_syntax_highlighting=1
has_autosuggestions=1
has_git_prompt=1
first_prompt_lag_ms=115.790
first_command_lag_ms=149.910
command_lag_ms=62.535
input_lag_ms=28.830
exit_time_ms=47.530

❯ ./zsh-bench zim
==> setting up a container for benchmarking ...
==> benchmarking zim ...
creates_tty=0
has_compsys=1
has_syntax_highlighting=1
has_autosuggestions=1
has_git_prompt=1
first_prompt_lag_ms=196.126
first_command_lag_ms=229.765
command_lag_ms=142.110
input_lag_ms=29.402
exit_time_ms=48.146

❯ ./zsh-bench zim
==> setting up a container for benchmarking ...
==> benchmarking zim ...
creates_tty=0
has_compsys=1
has_syntax_highlighting=1
has_autosuggestions=1
has_git_prompt=1
first_prompt_lag_ms=196.878
first_command_lag_ms=231.291
command_lag_ms=140.756
input_lag_ms=27.556
exit_time_ms=48.560

❯ ./zsh-bench zsh4humans
==> setting up a container for benchmarking ...
==> benchmarking zsh4humans ...
creates_tty=1
has_compsys=1
has_syntax_highlighting=1
has_autosuggestions=1
has_git_prompt=1
first_prompt_lag_ms=30.361
first_command_lag_ms=100.323
command_lag_ms=4.997
input_lag_ms=14.536
exit_time_ms=10.895

❯ ./zsh-bench zsh4humans
==> setting up a container for benchmarking ...
==> benchmarking zsh4humans ...
creates_tty=1
has_compsys=1
has_syntax_highlighting=1
has_autosuggestions=1
has_git_prompt=1
first_prompt_lag_ms=30.458
first_command_lag_ms=101.612
command_lag_ms=5.198
input_lag_ms=12.339
exit_time_ms=11.051

❯ ./zsh-bench zsh4humans
==> setting up a container for benchmarking ...
==> benchmarking zsh4humans ...
creates_tty=1
has_compsys=1
has_syntax_highlighting=1
has_autosuggestions=1
has_git_prompt=1
first_prompt_lag_ms=29.819
first_command_lag_ms=108.122
command_lag_ms=5.123
input_lag_ms=13.146
exit_time_ms=11.05

What is odd

zim first_prompt_lag_ms times in zsh-bench fluctuate quite a lot. It’s an average of 169.598ms with stdev of 46.601! In the recordings the average was 288.889ms with a stdev of 19.245, which was actually a stdev of 0.577 in terms of frames (it’s just one frame more in one recordings). zsh-bench should be more precise than a 30fps recording.

Also from zsh-bench’s code looks like this output is the min value. It would be good to know what is the stdev of values inside a zsh-bench run, to make sure the values are not fluctuating.

I could compare the minimum recording time with the minimum zsh-bench time for each framework, but let’s use the averages since the stdev of zsh-bench for zim was so high. So for completeness of information, the first_prompt_lag_ms average for zsh4humans is 30.213ms, with a stdev of 0.344. And in recordings the average was 244.444ms (stdev was 50.918, but again it was actually just one frame more in one of the recordings).

Then, it’s odd that zim is 1.703x faster in zsh-bench than the recordings – I was expecting much closer values --, and zsh4humans is 8.091x faster in zsh-bench than the recordings!

If anything, zsh-bench if favoring zsh4humans by a lot, both in terms of more stable measurements, than giving it times way faster than the real ones.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 48 (33 by maintainers)

Commits related to this issue

Most upvoted comments

To summarize what happened since this issue was opened:

Thanks to @romkatv’s input and to his thoughts in zsh-bench, this is what we’ve improved in Zim:

  • We changed our benchmarks to measure the time to the first prompt appearance (instead of the time to run exit). We’re using a different tool (expect) than zsh-bench (script) to measure that time, and @romkatv helped us make sure the times between our different approaches match.
  • We don’t compile Zsh startup scripts, and we don’t compile any scripts in the background anymore. The former is considered “cutting corners” in zsh-bench, and the latter unreliable. In fact, now Zim does not run anything in the background during your shell experience.
  • We’re using the same improved completion dumpfile check that zsh4humans uses during startup in our completion module, since using compinit -C alone is also considered “cutting corners” in zsh-bench.
  • We’ve also updated our templates to set ZSH_AUTOSUGGEST_MANUAL_REBIND=1 and source the zsh-users/zsh-autosuggestions module last.

2023-02-26 EDIT: We were still insisting in delaying the completion dumpfile check, but we’re finally doing the same improved dumpfile check that zsh4humans uses during startup.

Latency reported by zsh-framework-benchmark matches first_prompt_lag from zsh-bench within 1ms. There are two exceptions where zsh-bench reports higher latency: zinit-light with +5.5ms (+3.5%) and ohmyzsh with +19ms (+32%). These deviation are large enough to require an explanation. I haven’t looked at them yet.

@ericbn Both exceptions have disappeared with your fix: https://github.com/zimfw/zsh-framework-benchmark/pull/10

framework zsh-framework-benchmark (ms) zsh-bench (ms) diff (ms)
ohmyzsh 77.810 77.402 -0.408
zinit-light 161.342 159.638 -0.704

Now all zsh-framework-benchmark results match first_prompt_lag from zsh-bench --standalone --git no.

zsh-users/zsh-history-substring-search is enabled by default in zsh4humans.

On a related note, if you run touch 1/1/1/1 (this file exists), zim’s prompt will erroneously display [!]. If you run git status, [!] will disappear. This is a bug in zim as far as I can tell.

Opened https://github.com/zimfw/zimfw/issues/441.

I’d be interested to see these benchmarks run on a variety of devices.

Cool. If you run the benchmarks on other devices, please share the results.

Out of curiosity I’ve benchmarked zim and zsh4humans on a Raspberry Pi 4 4G Model B with a micro SD card for storage.

==> benchmarking zim
creates_tty=0
has_compsys=1
has_syntax_highlighting=1
has_autosuggestions=1
has_git_prompt=1
first_prompt_lag_ms=315.364
first_command_lag_ms=387.934
command_lag_ms=229.512
input_lag_ms=69.796
exit_time_ms=65.726

==> benchmarking zsh4humans
creates_tty=1
has_compsys=1
has_syntax_highlighting=1
has_autosuggestions=1
has_git_prompt=1
first_prompt_lag_ms=61.945
first_command_lag_ms=233.866
command_lag_ms=13.231
input_lag_ms=56.386
exit_time_ms=19.132

Edit: See more accurate results below.

Sorry, just catching up on all this and discussion in the other benchmarking repo.

@ericbn My memory is hazy and I don’t remember if there is any interaction with login shells when using zpty, which my original script used, and if there is/was it may not be documented.

Awesome that you were able to generate so many stats. Reading all of this was really the best kind of issue report 😃

@romkatv this is a really cool tool, although I’ll admit I haven’t looked deeply into your zsh4humans project. If the claim to fame is that it is fast, awesome! Fixing the horrendously slow prompt lag on slower devices (first generation raspberry pis, old old intel chips, etc) that came with other frameworks was one of my goals when I started zim.

Part of the original design of zim was also to allow the “heavier” tasks to run at the login shell, with the idea that it usually only happens once per interactive session. That is to say, when you sit down and login or connect, you’re only going to do it once per session/terminal. This included things like zcompile operations, initialization of the ssh/gpg agents, etc. My perception was, “if the user is forced to wait, make them wait one time and only one time”.

In that same line referenced above, you can see that an explicit and seemingly superfluous zsh process is launched with the ZDOTDIR of the to-be-benchmarked framework rc’s. This explicitly allowed first-time initialization of each framework (explicitly not benchmarking the first prompt lag). This isn’t to say that it doesn’t matter, it was just an intentional compromise.

Design obviously wasn’t perfect, but was a good compromise vs trying to make things asynchronous until the cows come home. By happenstance, working on my original zsh framework benchmarking tool is where I learned precisely how painful that is 😉

My only question, do you run login shells frequently? If you open a new terminal is it always a login shell? I ask only because it doesn’t match my personal experience.

I’d be interested to see these benchmarks run on a variety of devices. There is a huge difference in speed based on IO devices. This is why the original benchmarks for zim ran on a variety of devices. Machines with an NVMe will perform better with a Celeron chip than a machine with an HDD and a Ryzen Threadripper in some circumstances. (Zim was knowingly designed with file IO in mind, hence the zcompile of user RC files that you disagree with). You reference your CPU here, but not the disk: https://github.com/romkatv/zsh-bench#full-benchmark-data

cheers! 🍻

I purposely want to keep the steps where the screen is recorded and the recording frames are verified as manual steps, as I want to make sure no automation is interfering with the results.

I’d rather have both of these automated and then we can also manually perform the same steps to double-check that the automation produces the same results.

(Frankly, I don’t think recording a video is a wise approach. Recording the raw writes to the TTY is both easier and more precise. But if you want to go with video, that’s fine with me. It’s more work but the numbers should match eventually after we iron our potential quirks with vsync, adaptive frame rate, codecs, buffering, etc.)

Performance on Linux can be very different from performance on macOS (I’m assuming your host OS is macOS).

Yes, I’m using macOS, and yes there seems to be more going on in the /etc startup files in macOS than in Ubuntu, which might explain why zsh-bench reports a quicker time than the recording I did in my macOS.

You can also run zsh-bench without positional arguments to benchmark your own local login shell. You’ll have to install script from util-linux though and I don’t know how easy or hard that is on macOS. If you manage to instal it, create a local user, set up any kind of zsh environment you like and run zsh-bench to see how that shell performs on macOS. You can also use your regular user if you have a way to restore your rc files (e.g., from git).

What do you think about running the benchmarks with zsh -o NO_GLOBAL_RCS -li in the future?

The config called “zim” is for the default experience. It measures the performance of zim that a potential user would see if they installed zim and made no modifications. Similarly with “zsh4humans”. The config called zim+ disables everything – including global rcs – and enables only the core functionality (what’s considered core is in the docs of zsh-bench) with all possible optimizations. It shows the baseline latency for a config that one could build on top of zim. If zsh-bench ran all configs with no_gobal_rcs, it wouldn’t be able to benchmark some configs that people actually use. E.g., it wouldn’t be able to tell us what kind of performance people should expect if they installed zim or zsh4humans and used it without modifications.

I know. I just wanted to show how the results fluctuate between subsequent zsh-bench zim runs.

It looks like the first run was slow and the second two runs had identical first prompt lag. Perhaps something was running on your machine at the same time? Try using --iters 64, close all applications and don’t do anything while the benchmark is running. As I mentioned, this is the value I used for all benchmarks for all published results.

I thought you were measuring the first command lag like this: Type the command before the prompt appears, and measure how long it takes for the command output to actually appear after the prompt appears. (If we type it after the prompt appears, there’s already a lag between the time the prompt appeared and the time the command was typed, at least when typing manually.) If not, then how are you measuring it? And which command are you specifically using in zsh-bench?

Yes, that’s how zsh-bench measures first command lag. But you aren’t measuring first command lag, you are measuring first prompt lag. If you want to measure first command lag, you need to buffer print first""-""command and then track when “first-command” appears on the screen. Note that first command lag depends on the length of the command if syntax highlighting is enabled. Buffering “exit” doesn’t allow you to measure first command lag and doesn’t help with measuring first prompt lag.

I forgot to mention that you can replay the screen of the TTY that zsh-bench is acting on. Let’s say you are in playground zim. Run this:

rm -rf -- '' /tmp/zsh-bench-*(D) && ~/zsh-bench/zsh-bench --iters 1 --keep

Edit: I’ve added the crucial --keep to the command above.

Once this command finishes, you can see what zsh-bench was doing:

clear; scriptreplay -t /tmp/zsh-bench-*/timing /tmp/zsh-bench-*/out

This works only if your TERM is compatible with xterm-256color. You can multiply and/or increase by a constant all numbers in the first column of /tmp/zsh-bench-*/timing for slower replay.

This method is very useful when debugging zsh configs. It allows you to see errors if there are any and to generally verify that things work as they are supposed to.

You’ll see that at first zsh-bench issues a short command that sources something. That’s the first command and it gets sent to the TTY right away without waiting for the first prompt. The sourced file prints, and the result of that printing signifies execution of the first command.

Then zsh-bench starts measuring input latency. It prints a fairly long command composed of a bunch of escaped abc tokens (they are escaped to avoid triggering alias expansion, which can significantly slow down syntax highlighting), types one more character and waits for that character to be processed by zle. This is performed several times and then the command line is cleared without execution.

Then zsh-bench measures command lag by spamming ENTER and seeing how many prompts it would get.

Once you play around with script and zsh-bench a bit and verify some numbers with video recording, it shouldn’t be hard to see how zsh-bench measures things and that the way it measures it corresponds to what we actually want to measure. You can even run clear; print start; script -fqet$HOME/timing ~/typescript in the container and record a video. This will allow you to verify that latencies extracted from the logs of script match latencies that you get from video.